C# Comparison Help

Cactusmania

Newcomer
Joined
Apr 7, 2010
Messages
6
Hi guys,
I am new to C# and I need some help.
I have two richtextboxes which contain text, containing either .doc, rich text format or plain text. How do I highlight the exact differences between them and highlight the line in which the difference occurred?
A source code example would be appreciated.
 
Well, why don't you start by breaking it down. You have a couple of different tasks. First you need to identify the differences, then you need to highlight them. Highlighting them is easy and I'm going to guess you can handle that on your own. (If I'm wrong by all means say so.) The real challenge here, then, is identifying difference.

This can be pretty complicated, and the best place to start is to clearly define the expected behavior. Things get tricky if you have different formats. Should different formatting be highlighted? Will you be allowed to compare rich text to plain text? If so, would you want to examine formatting at all? Do you want to do a line-by-line comparison? What are the considerations for character positions? (If a single word is different in the middle of a sentence all following characters will be shifted left or right.)

What if we compare these two lines of text:
Code:
Lazy brown fox
Lazy brown dog
Should the difference be:
Code:
Lazy brown [COLOR="Red"]dog[/COLOR]
or
Code:
Lazy brown [COLOR="red"]d[/COLOR]o[COLOR="red"]g[/COLOR]

I think the hard part here isn't coming up with the code, but coming up with a clear, precise specification. From there the code should naturally follow.
 
Basically, I want to highlight the line ob which the difference occured and then, as in your example, the entire word (dog). Any suggestions?
 
Well will there be one word different? Could there be two non-adjacent differing words? What happens when we make the following comparisons:
Code:
I have t[COLOR="Red"]en dollar[/COLOR]s
--------
I have t[COLOR="Red"]wenty daffodil[/COLOR]s.
Will you match whole words?
Code:
one [COLOR="red"]two[/COLOR] [U]three[/U] [COLOR="red"]four[/COLOR] five
--------
one [COLOR="red"]four[/COLOR] [U]three[/U] [COLOR="Red"]two[/COLOR] five
Does your program need to be smart enough to pick out partial matches within a difference?
Code:
A
B
C
D
----------
[COLOR="red"]B
C
D[/COLOR]
This is why I was asking about a line by line comparison. The two listings here are very similar, but with a line-by-line comparison, no line is a match (A and B, B and C, C and D). These aren't corner cases, these are the most basic aspects of the functionality and they need to be well defined before they can be implemented.

That was my point about coming up with a specification for your comparison behavior. It wouldn't be wise to begin coding when you haven't worked out exactly what the code will do. Likewise, I would hate to start giving you some code samples and then have them not do what you need. What are you comparing? Code listings? Encyclopedia articles?

I don't know what you have tried here, or if you have ever tried to do any string manipulation or comparison. Or what particular difficulty are you having? I know you are new to C#, but how new are you to programming? I'm not trying to be difficult, but with no background and such an open-ended question it's hard to make useful suggestions.

A basic starting point would be to look at the strings one character at a time. Start at the beginning and go through the strings until you find a difference. If you get to the end, they match. If not, remember where the mismatch starts and move on to step two. Begin comparing the strings at their ends, one character at a time and move backwards until you find a difference. Now you know where the difference starts and where it ends.

Code:
int FindFirstDifferentChar(string A, string B) {
[COLOR="green"]    // Returns the index of the first different char, or -1 if they are the same

[COLOR="Green"]    // You can't look past the end of a string, so we will stop at the end of the shorter one[/COLOR]
[/COLOR]    int shorterStringLen = Math.Min(A.Length, B.Length)

[COLOR="green"]    // Look for a differing character[/COLOR]
    for(int CharIndex = 0; CharIndex < shorterStringLen; CharIndex++) {
        if(A[CharIndex] != B[CharIndex]) 
            return CharIndex; [COLOR="green"]// if two chars don't match, return their index[/COLOR]
    }

[COLOR="green"]    // Now, we know that both strings match up to the end of the shorter string.
    // If they are the same length then they are identical.
[/COLOR]    if(A.Length == B.Length) return -1; [COLOR="green"]// -1 means no difference[/COLOR]

[COLOR="green"]    // Otherwise we would consider the extra chars in the longer string
    // to be a difference
[/COLOR]    return shorterStringLen; [COLOR="Green"]// Difference starts at end of shorter string[/COLOR]
}
This would be the first step.
 
Right, so what have you got so far?

Ok, so I can highlight the diffs on files that have the same number of lines.
The code is attached.
I want to know how can I highlight the diffs on files that have a different number of lines? So, blank lines would also be highlighted.
Help is appreciated.

Code:
      private void button1_Click(object sender, EventArgs e)
      {
            string[] textFromRTB1 = richTextBox1.Lines;
            string[] textFromRTB2 = richTextBox2.Lines;
            richTextBox1.Clear();
            richTextBox2.Clear();
            int linesCount = 0;
       
            if (textFromRTB1.Length != textFromRTB2.Length)
            {
                  linesCount = Math.Min(textFromRTB1.Length, textFromRTB2.Length);
                  MessageBox.Show("Boxes have different number of lines.", "Information", MessageBoxButtons.OK, MessageBoxIcon.Information);
            }
            else
                  linesCount = textFromRTB1.Length;
       
            bool checker = true;
            for (int i = 0; i < linesCount; i++)
            {
                  checker = String.Equals(textFromRTB1[i], textFromRTB2[i]);
       
                  if (!checker)
                        {
                        richTextBox1.SelectionBackColor = Color.Red;
                        richTextBox2.SelectionBackColor = Color.Red;
                        richTextBox1.SelectedText = textFromRTB1[i] + "\n";
                        richTextBox2.SelectedText = textFromRTB2[i] + "\n";
                  }
                  else
                  {
                        richTextBox1.SelectionBackColor = Color.Green;
                        richTextBox2.SelectionBackColor = Color.Green;
                        richTextBox1.SelectedText = textFromRTB1[i] + "\n";
                        richTextBox2.SelectedText = textFromRTB2[i] + "\n";
                  }
            }
      }
 
Last edited by a moderator:
I don't think this is as straightforward as it seems.

You need to find or create an algorithm to identify arbitrary insertions and removals from a list. I don't know an algorithm to do this off the top of my head.

Here's what I mean.
Code:
1:    abc    abc
2:    123    [COLOR="Red"]987[/COLOR]
3:    xyz    123
4:           [COLOR="Red"]qwe[/COLOR]
5:           xyz

How is your program going to tell the difference between:
  • Removing "123" and inserting "987/123/xyz"
  • Inserting "978" and "qwe" separately

I'm guessing you'll need to analyze it recursively, i.e. search from the beginning and end of the lists for where a difference begins and ends, and search within the result for where a similarity begins/ends, and search within there to find there a difference would begin/end, ad infinitum (until you find a range the entirely matches or is entirely different).


Here's an example. We'll use the two lists above. First we look for a difference. Start at the beginning. Both lists start with "abc." Good. On the next line, line 2 in both lists, we have "123" and "987". Those are different.

Now, from the bottom up: the last is "xyz" in both lists. Good. Next one up is "123" in the left list on line 2 and "qwe" in the right list on line 4.

Now we know line 2 on the left list is different from lines 2 to 4 on the right list.

We can examine these ranges to find a similarity. This is what we now have:
Code:
[COLOR="Red"]2:    123    987
3:           123
4:           qwe[/COLOR]
In this case it's obvious what matches, but in a more complicated scenario the match could be buried anywhere within two longer lists. There is where I can't help you because I don't know how to find such a match.

I could sit down and try to figure it out, or do lots of research, but I don't know how long that would take and I don't know that I would have any more success than you would doing the same.
 
Back
Top