How to get a regex match without characters before and after counted (C#)

aewarnick

Senior Contributor
Joined
Jan 29, 2003
Messages
1,031
How to get a regex match without characters before and after counted

I am making a method that will get exact matches from a string. When you double click on a word in VS to highlight it, the word does not include the dot before it or after it. Likewise, this method searches for matches that are separated by many simbols or whitespace.

Code:
Regex.Matches(text, @"[\s\.\<\>\?\\\!\*\(\)\/\-\+\=]"+find+@"[\s\.\<\>\?\\\!\*\(\)\/\-\+\=]");

When a match is made it includes the special characters in the index and legth properties. My question is, how do I search for those characters for matches but not have them counted as part of the match value?

Also, I would really like to use C# instead of code to put my C# code in. How do I do that?
 
I think you're going to want to use Groups with the regular expression. The problem is that a Regular expression by itself just tells you whether or not there's a match, and optionally returns a list of matches. Since you really want to find a match based on delimiters, you need a way to specify which parts of the match you want.

Here's a new expression that includes a group name of MyMatch. It's the same expression as yours, but instead of just inserting your "find" string in the middle, I've put it inside of the named group, MyMatch. The for loop then loops through matches and pulls out the Group by name.
Code:
MatchCollection mc = Regex.Matches(text, 
    @"[\s\.\<\>\?\\\!\*\(\)\/\-\+\=]*(?<MyMatch>" + 
    find + 
    @")[\s\.\<\>\?\\\!\*\(\)\/\-\+\=]");
foreach(Match m in mc)
	Debug.WriteLine(m.Groups["MyMatch"].Value);

-Nerseus
 
That is good but I don't really understand how it works. Do you feel like explaining why these things are done:

1. Why is the * quantifyer placed before the grouping?
2. The question mark?
3. Why is the grouping end parethese placed inside the start of the 2nd set of special characters?


And, does anyone know of a good e-book or web page on advanced regular expressions. My book covers very basic stuff.

Also, please, someone tell me how to put code inside C# code and not just regular code for this forum.
 
1. Why is the * quantifyer placed before the grouping?
The * indicates that the previous character(s) should repeat 0 or more times. If you have the expression:
a[c]*d
it would match (0 or more of the character 'c'):
ad
acd
accd
acccd

You could use the expression:
a[cd]*e to match 0 or more of either the character 'c' or 'd'. The following would match:
ae
ace
ade
acce
adde
acde
accddccdce

The + is used to indicate 1 or more. So if the expression were "a[c]+" then "ad" would not match, but "ac", "acc", and "accc" would match.

2. The question mark?
The question mark, where I put it, is part of the Grouping syntax. You make a group like this:
(?<MatchName>expression to match)

3. Why is the grouping end parethese placed inside the start of the 2nd set of special characters?
Look at 2. above. The ending paren is just part of the group name and must come after the regular expression that represents what the group should be. Essentionally you're trying to find a word (just a set of characters in a particular order) and assign that word to a group name. So if you're looking for the word "Item", the group name should look like:
(?<MatchName>Item)

If you wanted to make it more generic, you could use any regular expression syntax in place of "Item", such as [\w]+, which would find one or more letters or numbers. Here's the new group:
(?<MatchName>[\w]+)

I find MS's help on regular expression VERY thorough, though you may have to jump through a number of links in the help to find a good sample of what you need. I'd use the help's search option for regular expression examples. The normal F1-linked help from Visual Studio will show mostly the Regex object help and I can never remember which links to click to get to the "good stuff".

-Nerseus
 
I changed the code so that there is much less writing:

C#:
MatchCollection mc = Regex.Matches(text, 
	@"[\D & \W]" + "(?<F>" + find + ")" + 
	@"[\D & \W]");
This is what I send to it:
C#:
a.Regexes.FindExact("gi2joe jgljkasjoejlasdfu*+-/joe....joe(&& kjlaksdgk)()*&%joe--++++halgu", "joe");
The joe in here: jgljkasjoejlasdfu should not be matched but it is!! I think that the & is not working in the expression. Did I do that right? I also tried many other things to get it to work even using && but nothing works.
 
I did some testing and found out that a digit is also a word character. The method works now. Did I use the correct syntax to say "If char is not a digit and not a word" with this: \D & \W?
 
I don't think so. \W by itself says match anything *except* digits and lower/upper case characters (and maybe underscore). By using \W you don't need to also use \D, it's redundant. I have no idea what you're doing with the & character... if you're trying to do "this AND this", you don't need the ampersand, that's more of C# syntax.

I think all you want/need is:
@"[\W]*(?<F>" + find + ")[\W]*"

This says match 0 or more non-letters and non-digits right next to your "find" letters. Then find 0 or more non-letters and non-digits on the right hand side. So:
".joe." will match
"#$%(*(joe@$^" will match
"joe" will match
"joejoe" will not match
".joe" will match
"joe." will match
"ajoe" will not match
"joea" will not match

This is all a guess - I'm not testing any of this :)

-Ner
 
I found that when I used the * before 2joe matched when it should not have. But maybe I am mistaken.

My question is, how can I say
if char is not whitespace AND not word?
 
You could use "[\S\W]" for non-whitespace and non-word (alpha+numeric). If you want 0 or more to match, use "[\S\W]*". If you want to match at least one but maybe many more, use "[\S\W]+". If you want to match 1 to 10, use "[\S\W]{1,10}".

For the record, "*" is short for "{0,} and "+" is short for "{1,}". If you want any specific number of characters, you can put them in curly braces (or use a range like 2,5).

-Ner
 
If it's not there, that's 0 matches. For example. Say you might have this:
...Dan.Jones
or
.Dan.Jones
or
Dan.Jones

To get the first name, you could use:
"^[\S\W]*[\w]+[\S\W]+..."

So, find 0 or more non-whitespace non-alpha non-digit chars, followed by a word character. You want 0 or more because you don't know if the string starts with a non word or a word character (and it might be more than one non-word character).

-ner
 
Back
Top