Regular Expression with string exlusions

david7777 · February 17, 2005

I am trying to build a regular expression, but need help...

"This is some foobar text bundled up in a string..."

In the above text I want to find all occurances of "b", except where the text is "This" or "bundled".

So i dont want to find any B's (case insensative) in "This" or "bundled", but only in "foobar".

The reason for this is I am trying to create a .net function to highlight my search results - My search page can take several words, and will search for all of the words. The search is done on database records with no html tags or anything. I can do a standard replace which works fine, but as soon as it comes to highlighting more than one word per record, i get problems.

Let me explain:

The record string is: "This is some foobar text bundled up in a string..."

I want to search for "t b".

First i will highlight all occurances of "t" (case insensative) with "t".

So all occurances of "t" will now be bold. The html will be: "This is some foobar text bundled up in a string..."

Now i need to go and highlight all occurences of "b" - Now there are html tags in the text which include "b"s, so i get unwanted results: "<b>T</b>his is some foobar <b>t</b>ex<b>t</b> bundled up in a s<b>t</b>ring..."

So my solution would be to make sure that all html tags are ignored..?

Any help would be great!

HJB417 · February 17, 2005

(?i)(?<!\</?)b(?!>)

david7777 · February 18, 2005

Thanks HJB417, but it doesnt seem to work... Any other ideas?

HJB417 · February 18, 2005

foreach(Match match in Regex.Matches("<b>This</b> is some foobar text <b>bundled</b> up in a string...", @"(?i)(?<!\</?)b(?!>)"))
Console.WriteLine("Match found at index {0}.", match.Index);

Match found at index 23.
Match found at index 35.

index 23 is the b of foobar.

index 35 is the b of bundled.

Isn't that what you want?

Richard Crist · February 19, 2005

If I understand

If I understand your goal correctly:

* you have a list of words to be highlighted

* for each word in the list you will search for the word in another text area that contains no HTML attributes to start with

* each time you find a word you want to highlight it with

If this is correct then I propose the following:

For a given word "testword" search for

([^>])(testword)([^<])

replace with

\1\2\\3

This says find "testword" where it does not have a ">" before and does not have a "<" after (meaning it is not surrounded by an HTML attribute), using parentheses to refer back to the search result in an ordinal manner. Then replace the found text, inserting the bold start and bold end HTML attribute at the proper positions.

Doing this for each word in the word list will work ok because as you search you are ignoring any candidate results that already have HTML around them.

I hope that I understand your question and that this answer helps in some way. :)

seve7_wa · February 23, 2005

If you know your search text will not contain "b>" unless it is from your highlighting, try a negative lookahead:

(b(?!\>))

matches the "b", only, in "abc" and " " but not "", "<\b>" nor "<tab>".

another try might be:

(?:(\<\S\>)|(b))

and reference the second $parameter in this group, not the first.

John_0025 · February 24, 2005

foreach(Match match in Regex.Matches("<b>This</b> is some foobar text <b>bundled</b> up in a string...", @"(?i)(?<!\</?)b(?!>)"))
Console.WriteLine("Match found at index {0}.", match.Index);

index 23 is the b of foobar.

index 35 is the b of bundled.

Isn't that what you want?

As I understan it he doesn't want to match the b in bundled because it is between the bold tags. Apart from being a bit untidy is there anything wrong with having bold tags within bold tags. Does it become ULTRA bold :D

seve7_wa · February 24, 2005

As I understan it he doesn't want to match the b in bundled because it is between
the bold tags. Apart from being a bit untidy is there anything wrong with having
bold tags within bold tags. Does it become ULTRA bold :D

Hmmm, maybe: ...(typing and thinking at the same time)

(MATCHME)(?!\s*\<\/(?=[^\>]+\>))

Would catch any MATCHME not immediately followed by a closing tag.

(?:(MATCHME)(?:[^\<]*))(?!\s*\<\/(?=[^\>]+\>))

Should match with any MATCHME plus other text until a tag, not

immediately followed by a closing tag. Only MATCHME itself would get

a $parameter assignment.

The subexpressions are (I'm thinking):

(MATCHME)
[i]the text to match[/i]
(?:[^\<]*)
[i]followed by all text until a probable tag, not captured[/i]
(?:(MATCHME)(?:[^\<]*))
[i]the (captured) text to match, and all other text until a probable 
tag[/i]

\s*\<\/
[i]any spaces until a probable closing tag, and that probable closing 
tag's start ...[/i]
(?=[^\>]+\>)
[i][b]something[/b], provided this in fact is the body of a closing tag[/i]
(?!\s*\<\/(?=[^\>]+\>))
[i][b]something[/b], provided it is not followed by (optional spaces and)
a closing tag[/i]
...

So if the negative lookahead (looking at the very next tag after the

MATCHME, and looking for a closing tag) fails, the MATCHME is stored.

Some problems are:

If the text ever contained a "<" that was not a part of some tag, this

could fail.

And because of the "plus other text until a tag" subexpression, this

regex would have to be reapplied to each line until it fails. But if

someone knows a way to have a non-consuming & non-capturing

subexpression, this would work in one go.

Also, MATCHME has to not be allowed WITHIN a tag, too.

Hope this helps :)

HJB417 · February 24, 2005

As I understan it he doesn't want to match the b in bundled because it is between the bold tags. Apart from being a bit untidy is there anything wrong with having bold tags within bold tags. Does it become ULTRA bold :D

good call.. I give up =/

seve7_wa · February 24, 2005

good call..

Hey, you're good at this though!

With these regex's, in an OR-condition where either side would match, are they evaluated until true from left to right?

ie;

input: foobar

expression: (o*|ob)

always returns: oo

if so, I think maybe I've got it. It should match MATCHME, so long as it is not in a tag, nor between tags:

(?:\<[^\>]*(?:MATCHME)[^\>]*\>)|(?:(MATCHME)(?:[^\<]*))(?!\s*\<\/(?=[^\>]+\>))

The left side of the OR-condition will consume MATCHMEs within a tag but not set $parameter, while the second will match MATCHMEs not between tags and set $parameter to this particular MATCHME within the text.

To use it, execute a regex where for each match you test $parameter[\B], ensure it is set, before you replace the text.

I'm saying 'it will' a lot, but I don't knowfor sure. I'm escaping the "\<" and such whether or not I have to, because frankly I'm not a VB scripter.

HJB417 · February 25, 2005

I use http://www.regexlib.com/retester.aspx to validate my regex. There's a switch (RightToLeft) to reverse the order of the processing.

Sign In

Regular Expression with string exlusions

Recommended Posts

david7777

HJB417

david7777

HJB417

Richard Crist

seve7_wa

John_0025

seve7_wa

HJB417

seve7_wa

HJB417

Join the conversation

Browse

Activity