Regular Expression with string exlusions

david7777

Freshman
Joined
Feb 17, 2005
Messages
33
I am trying to build a regular expression, but need help...

"<b>This</b> is some foobar text <b>bundled</b> up in a string..."

In the above text I want to find all occurances of "b", except where the text is "<b>This</b>" or "<b>bundled</b>".

So i dont want to find any B's (case insensative) in "<b>This</b>" or "<b>bundled</b>", but only in "foobar".

The reason for this is I am trying to create a .net function to highlight my search results - My search page can take several words, and will search for all of the words. The search is done on database records with no html tags or anything. I can do a standard replace which works fine, but as soon as it comes to highlighting more than one word per record, i get problems.

Let me explain:
The record string is: "This is some foobar text bundled up in a string..."
I want to search for "t b".

First i will highlight all occurances of "t" (case insensative) with "<b>t</b>".

So all occurances of "t" will now be bold. The html will be: "<b>T</b>his is some foobar <b>t</b>ex<b>t</b> bundled up in a s<b>t</b>ring..."

Now i need to go and highlight all occurences of "b" - Now there are html tags in the text which include "b"s, so i get unwanted results: "<<b>b</b>>T</<b>b</b>>his is some foo<b>b</b>ar <<b>b</b>>t</<b>b</b>>ex<<b>b</b>>t</<b>b</b>> <b>b</b>undled up in a s<<b>b</b>>t</<b>b</b>>ring..."

So my solution would be to make sure that all html tags are ignored..?

Any help would be great!
 
Visual Basic:
foreach(Match match in Regex.Matches("<b>This</b> is some foobar text <b>bundled</b> up in a string...", @"(?i)(?<!\</?)b(?!>)"))
	Console.WriteLine("Match found at index {0}.", match.Index);

output of above code said:
Match found at index 23.
Match found at index 35.

index 23 is the b of foobar.
index 35 is the b of bundled.

Isn't that what you want?
 
If I understand

If I understand your goal correctly:
* you have a list of words to be highlighted
* for each word in the list you will search for the word in another text area that contains no HTML attributes to start with
* each time you find a word you want to highlight it with <b></b>

If this is correct then I propose the following:

For a given word "testword" search for

([^>])(testword)([^<])

replace with

\1<b>\2\</b>\3

This says find "testword" where it does not have a ">" before and does not have a "<" after (meaning it is not surrounded by an HTML attribute), using parentheses to refer back to the search result in an ordinal manner. Then replace the found text, inserting the bold start and bold end HTML attribute at the proper positions.

Doing this for each word in the word list will work ok because as you search you are ignoring any candidate results that already have HTML around them.

I hope that I understand your question and that this answer helps in some way. :)
 
If you know your search text will not contain "b>" unless it is from your highlighting, try a negative lookahead:
Code:
(b(?!\>))
matches the "b", only, in "abc" and "<br>" but not "<b>", "<\b>" nor "<tab>".

another try might be:
Code:
(?:(\<\S\>)|(b))
and reference the second $parameter in this group, not the first.
 
HJB417 said:
Visual Basic:
foreach(Match match in Regex.Matches("<b>This</b> is some foobar text <b>bundled</b> up in a string...", @"(?i)(?<!\</?)b(?!>)"))
	Console.WriteLine("Match found at index {0}.", match.Index);



index 23 is the b of foobar.
index 35 is the b of bundled.

Isn't that what you want?

As I understan it he doesn't want to match the b in bundled because it is between the bold tags. Apart from being a bit untidy is there anything wrong with having bold tags within bold tags. Does it become ULTRA bold :D
 
John_0025 said:
As I understan it he doesn't want to match the b in bundled because it is between
the bold tags. Apart from being a bit untidy is there anything wrong with having
bold tags within bold tags. Does it become ULTRA bold :D

Hmmm, maybe: ...(typing and thinking at the same time)

Code:
(MATCHME)(?!\s*\<\/(?=[^\>]+\>))
Would catch any MATCHME not immediately followed by a closing tag.



Code:
(?:(MATCHME)(?:[^\<]*))(?!\s*\<\/(?=[^\>]+\>))
Should match with any MATCHME plus other text until a tag, not
immediately followed by a closing tag. Only MATCHME itself would get
a $parameter assignment.



The subexpressions are (I'm thinking):
Code:
(MATCHME)
[I]the text to match[/I]
(?:[^\<]*)
[I]followed by all text until a probable tag, not captured[/I]
(?:(MATCHME)(?:[^\<]*))
[I]the (captured) text to match, and all other text until a probable 
tag[/I]

\s*\<\/
[I]any spaces until a probable closing tag, and that probable closing 
tag's start ...[/I]
(?=[^\>]+\>)
[I][B]something[/B], provided this in fact is the body of a closing tag[/I]
(?!\s*\<\/(?=[^\>]+\>))
[I][B]something[/B], provided it is not followed by (optional spaces and)
a closing tag[/I]
...
So if the negative lookahead (looking at the very next tag after the
MATCHME, and looking for a closing tag) fails, the MATCHME is stored.

Some problems are:
If the text ever contained a "<" that was not a part of some tag, this
could fail.
And because of the "plus other text until a tag" subexpression, this
regex would have to be reapplied to each line until it fails. But if
someone knows a way to have a non-consuming & non-capturing
subexpression, this would work in one go.
Also, MATCHME has to not be allowed WITHIN a tag, too.

Hope this helps :)
 
John_0025 said:
As I understan it he doesn't want to match the b in bundled because it is between the bold tags. Apart from being a bit untidy is there anything wrong with having bold tags within bold tags. Does it become ULTRA bold :D
good call.. I give up =/
 
HJB417 said:
good call..

Hey, you're good at this though!

With these regex's, in an OR-condition where either side would match, are they evaluated until true from left to right?
ie;
input: foobar
expression: (o*|ob)
always returns: oo

if so, I think maybe I've got it. It should match MATCHME, so long as it is not in a tag, nor between tags:

Code:
(?:\<[^\>]*(?:MATCHME)[^\>]*\>)|(?:(MATCHME)(?:[^\<]*))(?!\s*\<\/(?=[^\>]+\>))
The left side of the OR-condition will consume MATCHMEs within a tag but not set $parameter, while the second will match MATCHMEs not between tags and set $parameter to this particular MATCHME within the text.

To use it, execute a regex where for each match you test $parameter[\B], ensure it is set, before you replace the text.

I'm saying 'it will' a lot, but I don't knowfor sure. I'm escaping the "\<" and such whether or not I have to, because frankly I'm not a VB scripter.
 
Back
Top