Jump to content
Xtreme .Net Talk

Recommended Posts

Posted

I am trying to build a regular expression, but need help...

 

"<b>This</b> is some foobar text <b>bundled</b> up in a string..."

 

In the above text I want to find all occurances of "b", except where the text is "<b>This</b>" or "<b>bundled</b>".

 

So i dont want to find any B's (case insensative) in "<b>This</b>" or "<b>bundled</b>", but only in "foobar".

 

The reason for this is I am trying to create a .net function to highlight my search results - My search page can take several words, and will search for all of the words. The search is done on database records with no html tags or anything. I can do a standard replace which works fine, but as soon as it comes to highlighting more than one word per record, i get problems.

 

Let me explain:

The record string is: "This is some foobar text bundled up in a string..."

I want to search for "t b".

 

First i will highlight all occurances of "t" (case insensative) with "<b>t</b>".

 

So all occurances of "t" will now be bold. The html will be: "<b>T</b>his is some foobar <b>t</b>ex<b>t</b> bundled up in a s<b>t</b>ring..."

 

Now i need to go and highlight all occurences of "b" - Now there are html tags in the text which include "b"s, so i get unwanted results: "<<b>b</b>>T</<b>b</b>>his is some foo<b>b</b>ar <<b>b</b>>t</<b>b</b>>ex<<b>b</b>>t</<b>b</b>> <b>b</b>undled up in a s<<b>b</b>>t</<b>b</b>>ring..."

 

So my solution would be to make sure that all html tags are ignored..?

 

Any help would be great!

Bypass your proxy and get anonymous internet surfing FREE!
Posted

foreach(Match match in Regex.Matches("<b>This</b> is some foobar text <b>bundled</b> up in a string...", @"(?i)(?<!\</?)b(?!>)"))
Console.WriteLine("Match found at index {0}.", match.Index);

 

Match found at index 23.

Match found at index 35.

 

index 23 is the b of foobar.

index 35 is the b of bundled.

 

Isn't that what you want?

Posted

If I understand

 

If I understand your goal correctly:

* you have a list of words to be highlighted

* for each word in the list you will search for the word in another text area that contains no HTML attributes to start with

* each time you find a word you want to highlight it with <b></b>

 

If this is correct then I propose the following:

 

For a given word "testword" search for

 

([^>])(testword)([^<])

 

replace with

 

\1<b>\2\</b>\3

 

This says find "testword" where it does not have a ">" before and does not have a "<" after (meaning it is not surrounded by an HTML attribute), using parentheses to refer back to the search result in an ordinal manner. Then replace the found text, inserting the bold start and bold end HTML attribute at the proper positions.

 

Doing this for each word in the word list will work ok because as you search you are ignoring any candidate results that already have HTML around them.

 

I hope that I understand your question and that this answer helps in some way. :)

Posted

If you know your search text will not contain "b>" unless it is from your highlighting, try a negative lookahead:

(b(?!\>))

matches the "b", only, in "abc" and "<br>" but not "<b>", "<\b>" nor "<tab>".

 

another try might be:

(?:(\<\S\>)|(b))

and reference the second $parameter in this group, not the first.

Posted
foreach(Match match in Regex.Matches("<b>This</b> is some foobar text <b>bundled</b> up in a string...", @"(?i)(?<!\</?)b(?!>)"))
Console.WriteLine("Match found at index {0}.", match.Index);

 

 

 

index 23 is the b of foobar.

index 35 is the b of bundled.

 

Isn't that what you want?

 

As I understan it he doesn't want to match the b in bundled because it is between the bold tags. Apart from being a bit untidy is there anything wrong with having bold tags within bold tags. Does it become ULTRA bold :D

Posted
As I understan it he doesn't want to match the b in bundled because it is between

the bold tags. Apart from being a bit untidy is there anything wrong with having

bold tags within bold tags. Does it become ULTRA bold :D

 

Hmmm, maybe: ...(typing and thinking at the same time)

 

(MATCHME)(?!\s*\<\/(?=[^\>]+\>))

Would catch any MATCHME not immediately followed by a closing tag.

 

 

 

(?:(MATCHME)(?:[^\<]*))(?!\s*\<\/(?=[^\>]+\>))

Should match with any MATCHME plus other text until a tag, not

immediately followed by a closing tag. Only MATCHME itself would get

a $parameter assignment.

 

 

 

The subexpressions are (I'm thinking):

(MATCHME)
[i]the text to match[/i]
(?:[^\<]*)
[i]followed by all text until a probable tag, not captured[/i]
(?:(MATCHME)(?:[^\<]*))
[i]the (captured) text to match, and all other text until a probable 
tag[/i]

\s*\<\/
[i]any spaces until a probable closing tag, and that probable closing 
tag's start ...[/i]
(?=[^\>]+\>)
[i][b]something[/b], provided this in fact is the body of a closing tag[/i]
(?!\s*\<\/(?=[^\>]+\>))
[i][b]something[/b], provided it is not followed by (optional spaces and)
a closing tag[/i]
...

So if the negative lookahead (looking at the very next tag after the

MATCHME, and looking for a closing tag) fails, the MATCHME is stored.

 

Some problems are:

If the text ever contained a "<" that was not a part of some tag, this

could fail.

And because of the "plus other text until a tag" subexpression, this

regex would have to be reapplied to each line until it fails. But if

someone knows a way to have a non-consuming & non-capturing

subexpression, this would work in one go.

Also, MATCHME has to not be allowed WITHIN a tag, too.

 

Hope this helps :)

Posted
As I understan it he doesn't want to match the b in bundled because it is between the bold tags. Apart from being a bit untidy is there anything wrong with having bold tags within bold tags. Does it become ULTRA bold :D

good call.. I give up =/

Posted
good call..

 

Hey, you're good at this though!

 

With these regex's, in an OR-condition where either side would match, are they evaluated until true from left to right?

ie;

input: foobar

expression: (o*|ob)

always returns: oo

 

if so, I think maybe I've got it. It should match MATCHME, so long as it is not in a tag, nor between tags:

 

(?:\<[^\>]*(?:MATCHME)[^\>]*\>)|(?:(MATCHME)(?:[^\<]*))(?!\s*\<\/(?=[^\>]+\>))

The left side of the OR-condition will consume MATCHMEs within a tag but not set $parameter, while the second will match MATCHMEs not between tags and set $parameter to this particular MATCHME within the text.

 

To use it, execute a regex where for each match you test $parameter[\B], ensure it is set, before you replace the text.

 

I'm saying 'it will' a lot, but I don't knowfor sure. I'm escaping the "\<" and such whether or not I have to, because frankly I'm not a VB scripter.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...