Advanced Regular Expressions problem

NiallWaller · Apr 11, 2003

Need a Regular Expression to strip out the text inside a anchor tag (<a> here </a>)

Sounds simple but the simple answer wont allow for nested tags inside the <a> tag...

something like this
<a [^>]*>(.*)</a>

where the .* obviously replaces what i'm looking for.

but needs to allow for nested tags like:
<a href=blah> <b>text1</b> </a>
picking up "<b>text1</b>"

And find adjascent <a> tags separetely
<a href=blah>text1</a>text2<a href=blah>text3</a>
picking up "text1" and "text3" seperately

All being used in .net - so lookahead assertions are a possible if anyone knows a solution using them...

Any help appreciated
Thanks,
Niall

philprice · Apr 12, 2003

I Would do this

Firstly make the regex opject, global, multiline and ignore case, then do something like this expression

<a.*?>(.+?)</a>

That will pick out blocks the ? on the .+ means "non greedy" so it wont try and be clever and match upto the last instance of a </a>, im not sure how the regex object in .NET works, but it should pick up stuff off one line, if not check for options you might want to use.

To be honnest its really not hard, you've made it sound worse than it is. Oh and dont use * to frequently, it will match nothing, which is usually not what you want, use + instead.

Bucky · Apr 12, 2003

I think what he's looking for is a way to strip out HTML tags from
a string. There is an article here about how to accomplish the
task. Granted the article is for classic ASP, but the Regexp class
works similarly in the .NET framework.

Advanced Regular Expressions problem

NiallWaller

Newcomer

philprice

Centurion

Bucky

Contributor