HTML Tag Stripper

amir100

Centurion
Joined
Mar 14, 2004
Messages
190
Location
Indonesia
Hi all,

I need a favor here. As you can see, stripping HTML Tag from an input is, in my opinion, a need is web application.

When I use PHP, I found a function to do that. But currently, I'm using ASP .Net. So, I'm looking for something similar in .Net framework. I've been searching it in the .Net Framework, this forum, and the internet. The best I can find is a user-defined function.

My question is ... isn't there a built-in function in .Net to handle Tag Stripping or do I have to use Regex to handle it?

Thanks.
 
amir100 said:
Hi all,

I need a favor here. As you can see, stripping HTML Tag from an input is, in my opinion, a need is web application.

When I use PHP, I found a function to do that. But currently, I'm using ASP .Net. So, I'm looking for something similar in .Net framework. I've been searching it in the .Net Framework, this forum, and the internet. The best I can find is a user-defined function.

My question is ... isn't there a built-in function in .Net to handle Tag Stripping or do I have to use Regex to handle it?

Thanks.

Personally I use regex, don't know of any function in the framework that can do this.

Function as is below

Code:
    Public Function StripHTMLTags(ByVal HTML As String) As String
        Dim StripTag As New Regex("<[^>]+>", RegexOptions.IgnoreCase)
        Return StripTag.Replace(HTML, "")
    End Function

And don't forget

Code:
Imports System.Text.RegularExpressions

Andy
 
Last edited:
To a_jam_sandwich:

I've tried your code. It works welll; it really removes the HTML Tags. But I notice that the Regex isn't actually removing the HTML Tags, it removes all substring that starts with the character "<" and ends with the character ">" from the input.

I wonder, if the input consist of arithmatic expression, i.e. consisting of a "less than" and "greater than" sign, then the Regex will remove all the characters between them, right? Even if there's a white space character between the "<" and ">", your code still removes them.

So I was thinking for another Regex, here it is:
Code:
"<(/){0,1}[\\w]+>"

The Regex should match anything that starts with a "<", followed by an optional "/", then followed by one or more "\w" characters, and ends with a ">". But just to be sure ... is my Regex really correct?

Still, thanks for the suggestion ... it inspired me :D.

To Robby:

From your post, I can tell what you're suggesting. But in my case, I'm trying to completely remove the tags. But still, I thank you for the suggestion. At least now I know how to store tags without worrying the browser will interpret it as HTML tags, but rather as a plain "<" and ">".

Thank you all.
 
amir100 said:
So I was thinking for another Regex, here it is:
Code:
"<(/){0,1}[\\w]+>"

Just realize something ... my Regex wouldn't work :D ... I guess a_jam_sandwich's Regex suits better. But I'm still thinking 'bout the arithmatic expression.

For example, if I have this kind of web page:
<html>
<body>
I have brownies < 100. But she has brownies > 200. Bla bla bla ...
</body>
</hmtl>

a_jam_sandwich's code will result in:
I have brownies 200. Bla bla bla ...

See what I mean ... since there's a possibility that I will be stripping tags from input containing arithmatic expression, this concerns a lot to me.

Any other suggestions perhaps?

~sigh :D
 
I know what your saying but you have to take into account that <img src="fdjdfjkdjkfd" /> has multiple words you may have a class or style etc, hence why mine strips the lot it relies on the user to not type > instead type > or < when a greater or less than sign is needed.

You could write a small function to change all > with a space to < e.g.

Code:
    Public Function StripHTMLTags(ByVal HTML As String) As String
        Dim StripTag As New Regex("<[^>]+>", RegexOptions.IgnoreCase)
        Return StripTag.Replace(FixNumericalTags(HTML), "")
    End Function

    Public Function FixNumericalTags(ByVal HTML As String) As String
        Dim StripTag As Regex
        Dim NewHTML as String = HTML

        StripTag  = New Regex("\s\<\s", RegexOptions.IgnoreCase)
        NewHTML = StripTag.Replace(NewHTML, "<")

        StripTag  = New Regex("\s\>\s", RegexOptions.IgnoreCase)
        NewHTML = StripTag.Replace(NewHTML, ">")
        Return NewHTML
    End Function

Thanks

Andy
 
Last edited:
To be honest, if a real "<" or ">" character exists in an HTML page, the visible text between them would disappear, as your browser would parse it as an HTML tag, decide its an unknown tag, and drop it.

The CORRECT procedure, as a_jam_sandwich says, is to use the < and > tags to show a "<" or ">" character in your page, and therefore the regex would work perfectly.

B.
 
Hmmm ...

To a_jam_sandwich:

It's a good code you got there. You're trying to replace all occurence of "\s\<\s" and "\s\>\s" with the corresponding HTML code. But shouldn't you call FixNumericalTags before stripping the tags? 'Coz if I strip the tags first, then I won't have any "<" or ">" left in the input, right?

Btw, why don't we use HTML Encode like Robby suggested?

Anyway, allow me to restate my case. Actually, I need to strip HTML Tags from a user input. That user input will go into the database to be shown in another page. I'm building a web-based discussion center. I don't want people to input an HTML tags. But if there were a discussion concerning arithmetic expression, I would expect them to use "<" and ">". Here's where the dilemma comes. I've explain it in my previous post.

Anything ... anyone?
 
Hmm theres one thing with robbys encoding method, correct me if im wrong but doesn't HTMLencode encode all the input? this would mean text like

Code:
the big dog

Would become ...
Code:
the%20big%20dog

If you want you could just use a replace command in your text to just replace all < or >

Code:
Public Function FixTags(HTML As String) As String
    Dim CorrectedHTML As String = HTML
    CorrectedHTML = CorrectedHTML.Replace("<", "<")
    CorrectedHTML = CorrectedHTML.Replace(">", ">")
    Return CorrectedHTML
End Function

Hope this helps

Andy
 
Code:
Public Function FixTags(HTML As String) As String
    Dim CorrectedHTML As String = HTML
    CorrectedHTML = CorrectedHTML.Replace("<", "<")
    CorrectedHTML = CorrectedHTML.Replace(">", ">")
    Return CorrectedHTML
End Function

Yes but this would be extremely slow compared to regular expressions; easier to read for someone who doesn't know regular expression though.
 
A closing statement :D

Thanks for all the suggestions guys. But I kinda figure out another way to handle my needs. I made use the validateRequest feature provided by .Net to handle "dangerous input" such as HTML tags.

Using it ... the tags are filtered ... if .Net detects anything "dangerous", it informs the error and stop the process. After filtering, all filtered "<" and ">" are replaced by "<" and ">", respectively.

It's a great feature. But still, I'm using precaution and used some Regex to remove the "<" and ">" manually.
 
Back
Top