Jump to content
Xtreme .Net Talk

Recommended Posts

Posted

Hi all,

 

I need a favor here. As you can see, stripping HTML Tag from an input is, in my opinion, a need is web application.

 

When I use PHP, I found a function to do that. But currently, I'm using ASP .Net. So, I'm looking for something similar in .Net framework. I've been searching it in the .Net Framework, this forum, and the internet. The best I can find is a user-defined function.

 

My question is ... isn't there a built-in function in .Net to handle Tag Stripping or do I have to use Regex to handle it?

 

Thanks.

Amir Syafrudin
Posted (edited)
Hi all,

 

I need a favor here. As you can see, stripping HTML Tag from an input is, in my opinion, a need is web application.

 

When I use PHP, I found a function to do that. But currently, I'm using ASP .Net. So, I'm looking for something similar in .Net framework. I've been searching it in the .Net Framework, this forum, and the internet. The best I can find is a user-defined function.

 

My question is ... isn't there a built-in function in .Net to handle Tag Stripping or do I have to use Regex to handle it?

 

Thanks.

 

Personally I use regex, don't know of any function in the framework that can do this.

 

Function as is below

 

   Public Function StripHTMLTags(ByVal HTML As String) As String
       Dim StripTag As New Regex("<[^>]+>", RegexOptions.IgnoreCase)
       Return StripTag.Replace(HTML, "")
   End Function

 

And don't forget

 

Imports System.Text.RegularExpressions

 

Andy

Edited by a_jam_sandwich
Code today gone tomorrow!
Posted

To a_jam_sandwich:

 

I've tried your code. It works welll; it really removes the HTML Tags. But I notice that the Regex isn't actually removing the HTML Tags, it removes all substring that starts with the character "<" and ends with the character ">" from the input.

 

I wonder, if the input consist of arithmatic expression, i.e. consisting of a "less than" and "greater than" sign, then the Regex will remove all the characters between them, right? Even if there's a white space character between the "<" and ">", your code still removes them.

 

So I was thinking for another Regex, here it is:

"<(/){0,1}[\\w]+>"

 

The Regex should match anything that starts with a "<", followed by an optional "/", then followed by one or more "\w" characters, and ends with a ">". But just to be sure ... is my Regex really correct?

 

Still, thanks for the suggestion ... it inspired me :D.

 

To Robby:

 

From your post, I can tell what you're suggesting. But in my case, I'm trying to completely remove the tags. But still, I thank you for the suggestion. At least now I know how to store tags without worrying the browser will interpret it as HTML tags, but rather as a plain "<" and ">".

 

Thank you all.

Amir Syafrudin
Posted

So I was thinking for another Regex, here it is:

"<(/){0,1}[\\w]+>"

 

Just realize something ... my Regex wouldn't work :D ... I guess a_jam_sandwich's Regex suits better. But I'm still thinking 'bout the arithmatic expression.

 

For example, if I have this kind of web page:

<html>

<body>

I have brownies < 100. But she has brownies > 200. Bla bla bla ...

</body>

</hmtl>

 

a_jam_sandwich's code will result in:

 

I have brownies 200. Bla bla bla ...

 

 

See what I mean ... since there's a possibility that I will be stripping tags from input containing arithmatic expression, this concerns a lot to me.

 

Any other suggestions perhaps?

 

~sigh :D

Amir Syafrudin
Posted (edited)

I know what your saying but you have to take into account that <img src="fdjdfjkdjkfd" /> has multiple words you may have a class or style etc, hence why mine strips the lot it relies on the user to not type > instead type > or < when a greater or less than sign is needed.

 

You could write a small function to change all > with a space to < e.g.

 

   Public Function StripHTMLTags(ByVal HTML As String) As String
       Dim StripTag As New Regex("<[^>]+>", RegexOptions.IgnoreCase)
       Return StripTag.Replace(FixNumericalTags(HTML), "")
   End Function

   Public Function FixNumericalTags(ByVal HTML As String) As String
       Dim StripTag As Regex
       Dim NewHTML as String = HTML

       StripTag  = New Regex("\s\<\s", RegexOptions.IgnoreCase)
       NewHTML = StripTag.Replace(NewHTML, "<")

       StripTag  = New Regex("\s\>\s", RegexOptions.IgnoreCase)
       NewHTML = StripTag.Replace(NewHTML, ">")
       Return NewHTML
   End Function

 

Thanks

 

Andy

Edited by a_jam_sandwich
Code today gone tomorrow!
Posted

To be honest, if a real "<" or ">" character exists in an HTML page, the visible text between them would disappear, as your browser would parse it as an HTML tag, decide its an unknown tag, and drop it.

 

The CORRECT procedure, as a_jam_sandwich says, is to use the < and > tags to show a "<" or ">" character in your page, and therefore the regex would work perfectly.

 

B.

Posted

Hmmm ...

 

To a_jam_sandwich:

 

It's a good code you got there. You're trying to replace all occurence of "\s\<\s" and "\s\>\s" with the corresponding HTML code. But shouldn't you call FixNumericalTags before stripping the tags? 'Coz if I strip the tags first, then I won't have any "<" or ">" left in the input, right?

 

Btw, why don't we use HTML Encode like Robby suggested?

 

Anyway, allow me to restate my case. Actually, I need to strip HTML Tags from a user input. That user input will go into the database to be shown in another page. I'm building a web-based discussion center. I don't want people to input an HTML tags. But if there were a discussion concerning arithmetic expression, I would expect them to use "<" and ">". Here's where the dilemma comes. I've explain it in my previous post.

 

Anything ... anyone?

Amir Syafrudin
Posted

Hmm theres one thing with robbys encoding method, correct me if im wrong but doesn't HTMLencode encode all the input? this would mean text like

 

the big dog

 

Would become ...

the%20big%20dog

 

If you want you could just use a replace command in your text to just replace all < or >

 

Public Function FixTags(HTML As String) As String
   Dim CorrectedHTML As String = HTML
   CorrectedHTML = CorrectedHTML.Replace("<", "<")
   CorrectedHTML = CorrectedHTML.Replace(">", ">")
   Return CorrectedHTML
End Function

 

Hope this helps

 

Andy

Code today gone tomorrow!
Posted

Public Function FixTags(HTML As String) As String
   Dim CorrectedHTML As String = HTML
   CorrectedHTML = CorrectedHTML.Replace("<", "<")
   CorrectedHTML = CorrectedHTML.Replace(">", ">")
   Return CorrectedHTML
End Function

 

Yes but this would be extremely slow compared to regular expressions; easier to read for someone who doesn't know regular expression though.

  • 2 weeks later...
Posted

A closing statement :D

 

Thanks for all the suggestions guys. But I kinda figure out another way to handle my needs. I made use the validateRequest feature provided by .Net to handle "dangerous input" such as HTML tags.

 

Using it ... the tags are filtered ... if .Net detects anything "dangerous", it informs the error and stop the process. After filtering, all filtered "<" and ">" are replaced by "<" and ">", respectively.

 

It's a great feature. But still, I'm using precaution and used some Regex to remove the "<" and ">" manually.

Amir Syafrudin

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...