amir100 Posted June 15, 2005 Posted June 15, 2005 Hi all, I need a favor here. As you can see, stripping HTML Tag from an input is, in my opinion, a need is web application. When I use PHP, I found a function to do that. But currently, I'm using ASP .Net. So, I'm looking for something similar in .Net framework. I've been searching it in the .Net Framework, this forum, and the internet. The best I can find is a user-defined function. My question is ... isn't there a built-in function in .Net to handle Tag Stripping or do I have to use Regex to handle it? Thanks. Quote Amir Syafrudin
a_jam_sandwich Posted June 15, 2005 Posted June 15, 2005 (edited) Hi all, I need a favor here. As you can see, stripping HTML Tag from an input is, in my opinion, a need is web application. When I use PHP, I found a function to do that. But currently, I'm using ASP .Net. So, I'm looking for something similar in .Net framework. I've been searching it in the .Net Framework, this forum, and the internet. The best I can find is a user-defined function. My question is ... isn't there a built-in function in .Net to handle Tag Stripping or do I have to use Regex to handle it? Thanks. Personally I use regex, don't know of any function in the framework that can do this. Function as is below Public Function StripHTMLTags(ByVal HTML As String) As String Dim StripTag As New Regex("<[^>]+>", RegexOptions.IgnoreCase) Return StripTag.Replace(HTML, "") End Function And don't forget Imports System.Text.RegularExpressions Andy Edited June 15, 2005 by a_jam_sandwich Quote Code today gone tomorrow!
Moderators Robby Posted June 15, 2005 Moderators Posted June 15, 2005 You can always use HTMLEncode, it will convert this... <img> To this... <img> Quote Visit...Bassic Software
amir100 Posted June 16, 2005 Author Posted June 16, 2005 To a_jam_sandwich: I've tried your code. It works welll; it really removes the HTML Tags. But I notice that the Regex isn't actually removing the HTML Tags, it removes all substring that starts with the character "<" and ends with the character ">" from the input. I wonder, if the input consist of arithmatic expression, i.e. consisting of a "less than" and "greater than" sign, then the Regex will remove all the characters between them, right? Even if there's a white space character between the "<" and ">", your code still removes them. So I was thinking for another Regex, here it is: "<(/){0,1}[\\w]+>" The Regex should match anything that starts with a "<", followed by an optional "/", then followed by one or more "\w" characters, and ends with a ">". But just to be sure ... is my Regex really correct? Still, thanks for the suggestion ... it inspired me :D. To Robby: From your post, I can tell what you're suggesting. But in my case, I'm trying to completely remove the tags. But still, I thank you for the suggestion. At least now I know how to store tags without worrying the browser will interpret it as HTML tags, but rather as a plain "<" and ">". Thank you all. Quote Amir Syafrudin
amir100 Posted June 16, 2005 Author Posted June 16, 2005 So I was thinking for another Regex, here it is: "<(/){0,1}[\\w]+>" Just realize something ... my Regex wouldn't work :D ... I guess a_jam_sandwich's Regex suits better. But I'm still thinking 'bout the arithmatic expression. For example, if I have this kind of web page: <html> <body> I have brownies < 100. But she has brownies > 200. Bla bla bla ... </body> </hmtl> a_jam_sandwich's code will result in: I have brownies 200. Bla bla bla ... See what I mean ... since there's a possibility that I will be stripping tags from input containing arithmatic expression, this concerns a lot to me. Any other suggestions perhaps? ~sigh :D Quote Amir Syafrudin
a_jam_sandwich Posted June 16, 2005 Posted June 16, 2005 (edited) I know what your saying but you have to take into account that <img src="fdjdfjkdjkfd" /> has multiple words you may have a class or style etc, hence why mine strips the lot it relies on the user to not type > instead type > or < when a greater or less than sign is needed. You could write a small function to change all > with a space to < e.g. Public Function StripHTMLTags(ByVal HTML As String) As String Dim StripTag As New Regex("<[^>]+>", RegexOptions.IgnoreCase) Return StripTag.Replace(FixNumericalTags(HTML), "") End Function Public Function FixNumericalTags(ByVal HTML As String) As String Dim StripTag As Regex Dim NewHTML as String = HTML StripTag = New Regex("\s\<\s", RegexOptions.IgnoreCase) NewHTML = StripTag.Replace(NewHTML, "<") StripTag = New Regex("\s\>\s", RegexOptions.IgnoreCase) NewHTML = StripTag.Replace(NewHTML, ">") Return NewHTML End Function Thanks Andy Edited June 16, 2005 by a_jam_sandwich Quote Code today gone tomorrow!
penfold69 Posted June 16, 2005 Posted June 16, 2005 To be honest, if a real "<" or ">" character exists in an HTML page, the visible text between them would disappear, as your browser would parse it as an HTML tag, decide its an unknown tag, and drop it. The CORRECT procedure, as a_jam_sandwich says, is to use the < and > tags to show a "<" or ">" character in your page, and therefore the regex would work perfectly. B. Quote
a_jam_sandwich Posted June 16, 2005 Posted June 16, 2005 Sorry should have reposted as new code see above Quote Code today gone tomorrow!
amir100 Posted June 17, 2005 Author Posted June 17, 2005 Hmmm ... To a_jam_sandwich: It's a good code you got there. You're trying to replace all occurence of "\s\<\s" and "\s\>\s" with the corresponding HTML code. But shouldn't you call FixNumericalTags before stripping the tags? 'Coz if I strip the tags first, then I won't have any "<" or ">" left in the input, right? Btw, why don't we use HTML Encode like Robby suggested? Anyway, allow me to restate my case. Actually, I need to strip HTML Tags from a user input. That user input will go into the database to be shown in another page. I'm building a web-based discussion center. I don't want people to input an HTML tags. But if there were a discussion concerning arithmetic expression, I would expect them to use "<" and ">". Here's where the dilemma comes. I've explain it in my previous post. Anything ... anyone? Quote Amir Syafrudin
a_jam_sandwich Posted June 17, 2005 Posted June 17, 2005 Hmm theres one thing with robbys encoding method, correct me if im wrong but doesn't HTMLencode encode all the input? this would mean text like the big dog Would become ... the%20big%20dog If you want you could just use a replace command in your text to just replace all < or > Public Function FixTags(HTML As String) As String Dim CorrectedHTML As String = HTML CorrectedHTML = CorrectedHTML.Replace("<", "<") CorrectedHTML = CorrectedHTML.Replace(">", ">") Return CorrectedHTML End Function Hope this helps Andy Quote Code today gone tomorrow!
Moderators Robby Posted June 17, 2005 Moderators Posted June 17, 2005 Hmmmmm, Andy; that is not correct "the big dog" will not become "the%20big%20dog" Quote Visit...Bassic Software
bri189a Posted June 17, 2005 Posted June 17, 2005 Public Function FixTags(HTML As String) As String Dim CorrectedHTML As String = HTML CorrectedHTML = CorrectedHTML.Replace("<", "<") CorrectedHTML = CorrectedHTML.Replace(">", ">") Return CorrectedHTML End Function Yes but this would be extremely slow compared to regular expressions; easier to read for someone who doesn't know regular expression though. Quote
a_jam_sandwich Posted June 18, 2005 Posted June 18, 2005 Lol sorry bout that Robby got confused with somthing else :), I miss read it as URLEncode my bad Andy Quote Code today gone tomorrow!
amir100 Posted July 1, 2005 Author Posted July 1, 2005 A closing statement :D Thanks for all the suggestions guys. But I kinda figure out another way to handle my needs. I made use the validateRequest feature provided by .Net to handle "dangerous input" such as HTML tags. Using it ... the tags are filtered ... if .Net detects anything "dangerous", it informs the error and stop the process. After filtering, all filtered "<" and ">" are replaced by "<" and ">", respectively. It's a great feature. But still, I'm using precaution and used some Regex to remove the "<" and ">" manually. Quote Amir Syafrudin
Moderators Robby Posted July 1, 2005 Moderators Posted July 1, 2005 That's exactly what HTMLEncode does. Quote Visit...Bassic Software
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.