Finding text in HTML document

amir100 · March 20, 2007

Hi all.

Can anyone suggest a way to find specific text in HTML document without confusing it with any HTML tags and their attributes? For instance if I want to find the word body then I can skip the <body> tag.

Any help would be appreciated.

I've done some reading but I can only find methods to strip HTML tag which in my case stripping HTML tag is out of question. I have to preserve the original HTML document.

amir100 · March 23, 2007

Okay. At this point I've managed to skip the <head> part of the HTML document. I went straight to process the document after the <body> tag. It's a straightforward solution. But I can't figure out any other way. :D

Anyway the question persist. How do I differentiate the text that I found is not part of an HTML tag? How do I know, for instance, if I want to find "mytzixklyptomic" then the occurence of such word is not part of an HTML tag?

Anyone?

Even the slightest would help. So please help me. :D

MrPaul · March 23, 2007

Text parsing

How do I know, for instance, if I want to find "mytzixklyptomic" then the occurence of such word is not part of an HTML tag?

Regex is not my forté but logically you could determine if a given substring of text is within an HTML tag by looking backwards and forwards from the substring position for > and < characters.

For example, if when searching backwards from the substring (e.g. using LastIndexOfAny), a < is encountered before a > then the given substring is likely to be within a HTML tag. In the same way, when searching forwards past the substring (e.g. using IndexOfAny), a > will be encountered before a < if the substring is within an HTML tag. This of course assumes the HTML is well formed.

'Assume the portion of text you wish to test is located at testPos
Dim pos As Integer

pos = html.LastIndexOfAny(New Char() {">"c, "<"c}, testPos)
If (pos > -1) AndAlso (html(pos) = "<"c) Then
   'Probably within a tag

   pos = html.IndexOfAny(New Char() {">"c, "<"c}, testPos)
   If (pos > -1) AndAlso (html(pos) = ">"c) Then
       'Definitely within a tag
   Else
       'Malformed HTML?
   End If
Else
   'Not within a tag
End If

This could probably be performed more elegantly and robustly using regex, but it's an idea.

Good luck :cool:

amir100 · March 26, 2007

Re: Text parsing

Thx for replying MrPaul. I've thought of the same solution. Still you've manage to provide technical detail on how to accomplish that. :D

I'll give it a shot.

Thx again.

amir100 · March 26, 2007

Re: Text parsing

Almost forgot.

This wouldn't work if I'm dealing with documents containing mathematic equations using < and > right?

text here < text here ... mytext here ... text here > text here

I bet mytext would be considered part of an HTML tag. CMIIW.

PlausiblyDamp · March 26, 2007

Re: Text parsing

If the document has been properly created then various characters should have been encoded i.e. < and > would be < and > so in that case it should still be ok.

Unfortunately if the document does contain such symbols in an un-encoded form it will make parsing of the file very difficult indeed.

amir100 · March 26, 2007

Re: Text parsing

The code from MrPaul works perfectly. It is a bit straightforward but right now I'll go with that. Thx again for MrPaul.

It is true, just like PlausiblyDamp said, that if my HTML document were properly created then having a document with equations won't be a problem.

The complete idea to finding a text in an HTML document would be:

- Go past over the <body> tag.

- From that point, start finding the text.

- Use the method proposed by MrPaul to determine whether the text is a part of an HTML Tag or not.

Well then. That wraps it up.

Thank you all.

mskeel · March 26, 2007

Re: Text parsing

Sorry for not thinking of this sooner but what about using a DOM parser to extract the data?

Using a DOM parser would negate issues with regards to stray < and > scattered throughout the text.

amir100 · March 26, 2007

Re: Text parsing

I've never really use a DOM Parser before. I don't even know which one you're refering to, mskeel. :D In any case, I don't think DOM Parser is really the one I need. Care to explain how do I use a DOM Parser in my case?

mskeel · March 27, 2007

This code still has a few problems in that > will work, but < will not. Those should be converted into > and < anyway. You can probably work around this, but I'm just using the default behaviors. The code should look familiar to this.

This is pretty quick and dirty, but it demonstrates the basics of how you might use a Document Object Model parser to extract the information you want.

Let me know if you have any questions.

GetText.zip

amir100 · March 28, 2007

This code still has a few problems in that > will work, but < will not. Those should be converted into > and < anyway. You can probably work around this, but I'm just using the default behaviors. The code should look familiar to this.

This is pretty quick and dirty, but it demonstrates the basics of how you might use a Document Object Model parser to extract the information you want.

This feels nostalgic. The first time I used an XMLTextReader sure gave me a hard time. But after looking at the link you provide, it turns out that it is really that simple to use an XMLTextReader. I guess at that time I'm really lacking in reference.

Anyway about your code. I have to say that your code works fine. I don't really have anything to ask. I get the big picture of your code.

[CS]

private void backgroundWorker1_ProgressChanged(object sender, ProgressChangedEventArgs e)

{

this.textBox1.Text += e.UserState.ToString() + " ---- ";

}

[/CS]

I alter a bit of your code. As you can see in the above code, I've added a simple concatenation to distinguish every text element that you process. Using your sample.html, I got this result after running your code.

Tri-Corner Humor Web Shoppe ---- Home of the original HA! HA! Guy Whiteboard ----
In the year of our Lord two thousand and seven, we at Tri-Corner Humor bear somber witness to the future of door enhancements: the world's first HA! HA! Guy Whiteboard!
---- See HA! HA! Whiteboards in action! ---- Win free whiteboards throughout the month of April by sending questions to the HA! HA! Guy!! ----
A ---- blockbuster Internet phenomenon ---- spanning more than two years, HA! HA! Guy's iron grip on the groin of our collective imagination is as strong as ever. The secret to the Quaker's staying power is his unique delivery, one which is certain to ---- delight even the most cynical jerkface. ---- Regular checkups help detect polyps! ----
Like having your very own incarnation of the Dalai Lama, a HA! HA! Guy Whiteboard is there when you need a little extra something in a delicate situation. ---- This offering is your ticket to a new world, one unshackled from the demands of flowery diplomacy and tact. ---- We envision a future where every major business, political, and medical transaction occurs through the ritual exchange of HA! HA!s.
---- He'll understand. ----
Before this product was available you would have had to hot glue a laptop to your door if you wanted to use the HA! HA! Guy to let your roommate know you ate the last yogurt and his cobra escaped while you were playing with it. Now you can spare the laptop and spoil yourself with this handsome offering for only ---- $9.95! ---- Yes! Look me in the eye and tell me you haven't spent more and gotten a whole lot less. ---- We promise that this will be the best online whiteboard impulse buy you will ever make! ----

When you order now you will receive:
---- One HA! HA! Guy Whiteboard. ---- One stylish black whiteboard marker. ---- Two heavy duty epoxy strips for affixing your whiteboard to ---- consenting ---- surfaces. ---- Packing material suitable for preserving your whiteboard in "mint" condition for collector's purposes. ----

this is a test for stray greater thans: 8 > 9 8 > 9 8 > 9

---- Shopping Cart ---- | ---- Policies ---- | ---- About ---- | ---- 2007 Tri-corner Humor. All Rights Reserved. ----

As I said earlier. Your code works fine as an well-formed HTML Parser. But this is not what I really need.

I needed to have a library to find the right word or phrase in an HTML Document and replace those words or phrases with an appropriate replacement. I must do that without changing anything else from that HTML Document.

Here's an example. I want to change all occurence of "when" in your sample.html. Then my library has to produce sample.html with the "when" words already replaced.

I'm thinking of possibilities to use your code to develop the library I need. I'm kinda stuck here. Any idea on how to achieve my goal using your code?

Sign In

Finding text in HTML document

Recommended Posts

amir100

amir100

MrPaul

amir100

amir100

PlausiblyDamp

amir100

mskeel

amir100

mskeel

amir100

Join the conversation

Browse

Activity