HTML Parser

philprice

Centurion
Joined
Mar 21, 2003
Messages
116
Location
Hull, UK
Hey,

Is there a similar parser for HTML thats like, say XPath.*, im currently doing regex like:

Visual Basic:
        Dim strLinkRSSPattern As New String("<link.*?" + _
                                            "type=""application/rss\+xml""" + _
                                            "\stitle=""RSS""\shref=""(.+?)"".*?>")

Which is obviously a) lame b) wont work on all layouts

I basically just want to pick out the href if certian attributes in <link .. > match, i cant pass it to the xml parser because html is more liberal and less strict than xml...
 
You can use XPath if the HTML is well-formed or is XHTML compliant.

Well-formed examples:
Code:
<link type="application/rss+xml" href="www.rss.com/rss.xml" />

<html>
    <head>
        <title>Demo title</title>
    </head>
    <body>
    Demo text
    </body>
</html>

<p>Text<br />More Text</p>

Malformed examples:
Code:
<link type=application/rss+xml href="www.rss.com/rss.xml">

<html>
    <head>
        <title>Demo title</title>
    </head>
    <body>
    <b><i>Demo text</b></i>
    </body>
</html>

<p>Text<br>More Text
If the HTML is malformed you're out of luck when it comes to built-in framework classes.
 
Is it possible to just parse one node? If i pull out the line i want to parse, is that possible? It reduces the likelyness of malformedness..
 
Are you getting the HTML code and parsing out all of the links? I just wrote a program that does that and once it gets all links on a page it gets all the links on all of those pages and so on. Basically it never ends.
 
Im parsing <link .. > tags for href data (NOT <a href="">) when the attributes match certain things. Its not as easy as <A href> because you can just basically say "look for <a ANYTHING href="LINK" ANYTHING>", but i need to know the attirbutes for this. The only way i can think to do it in regex is to parse it if its a link tag, then pull out eace bit seperatly with one parse at a time with the regex object.
 
There's no reason why you can't just do what you proposed: Cut out a single node and parse through it. I haven't had much time to play with the .NET XML libs, but equivalent Java libraries will usually throw a handy exception if the XML (HTML, in your case) is malformed.

When this happens, I'm pretty sure you're stuck dealing with the malformed HTML yourself: Either through some sort of OO parsing or a lengthy run of regex's. :)

If this is something you're seriously trying to implement, I would recommend trying it on live webpages and having a look at what sort of HTML gives you trouble, then patching over these. It's pretty hacky, but I don't know of a cleaner way to do it.

.steve
 
Back
Top