Jump to content
Xtreme .Net Talk

Recommended Posts

Posted
(?<=href=")\S+?(?=") will extract everything in href="..."

What else do you need?

 

 

Well still kinda learning this Regex. Trying searing msdn for the a syntax code or some beginner examples on Regex for VB.net but havent found any. Could you just give me a small example of how id use that with a string? just curious not asking you to write the whole code (id never do that) just a small example that i could work with.

Posted

Check the sticky for info and tools; I use Regex Master.

Regex regex = new Regex(@"(?<=href="")\S+?(?="")");
Match match = regex.Match(htmlDocument);

 

Now you can use the match object to iterate thru all the matches.

"Who is John Galt?"
  • 4 weeks later...
Posted

you could just use the Document Object Module (you need to import the Microsoft MSHTML reference):

 

Dim I As Object
Dim WDoc As HTMLDocument
Dim Wlval As HTMLAnchorElement
Dim nelements As Short
Dim sHref As String
Dim sTitle As String
Dim sText As String

WDoc = WebBrowser1.Document
nelements = WDoc.links.length

For I = 0 To nelements - 1
           Wlval = WDoc.links.item(I)
           sHref = Wlval.href 
           sText= Wlval.outerText
           sTitle = Wlval.title
           lstbox1.Items.add(sHref)
           'to see if it ends with a .jpg, you could just do the following:
           If sHref.EndsWith(".jpg") Then
                 lstBox2.Items.add(sHref)
           End If
Next

 

By using this you can get soo much information about a webpage :)

Posted
Anybody know of a good regex function for extracting links from html code? Im finding it hard with the various ways to display links.

 

Here are a couple of good ones from http://www.regular-expressions.info/.

That is a great reference for new RegEx and old Regex users.

 

createRegexObj("<" + tagName + "[^>]*>(.*?)</" + tagName + ">");

matchObj = regexObj.Match(search);

 

You could then loop through the matches.

 

This next one does the same thing but uses Backreferences to capture the text inside the tags.

 

createRegexObj(@"<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>");

matchObj = regexObj.Match(search);

 

You could write a generic print module to show all the results like this:

 

private void printMatch()

{

// Regex.Match constructs and returns a Match object

// You can query this object to get all possible information about the match

while (matchObj.Success)

{

 

Console.WriteLine("Match offset: " + matchObj.Index.ToString() + "\r\n");

Console.WriteLine("Match length: " + matchObj.Length.ToString() +"\r\n");

Console.WriteLine("Matched text: " + matchObj.Value + "\r\n");

if (matchObj.Groups.Count > 1)

{

// matchObj.Groups[0] holds the entire regex match also held by

// matchObj itself. The other Group objects hold the matches for

// capturing parentheses in the regex

for (int i = 1; i < matchObj.Groups.Count; i++)

{

Group g = matchObj.Groups;

if (g.Success)

{

Console.WriteLine("Group " + i.ToString() +

" offset: " + g.Index.ToString() + "\r\n");

Console.WriteLine("Group " + i.ToString() +

" length: " + g.Length.ToString() + "\r\n");

Console.WriteLine("Group " + i.ToString() +

" text: " + g.Value + "\r\n");

}

else

{

Console.WriteLine("Group " + i.ToString() +

" did not participate in the overall match\r\n");

}

}

}

else

{

Console.WriteLine("no backreferences/groups");

}

 

// Get the next match

matchObj = matchObj.NextMatch();

}

 

}

 

Neither of these get tags within tags. You would need to loop through the backexpressions to do that.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...