neodammer Posted August 8, 2005 Posted August 8, 2005 Anybody know of a good regex function for extracting links from html code? Im finding it hard with the various ways to display links. Quote Enzin Research and Development
IngisKahn Posted August 8, 2005 Posted August 8, 2005 (?<=href=")\S+?(?=") will extract everything in href="..." What else do you need? Quote "Who is John Galt?"
neodammer Posted August 8, 2005 Author Posted August 8, 2005 (?<=href=")\S+?(?=") will extract everything in href="..." What else do you need? Well still kinda learning this Regex. Trying searing msdn for the a syntax code or some beginner examples on Regex for VB.net but havent found any. Could you just give me a small example of how id use that with a string? just curious not asking you to write the whole code (id never do that) just a small example that i could work with. Quote Enzin Research and Development
IngisKahn Posted August 8, 2005 Posted August 8, 2005 Check the sticky for info and tools; I use Regex Master. Regex regex = new Regex(@"(?<=href="")\S+?(?="")"); Match match = regex.Match(htmlDocument); Now you can use the match object to iterate thru all the matches. Quote "Who is John Galt?"
neodammer Posted August 9, 2005 Author Posted August 9, 2005 ahh.. C# is good ill try to port it over to vb.net thanks man you rock. :cool: Just curious wouldnt that take every link on the page? I guess that works I will figure out how to include just links with .jpg endings shouldnt be too hard. Quote Enzin Research and Development
decrypt Posted August 31, 2005 Posted August 31, 2005 you could just use the Document Object Module (you need to import the Microsoft MSHTML reference): Dim I As Object Dim WDoc As HTMLDocument Dim Wlval As HTMLAnchorElement Dim nelements As Short Dim sHref As String Dim sTitle As String Dim sText As String WDoc = WebBrowser1.Document nelements = WDoc.links.length For I = 0 To nelements - 1 Wlval = WDoc.links.item(I) sHref = Wlval.href sText= Wlval.outerText sTitle = Wlval.title lstbox1.Items.add(sHref) 'to see if it ends with a .jpg, you could just do the following: If sHref.EndsWith(".jpg") Then lstBox2.Items.add(sHref) End If Next By using this you can get soo much information about a webpage :) Quote
MHOWLAND Posted September 4, 2005 Posted September 4, 2005 Anybody know of a good regex function for extracting links from html code? Im finding it hard with the various ways to display links. Here are a couple of good ones from http://www.regular-expressions.info/. That is a great reference for new RegEx and old Regex users. createRegexObj("<" + tagName + "[^>]*>(.*?)</" + tagName + ">"); matchObj = regexObj.Match(search); You could then loop through the matches. This next one does the same thing but uses Backreferences to capture the text inside the tags. createRegexObj(@"<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>"); matchObj = regexObj.Match(search); You could write a generic print module to show all the results like this: private void printMatch() { // Regex.Match constructs and returns a Match object // You can query this object to get all possible information about the match while (matchObj.Success) { Console.WriteLine("Match offset: " + matchObj.Index.ToString() + "\r\n"); Console.WriteLine("Match length: " + matchObj.Length.ToString() +"\r\n"); Console.WriteLine("Matched text: " + matchObj.Value + "\r\n"); if (matchObj.Groups.Count > 1) { // matchObj.Groups[0] holds the entire regex match also held by // matchObj itself. The other Group objects hold the matches for // capturing parentheses in the regex for (int i = 1; i < matchObj.Groups.Count; i++) { Group g = matchObj.Groups; if (g.Success) { Console.WriteLine("Group " + i.ToString() + " offset: " + g.Index.ToString() + "\r\n"); Console.WriteLine("Group " + i.ToString() + " length: " + g.Length.ToString() + "\r\n"); Console.WriteLine("Group " + i.ToString() + " text: " + g.Value + "\r\n"); } else { Console.WriteLine("Group " + i.ToString() + " did not participate in the overall match\r\n"); } } } else { Console.WriteLine("no backreferences/groups"); } // Get the next match matchObj = matchObj.NextMatch(); } } Neither of these get tags within tags. You would need to loop through the backexpressions to do that. Quote
mark007 Posted September 6, 2005 Posted September 6, 2005 Check out the cs and vb tags - they format your code and make reading much easier. :) Quote Please check the Knowledge Base before you post. "Computers are useless. They can only give you answers." - Pablo Picasso The Code Net
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.