trend Posted July 30, 2005 Posted July 30, 2005 Hello, I need to take a string (4k characters or so), search for "return ss('go to" and take the url right before it.. Here is the txt I will be looking through: tons of code <a id=aw5 href=/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZH4_____wGYAdpRqgEEMk5SU8gBAQ&num=7 onMouseOver="return ss('go to www.uis-insurance.com')" tons of code So basically I need to strip out /url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZH4_____wGYAdpRqgEEMk5SU8gBAQ&num=7 and store that in a string. (I say search for "return ss('go to" because I do not know how to make vb search for "onMouseOver="return ss('go to " (because there is the extra " in the middle) Any ideas? thanks lee Quote
trend Posted July 30, 2005 Author Posted July 30, 2005 I guess the problem I am having is.. i don't know how to findout where the href=/url starts.. there is multiple href=/url in the string.. Quote
*Experts* DiverDan Posted July 30, 2005 *Experts* Posted July 30, 2005 Have you tried breaking the string into an array? Dim FindURL() as String = tonsofcode.Split("href=") Dim strUrl as String = FindURL(1) Quote Member, in good standing, of the elite fraternity of mentally challenged programmers. Dolphins Software
trend Posted July 31, 2005 Author Posted July 31, 2005 Have you tried breaking the string into an array? Dim FindURL() as String = tonsofcode.Split("href=") Dim strUrl as String = FindURL(1) Not exacully sure how to implement what you are saying.. I tried: Private Sub GetPageHTTP(ByVal URL As String) 'Ex.: dim s As string = GetPageHTTP("http://www.uol.com.br") 'Ex.: dim x As string = GetPageHTTP("http://www.microsoft.com/") 'Ex.: dim x As string = GetPageHTTP("http://www.planet-source-code.com/vb/default.asp?lngWId=10") Dim wc As New System.Net.WebClient Dim s As System.IO.Stream = wc.OpenRead(URL) Dim r As String Dim sr As System.IO.StreamReader = New System.IO.StreamReader(s, System.Text.Encoding.UTF7, False) r = sr.ReadToEnd() Dim FindURL() As String = r.Split("href=/url?") Dim strUrl As String = FindURL(8) MsgBox(strUrl) End Sub And I definitly do not get anything out that I need :/ After looking at the webpage a little longer.. I could just search for "href=/url?" and take everything after that until there is a whitespace. How would I do something like this? Or would it still be best to convert all to an array? thanks Lee Quote
*Experts* DiverDan Posted July 31, 2005 *Experts* Posted July 31, 2005 (edited) Did you read and understand my post? What's FindURL(8)...This probably caused the program to crash as there are only two values in the array. It appears the you also need to remove the " onMouseOver=..." section also. So just split the text again. Cut and paste this code without changes. Private Sub GetPageHTTP(ByVal URL As String) 'Ex.: dim s As string = GetPageHTTP("http://www.uol.com.br" ) 'Ex.: dim x As string = GetPageHTTP("http://www.microsoft.com/" ) 'Ex.: dim x As string = GetPageHTTP("http://www.planet-source-code.com/vb/default.asp?lngWId=10" ) Dim wc As New System.Net.WebClient Dim s As System.IO.Stream = wc.OpenRead(URL) Dim r As String Dim sr As System.IO.StreamReader = New System.IO.StreamReader(s, System.Text.Encoding.UTF7, False) r = sr.ReadToEnd() Dim FindURL() As String = r.Split("href=") Dim strUrl As String = FindURL(1) 'not 8 'This will return "/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZ H4_____wGYAdpRqgEEMk5SU8gBAQ&num=7 onMouseOver="return ss('go to www.uis-insurance.com')" 'Now split it again to remove the " onMouseOver section" FindURL = strUrl.Split(" onMouseOver") strUrl = FindURL(0) 'not 8 or 1 'This wll return "/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZ H4_____wGYAdpRqgEEMk5SU8gBAQ&num=7" regardles of the text length both before "href=" and after the " onMouseOver" section. MsgBox(strUrl) End Sub If the " onMouseOver" section is always there this snippet will work, if an " on" type command is there split the last array with (" on") in place of (" onMouseOver"). The Idea is to find "SET" points in the text string that will always be present and split at those points. You can check if the split point exists in the string using the .IndexOf method. If strUrl.IndexOf(" on") >= 0 Then 'Split the text here End If Edited July 31, 2005 by DiverDan Quote Member, in good standing, of the elite fraternity of mentally challenged programmers. Dolphins Software
trend Posted July 31, 2005 Author Posted July 31, 2005 (edited) Did you read and understand my post? What's FindURL(8)...This probably caused the program to crash as there are only two values in the array. It appears the you also need to remove the " onMouseOver=..." section also. So just split the text again. Cut and paste this code without changes. Private Sub GetPageHTTP(ByVal URL As String) 'Ex.: dim s As string = GetPageHTTP("http://www.uol.com.br" ) 'Ex.: dim x As string = GetPageHTTP("http://www.microsoft.com/" ) 'Ex.: dim x As string = GetPageHTTP("http://www.planet-source-code.com/vb/default.asp?lngWId=10" ) Dim wc As New System.Net.WebClient Dim s As System.IO.Stream = wc.OpenRead(URL) Dim r As String Dim sr As System.IO.StreamReader = New System.IO.StreamReader(s, System.Text.Encoding.UTF7, False) r = sr.ReadToEnd() Dim FindURL() As String = r.Split("href=") Dim strUrl As String = FindURL(1) 'not 8 'This will return "/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZ H4_____wGYAdpRqgEEMk5SU8gBAQ&num=7 onMouseOver="return ss('go to www.uis-insurance.com')" 'Now split it again to remove the " onMouseOver section" FindURL = strUrl.Split(" onMouseOver") strUrl = FindURL(0) 'not 8 or 1 'This wll return "/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZ H4_____wGYAdpRqgEEMk5SU8gBAQ&num=7" regardles of the text length both before "href=" and after the " onMouseOver" section. MsgBox(strUrl) End Sub If the " onMouseOver" section is always there this snippet will work, if an " on" type command is there split the last array with (" on") in place of (" onMouseOver"). The Idea is to find "SET" points in the text string that will always be present and split at those points. You can check if the split point exists in the string using the .IndexOf method. If strUrl.IndexOf(" on") >= 0 Then 'Split the text here End If I tried the FindURL(0) before and didn't get anything.. so I changed it to 8 because i was lost.. With FindURL(0) I get this: http://ezinksystems.com/images/try2.gif (by the way.. I tried with this url too: http://www.google.com/search?hl=en&q=dell and got the same results) I am stumped again.. any ideas? thanks! Edited July 31, 2005 by trend Quote
*Experts* DiverDan Posted July 31, 2005 *Experts* Posted July 31, 2005 What this code does is look for the "href=" index and the " onMouseOver" index as your sample indicated. But now the conditions have changed. (this has now become a moving target which is impossible to deal with) So.... If I understand correctly you want the string to contain: "/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZ H4_____wGYAdpRqgEEMk5SU8gBAQ&num=7" or something similar from an unknown string, with an unknown length. Is that correct? Do all of the URL strings begin with "<a id=aw5 href=" or at least contain "href=" preceding the "/url?" Can the wanted URL string end after the "href=" prefix is stripped out, or is there always a " onMouse...something" or atleast " on" following? In other words, what are the conditions and possibilities of the URL string. This must be known before anything will consistantly work. Quote Member, in good standing, of the elite fraternity of mentally challenged programmers. Dolphins Software
trend Posted August 1, 2005 Author Posted August 1, 2005 Oh ok. Well, let me phrase my problem a different way. I have 4k of html. Something to this affect: tons of code <a id=aw5 href=/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZ_H4_____wGYAdpRqgEEMk5SU8gBAQ&num=7 onMouseOver="return ss('go to www.uis-insurance.com')" tons of code I want to strip out all the contents between "href=/url?" and "onMouseOver="return ss('go to " (and there will also be multiple instances of the above.) In this example.. I would want this stripped out: sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZ_H4_____wGYAdpRqgEEMk5SU8gBAQ&num=7 Dan, thanks for all the help! Lee /edit/ there appeared to be spaces in the url.. inreality there are not.. I am not sure how they got there.. (I must of added them by mistake) Quote
*Experts* DiverDan Posted August 1, 2005 *Experts* Posted August 1, 2005 Hi Lee, Do all the instances of "sa=...num=7" begin with the prefix "href=/url?" and end with the suffix " on"?? If they do then this technique will work. Quote Member, in good standing, of the elite fraternity of mentally challenged programmers. Dolphins Software
trend Posted August 1, 2005 Author Posted August 1, 2005 Hi Lee, Do all the instances of "sa=...num=7" begin with "href=/url?" ?? If they do then this technique will work. Yes all instances begin with "href=/url" (some instances are "bs=...num=2" fyi.. but I don't even think this is going to matter in your solution..atleast i don't think so) thanks! Lee Quote
*Experts* DiverDan Posted August 1, 2005 *Experts* Posted August 1, 2005 What's this going to be used for? It seems that from viewing the code of a search engine web page this code will return all the listed URLs...which could result in SPAM. Sorry, but I think I'll stop here. Quote Member, in good standing, of the elite fraternity of mentally challenged programmers. Dolphins Software
trend Posted August 1, 2005 Author Posted August 1, 2005 I will PM because this is confidential Quote
ALEX_0077 Posted August 1, 2005 Posted August 1, 2005 IF its not 4 spam, and indeed a legit use (Im not saying it IS for illegit use), catch every instance of <A></A> HTML tags, then split those by the href="" values, and get everthing inside the first and last instance of the " " quotes. Quote Me = 49% Linux, 49% Windows, 2% Hot gas. ...Bite me. My Site: www.RedPierSystems.net -.net, php, AutoCAD VBA, Graphics Design
trend Posted August 2, 2005 Author Posted August 2, 2005 (edited) IF its not 4 spam' date=' and indeed a legit use (Im not saying it IS for illegit use), catch every instance of <A></A> HTML tags, then split those by the href="" values, and get everthing inside the first and last instance of the " " quotes.[/quote'] Hah, yeah .. I can agree it sounds shaddy, esp since I cannot provide to much detail.. But I promise it is legit. I even told one memeber my idea, and I he stopped helping and is going to run with my idea (hah, hopefully not, but that is what it is looking like now). Anyways.. Why wouldn't I just look for: every instance of "href=/url" and "onMouseOver="return ss('go to " (how would I deal with the " in the middle?) And then I have exacully the info I need. thanks Lee /edit/ I guess I could use something like: 'expects an --Imports System.Text.RegularExpressions-- at the top 'Dim rx As New Regex("\<font.*\>(.+?) .+?;(.+)\</font\>") Dim rx As New Regex("href=/url?(?<link>.*)\s*onMouseOver=") Dim m As Match = rx.Match(r) 'If the match succeeded (m.Success), m will contain the groups inside 'the Groups property 'NB: group 0 contains the entire match (which in this case is the entire string) 'so the index to start here would be 1 MsgBox("result 1: " + m.Groups(1).Value) MsgBox("result 2: " + m.Groups(2).Value) but.. I a cannot find the exact regex string to work.. I do not think the ending case is being satisfied (because result one captures everything from hfref=/url? till the end ). ideas? Edited August 2, 2005 by trend Quote
trend Posted August 2, 2005 Author Posted August 2, 2005 Looks like: Dim rx As New Regex("href=/url?((.|\n)*?)onMouseOver=") Dim m As Match = rx.Match(r) 'If the match succeeded (m.Success), m will contain the groups inside 'the Groups property 'NB: group 0 contains the entire match (which in this case is the entire string) 'so the index to start here would be 1 MsgBox("result 1: " + m.Groups(1).Value) MsgBox("result 2: " + m.Groups(2).Value) works to capture a single link like I want!.. but... it will only capture a single instance in the html file (even though there is more than one).. Any ideas? Quote
IngisKahn Posted August 2, 2005 Posted August 2, 2005 I'm not sure what the problem is here... String.Substring and String.IndexOf are all you need. Anyway, if you're using RegEx then .Matches will return a collection. Quote "Who is John Galt?"
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.