Jump to content
Xtreme .Net Talk

I need search a string and get the word before a string.. how do I do this?


Recommended Posts

Posted

Hello, I need to take a string (4k characters or so), search for "return ss('go to" and take the url right before it..

 

Here is the txt I will be looking through:

 

tons of code
<a id=aw5 href=/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZH4_____wGYAdpRqgEEMk5SU8gBAQ&num=7 onMouseOver="return ss('go to www.uis-insurance.com')"
tons of code

So basically I need to strip out

/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZH4_____wGYAdpRqgEEMk5SU8gBAQ&num=7

and store that in a string.

 

 

 

(I say search for "return ss('go to" because I do not know how to make vb search for "onMouseOver="return ss('go to " (because there is the extra " in the middle)

 

Any ideas?

 

 

thanks

lee

Posted
I guess the problem I am having is.. i don't know how to findout where the href=/url starts.. there is multiple href=/url in the string..
  • *Experts*
Posted

Have you tried breaking the string into an array?

Dim FindURL() as String = tonsofcode.Split("href=")

Dim strUrl as String = FindURL(1)

Member, in good standing, of the elite fraternity of mentally challenged programmers.

 

Dolphins Software

Posted
Have you tried breaking the string into an array?

Dim FindURL() as String = tonsofcode.Split("href=")

Dim strUrl as String = FindURL(1)

 

 

Not exacully sure how to implement what you are saying.. I tried:

 

   Private Sub GetPageHTTP(ByVal URL As String)
       'Ex.: dim s As string = GetPageHTTP("http://www.uol.com.br")
       'Ex.: dim x As string = GetPageHTTP("http://www.microsoft.com/")
       'Ex.: dim x As string = GetPageHTTP("http://www.planet-source-code.com/vb/default.asp?lngWId=10")
       Dim wc As New System.Net.WebClient
       Dim s As System.IO.Stream = wc.OpenRead(URL)
       Dim r As String
       Dim sr As System.IO.StreamReader = New System.IO.StreamReader(s, System.Text.Encoding.UTF7, False)
       r = sr.ReadToEnd()


       Dim FindURL() As String = r.Split("href=/url?")
       Dim strUrl As String = FindURL(8)

       MsgBox(strUrl)

   End Sub

 

 

And I definitly do not get anything out that I need :/

 

After looking at the webpage a little longer.. I could just search for "href=/url?" and take everything after that until there is a whitespace.

 

How would I do something like this? Or would it still be best to convert all to an array?

 

 

thanks

Lee

  • *Experts*
Posted (edited)

Did you read and understand my post? What's FindURL(8)...This probably caused the program to crash as there are only two values in the array.

 

It appears the you also need to remove the " onMouseOver=..." section also. So just split the text again.

 

Cut and paste this code without changes.

   Private Sub GetPageHTTP(ByVal URL As String)
       'Ex.: dim s As string = GetPageHTTP("http://www.uol.com.br" ) 
       'Ex.: dim x As string = GetPageHTTP("http://www.microsoft.com/" ) 
       'Ex.: dim x As string = GetPageHTTP("http://www.planet-source-code.com/vb/default.asp?lngWId=10" ) 
       Dim wc As New System.Net.WebClient
       Dim s As System.IO.Stream = wc.OpenRead(URL)
       Dim r As String
       Dim sr As System.IO.StreamReader = New System.IO.StreamReader(s, System.Text.Encoding.UTF7, False)
       r = sr.ReadToEnd()

       Dim FindURL() As String = r.Split("href=")
       Dim strUrl As String = FindURL(1) 'not 8

'This will return "/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZ  H4_____wGYAdpRqgEEMk5SU8gBAQ&num=7 onMouseOver="return ss('go to www.uis-insurance.com')"

'Now split it again to remove the " onMouseOver section"
       FindURL = strUrl.Split(" onMouseOver")
       strUrl = FindURL(0) 'not 8 or 1

'This wll return "/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZ  H4_____wGYAdpRqgEEMk5SU8gBAQ&num=7" regardles of the text length both before "href=" and after the " onMouseOver" section.

       MsgBox(strUrl)

   End Sub 

If the " onMouseOver" section is always there this snippet will work, if an " on" type command is there split the last array with (" on") in place of (" onMouseOver"). The Idea is to find "SET" points in the text string that will always be present and split at those points. You can check if the split point exists in the string using the .IndexOf method.

 

If strUrl.IndexOf(" on") >= 0 Then

'Split the text here

End If

Edited by DiverDan

Member, in good standing, of the elite fraternity of mentally challenged programmers.

 

Dolphins Software

Posted (edited)
Did you read and understand my post? What's FindURL(8)...This probably caused the program to crash as there are only two values in the array.

 

It appears the you also need to remove the " onMouseOver=..." section also. So just split the text again.

 

Cut and paste this code without changes.

   Private Sub GetPageHTTP(ByVal URL As String)
       'Ex.: dim s As string = GetPageHTTP("http://www.uol.com.br" ) 
       'Ex.: dim x As string = GetPageHTTP("http://www.microsoft.com/" ) 
       'Ex.: dim x As string = GetPageHTTP("http://www.planet-source-code.com/vb/default.asp?lngWId=10" ) 
       Dim wc As New System.Net.WebClient
       Dim s As System.IO.Stream = wc.OpenRead(URL)
       Dim r As String
       Dim sr As System.IO.StreamReader = New System.IO.StreamReader(s, System.Text.Encoding.UTF7, False)
       r = sr.ReadToEnd()

       Dim FindURL() As String = r.Split("href=")
       Dim strUrl As String = FindURL(1) 'not 8

'This will return "/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZ  H4_____wGYAdpRqgEEMk5SU8gBAQ&num=7 onMouseOver="return ss('go to www.uis-insurance.com')"

'Now split it again to remove the " onMouseOver section"
       FindURL = strUrl.Split(" onMouseOver")
       strUrl = FindURL(0) 'not 8 or 1

'This wll return "/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZ  H4_____wGYAdpRqgEEMk5SU8gBAQ&num=7" regardles of the text length both before "href=" and after the " onMouseOver" section.

       MsgBox(strUrl)

   End Sub 

If the " onMouseOver" section is always there this snippet will work, if an " on" type command is there split the last array with (" on") in place of (" onMouseOver"). The Idea is to find "SET" points in the text string that will always be present and split at those points. You can check if the split point exists in the string using the .IndexOf method.

 

If strUrl.IndexOf(" on") >= 0 Then

'Split the text here

End If

 

I tried the FindURL(0) before and didn't get anything.. so I changed it to 8 because i was lost..

 

With FindURL(0) I get this:

 

http://ezinksystems.com/images/try2.gif

 

(by the way.. I tried with this url too:

http://www.google.com/search?hl=en&q=dell and got the same results)

I am stumped again.. any ideas?

 

thanks!

Edited by trend
  • *Experts*
Posted

What this code does is look for the "href=" index and the " onMouseOver" index as your sample indicated. But now the conditions have changed. (this has now become a moving target which is impossible to deal with)

 

So....

If I understand correctly you want the string to contain:

"/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZ H4_____wGYAdpRqgEEMk5SU8gBAQ&num=7"

 

or something similar from an unknown string, with an unknown length. Is that correct?

 

Do all of the URL strings begin with "<a id=aw5 href=" or at least contain "href=" preceding the "/url?"

 

Can the wanted URL string end after the "href=" prefix is stripped out, or is there always a " onMouse...something" or atleast " on" following?

 

In other words, what are the conditions and possibilities of the URL string. This must be known before anything will consistantly work.

Member, in good standing, of the elite fraternity of mentally challenged programmers.

 

Dolphins Software

Posted

Oh ok.

 

Well, let me phrase my problem a different way.

 

I have 4k of html. Something to this affect:

 

tons of code
<a id=aw5 href=/url?sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZ_H4_____wGYAdpRqgEEMk5SU8gBAQ&num=7 onMouseOver="return ss('go to www.uis-insurance.com')"
tons of code

 

I want to strip out all the contents between "href=/url?" and "onMouseOver="return ss('go to "

(and there will also be multiple instances of the above.)

In this example.. I would want this stripped out:

 

sa=l&q=http://www.uis-insurance.com&ai=BAUQ5nBPrQuP9O7aeaJfTqOMJ0vqDCuqvqKQBwte0CrD6TxAFGAcoCDgASIo5UNyesZ_H4_____wGYAdpRqgEEMk5SU8gBAQ&num=7

 

 

Dan, thanks for all the help!

 

Lee

 

 

/edit/ there appeared to be spaces in the url.. inreality there are not.. I am not sure how they got there.. (I must of added them by mistake)

  • *Experts*
Posted

Hi Lee,

Do all the instances of "sa=...num=7" begin with the prefix "href=/url?" and end with the suffix " on"??

If they do then this technique will work.

Member, in good standing, of the elite fraternity of mentally challenged programmers.

 

Dolphins Software

Posted
Hi Lee,

Do all the instances of "sa=...num=7" begin with "href=/url?" ??

If they do then this technique will work.

 

 

Yes all instances begin with "href=/url" (some instances are "bs=...num=2" fyi.. but I don't even think this is going to matter in your solution..atleast i don't think so)

 

 

thanks!

Lee

  • *Experts*
Posted

What's this going to be used for?

 

It seems that from viewing the code of a search engine web page this code will return all the listed URLs...which could result in SPAM.

 

Sorry, but I think I'll stop here.

Member, in good standing, of the elite fraternity of mentally challenged programmers.

 

Dolphins Software

Posted
IF its not 4 spam, and indeed a legit use (Im not saying it IS for illegit use), catch every instance of <A></A> HTML tags, then split those by the href="" values, and get everthing inside the first and last instance of the " " quotes.

Me = 49% Linux, 49% Windows, 2% Hot gas.

 

...Bite me.

 

My Site: www.RedPierSystems.net

-.net, php, AutoCAD VBA, Graphics Design

Posted (edited)
IF its not 4 spam' date=' and indeed a legit use (Im not saying it IS for illegit use), catch every instance of <A></A> HTML tags, then split those by the href="" values, and get everthing inside the first and last instance of the " " quotes.[/quote']

 

Hah, yeah .. I can agree it sounds shaddy, esp since I cannot provide to much detail.. But I promise it is legit. I even told one memeber my idea, and I he stopped helping and is going to run with my idea (hah, hopefully not, but that is what it is looking like now).

 

Anyways.. Why wouldn't I just look for:

 

every instance of "href=/url" and "onMouseOver="return ss('go to " (how would I deal with the " in the middle?)

 

And then I have exacully the info I need.

 

thanks

Lee

 

/edit/ I guess I could use something like:

 

'expects an --Imports System.Text.RegularExpressions-- at the top 
       'Dim rx As New Regex("\<font.*\>(.+?) .+?;(.+)\</font\>")

       Dim rx As New Regex("href=/url?(?<link>.*)\s*onMouseOver=")
       Dim m As Match = rx.Match(r)

       'If the match succeeded (m.Success), m will contain the groups inside 
       'the Groups property 
       'NB: group 0 contains the entire match (which in this case is the entire string) 
       'so the index to start here would be 1 
       MsgBox("result 1: " + m.Groups(1).Value)
       MsgBox("result 2: " + m.Groups(2).Value)

 

but.. I a cannot find the exact regex string to work.. I do not think the ending case is being satisfied (because result one captures everything from hfref=/url? till the end ).

 

ideas?

Edited by trend
Posted

Looks like:

       Dim rx As New Regex("href=/url?((.|\n)*?)onMouseOver=")

       Dim m As Match = rx.Match(r)

       'If the match succeeded (m.Success), m will contain the groups inside 
       'the Groups property 
       'NB: group 0 contains the entire match (which in this case is the entire string) 
       'so the index to start here would be 1 
       MsgBox("result 1: " + m.Groups(1).Value)
       MsgBox("result 2: " + m.Groups(2).Value)

works to capture a single link like I want!.. but... it will only capture a single instance in the html file (even though there is more than one)..

 

Any ideas?

Posted

I'm not sure what the problem is here...

String.Substring and String.IndexOf are all you need. Anyway, if you're using RegEx then .Matches will return a collection.

"Who is John Galt?"

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...