Jump to content
Xtreme .Net Talk

Recommended Posts

Posted

This is for learning perpose and does not have to function really well.... What I would like to do is create a simplified web spider. But I'm running into a few problems/questions.

 

-I can't seem to read in a HTML into a string. My program does nothing then an error message appears saying the connection was closed and it could not connect to the remote server. Is this my PC? Or is it me who's forgetting something? Anyone has working code for this?

 

-If I ever manage to succesfull read in a HTML file. What would be the best way to collect all the <a>-tags out of it? I want to use all of these links as the next URL to visit....

 

-If I collect all of the links from a page and then do the same for these pages, chances are big my program ends up in a never ending loop. For example 4 pages with the same links (menu)... and my program will keep on visiting the same pages. What would be a good and fast way to check what URLS I already visited?

 

Its for a school assignment, so it doesn't have to work real good..... but good enough.

  • *Experts*
Posted

Well, for the first question, use a WebClient:

Dim wc As System.Net.WebClient
Dim bhtml() As Byte, html As String

bhtml = wc.DownloadData("www.somesite.com")
html = System.Text.AsciiEncoding.Ascii.GetString(bhtml)

For your second question, there are a few choices. You can go with

some simple link parsing with .IndexOf statements, .SubString statements,

and the like, but I would use RegularExpressions for something like

this. RegularExpressions is a sort of search language; you can enter

a complex pattern, and it will attempt to find it. It's a very complex

thing to master, but it can save a lot of time if you learn it. Read

about it in the MSDN.

 

As for your third question, you could use an ArrayList, I believe. Like this:

'After visiting:
       visited.Add("http://www.somesite.com")

       'To check if it's been visited:
       If visited.Contains("http://www.somesite.com") Then

Posted

Thanks, i'll look into it.

 

However, I still can't get the HTML reading to work. Even with your code I get the following error:

An unhandled exception of type 'System.Net.WebException' occurred in system.dll

 

Additional information: The underlying connection was closed: Unable to connect to the remote server.

 

I have internet on this PC via cable (and i'm behind a router)... what should I do to fix this?

Posted

I have http:// in front of it, I tried multiple urls.... I even tried with specifying a full path to a html file (so .../index.html), no luck.

I'm using a proxy..... perhaps that's the problem? Please anyone? What can I do to fix this problem. Even the sample code from MSDN gives me the same error.

Posted

I have a similar problem when using the WebClient class except my error message is "The underlying connection was closed: The remote name could not be resolved."

 

I believe it to be a security/configuration issue as our internet connection goes through one of our companies proxy servers and prompts for a username as password. I have to enter this every time I open a browser and attempt to connect to the internet but when I use the WebClient class the dialog asking for the username and password doesn't appear.

TT

(*_*)

 

There are 10 types of people in this world;

those that understand binary and those that don't.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...