web spider questions

dotnetnoob · Jan 26, 2003

This is for learning perpose and does not have to function really well.... What I would like to do is create a simplified web spider. But I'm running into a few problems/questions.

-I can't seem to read in a HTML into a string. My program does nothing then an error message appears saying the connection was closed and it could not connect to the remote server. Is this my PC? Or is it me who's forgetting something? Anyone has working code for this?

-If I ever manage to succesfull read in a HTML file. What would be the best way to collect all the <a>-tags out of it? I want to use all of these links as the next URL to visit....

-If I collect all of the links from a page and then do the same for these pages, chances are big my program ends up in a never ending loop. For example 4 pages with the same links (menu)... and my program will keep on visiting the same pages. What would be a good and fast way to check what URLS I already visited?

Its for a school assignment, so it doesn't have to work real good..... but good enough.

Volte · Jan 26, 2003

Well, for the first question, use a WebClient:

Visual Basic:

Dim wc As System.Net.WebClient
Dim bhtml() As Byte, html As String

bhtml = wc.DownloadData("www.somesite.com")
html = System.Text.AsciiEncoding.Ascii.GetString(bhtml)

For your second question, there are a few choices. You can go with
some simple link parsing with .IndexOf statements, .SubString statements,
and the like, but I would use RegularExpressions for something like
this. RegularExpressions is a sort of search language; you can enter
a complex pattern, and it will attempt to find it. It's a very complex
thing to master, but it can save a lot of time if you learn it. Read
about it in the MSDN.

As for your third question, you could use an ArrayList, I believe. Like this:

Visual Basic:

'After visiting:
        visited.Add("http://www.somesite.com")

        'To check if it's been visited:
        If visited.Contains("http://www.somesite.com") Then

dotnetnoob · Jan 26, 2003

Thanks, i'll look into it.

However, I still can't get the HTML reading to work. Even with your code I get the following error:

An unhandled exception of type 'System.Net.WebException' occurred in system.dll

Additional information: The underlying connection was closed: Unable to connect to the remote server.

I have internet on this PC via cable (and i'm behind a router)... what should I do to fix this?

Volte · Jan 26, 2003

Oh, add 'http://' in front of the URL, and make sure the address is
valid.

dotnetnoob · Jan 26, 2003

I have http:// in front of it, I tried multiple urls.... I even tried with specifying a full path to a html file (so .../index.html), no luck.
I'm using a proxy..... perhaps that's the problem? Please anyone? What can I do to fix this problem. Even the sample code from MSDN gives me the same error.

TechnoTone · Jan 27, 2003

I have a similar problem when using the WebClient class except my error message is "The underlying connection was closed: The remote name could not be resolved."

I believe it to be a security/configuration issue as our internet connection goes through one of our companies proxy servers and prompts for a username as password. I have to enter this every time I open a browser and attempt to connect to the internet but when I use the WebClient class the dialog asking for the username and password doesn't appear.

web spider questions

dotnetnoob

Newcomer

Volte

Neutiquam Erro

dotnetnoob

Newcomer

Volte

Neutiquam Erro

dotnetnoob

Newcomer

TechnoTone

Junior Contributor