dotnetnoob Posted January 26, 2003 Posted January 26, 2003 This is for learning perpose and does not have to function really well.... What I would like to do is create a simplified web spider. But I'm running into a few problems/questions. -I can't seem to read in a HTML into a string. My program does nothing then an error message appears saying the connection was closed and it could not connect to the remote server. Is this my PC? Or is it me who's forgetting something? Anyone has working code for this? -If I ever manage to succesfull read in a HTML file. What would be the best way to collect all the <a>-tags out of it? I want to use all of these links as the next URL to visit.... -If I collect all of the links from a page and then do the same for these pages, chances are big my program ends up in a never ending loop. For example 4 pages with the same links (menu)... and my program will keep on visiting the same pages. What would be a good and fast way to check what URLS I already visited? Its for a school assignment, so it doesn't have to work real good..... but good enough. Quote
*Experts* Volte Posted January 26, 2003 *Experts* Posted January 26, 2003 Well, for the first question, use a WebClient:Dim wc As System.Net.WebClient Dim bhtml() As Byte, html As String bhtml = wc.DownloadData("www.somesite.com") html = System.Text.AsciiEncoding.Ascii.GetString(bhtml)For your second question, there are a few choices. You can go with some simple link parsing with .IndexOf statements, .SubString statements, and the like, but I would use RegularExpressions for something like this. RegularExpressions is a sort of search language; you can enter a complex pattern, and it will attempt to find it. It's a very complex thing to master, but it can save a lot of time if you learn it. Read about it in the MSDN. As for your third question, you could use an ArrayList, I believe. Like this:'After visiting: visited.Add("http://www.somesite.com") 'To check if it's been visited: If visited.Contains("http://www.somesite.com") Then Quote
dotnetnoob Posted January 26, 2003 Author Posted January 26, 2003 Thanks, i'll look into it. However, I still can't get the HTML reading to work. Even with your code I get the following error: An unhandled exception of type 'System.Net.WebException' occurred in system.dll Additional information: The underlying connection was closed: Unable to connect to the remote server. I have internet on this PC via cable (and i'm behind a router)... what should I do to fix this? Quote
*Experts* Volte Posted January 26, 2003 *Experts* Posted January 26, 2003 Oh, add 'http://' in front of the URL, and make sure the address is valid. Quote
dotnetnoob Posted January 26, 2003 Author Posted January 26, 2003 I have http:// in front of it, I tried multiple urls.... I even tried with specifying a full path to a html file (so .../index.html), no luck. I'm using a proxy..... perhaps that's the problem? Please anyone? What can I do to fix this problem. Even the sample code from MSDN gives me the same error. Quote
TechnoTone Posted January 27, 2003 Posted January 27, 2003 I have a similar problem when using the WebClient class except my error message is "The underlying connection was closed: The remote name could not be resolved." I believe it to be a security/configuration issue as our internet connection goes through one of our companies proxy servers and prompts for a username as password. I have to enter this every time I open a browser and attempt to connect to the internet but when I use the WebClient class the dialog asking for the username and password doesn't appear. Quote TT (*_*) There are 10 types of people in this world; those that understand binary and those that don't.
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.