MoPraL Posted September 19, 2004 Posted September 19, 2004 Hello to all, I'm working on Visual C# Express 2005 (whidbey) and i want to scan a plain-text file (especially .html and other source pages) to get all relative or not URI/URL. Exemple : <a href="/james/photos.htm">James's Pics</a> <a href="http://www.jamespics.com">...</a> And get all URI on the page! But, i have only the pattern for matching URI and this pattern doesn't work it throws me an exception : string pattern = @"^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-] +)*@)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1 }|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9 ]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1 }[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1 -9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0- 9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|a ero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\ ?\'\\\+&%\$#\=~_\-]+))*$"; Regex regexp = new Regex(pattern); MatchCollection mc = regexp.Matches("http://www.yahoo.co.uk/nostream.php?acting=lolz", 0); foreach(Match match in mc) { MessageBox.Show(match.Value); } And, the exception : parsing \"^(http|https|ftp)\\://([a-zA-Z0-9\\.\\-]+(\\:[a-zA-Z0-9\\.&%\\$\\-]\r\n +)*@)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1\r\n }|[1-9])\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9\r\n ]{1}|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1\r\n }[0-9]{1}|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1\r\n -9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\\-]+\\.)*[a-zA-Z0-\r\n 9\\-]+\\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|a\r\n ero|coop|museum|[a-zA-Z]{2}))(\\:[0-9]+)*(/($|[a-zA-Z0-9\\.\\,\\\r\n ?\\'\\\\\\+&%\\$#\\=~_\\-]+))*$\" - [x-y] range in reverse order. So, if someone has a solution get all URI (all styles : http(s)://(www.)yahoo.co.uk/dir/page.php?var=none&var2=LOL and all other style like http://www.hey-you.com/mister/james.php?page=pics, etc...) and the relative uri (ex.: ./dir/page.php?lol=no (so we must get the href="CONTENT") Thanks, i've tried but never succedeed to make my own regexp :s Thanks a lot! Quote
Arch4ngel Posted September 19, 2004 Posted September 19, 2004 Once you have your whole HTML file in a string variable... you could submit this string to the following RegEx expression : [b]Regex reg = new Regex("<a.*?(href=(\"|').*?(\"|'))+.*?>");[/b] I used a specialized function in my program that was only looking for JPEG and GIF file... so here is the function... you only have to change the specialized part : private string[] GetImgListFromHtml(string html) { Regex reg = new Regex("<a.*?(href=(\"|').*?(\"|'))+.*?>"); MatchCollection ms = reg.Matches(html.ToLower()); string[] ret = new string[0]; if( ms.Count > 0 ) { ret = new string[ms.Count]; for( int i = 0; i < ms.Count ; i++ ) { bool apost = false; string elem = ms[i].Value; int ihrefB = elem.IndexOf("href='",1); if( ihrefB == -1 ) { ihrefB = elem.IndexOf("href=\"",1); apost = true; } int ihrefE = apost?elem.IndexOf("\"",ihrefB + 6): elem.IndexOf("'", ihrefB+5); elem = elem.Substring(ihrefB+6,ihrefE-ihrefB-6 ); string ext = Path.GetExtension(elem); [b] if(ext == ".jpg" || ext == ".gif") ret[i] = elem;[/b] } } int nbNonNull = 0; ArrayList arr = new ArrayList(ret); for( int k = 0; k < arr.Count; k++ ) { if( arr[k] != null) nbNonNull++; } string [] ret2 = new string[nbNonNull]; int ind = 0; for( int j = 0; j < arr.Count; j++) { if( arr[j] != null) { ret2[ind] = arr[j].ToString(); ind++; } } ret = ret2; return ret; } N.B.: Sorry if the programming is not perfect... I was only looking to make it work... however... it work without any problem. Give me news Quote "If someone say : "Die mortal !"... don't stay to see if he isn't." - Unknown "Learning to program is like going out with a new girl friend. There's always something that wasn't mentioned in the documentation..." - Me "A drunk girl is like an animal... it scream at everything like a cat and roll in the grass like a dog." - Me after seeing my girlfriend drunk and some of her drunk friend. C# TO VB TRANSLATOR
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.