Hello to all,


I'm working on Visual C# Express 2005 (whidbey) and i want to scan a plain-text file (especially .html and other source pages) to get all relative or not URI/URL.

Exemple :


<a href="/james/photos.htm">James's Pics</a>

<a href="http://www.jamespics.com">...</a>


And get all URI on the page! But, i have only the pattern for matching URI and this pattern doesn't work it throws me an exception :

            string pattern = @"^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]

           Regex regexp = new Regex(pattern);
           MatchCollection mc = regexp.Matches("http://www.yahoo.co.uk/nostream.php?acting=lolz", 0);

           foreach(Match match in mc)

And, the exception :

parsing \"^(http|https|ftp)\\://([a-zA-Z0-9\\.\\-]+(\\:[a-zA-Z0-9\\.&%\\$\\-]\r\n            +)*@)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1\r\n            }|[1-9])\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9\r\n            ]{1}|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1\r\n            }[0-9]{1}|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1\r\n            -9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\\-]+\\.)*[a-zA-Z0-\r\n            9\\-]+\\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|a\r\n            ero|coop|museum|[a-zA-Z]{2}))(\\:[0-9]+)*(/($|[a-zA-Z0-9\\.\\,\\\r\n            ?\\'\\\\\\+&%\\$#\\=~_\\-]+))*$\" - [x-y] range in reverse order.


So, if someone has a solution get all URI (all styles : http(s)://(www.)yahoo.co.uk/dir/page.php?var=none&var2=LOL and all other style like http://www.hey-you.com/mister/james.php?page=pics, etc...) and the relative uri (ex.: ./dir/page.php?lol=no (so we must get the href="CONTENT")


Thanks, i've tried but never succedeed to make my own regexp :s


Thanks a lot!


Once you have your whole HTML file in a string variable... you could submit this string to the following RegEx expression :

[b]Regex reg = new Regex("<a.*?(href=(\"|').*?(\"|'))+.*?>");[/b]


I used a specialized function in my program that was only looking for JPEG and GIF file... so here is the function... you only have to change the specialized part :


private string[] GetImgListFromHtml(string html)
       Regex reg = new Regex("<a.*?(href=(\"|').*?(\"|'))+.*?>");
       MatchCollection ms = reg.Matches(html.ToLower());
       string[] ret = new string[0];
       if( ms.Count > 0 )
         ret = new string[ms.Count];
         for( int i = 0; i < ms.Count ; i++ )
           bool apost = false;
           string elem = ms[i].Value;
           int ihrefB = elem.IndexOf("href='",1);
           if( ihrefB == -1 )
             ihrefB = elem.IndexOf("href=\"",1);
             apost = true;
           int ihrefE = apost?elem.IndexOf("\"",ihrefB + 6): elem.IndexOf("'", ihrefB+5);
           elem = elem.Substring(ihrefB+6,ihrefE-ihrefB-6 );
           string ext = Path.GetExtension(elem);
[b]            if(ext == ".jpg" || ext == ".gif")
              ret[i] = elem;[/b]
       int nbNonNull = 0;
       ArrayList arr = new ArrayList(ret);
       for( int k = 0; k < arr.Count; k++ )
         if( arr[k] != null)
       string [] ret2 = new string[nbNonNull];
       int ind = 0;
       for( int j = 0; j < arr.Count; j++)
         if( arr[j] != null)
           ret2[ind] = arr[j].ToString();
       ret = ret2;
       return ret;


N.B.: Sorry if the programming is not perfect... I was only looking to make it work... however... it work without any problem.


Give me news

"If someone say : "Die mortal !"... don't stay to see if he isn't." - Unknown

"Learning to program is like going out with a new girl friend. There's always something that wasn't mentioned in the documentation..." - Me

"A drunk girl is like an animal... it scream at everything like a cat and roll in the grass like a dog." - Me after seeing my girlfriend drunk and some of her drunk friend.


