Jump to content
Xtreme .Net Talk

Recommended Posts

Posted

Hello to all,

 

I'm working on Visual C# Express 2005 (whidbey) and i want to scan a plain-text file (especially .html and other source pages) to get all relative or not URI/URL.

Exemple :

 

<a href="/james/photos.htm">James's Pics</a>

<a href="http://www.jamespics.com">...</a>

 

And get all URI on the page! But, i have only the pattern for matching URI and this pattern doesn't work it throws me an exception :

            string pattern = @"^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]
           +)*@)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1
           }|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9
           ]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1
           }[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1
           -9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-
           9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|a
           ero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\
           ?\'\\\+&%\$#\=~_\-]+))*$";

           Regex regexp = new Regex(pattern);
           MatchCollection mc = regexp.Matches("http://www.yahoo.co.uk/nostream.php?acting=lolz", 0);

           foreach(Match match in mc)
           {
               MessageBox.Show(match.Value);
           }

And, the exception :

parsing \"^(http|https|ftp)\\://([a-zA-Z0-9\\.\\-]+(\\:[a-zA-Z0-9\\.&%\\$\\-]\r\n            +)*@)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1\r\n            }|[1-9])\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9\r\n            ]{1}|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1\r\n            }[0-9]{1}|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1\r\n            -9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\\-]+\\.)*[a-zA-Z0-\r\n            9\\-]+\\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|a\r\n            ero|coop|museum|[a-zA-Z]{2}))(\\:[0-9]+)*(/($|[a-zA-Z0-9\\.\\,\\\r\n            ?\\'\\\\\\+&%\\$#\\=~_\\-]+))*$\" - [x-y] range in reverse order.

 

So, if someone has a solution get all URI (all styles : http(s)://(www.)yahoo.co.uk/dir/page.php?var=none&var2=LOL and all other style like http://www.hey-you.com/mister/james.php?page=pics, etc...) and the relative uri (ex.: ./dir/page.php?lol=no (so we must get the href="CONTENT")

 

Thanks, i've tried but never succedeed to make my own regexp :s

 

Thanks a lot!

Posted

Once you have your whole HTML file in a string variable... you could submit this string to the following RegEx expression :

[b]Regex reg = new Regex("<a.*?(href=(\"|').*?(\"|'))+.*?>");[/b]

 

I used a specialized function in my program that was only looking for JPEG and GIF file... so here is the function... you only have to change the specialized part :

 

private string[] GetImgListFromHtml(string html)
     {
       Regex reg = new Regex("<a.*?(href=(\"|').*?(\"|'))+.*?>");
       MatchCollection ms = reg.Matches(html.ToLower());
       string[] ret = new string[0];
       if( ms.Count > 0 )
       {
         ret = new string[ms.Count];
         for( int i = 0; i < ms.Count ; i++ )
         {
           bool apost = false;
           string elem = ms[i].Value;
           int ihrefB = elem.IndexOf("href='",1);
           if( ihrefB == -1 )
           {
             ihrefB = elem.IndexOf("href=\"",1);
             apost = true;
           }
           int ihrefE = apost?elem.IndexOf("\"",ihrefB + 6): elem.IndexOf("'", ihrefB+5);
           elem = elem.Substring(ihrefB+6,ihrefE-ihrefB-6 );
           string ext = Path.GetExtension(elem);
[b]            if(ext == ".jpg" || ext == ".gif")
              ret[i] = elem;[/b]
         }
         
       }
       int nbNonNull = 0;
       ArrayList arr = new ArrayList(ret);
       for( int k = 0; k < arr.Count; k++ )
       {
         if( arr[k] != null)
           nbNonNull++;
       }
       string [] ret2 = new string[nbNonNull];
       int ind = 0;
       for( int j = 0; j < arr.Count; j++)
       {
         if( arr[j] != null)
         {
           ret2[ind] = arr[j].ToString();
           ind++;
         }
       }
       ret = ret2;
       return ret;
     }

 

N.B.: Sorry if the programming is not perfect... I was only looking to make it work... however... it work without any problem.

 

Give me news

"If someone say : "Die mortal !"... don't stay to see if he isn't." - Unknown

"Learning to program is like going out with a new girl friend. There's always something that wasn't mentioned in the documentation..." - Me

"A drunk girl is like an animal... it scream at everything like a cat and roll in the grass like a dog." - Me after seeing my girlfriend drunk and some of her drunk friend.

C# TO VB TRANSLATOR

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...