Regex Question

darknuke · Feb 17, 2004

I am trying to get data between HTML tags, but I am doing it wrong, as it is returning nearly all the source.

I modified the MSDN example to try and do it, but no success...

Visual Basic:

        Dim r As System.Text.RegularExpressions.Regex
        Dim m As System.Text.RegularExpressions.Match

        r = New System.Text.RegularExpressions.Regex("<td.*>(.*)</td>", _
             System.Text.RegularExpressions.RegexOptions.IgnoreCase Or System.Text.RegularExpressions.RegexOptions.Compiled)

        m = r.Match(inputString)

        While m.Success
            MsgBox(m.Groups(1).Value.ToString)
            m = m.NextMatch()
        End While

I am trying to get what is in between <td (attributes here)> and </td>... what am I doing wrong?

HJB417 · Feb 18, 2004

would you consider using mshtml as an alternative?

darknuke · Feb 18, 2004

Where can I find that...

Hamburger1984 · Feb 18, 2004

Visual Basic:

        Dim r As System.Text.RegularExpressions.Regex
        Dim m As System.Text.RegularExpressions.Match

        r = New System.Text.RegularExpressions.Regex("<td[^>]+>([^<]+)</td>", _
             System.Text.RegularExpressions.RegexOptions.IgnoreCase Or System.Text.RegularExpressions.RegexOptions.Compiled)

        m = r.Match(inputString)

        While m.Success
            MsgBox(m.Groups(1).Value.ToString)
            m = m.NextMatch()
        End While

^^ try that ^^

Hope this helps!

Andreas

darknuke · Feb 18, 2004

One problem; there's HTML tags between the <td> tags

... How can I include anything that appears between the tags?

HJB417 · Feb 18, 2004

Can you post the html code, and then tell us what portions you want from the code?

darknuke · Feb 18, 2004

Trying to get eBay listings:

(I have looked at the eBay developer SDK and for example source already)

Stuff like (full source is huuuge):

Code:

<tr bgcolor="#eeeeee">
 <td valign="center" align="middle" width="12%" rowspan="">
   <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=3660198719&category=51347">
   <img height="64" width="64" border="0" src="http://thumbs.ebaystatic.com/pict/36601987196464.jpg" alt="**NEW** Disney Toy Story 2 Activity Studio CD"></a></td>
 <td valign="top">
  <font size="3">
  <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=3660198719&category=51347"> **NEW** Disney Toy Story 2 Activity Studio CD </a>
  </font>?
  <img src="http://pics.ebaystatic.com/aw/pics/paypal/logo_paypalPPBuyerProtection_28x16.gif" alt="PayPal Buyer Protection Program" border="0" width="28" height="16">
  <br>
  <img height="1" width="200" border="0" alt="" src="http://pics.ebaystatic.com/aw/pics/s.gif"></td>

The regex should return:

Code:

 <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=3660198719&category=51347">
   <img height="64" width="64" border="0" src="http://thumbs.ebaystatic.com/pict/36601987196464.jpg" alt="**NEW** Disney Toy Story 2 Activity Studio CD"></a>

and

Code:

  <font size="3">
  <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=3660198719&category=51347"> **NEW** Disney Toy Story 2 Activity Studio CD </a>
  </font>?
  <img src="http://pics.ebaystatic.com/aw/pics/paypal/logo_paypalPPBuyerProtection_28x16.gif" alt="PayPal Buyer Protection Program" border="0" width="28" height="16">
  <br>
  <img height="1" width="200" border="0" alt="" src="http://pics.ebaystatic.com/aw/pics/s.gif">

HJB417 · Feb 18, 2004

do you know c#?

What I would do is use mshtml to parse the page, look for the table that has the listings, and then iterate through each row and do whatever u want to do with the data. I made an object that will download a webpage, remove the scripting and create a html document. From there you could do HtmlDocument.getElementsByTagName("TABLE") to retrieve all the html tables in the page, Find the table that has the rows you want. And then iterate through each row.

The thing is, I did it in c#, and you're using vb. For this to work, I could 1) give u the dll or 2) give you the c# code, and you use the c# compiler to create the dll and import the dll to your project. Or you can try to convert the c# code to vb.net code.

darknuke · Feb 19, 2004

I don't have MSHTML as far as I know. I don't have C#, I got VB.NET 2003 in a stand-alone package.

HJB417 · Feb 19, 2004

darknuke said:
I don't have MSHTML as far as I know. I don't have C#, I got VB.NET 2003 in a stand-alone package.

Sure you do, Project -> Add Reference -> COM -> Microsoft HTML Object Library.

darknuke · Feb 19, 2004

How can I put a string (HTML) into the the HTMLDocumentClass class...

HJB417 · Feb 19, 2004

darknuke said:
How can I put a string (HTML) into the the HTMLDocumentClass class...

You need to save the contents of the string 2 disk.
C# code

Code:

HTMLDocumentClass htmlDoc = new HTMLDocumentClass();
System.Runtime.InteropServices.UCOMIPersistFile pf = (System.Runtime.InteropServices.UCOMIPersistFile) htmlDoc;
pf.Load(filename, 0);
while(htmlDoc.body == null)
		System.Windows.Forms.Application.DoEvents();
while(htmlDoc.readyState != "complete")
	System.Windows.Forms.Application.DoEvents();

darknuke · Feb 19, 2004

darknuke said:
I don't have C#, I got VB.NET 2003 in a stand-alone package.

*screams wildly* I don't know what I'm doing wrong

Visual Basic:

        Dim htmlDoc As New mshtml.HTMLDocument
        Dim pf As System.Runtime.InteropServices.UCOMIPersistFile
        pf.Load("c:\eBay.html", 0)
        htmlDoc = pf

I get an error on the pf.Load line... (unhandled exception; object not set to a instance of an object)

HJB417 · Feb 22, 2004

can u do

Code:

Dim htmlDoc As New mshtml.HTMLDocumentClass

?

darknuke · Feb 22, 2004

Would someone please show me an example use of MSHTML that is in VB.NET (that does not require me to use a browser control, if possible)?

Hamburger1984 · Feb 23, 2004

change this:

Visual Basic:

        Dim htmlDoc As New mshtml.HTMLDocument
        Dim pf As System.Runtime.InteropServices.UCOMIPersistFile
        pf.Load("c:\eBay.html", 0)
        htmlDoc = pf

..to this!:

Visual Basic:

        Dim htmlDoc As New mshtml.HTMLDocument
        Dim pf As System.Runtime.InteropServices.UCOMIPersistFile
        [b]pf = CType(htmlDoc,System.Runtime.InteropServices.UCOMIPersistFile)[/b]
        pf.Load("c:\eBay.html", 0)
        htmlDoc = pf

Hope this helps!

Andreas

fblanco · Mar 4, 2005

Remove scripting code

Hi,

If you are still offering, I would be grateful to receive the c# code you mentioned below.

Many thanks.

HJB417 said:
do you know c#?

What I would do is use mshtml to parse the page, look for the table that has the listings, and then iterate through each row and do whatever u want to do with the data. I made an object that will download a webpage, remove the scripting and create a html document. From there you could do HtmlDocument.getElementsByTagName("TABLE") to retrieve all the html tables in the page, Find the table that has the rows you want. And then iterate through each row.

The thing is, I did it in c#, and you're using vb. For this to work, I could 1) give u the dll or 2) give you the c# code, and you use the c# compiler to create the dll and import the dll to your project. Or you can try to convert the c# code to vb.net code.

HJB417 · Mar 4, 2005

You will need to add references to mshtml and system.windows.forms

Code:

using System;
using System.Diagnostics;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
using mshtml;

namespace HB.Net
{
	/// <summary>
	/// Creates a managed wrapper for a <see cref="mshtml.HTMLDocument"/> object.
	/// </summary>
	public class HtmlDocument : IDisposable
	{

		private bool _deleteWhenDone;

		/// <summary>
		/// The underlying <see cref="mshtml.HTMLDocument"/>.
		/// </summary>
		public readonly HTMLDocument MsHtmlDoc;
		
		/// <summary>
		/// The file path of the downloaded html document.
		/// </summary>
		public readonly string LocalPath;

		private bool _disposed;

		/// <summary>
		/// The content of the webpage.
		/// </summary>
		public readonly string AsciiData;
		
		private static readonly Regex ScriptParser;
		private static readonly Regex FileExtRemover;
		
		static HtmlDocument()
		{
			string[] tags = new string[] {"script", /*"style", */"object", "head", "map", "iframe", "javascript"};
			string scriptParserPattern = @"<(" + string.Join("|", tags) + @">).*?</\1>";
			ScriptParser = new Regex(scriptParserPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);
			FileExtRemover = new Regex(@"\.\w+$", RegexOptions.Compiled);
		}

		/// <summary>
		/// Creates a <see cref="HtmlDocument"/> from the binary data of a webpage.
		/// </summary>
		/// <param name="data">The binary data of the webpage</param>
		/// <param name="removeScripting">true to remove javascript.</param>
		public HtmlDocument(byte[] data, bool removeScripting)
			: this(CreateFile(data), true, removeScripting)
		{
		}

		public HtmlDocument(string html, bool removeScripting)
			: this(html, removeScripting, Encoding.ASCII)
		{
		}

		public HtmlDocument(string html, bool removeScripting, Encoding encoding)
			: this(encoding.GetBytes(html), removeScripting)
		{
		}

		/// <summary>
		/// Creates a <see cref="HtmlDocument"/> from a webpage file.
		/// </summary>
		/// <param name="filename">The file path of the webpage.</param>
		/// <param name="deleteFile">true to delete the file on dispose.</param>
		/// <param name="removeScripting">set to true to remove script tags.</param>
		public HtmlDocument(string filename, bool deleteFile, bool removeScripting)
		{
			_disposed = false;
			_deleteWhenDone = true;
			LocalPath = filename;
			try
			{
				if(removeScripting)
					Preparse(filename);
				MsHtmlDoc = CreateHTMLDocument(out AsciiData);
			}
			catch
			{
				try
				{
					File.Delete(filename);
				}
				catch{}
				throw;
			}
		}

		/// <summary>
		/// Deletes the webpage and closes the underlying <see cref="mshtml.HTMLDocument"/> object.
		/// </summary>
		public void Dispose()
		{
			if(_disposed)
				return;
			MsHtmlDoc.close();
			if(_deleteWhenDone)
			{
				try
				{
					File.Delete(LocalPath);
				}
				catch{}
			}
			_disposed = true;
			GC.SuppressFinalize(this);
		}

		~HtmlDocument()
		{
			try
			{
				Dispose();
			}
			catch{}
		}

		/// <summary>
		/// Creates a HTMLDocument.
		/// </summary>
		private HTMLDocumentClass CreateHTMLDocument(out string asciiData)
		{
			byte[] _htmlData;
			FileStream file = File.OpenRead(LocalPath);
			try
			{
				_htmlData = new byte[file.Length];
				for(int read = 0; read < file.Length;)
					read+=file.Read(_htmlData, read, (int)(file.Length - read));
			}
			finally
			{
				file.Close();
			}
			HTMLDocumentClass htmlDoc = new HTMLDocumentClass();
			try
			{
				System.Runtime.InteropServices.UCOMIPersistFile pf = (System.Runtime.InteropServices.UCOMIPersistFile) htmlDoc;
				pf.Load(LocalPath, 0);
				while(htmlDoc.body == null)
					System.Windows.Forms.Application.DoEvents();
				while(htmlDoc.readyState != "complete")
					System.Windows.Forms.Application.DoEvents();
				asciiData = Encoding.ASCII.GetString(_htmlData);
			}
			catch(Exception e)
			{
				htmlDoc.close();
				throw new ApplicationException("An error occurred while creating a mshtml.HTMLDocumentClass object.", e);
			}
			return htmlDoc;
		}

		/// <summary>
		/// Removies scripting from a html file.
		/// </summary>
		/// <param name="filename">The path of the file.</param>
		public void Preparse(string filename)
		{
			//read in txt file
			TextReader file = File.OpenText(filename);
			string text = null;
			try
			{
				text = file.ReadToEnd();
				text = ScriptParser.Replace(text, "");
			}
			finally
			{
				file.Close();
			}
			TextWriter output = File.CreateText(filename);
			try
			{
				output.Write(text);
				output.Flush();
			}
			finally
			{
				output.Close();
			}
		}

		private static string CreateTempHtmlFile()
		{
			while(true)
			{
				string filename = Path.GetTempFileName();
				try
				{
					string htmlFileName = FileExtRemover.Replace(filename, ".html");
					File.Move(filename, htmlFileName);
					return htmlFileName;
				}
				catch
				{
					File.Delete(filename);
				}
			}
		}

		/// <summary>
		/// Creates a html file from an array of bytes.
		/// </summary>
		/// <param name="data">The array of bytes to create the data from.</param>
		private static string CreateFile(byte[] data)
		{
			string filename = CreateTempHtmlFile();
			FileStream file = File.OpenWrite(filename);
			try
			{
				file.Write(data, 0, data.Length);
				file.Flush();
				return filename;
			}
			catch
			{
				try
				{
					File.Delete(filename);
				}
				catch{}
				throw;
			}
			finally
			{
				file.Close();
			}
		}

		/// <summary>
		/// Returns the content of the html document.
		/// </summary>
		/// <returns>The content of the html document.</returns>
		[System.Diagnostics.DebuggerStepThrough]
		public override string ToString()
		{
			return AsciiData;
		}
	}
}

edit: cleaned up the code.

Regex Question

darknuke

Regular

HJB417

Contributor

darknuke

Regular

Hamburger1984

Centurion

darknuke

Regular

HJB417

Contributor

darknuke

Regular

HJB417

Contributor

darknuke

Regular

HJB417

Contributor

darknuke

Regular

HJB417

Contributor

darknuke

Regular

HJB417

Contributor

darknuke

Regular

Hamburger1984

Centurion

fblanco

Newcomer

HJB417

Contributor