Jump to content
Xtreme .Net Talk

Recommended Posts

Posted (edited)

I am trying to get data between HTML tags, but I am doing it wrong, as it is returning nearly all the source. :(


I modified the MSDN example to try and do it, but no success...


       Dim r As System.Text.RegularExpressions.Regex
       Dim m As System.Text.RegularExpressions.Match

       r = New System.Text.RegularExpressions.Regex("<td.*>(.*)</td>", _
            System.Text.RegularExpressions.RegexOptions.IgnoreCase Or System.Text.RegularExpressions.RegexOptions.Compiled)

       m = r.Match(inputString)

       While m.Success
           m = m.NextMatch()
       End While


I am trying to get what is in between <td (attributes here)> and </td>... what am I doing wrong?

Edited by darknuke

This is only a test of the emergency broadcast system

This is a product of hysterical mass confusion

A ship of fools adrift on the sea of our pollution

Rudderless and powerless on the sea of our delusion

pennywise - this is only a test

Where can I find that...

This is only a test of the emergency broadcast system

This is a product of hysterical mass confusion

A ship of fools adrift on the sea of our pollution

Rudderless and powerless on the sea of our delusion

pennywise - this is only a test


       Dim r As System.Text.RegularExpressions.Regex
       Dim m As System.Text.RegularExpressions.Match

       r = New System.Text.RegularExpressions.Regex("<td[^>]+>([^<]+)</td>", _
            System.Text.RegularExpressions.RegexOptions.IgnoreCase Or System.Text.RegularExpressions.RegexOptions.Compiled)

       m = r.Match(inputString)

       While m.Success
           m = m.NextMatch()
       End While


^^ try that ^^


Hope this helps!



Posted (edited)
One problem; there's HTML tags between the <td> tags :( ... How can I include anything that appears between the tags? Edited by darknuke

This is only a test of the emergency broadcast system

This is a product of hysterical mass confusion

A ship of fools adrift on the sea of our pollution

Rudderless and powerless on the sea of our delusion

pennywise - this is only a test

Posted (edited)

Trying to get eBay listings:


(I have looked at the eBay developer SDK and for example source already)


Stuff like (full source is huuuge):


<tr bgcolor="#eeeeee">
<td valign="center" align="middle" width="12%" rowspan="">
  <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=3660198719&category=51347">
  <img height="64" width="64" border="0" src="http://thumbs.ebaystatic.com/pict/36601987196464.jpg" alt="**NEW** Disney Toy Story 2 Activity Studio CD"></a></td>
<td valign="top">
 <font size="3">
 <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=3660198719&category=51347"> **NEW** Disney Toy Story 2 Activity Studio CD </a>
 <img src="http://pics.ebaystatic.com/aw/pics/paypal/logo_paypalPPBuyerProtection_28x16.gif" alt="PayPal Buyer Protection Program" border="0" width="28" height="16">
 <img height="1" width="200" border="0" alt="" src="http://pics.ebaystatic.com/aw/pics/s.gif"></td>


The regex should return:


 <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=3660198719&category=51347">
  <img height="64" width="64" border="0" src="http://thumbs.ebaystatic.com/pict/36601987196464.jpg" alt="**NEW** Disney Toy Story 2 Activity Studio CD"></a>




 <font size="3">
 <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=3660198719&category=51347"> **NEW** Disney Toy Story 2 Activity Studio CD </a>
 <img src="http://pics.ebaystatic.com/aw/pics/paypal/logo_paypalPPBuyerProtection_28x16.gif" alt="PayPal Buyer Protection Program" border="0" width="28" height="16">
 <img height="1" width="200" border="0" alt="" src="http://pics.ebaystatic.com/aw/pics/s.gif">

Edited by darknuke

This is only a test of the emergency broadcast system

This is a product of hysterical mass confusion

A ship of fools adrift on the sea of our pollution

Rudderless and powerless on the sea of our delusion

pennywise - this is only a test

Posted (edited)

do you know c#?


What I would do is use mshtml to parse the page, look for the table that has the listings, and then iterate through each row and do whatever u want to do with the data. I made an object that will download a webpage, remove the scripting and create a html document. From there you could do HtmlDocument.getElementsByTagName("TABLE") to retrieve all the html tables in the page, Find the table that has the rows you want. And then iterate through each row.


The thing is, I did it in c#, and you're using vb. For this to work, I could 1) give u the dll or 2) give you the c# code, and you use the c# compiler to create the dll and import the dll to your project. Or you can try to convert the c# code to vb.net code.

Edited by HJB417
I don't have MSHTML as far as I know. I don't have C#, I got VB.NET 2003 in a stand-alone package.

This is only a test of the emergency broadcast system

This is a product of hysterical mass confusion

A ship of fools adrift on the sea of our pollution

Rudderless and powerless on the sea of our delusion

pennywise - this is only a test

I don't have MSHTML as far as I know. I don't have C#, I got VB.NET 2003 in a stand-alone package.


Sure you do, Project -> Add Reference -> COM -> Microsoft HTML Object Library.

How can I put a string (HTML) into the the HTMLDocumentClass class...

This is only a test of the emergency broadcast system

This is a product of hysterical mass confusion

A ship of fools adrift on the sea of our pollution

Rudderless and powerless on the sea of our delusion

pennywise - this is only a test

How can I put a string (HTML) into the the HTMLDocumentClass class...


You need to save the contents of the string 2 disk.

C# code

HTMLDocumentClass htmlDoc = new HTMLDocumentClass();
System.Runtime.InteropServices.UCOMIPersistFile pf = (System.Runtime.InteropServices.UCOMIPersistFile) htmlDoc;
pf.Load(filename, 0);
while(htmlDoc.body == null)
while(htmlDoc.readyState != "complete")

Posted (edited)
I don't have C#, I got VB.NET 2003 in a stand-alone package.




*screams wildly* I don't know what I'm doing wrong :(


        Dim htmlDoc As New mshtml.HTMLDocument
       Dim pf As System.Runtime.InteropServices.UCOMIPersistFile
       pf.Load("c:\eBay.html", 0)
       htmlDoc = pf


I get an error on the pf.Load line... (unhandled exception; object not set to a instance of an object)

Edited by darknuke

This is only a test of the emergency broadcast system

This is a product of hysterical mass confusion

A ship of fools adrift on the sea of our pollution

Rudderless and powerless on the sea of our delusion

pennywise - this is only a test

Would someone please show me an example use of MSHTML that is in VB.NET (that does not require me to use a browser control, if possible)?

This is only a test of the emergency broadcast system

This is a product of hysterical mass confusion

A ship of fools adrift on the sea of our pollution

Rudderless and powerless on the sea of our delusion

pennywise - this is only a test


change this:

        Dim htmlDoc As New mshtml.HTMLDocument
       Dim pf As System.Runtime.InteropServices.UCOMIPersistFile
       pf.Load("c:\eBay.html", 0)
       htmlDoc = pf


..to this!:


        Dim htmlDoc As New mshtml.HTMLDocument
       Dim pf As System.Runtime.InteropServices.UCOMIPersistFile
       [b]pf = CType(htmlDoc,System.Runtime.InteropServices.UCOMIPersistFile)[/b]
       pf.Load("c:\eBay.html", 0)
       htmlDoc = pf


Hope this helps!



  • 1 year later...

Remove scripting code




If you are still offering, I would be grateful to receive the c# code you mentioned below.


Many thanks.


do you know c#?


What I would do is use mshtml to parse the page, look for the table that has the listings, and then iterate through each row and do whatever u want to do with the data. I made an object that will download a webpage, remove the scripting and create a html document. From there you could do HtmlDocument.getElementsByTagName("TABLE") to retrieve all the html tables in the page, Find the table that has the rows you want. And then iterate through each row.


The thing is, I did it in c#, and you're using vb. For this to work, I could 1) give u the dll or 2) give you the c# code, and you use the c# compiler to create the dll and import the dll to your project. Or you can try to convert the c# code to vb.net code.

Posted (edited)

You will need to add references to mshtml and system.windows.forms


using System;
using System.Diagnostics;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
using mshtml;

namespace HB.Net
/// <summary>
/// Creates a managed wrapper for a <see cref="mshtml.HTMLDocument"/> object.
/// </summary>
public class HtmlDocument : IDisposable

	private bool _deleteWhenDone;

	/// <summary>
	/// The underlying <see cref="mshtml.HTMLDocument"/>.
	/// </summary>
	public readonly HTMLDocument MsHtmlDoc;
	/// <summary>
	/// The file path of the downloaded html document.
	/// </summary>
	public readonly string LocalPath;

	private bool _disposed;

	/// <summary>
	/// The content of the webpage.
	/// </summary>
	public readonly string AsciiData;
	private static readonly Regex ScriptParser;
	private static readonly Regex FileExtRemover;
	static HtmlDocument()
		string[] tags = new string[] {"script", /*"style", */"object", "head", "map", "iframe", "javascript"};
		string scriptParserPattern = @"<(" + string.Join("|", tags) + @">).*?</\1>";
		ScriptParser = new Regex(scriptParserPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);
		FileExtRemover = new Regex(@"\.\w+$", RegexOptions.Compiled);

	/// <summary>
	/// Creates a <see cref="HtmlDocument"/> from the binary data of a webpage.
	/// </summary>
	/// <param name="data">The binary data of the webpage</param>
	/// <param name="removeScripting">true to remove javascript.</param>
	public HtmlDocument(byte[] data, bool removeScripting)
		: this(CreateFile(data), true, removeScripting)

	public HtmlDocument(string html, bool removeScripting)
		: this(html, removeScripting, Encoding.ASCII)

	public HtmlDocument(string html, bool removeScripting, Encoding encoding)
		: this(encoding.GetBytes(html), removeScripting)

	/// <summary>
	/// Creates a <see cref="HtmlDocument"/> from a webpage file.
	/// </summary>
	/// <param name="filename">The file path of the webpage.</param>
	/// <param name="deleteFile">true to delete the file on dispose.</param>
	/// <param name="removeScripting">set to true to remove script tags.</param>
	public HtmlDocument(string filename, bool deleteFile, bool removeScripting)
		_disposed = false;
		_deleteWhenDone = true;
		LocalPath = filename;
			MsHtmlDoc = CreateHTMLDocument(out AsciiData);

	/// <summary>
	/// Deletes the webpage and closes the underlying <see cref="mshtml.HTMLDocument"/> object.
	/// </summary>
	public void Dispose()
		_disposed = true;


	/// <summary>
	/// Creates a HTMLDocument.
	/// </summary>
	private HTMLDocumentClass CreateHTMLDocument(out string asciiData)
		byte[] _htmlData;
		FileStream file = File.OpenRead(LocalPath);
			_htmlData = new byte[file.Length];
			for(int read = 0; read < file.Length;)
				read+=file.Read(_htmlData, read, (int)(file.Length - read));
		HTMLDocumentClass htmlDoc = new HTMLDocumentClass();
			System.Runtime.InteropServices.UCOMIPersistFile pf = (System.Runtime.InteropServices.UCOMIPersistFile) htmlDoc;
			pf.Load(LocalPath, 0);
			while(htmlDoc.body == null)
			while(htmlDoc.readyState != "complete")
			asciiData = Encoding.ASCII.GetString(_htmlData);
		catch(Exception e)
			throw new ApplicationException("An error occurred while creating a mshtml.HTMLDocumentClass object.", e);
		return htmlDoc;

	/// <summary>
	/// Removies scripting from a html file.
	/// </summary>
	/// <param name="filename">The path of the file.</param>
	public void Preparse(string filename)
		//read in txt file
		TextReader file = File.OpenText(filename);
		string text = null;
			text = file.ReadToEnd();
			text = ScriptParser.Replace(text, "");
		TextWriter output = File.CreateText(filename);

	private static string CreateTempHtmlFile()
			string filename = Path.GetTempFileName();
				string htmlFileName = FileExtRemover.Replace(filename, ".html");
				File.Move(filename, htmlFileName);
				return htmlFileName;

	/// <summary>
	/// Creates a html file from an array of bytes.
	/// </summary>
	/// <param name="data">The array of bytes to create the data from.</param>
	private static string CreateFile(byte[] data)
		string filename = CreateTempHtmlFile();
		FileStream file = File.OpenWrite(filename);
			file.Write(data, 0, data.Length);
			return filename;

	/// <summary>
	/// Returns the content of the html document.
	/// </summary>
	/// <returns>The content of the html document.</returns>
	public override string ToString()
		return AsciiData;


edit: cleaned up the code.

Edited by HJB417

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...