darknuke Posted February 18, 2004 Posted February 18, 2004 (edited) I am trying to get data between HTML tags, but I am doing it wrong, as it is returning nearly all the source. :( I modified the MSDN example to try and do it, but no success... Dim r As System.Text.RegularExpressions.Regex Dim m As System.Text.RegularExpressions.Match r = New System.Text.RegularExpressions.Regex("<td.*>(.*)</td>", _ System.Text.RegularExpressions.RegexOptions.IgnoreCase Or System.Text.RegularExpressions.RegexOptions.Compiled) m = r.Match(inputString) While m.Success MsgBox(m.Groups(1).Value.ToString) m = m.NextMatch() End While I am trying to get what is in between <td (attributes here)> and </td>... what am I doing wrong? Edited February 18, 2004 by darknuke Quote This is only a test of the emergency broadcast system This is a product of hysterical mass confusion A ship of fools adrift on the sea of our pollution Rudderless and powerless on the sea of our delusion pennywise - this is only a test
HJB417 Posted February 18, 2004 Posted February 18, 2004 would you consider using mshtml as an alternative? Quote
darknuke Posted February 18, 2004 Author Posted February 18, 2004 Where can I find that... Quote This is only a test of the emergency broadcast system This is a product of hysterical mass confusion A ship of fools adrift on the sea of our pollution Rudderless and powerless on the sea of our delusion pennywise - this is only a test
Hamburger1984 Posted February 18, 2004 Posted February 18, 2004 Dim r As System.Text.RegularExpressions.Regex Dim m As System.Text.RegularExpressions.Match r = New System.Text.RegularExpressions.Regex("<td[^>]+>([^<]+)</td>", _ System.Text.RegularExpressions.RegexOptions.IgnoreCase Or System.Text.RegularExpressions.RegexOptions.Compiled) m = r.Match(inputString) While m.Success MsgBox(m.Groups(1).Value.ToString) m = m.NextMatch() End While ^^ try that ^^ Hope this helps! Andreas Quote
darknuke Posted February 19, 2004 Author Posted February 19, 2004 (edited) One problem; there's HTML tags between the <td> tags :( ... How can I include anything that appears between the tags? Edited February 19, 2004 by darknuke Quote This is only a test of the emergency broadcast system This is a product of hysterical mass confusion A ship of fools adrift on the sea of our pollution Rudderless and powerless on the sea of our delusion pennywise - this is only a test
HJB417 Posted February 19, 2004 Posted February 19, 2004 Can you post the html code, and then tell us what portions you want from the code? Quote
darknuke Posted February 19, 2004 Author Posted February 19, 2004 (edited) Trying to get eBay listings: (I have looked at the eBay developer SDK and for example source already) Stuff like (full source is huuuge): <tr bgcolor="#eeeeee"> <td valign="center" align="middle" width="12%" rowspan=""> <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=3660198719&category=51347"> <img height="64" width="64" border="0" src="http://thumbs.ebaystatic.com/pict/36601987196464.jpg" alt="**NEW** Disney Toy Story 2 Activity Studio CD"></a></td> <td valign="top"> <font size="3"> <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=3660198719&category=51347"> **NEW** Disney Toy Story 2 Activity Studio CD </a> </font>? <img src="http://pics.ebaystatic.com/aw/pics/paypal/logo_paypalPPBuyerProtection_28x16.gif" alt="PayPal Buyer Protection Program" border="0" width="28" height="16"> <br> <img height="1" width="200" border="0" alt="" src="http://pics.ebaystatic.com/aw/pics/s.gif"></td> The regex should return: <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=3660198719&category=51347"> <img height="64" width="64" border="0" src="http://thumbs.ebaystatic.com/pict/36601987196464.jpg" alt="**NEW** Disney Toy Story 2 Activity Studio CD"></a> and <font size="3"> <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=3660198719&category=51347"> **NEW** Disney Toy Story 2 Activity Studio CD </a> </font>? <img src="http://pics.ebaystatic.com/aw/pics/paypal/logo_paypalPPBuyerProtection_28x16.gif" alt="PayPal Buyer Protection Program" border="0" width="28" height="16"> <br> <img height="1" width="200" border="0" alt="" src="http://pics.ebaystatic.com/aw/pics/s.gif"> Edited February 19, 2004 by darknuke Quote This is only a test of the emergency broadcast system This is a product of hysterical mass confusion A ship of fools adrift on the sea of our pollution Rudderless and powerless on the sea of our delusion pennywise - this is only a test
HJB417 Posted February 19, 2004 Posted February 19, 2004 (edited) do you know c#? What I would do is use mshtml to parse the page, look for the table that has the listings, and then iterate through each row and do whatever u want to do with the data. I made an object that will download a webpage, remove the scripting and create a html document. From there you could do HtmlDocument.getElementsByTagName("TABLE") to retrieve all the html tables in the page, Find the table that has the rows you want. And then iterate through each row. The thing is, I did it in c#, and you're using vb. For this to work, I could 1) give u the dll or 2) give you the c# code, and you use the c# compiler to create the dll and import the dll to your project. Or you can try to convert the c# code to vb.net code. Edited February 19, 2004 by HJB417 Quote
darknuke Posted February 19, 2004 Author Posted February 19, 2004 I don't have MSHTML as far as I know. I don't have C#, I got VB.NET 2003 in a stand-alone package. Quote This is only a test of the emergency broadcast system This is a product of hysterical mass confusion A ship of fools adrift on the sea of our pollution Rudderless and powerless on the sea of our delusion pennywise - this is only a test
HJB417 Posted February 19, 2004 Posted February 19, 2004 I don't have MSHTML as far as I know. I don't have C#, I got VB.NET 2003 in a stand-alone package. Sure you do, Project -> Add Reference -> COM -> Microsoft HTML Object Library. Quote
darknuke Posted February 19, 2004 Author Posted February 19, 2004 How can I put a string (HTML) into the the HTMLDocumentClass class... Quote This is only a test of the emergency broadcast system This is a product of hysterical mass confusion A ship of fools adrift on the sea of our pollution Rudderless and powerless on the sea of our delusion pennywise - this is only a test
HJB417 Posted February 20, 2004 Posted February 20, 2004 How can I put a string (HTML) into the the HTMLDocumentClass class... You need to save the contents of the string 2 disk. C# code HTMLDocumentClass htmlDoc = new HTMLDocumentClass(); System.Runtime.InteropServices.UCOMIPersistFile pf = (System.Runtime.InteropServices.UCOMIPersistFile) htmlDoc; pf.Load(filename, 0); while(htmlDoc.body == null) System.Windows.Forms.Application.DoEvents(); while(htmlDoc.readyState != "complete") System.Windows.Forms.Application.DoEvents(); Quote
darknuke Posted February 20, 2004 Author Posted February 20, 2004 (edited) I don't have C#, I got VB.NET 2003 in a stand-alone package. :D *screams wildly* I don't know what I'm doing wrong :( Dim htmlDoc As New mshtml.HTMLDocument Dim pf As System.Runtime.InteropServices.UCOMIPersistFile pf.Load("c:\eBay.html", 0) htmlDoc = pf I get an error on the pf.Load line... (unhandled exception; object not set to a instance of an object) Edited February 20, 2004 by darknuke Quote This is only a test of the emergency broadcast system This is a product of hysterical mass confusion A ship of fools adrift on the sea of our pollution Rudderless and powerless on the sea of our delusion pennywise - this is only a test
HJB417 Posted February 23, 2004 Posted February 23, 2004 can u do Dim htmlDoc As New mshtml.HTMLDocumentClass ? Quote
darknuke Posted February 23, 2004 Author Posted February 23, 2004 Would someone please show me an example use of MSHTML that is in VB.NET (that does not require me to use a browser control, if possible)? Quote This is only a test of the emergency broadcast system This is a product of hysterical mass confusion A ship of fools adrift on the sea of our pollution Rudderless and powerless on the sea of our delusion pennywise - this is only a test
Hamburger1984 Posted February 23, 2004 Posted February 23, 2004 change this: Dim htmlDoc As New mshtml.HTMLDocument Dim pf As System.Runtime.InteropServices.UCOMIPersistFile pf.Load("c:\eBay.html", 0) htmlDoc = pf ..to this!: Dim htmlDoc As New mshtml.HTMLDocument Dim pf As System.Runtime.InteropServices.UCOMIPersistFile [b]pf = CType(htmlDoc,System.Runtime.InteropServices.UCOMIPersistFile)[/b] pf.Load("c:\eBay.html", 0) htmlDoc = pf Hope this helps! Andreas Quote
fblanco Posted March 4, 2005 Posted March 4, 2005 Remove scripting code Hi, If you are still offering, I would be grateful to receive the c# code you mentioned below. Many thanks. do you know c#? What I would do is use mshtml to parse the page, look for the table that has the listings, and then iterate through each row and do whatever u want to do with the data. I made an object that will download a webpage, remove the scripting and create a html document. From there you could do HtmlDocument.getElementsByTagName("TABLE") to retrieve all the html tables in the page, Find the table that has the rows you want. And then iterate through each row. The thing is, I did it in c#, and you're using vb. For this to work, I could 1) give u the dll or 2) give you the c# code, and you use the c# compiler to create the dll and import the dll to your project. Or you can try to convert the c# code to vb.net code. Quote
HJB417 Posted March 4, 2005 Posted March 4, 2005 (edited) You will need to add references to mshtml and system.windows.forms using System; using System.Diagnostics; using System.Text; using System.Text.RegularExpressions; using System.IO; using mshtml; namespace HB.Net { /// <summary> /// Creates a managed wrapper for a <see cref="mshtml.HTMLDocument"/> object. /// </summary> public class HtmlDocument : IDisposable { private bool _deleteWhenDone; /// <summary> /// The underlying <see cref="mshtml.HTMLDocument"/>. /// </summary> public readonly HTMLDocument MsHtmlDoc; /// <summary> /// The file path of the downloaded html document. /// </summary> public readonly string LocalPath; private bool _disposed; /// <summary> /// The content of the webpage. /// </summary> public readonly string AsciiData; private static readonly Regex ScriptParser; private static readonly Regex FileExtRemover; static HtmlDocument() { string[] tags = new string[] {"script", /*"style", */"object", "head", "map", "iframe", "javascript"}; string scriptParserPattern = @"<(" + string.Join("|", tags) + @">).*?</\1>"; ScriptParser = new Regex(scriptParserPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline); FileExtRemover = new Regex(@"\.\w+$", RegexOptions.Compiled); } /// <summary> /// Creates a <see cref="HtmlDocument"/> from the binary data of a webpage. /// </summary> /// <param name="data">The binary data of the webpage</param> /// <param name="removeScripting">true to remove javascript.</param> public HtmlDocument(byte[] data, bool removeScripting) : this(CreateFile(data), true, removeScripting) { } public HtmlDocument(string html, bool removeScripting) : this(html, removeScripting, Encoding.ASCII) { } public HtmlDocument(string html, bool removeScripting, Encoding encoding) : this(encoding.GetBytes(html), removeScripting) { } /// <summary> /// Creates a <see cref="HtmlDocument"/> from a webpage file. /// </summary> /// <param name="filename">The file path of the webpage.</param> /// <param name="deleteFile">true to delete the file on dispose.</param> /// <param name="removeScripting">set to true to remove script tags.</param> public HtmlDocument(string filename, bool deleteFile, bool removeScripting) { _disposed = false; _deleteWhenDone = true; LocalPath = filename; try { if(removeScripting) Preparse(filename); MsHtmlDoc = CreateHTMLDocument(out AsciiData); } catch { try { File.Delete(filename); } catch{} throw; } } /// <summary> /// Deletes the webpage and closes the underlying <see cref="mshtml.HTMLDocument"/> object. /// </summary> public void Dispose() { if(_disposed) return; MsHtmlDoc.close(); if(_deleteWhenDone) { try { File.Delete(LocalPath); } catch{} } _disposed = true; GC.SuppressFinalize(this); } ~HtmlDocument() { try { Dispose(); } catch{} } /// <summary> /// Creates a HTMLDocument. /// </summary> private HTMLDocumentClass CreateHTMLDocument(out string asciiData) { byte[] _htmlData; FileStream file = File.OpenRead(LocalPath); try { _htmlData = new byte[file.Length]; for(int read = 0; read < file.Length;) read+=file.Read(_htmlData, read, (int)(file.Length - read)); } finally { file.Close(); } HTMLDocumentClass htmlDoc = new HTMLDocumentClass(); try { System.Runtime.InteropServices.UCOMIPersistFile pf = (System.Runtime.InteropServices.UCOMIPersistFile) htmlDoc; pf.Load(LocalPath, 0); while(htmlDoc.body == null) System.Windows.Forms.Application.DoEvents(); while(htmlDoc.readyState != "complete") System.Windows.Forms.Application.DoEvents(); asciiData = Encoding.ASCII.GetString(_htmlData); } catch(Exception e) { htmlDoc.close(); throw new ApplicationException("An error occurred while creating a mshtml.HTMLDocumentClass object.", e); } return htmlDoc; } /// <summary> /// Removies scripting from a html file. /// </summary> /// <param name="filename">The path of the file.</param> public void Preparse(string filename) { //read in txt file TextReader file = File.OpenText(filename); string text = null; try { text = file.ReadToEnd(); text = ScriptParser.Replace(text, ""); } finally { file.Close(); } TextWriter output = File.CreateText(filename); try { output.Write(text); output.Flush(); } finally { output.Close(); } } private static string CreateTempHtmlFile() { while(true) { string filename = Path.GetTempFileName(); try { string htmlFileName = FileExtRemover.Replace(filename, ".html"); File.Move(filename, htmlFileName); return htmlFileName; } catch { File.Delete(filename); } } } /// <summary> /// Creates a html file from an array of bytes. /// </summary> /// <param name="data">The array of bytes to create the data from.</param> private static string CreateFile(byte[] data) { string filename = CreateTempHtmlFile(); FileStream file = File.OpenWrite(filename); try { file.Write(data, 0, data.Length); file.Flush(); return filename; } catch { try { File.Delete(filename); } catch{} throw; } finally { file.Close(); } } /// <summary> /// Returns the content of the html document. /// </summary> /// <returns>The content of the html document.</returns> [system.Diagnostics.DebuggerStepThrough] public override string ToString() { return AsciiData; } } } edit: cleaned up the code. Edited March 4, 2005 by HJB417 Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.