rfazendeiro Posted November 24, 2008 Posted November 24, 2008 (edited) I wrote a web application that allows users to search words in files. These files can be .doc, .pdf, .txt, etc. Right now the search engine is working but the other day a user send a bug that the search was not seeking in files that where attached to a .doc file. I've been seeking the web and cant find nothing about this. Can anyone tell me how can i extract an attachment from a Word document? especially a PDF file? thx Edited November 24, 2008 by rfazendeiro Quote
rfazendeiro Posted December 9, 2008 Author Posted December 9, 2008 hello again :) Well i've managed to extract attachments from a .doc file if they are .doc, .xls, .ppt. I'm really lost on how to extract PDF files. Any help would be really appreciated thank you all Quote
Administrators PlausiblyDamp Posted December 9, 2008 Administrators Posted December 9, 2008 What code are you using to extract the other attachment types? What exactly happens when you try with PDF files? Quote Posting Guidelines FAQ Post Formatting Intellectuals solve problems; geniuses prevent them. -- Albert Einstein
rfazendeiro Posted December 9, 2008 Author Posted December 9, 2008 Well actually extrating office files is pretty easy, because you have support for those kind of files. here's a sample on how i take the office documents from a word document: public static void SearchFileAttachments(Uri file) { object missing = Type.Missing; object fileName = file.ToString(); object VerbIndex = Microsoft.Office.Interop.Word.WdOLEVerb.wdOLEVerbOpen; Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application(); Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref fileName, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing); try { docs.Activate(); foreach (Microsoft.Office.Interop.Word.InlineShape inlineShape in docs.InlineShapes) { if (inlineShape.OLEFormat.ProgID != null) { switch (inlineShape.OLEFormat.ProgID) { case "PowerPoint.Show.8": Microsoft.Office.Interop.PowerPoint.Application powerpoint = new Microsoft.Office.Interop.PowerPoint.Application(); try { powerpoint.WindowState = Microsoft.Office.Interop.PowerPoint.PpWindowState.ppWindowNormal; inlineShape.OLEFormat.DoVerb(ref VerbIndex); powerpoint = Marshal.GetActiveObject("PowerPoint.Application") as Microsoft.Office.Interop.PowerPoint.Application; if (powerpoint != null) { Guid guid = Guid.NewGuid(); string presentationName = guid + ".ppt"; powerpoint.ActivePresentation.SaveAs(presentationName, Microsoft.Office.Interop.PowerPoint.PpSaveAsFileType.ppSaveAsPresentation, Microsoft.Office.Core.MsoTriState.msoTrue); } } catch (Exception ex) { //exception code here } finally { if (powerpoint != null) { powerpoint.ActivePresentation.Close(); powerpoint.Quit(); } } break; case "Excel.Sheet.8": Microsoft.Office.Interop.Excel.Application excel = new Microsoft.Office.Interop.Excel.Application(); try { excel.Visible = false; excel.ScreenUpdating = false; excel.Left = -1000; inlineShape.OLEFormat.DoVerb(ref VerbIndex); excel = Marshal.GetActiveObject("Excel.Application") as Microsoft.Office.Interop.Excel.Application; if (excel != null) { Guid guid = Guid.NewGuid(); object workBookName = guid + ".xls"; excel.ActiveWorkbook.SaveAs(workBookName, missing, missing, missing, missing, missing, Microsoft.Office.Interop.Excel.XlSaveAsAccessMode.xlNoChange, missing, missing, missing, missing, missing); } } catch (Exception ex) { //exception code here } finally { if (excel != null) { excel.Workbooks.Close(); excel.Quit(); } } break; case "Word.Document.8": Microsoft.Office.Interop.Word.Application wordDocument = new Microsoft.Office.Interop.Word.Application(); try { wordDocument.Visible = false; wordDocument.ScreenUpdating = false; wordDocument.Left = -1000; inlineShape.OLEFormat.DoVerb(ref VerbIndex); wordDocument = Marshal.GetActiveObject("Word.Application") as Microsoft.Office.Interop.Word.Application; if (wordDocument != null) { Guid guid = Guid.NewGuid(); object wordDocumentName = guid + ".doc"; wordDocument.ActiveDocument.SaveAs(ref wordDocumentName, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing); } } catch (Exception ex) { //exeption code here } finally { if (wordDocument != null) { object saveChanges = Microsoft.Office.Interop.Word.WdSaveOptions.wdDoNotSaveChanges; object originalFormat = Microsoft.Office.Interop.Word.WdOriginalFormat.wdWordDocument; object routeDocument = true; wordDocument.ActiveDocument.Close(ref saveChanges, ref originalFormat, ref routeDocument); //wordDocument.Quit(ref saveChanges, ref originalFormat, ref routeDocument); } } break; default: break; } } } } catch (Exception ex) { //exception code here } finally { object saveChanges = Microsoft.Office.Interop.Word.WdSaveOptions.wdDoNotSaveChanges; object originalFormat = Microsoft.Office.Interop.Word.WdOriginalFormat.wdWordDocument; object routeDocument = true; docs.Close(ref saveChanges, ref originalFormat, ref routeDocument); word.Quit(ref saveChanges, ref originalFormat, ref routeDocument); } } With the Microsoft Office type of attachments I have support by importing the Microsoft.Office.Interop.Excel.dll Microsoft.Office.Interop.PowerPoint.dll Microsoft.Office.Interop.Word.dll but i have no such luck with PDF's. I have tried importing an DLL from Acrobat but with no success. Any ideia on how to extract the PDF? As a side note, if it's a PDF the inlineShape.OLEFormat.ProgID is AcroExch.Document.X, where "X" is the version of the PDF File. Quote
rfazendeiro Posted December 11, 2008 Author Posted December 11, 2008 So i've been freneticly seeking the web for this problem (taking the PDF from a word attachment) and found this code. The user says he's able to get 95% of the word attachment out, but the problem is that it's written in pearl :/ I've been trying to convert it to C# but not having much sucess. So can anyone here help translating this to C#? $byte = ""; $buffer = ""; #$infh = new FileHandle; #sysopen $infh, "$explodeinto/$inname", O_RDONLY; Open the infh filehandle with the "inname" file containing the OLE object. sysseek $infh, 6, SEEK_SET; # Skip 1st 6 bytes Skip the first 6 bytes, these appear to be useless $outname = ""; $finished = 0; $length = 0; until ($byte eq "\0" || $finished || $length>1000) { # Read a C-string into $outname sysread($infh, $byte, 1) or $finished = 1; $outname .= $byte; $length++; } Read a null-terminated string of bytes, this becomes the output filename. next OLEFILE if $length>1000; # Bail out if it went wrong If the filename was way too long, this is probably corrupt. $finished = 0; $byte = 1; $length = 0; until ($byte eq "\0" || $finished || $length>1000) { # Throw away a C-string sysread($infh, $byte, 1) or $finished = 1; $length++; } Throw away the next null-terminated string of bytes. next OLEFILE if $length>1000; # Bail out if it went wrong If the string was way too long, this is probably corrupt. sysseek $infh, 4, Fcntl::SEEK_CUR or next OLEFILE; # Skip next 4 bytes Skip the next 4 bytes of the file. sysread $infh, $number, 4 or next OLEFILE; $number = unpack 'V', $number; Read the next 4 bytes into a 4-byte int called "$number". #print STDERR "Skipping $number bytes of header filename\n"; if ($number>0 && $number<1_000_000) { sysseek $infh, $number, Fcntl::SEEK_CUR; # Skip the next bit of header (C-string) } else { next OLEFILE; } If the number $number was a reasonable size, skip that many bytes of the file. sysread $infh, $number, 4 or next OLEFILE; $number = unpack 'V', $number; Read the next 4 bytes in a 4-byte int called "$number". This is the length of the real embedded file we want to extract. #print STDERR "Reading $number bytes of file data\n"; sysread $infh, $buffer, $number if $number>0 && $number < $size; # Sanity check Read the $number number of bytes into memory into a chunk of memory allocated which is at least $number bytes long. Do a sanity check that the number of bytes we have asked it to read is less than the total length of the input file. $outfh = new FileHandle; $outsafe = $this->MakeNameSafe($outname, $explodeinto); sysopen $outfh, "$explodeinto/$outsafe", (O_CREAT | O_WRONLY) or next OLEFILE; Create an output file with a filename which is a sanitised safe version of the filename we read at the top of this bit of code. if ($number>0 && $number<1_000_000_000) { # Number must be reasonable! syswrite $outfh, $buffer, $number or next OLEFILE; } close $outfh; If the output file is less than 1Gbyte long, write out the data we just read. This creates the file containing the embedded file we wanted to extract. Then close that output file. Thank you to all Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.