I need a toolkit for extracting embedded OLE objects from MS Word- Excel- PowerPoint- and PDF files

Hi

Thanks for your request. You can use the following code to extract OLE objects from Word document.

public void TestExtractingOLEobjects()
{
    Document doc = new Document(@"053_88927_RogerHarvest\TestOleExtraction.doc");
    NodeCollection shapes = doc.GetChildNodes(NodeType.Shape, true);
    int i = 0;

    foreach (Node node in shapes)
    {
        //save ole object on disk. you can save ole object to stream using OleFormat.Save(Stream stream) method.
        (node as Shape).OleFormat.Save(@"053_88927_RogerHarvest\test" + i.ToString());
        i++;
    }
}

Concerning to other formats I have notified my colleagues of this theme. They will reply to you shortly.

Best regards.

That's great, thank you!

I would like to download an evaluation version. Which one should I pick for the .Net 2.0 framework?

Hi

Please download the latest version of Aspose.Words.

Best regards.

Hi Roger,

Aspose.Cells is capable of extracting OLE Objects in Excel Spreadsheets

Following is sample code:

//Instantiating a Workbook object
Workbook workbook = new Workbook();
workbook.Open(@“D:\test\MyTemplateFile.xls”);
OleObjects oles = workbook.Worksheets[0].OleObjects;
for(int i = 0; i < oles.Count ; i++)
{
OleObject ole = oles[i];
string fileName = “d:\test\tole” + i + “.”;
switch (ole.FileType)
{
case OleFileType.Doc:
fileName += “Doc”;
break;
case OleFileType.Xls:
fileName += “Xls”;
break;
case OleFileType.Ppt:
fileName += “Ppt”;
break;
case OleFileType.Pdf:
fileName += “Pdf”;
break;
default:
fileName += “data”;
break;
}
FileStream file = File.Create(fileName);
byte[] data = ole.ObjectData;
file.Write(data, 0, data.Length);
}

You may download Aspose.Cells latest version 4.3 @:
https://downloads.aspose.com/cells/net

For our wiki docs about how to use Aspose.Cells APIs, please check:
https://docs.aspose.com/cells/net/

And for featured demos, please check:
https://github.com/aspose-cells/Aspose.Cells-for-.NET

Feel free to contact us any time for further info or technical issues @ Aspose.Cells forum: http://www.aspose.com/Community/Forums/19/ShowForum.aspx

Thank you.

Hi Roger,

I don’t think there is OLE object in PDF so we can’t extract it.

Thanks again for all the help and suggestions. Today I downloaded Words and I'm trying it out. I am primarily interested in extracting embedded files (excel and powerpoint docs, etc., embedded in a Word doc). Is there any way to tell what the filename of the embedded file is? This usually appears in the icon that represents the embedded file in the word doc.

Thanks!

No, unfortunately, the file name of the embedded documents are not stored in the Word document file. What you see is just an icon where the name or part of the name can be seen as a raster image. The name itself is not stored inside the document, I am absolutely certain about this.

That, however, is true only about the OLE2 embeddings such as Excel and Word files. OLE1 emebddings (files in not OLE compatible format) such as text, images or PDF have file name stored inside the embedding. We, however, do not yet provide means of extracting them as there was no demand yet for such a feature.

Hope this answers your question,

Thanks Vladimir! May I officially express some interest in extracting the fielnames for OLE1 embeddings? For the OLE2 embeddings, I can check the ProgId from the OleFormat object to at least calculate a viable file extension in most cases. But the OLE1 all have "Package" for progid, be they MSG files, text documents, or whatever, which makes giving them a viable extension very difficult.

Thanks again. The Words API is very impressive.

Yes, we have it as a logged feature rquest already and will implement it in one of the next versions.

Meanwhile you can use a workaround code given in

<A href="</A></P> <P>It shows how to extract OLE1 object together with filenames. OLE1 objects extraction is already implemented as OleFormat.Save, so you can only use the beginnning of the code where the original filename in three different formats is extracted.</P> <P>Hope this helps,</P>

Great, I will give that a try, thank you! By the way, I've been investigating it myself, and it looks like the filename that is displayed in the image representation of the OLE object in word is actually embedded as text in the metafile. If you call ImageData.Save and examine the resulting WMF file in a text editor, you will see something like ""CRT05737 -VBRP 2
/ f B data Spec.doc", which look like the verctor graphics instructions for drawing that text. This appears for both OLE1 and OLE2 object types. I'm trying to find a way to parse it now, but it would be great to have it built into the SDK.

Thanks again!

Thank you for noticing it. Yes, it may work in case of the object embedded as icon. But it will still not work for the objects embedded as control as far as I can see.

I have also checked MS Word 2007 DOCX behavior for embedded documents - it does not even attempt to save an original name - the files are stored as Microsoft_Office_Excel_97-2003_Worksheet2.xls, Microsoft_Office_Word_97_-_2003_Document4.doc, etc.

Bu we still can try and extract names for OLE1 objects and OLE2 objects embedded as icons. It will be a convenient addition to API. I will notify you as soon as it will be implemented in the released version.

Best regards,

Thanks again! I'm only interested in the original filenames of objects embedded as icons anyway, so this will be great. I got my code for parsing the filename out of the metafile working. In case its of anyhelp to you, I've pasted it below. Its an adaptation of the sample code for the EnumerateMetafileProc documentation in the .Net online help.

Imports System

Imports System.Drawing

Imports System.Drawing.Imaging

Imports System.Runtime.InteropServices

Imports System.Text

Public Class MetafileParser

Private metafile1 As Metafile

Private metafileDelegate As Graphics.EnumerateMetafileProc

Private destPoint As Point

Private MetafileText As String = ""

Public Function GetMetafileText(ByVal wmf As Metafile) As String

metafile1 = wmf

MetafileText = ""

metafileDelegate = New Graphics.EnumerateMetafileProc(AddressOf MetafileCallback)

destPoint = New Point(20, 10)

Dim G As Graphics

Dim bmp As New Bitmap(metafile1)

G = Graphics.FromImage(bmp)

G.EnumerateMetafile(metafile1, destPoint, metafileDelegate)

Return MetafileText

End Function

Private Function MetafileCallback(ByVal recordType As _

EmfPlusRecordType, ByVal flags As Integer, ByVal dataSize As Integer, _

ByVal data As IntPtr, ByVal callbackData As PlayRecordCallback) As Boolean

Dim Record As String

Dim dataArray As Byte() = Nothing

If data <> IntPtr.Zero Then

' Copy the unmanaged record to a managed byte buffer

' that can be used by PlayRecord.

dataArray = New Byte(dataSize) {}

Marshal.Copy(data, dataArray, 0, dataSize)

End If

Dim X As Integer = 0

Dim VectorCommand As String = ""

Dim sb As New StringBuilder

Dim c As Char

If recordType = EmfPlusRecordType.WmfExtTextOut Then

For X = dataArray.GetLowerBound(0) To dataArray.GetUpperBound(0)

If dataArray(X) > 31 Then 'And dataArray(X) < 127 Then

'Debug.Print(dataArray(X).ToString & " - " & Chr(dataArray(X)) & vbCrLf)

c = Chr(dataArray(X))

sb.Append(c)

End If

Next

Record = sb.ToString

Debug.Print(Record)

If Record.StartsWith("""fB") Then

Record = Record.Substring(3)

ElseIf Record.StartsWith("/fB") Then

Record = Record.Substring(3)

ElseIf Record.StartsWith("$fB") Then

Record = Record.Substring(3)

ElseIf Record.StartsWith("/*fB") Then

Record = Record.Substring(4)

ElseIf Record.StartsWith("/'fB") Then

Record = Record.Substring(4)

End If

If Record.Contains(Chr(34)) Then

Record = Record.Replace(Chr(34), "")

End If

MetafileText = MetafileText & Record

End If

metafile1.PlayRecord(recordType, flags, dataSize, dataArray)

Return True

End Function

End Class

Thanks for the code. We will try to implement this functionality as soon as possible.

Best regards,

Hello again. I had to put this project aside for a little while, but I am picking it up again. Can you tell me if the fielname extraction has been incorporated yet?

Thanks!

RH

Hi

Thanks for your request. Unfortunately, this feature is not implemented yet. It will be implemented in one of the future releases.

But OleFormat has property OleFormat.SourceFullName. It Gets or sets the path and name of the source file for the linked OLE object.

Best regards.

Hi,

So I gather from this thread the following:

Aspose.Words can be used to extract OLE objects from Microsoft Word.
Aspose.Cells can be used to extract OLE objects from Microsoft Excel.
There are no OLE objects in Adobe PDFs, so Aspose doesn’t offer a solution for this.

However no mention was made of a solution for PPT files. Can someone help me out with this, and let me know if I’m understanding all of this correctly?

Thanks,
Tom

Hi

Thanks for your request. I think that you can use Aspose.Slides for this. Please see the following link.

I hope this could help you. Please let me know if you would like to know anything else. I will be happy to help you.

Best regards.

Thanks for the quick response. And as for the PDF issue, if the embedded objects are not OLE, then what are they? Is there a resource you can point me to?

Currently we can only extract attachment from PDF document. Please refer to ExtractAttachment.

I’m still a little confused as to how to use this tool to extract OLE objects. The situation is that I have PPT files that can have any number of embedded files (PDFs, Excel Spreadsheets, other PPT files) in any of the slides. I have no idea where or when these embedded objects will occur. The example code seems to specifically look for a chart on a specific slide, is there any way to simply loop through every slide, saving all of the embedded objects to their own files? Thanks.