Embedded OLE objects

In my programming task, I want to extract the embedded object that was inserted as a Link or an Icon, and then save it to somewhere else.

As far as I know there are 3 ways of insert Ole object: 1. insert the content 2. insert as icon 3. insert as link.

I used the OleFormat property of the Shape object, then I checked the IsLink property of the OleFormat as following:

asposeDoc = new Document(doc.FileFullPath);
NodeCollection shapes = asposeDoc.GetChildNodes(NodeType.Shape, true);
foreach (Shape shape in shapes)
{
    if (shape.OleFormat.IsLink)
    {
        .....
    }
}

The above code can successfully detect an OLE object if it was inserted as a Link. Is there a way to identity that OLE object is embedded as icon?

Thanks!

Another related question is: How would I differentiate whether an embedded object is the actual content or merely an icon? For example, I could embed an excel spreadsheet by placing the excel file inside the word doc as an icon, or I can insert the excel spreadsheet itself inside my document. How would I programmatically differentiate these two scenarios? I do not need to using Aspose.Words to create different kinds of embedded objects, all I wanted to do is to know what type of embedded object an existing document has.

Thanks and appreciate your help!

Well, you can determine if the shape is an OLE linked object by using Shape.OleFormat.IsLink property. But we have not yet figured out how to tell if the OLE object is a content or an icon. That requires some additional binary format research to be done and we will probably do it but later. I have logged this feature request to our defect base as issue #1403.

Best regards,

Thanks for your reply.

I have another question: I have two Word docs, one has an Ole object as link, the other has an Ole object as icon, they all refer to a same Word doc. I was using OleFormat.Save() to save the embedded object to another place. I found that if the Ole object is an icon, I can save and later open the extracted Word document without problem, however, if the Ole object is a link, then the document extracted is not readable via Word any more (Word pops up the “File Conversion” dialog box asking the text encoding). Here is my code:

FileStream fileStream = new FileStream(Path.Combine(doc.FileDir, "test1.doc") , FileMode.Create);
shape.OleFormat.Save(fileStream);

Did I do something wrong?

Another question is: Is there a way to know the underlying file name of the Ole object?

Thanks much.

It is probably because in case of a linked file the file contents are not stored in the document, only some link information is stored. Generally, contents of the OLE object embedded inside Word document is a black box to us. It is a totally undocumented information and will require a lot of efforts to figure it out. So we are not going to decipher it in the nearest future. The underlying name of OLE is sometimes exposed and sometimes not. Try and save to WordProcessingML to check what information is actually available and what is packed inside OLE binary package. Generally, what can be seen in WordML is also available via Aspose.Words API.

Best regards,

The following info about OLE objects we cannot extract from DOC files:

AsIcon, IconLabel, IconPath and IconIndex

We don’t know where this is stored in the DOC files.

This is a new question regarding extracted embedded object.

The scenario is: I embed a text file in a Word doc, and then I extracted it using Aspose.Words, then I use text editor to view the extracted text file, it shows the text but with unrecognized characters around it. Is their a way to cleanly extract the text file?

I have attached the Word doc and the extracted text file (I gave it extension of “unknown”).

The text file embedded: TextFileToEmbed.txt

The Word doc: File1_Embedded_Unknown.doc

The extracted file: File1_Embedded_Unknown_0.txt

Thanks!

Thanks for providing the sample files. I can see the problem. Please give me some time to work out recommendations on how to get this working. I will probably be able to do this tomorrow.

Best regards,

Vladimir Averkin
Developer/Technical Support
Aspose Auckland team

Seems that extraction of the embedded file requires some changes in our API. I have logged this to our defect base as issue #1576. We will probably implemented it in the next hotfix version.

I think similar issue happens for Aspose.Slides as well, shall I contact them seperately, or this will cover it?

Thank you!

I think it is beeter to contact Aspose.Slides team directly by posting this issue on Aspose.Slides forum.

A new method is now exposed for OleFormat object, starting from version 4.2.3. You can use it to extract named parts from OLE object data. Here is the code showing how this method can be used to extract text content from text file embedded in Word document:

private string ExtractEmbeddedText(Shape shape)
{
    MemoryStream stream = shape.OleFormat.GetOleEntry("\x0001Ole10Native");
    BinaryReader reader = new BinaryReader(stream, Encoding.ASCII);
    // Read total length.
    int totalLength = reader.ReadInt32();
    // Read first header, should be 02 00 (have no idea what does this header mean).
    int header1 = reader.ReadInt16();
    // Read file name 1.
    string filename1 = ReadAsciizString(reader);
    // Read file name 2.
    string filename2 = ReadAsciizString(reader);
    // Read second header, should be 00 00 03 00 (have no idea what does this header mean).
    int header2 = reader.ReadInt32();
    // Read length of the filename that goes next.
    int filenameLength = reader.ReadInt32();
    // Read file name 3.
    // I think it is ok to read it as a null-terminated string, instead of using the preceding length.
    string filename3 = ReadAsciizString(reader);
    // Read file length.
    int contentLength = reader.ReadInt32();
    // Read content.
    StringBuilder sb = new StringBuilder();
    while (contentLength-- != 0)
        sb.Append(reader.ReadChar());
    return sb.ToString();
}

///
/// Read null-terminated ASCII string.
///
///
private string ReadAsciizString(BinaryReader reader)
{
    StringBuilder sb = new StringBuilder();
    while (true)
    {
        char ch = reader.ReadChar();
        if (ch == '\0')
            break;
        else
            sb.Append(ch);
    }
    return sb.ToString();
}

The new version of Aspose.Words is available at:
https://downloads.aspose.com/words/net

Hope this will help to resolve your problem.

I’ve created a simple project that extracts both OLE1 and OLE2 objects properly. It is easy to extract OLE2 objects in the current Aspose.Words version, but extracting OLE1 objects requires code that miklovan provided above (and I included in the project). In the next version of Aspose.Words we will make sure to add some useful methods to make extraction of OLE objects more straightforward.

Hi,

I have a Word document template file with an embedded Excel sheet . I would like to make a change to the embededed Excel sheet and then save the Word document. Is this possible?

Thanks,
Adam.

You can extract the embedded Excel file from the document using Aspose.Words but you won’t be able to embed the changed file back into the document. That is not supported in Aspose.words yet.

I am also using Aspose.Words to extract embedded OLE objects from a word document. I am using the Shape.OLEformat for loop as described. Here is my problem: The shapes are not recognized unless I delete the pages preceeding my objects. I have two files I am attaching, the first one does not recognize any shape type; the second file has pages removed from my document before it can recognize the nodes as shapes.

Aspose.Words.Document doc = new Aspose.Words.Document(filename, Aspose.Words.LoadFormat.Auto, "");
Aspose.Words.NodeCollection col = doc.GetChildNodes(Aspose.Words.NodeType.Any, true);
if (col.Count > 0)
{
    foreach (Aspose.Words.Node node in col)
    {
        if (node.NodeType == Aspose.Words.NodeType.Shape)
        {
            Aspose.Words.Drawing.Shape shape = (node as Aspose.Words.Drawing.Shape);
            if (shape.OleFormat != null)
                MessageBox.Show(shape.OleFormat.ProgId.ToString());
            else
                MessageBox.Show("Node Type = " + node.NodeType.ToString());
        }
    }
}

Hi

Thanks for your request. I used the following code for testing and it seems all works fine. I successfully extracted OLE objects from both documents.

//Open document
Document doc = new Document(@"Test184\00016268.doc");
//get collection of shapes
NodeCollection shapes = doc.GetChildNodes(NodeType.Shape, true);
int i = 0;
//Loop through all shapes
foreach (Shape shape in shapes)
{
    if (shape.OleFormat != null)
    {
        //Extract OLE object
        if (shape.OleFormat.ProgId == "Excel.Sheet.8")
        {
            shape.OleFormat.Save(String.Format(@"Test184\out\_{0}.xls", i));
            i++;
        }
    }
}

I use the latest version of Aspose.Words (6.1.0) for testing.

Best regards.

Hi,

After printing out all ProgId, I’ve found that both .txt and .zip file have the same ProgId “Package”, is there any way to extract them into different extension? Thanks a lot!

Hi

Thanks for your request. You can try creating your own logic to determine file format of an embedded OLE object. For example for Word formats, you can try using Document.DetectFileFomat method.

Best regards.

Thanks Alexey ,

One more question, can Aspose also extract back the OLE’s original filename? Or I must implement my own naming logic like: attachment_1.xxx, attachment_2.xxx?

Thanks and regards,
Bearyung