The ways to retrieve text from the document are:
- Use Document.Save with SaveFormat.Text to save as plain text into a file or stream.
- Use Node.ToTxt. Internally, this invokes save as text into a memory stream and returns the resulting string.
- Use Node.GetText to retrieve text with all Microsoft Word control characters including field codes.
- Implement a custom DocumentVisitor to perform customized extraction.
Using Node.GetText
A Word document can contains control characters that designate special elements such as field, end of cell, end of section etc. The full list of possible Word control characters is defined in the ControlChar class. The Node.GetText method returns text with all of the control character characters present in the node.
Using SaveFormat.Text
This example saves the document as follows:
- Filters out field characters and field codes, shape, footnote, endnote and comment references.
- Replaces end of paragraph ControlChar.Cr characters with ControlChar.CrLf combinations.
- Uses UTF8 encoding.
Example ExtractContentSaveAsText
Shows how to save a document in TXT format.
[C#]
Document doc = new Document(MyDir + "Document.doc");
doc.Save(MyDir + "Document.ConvertToTxt Out.txt");
[Visual Basic]
Dim doc As Document = New Document(MyDir & "Document.doc")
doc.Save(MyDir & "Document.ConvertToTxt Out.txt")
[Java]
Document doc = new Document(getMyDir() + "Document.doc");
doc.save(getMyDir() + "Document.ConvertToTxt Out.txt");