Aspose.Words can be used not only for creating Microsoft Word documents by building them dynamically or merging templates with data but also for parsing documents in order to extract separate document elements such as headers, footers, paragraphs, tables, images, and others. Other possible task is to find all text of a specific formatting or style for example.
Use the DocumentVisitor class to implement this usage scenario. This class corresponds to the well-known Visitor design pattern. With DocumentVisitor, you can define and execute custom operations that require enumeration over the document tree.
DocumentVisitor provides a set of VisitXXX of methods that are invoked when a particular document element (node) is encountered. For example, VisitParagraphStart is called when the beginning of a text paragraph is found and VisitParagraphEnd is called when the end of a text paragraph is found. Each VisitXXX method accepts the corresponding object that is encountered so you can use it as needed (say retrieve the formatting), e.g. both VisitParagraphStart and VisitParagraphEnd accept a Paragraph object.
Each VisitXXX method returns a VisitorAction value that controls the enumeration of nodes. You can request either to continue the enumeration, skip the current node (but continue the enumeration), or stop the enumeration of nodes.
These are the steps you should follow to programmatically determine and extract various parts of a document:
- Create a class derived from DocumentVisitor.
- Override and provide implementations for some or all of the VisitXXX methods to perform some custom operations.
- Call Node.Accept on the node that you want to start the enumeration from. For example, if you want to enumerate the whole document, use Document.Accept.
DocumentVisitor provides default implementations for all of the VisitXXX methods to make it easier to create new document visitors as only the methods required for the particular visitor need to be overridden. It is not necessary to override all of the visitor methods.
This example shows how to use the Visitor pattern to add new operations to the Aspose.Words object model. In this case, we create a simple document converter into a text format.
Example ExtractContentDocToTxtConverter
Shows how to use the Visitor pattern to add new operations to the Aspose.Words object model. In this case we create a simple document converter into a text format.
[C#]
public void ToText()
{
// Open the document we want to convert.
Document doc = new Document(MyDir + "Visitor.ToText.doc");
// Create an object that inherits from the DocumentVisitor class.
MyDocToTxtWriter myConverter = new MyDocToTxtWriter();
// This is the well known Visitor pattern. Get the model to accept a visitor.
// The model will iterate through itself by calling the corresponding methods
// on the visitor object (this is called visiting).
//
// Note that every node in the object model has the Accept method so the visiting
// can be executed not only for the whole document, but for any node in the document.
doc.Accept(myConverter);
// Once the visiting is complete, we can retrieve the result of the operation,
// that in this example, has accumulated in the visitor.
Console.WriteLine(myConverter.GetText());
}
/// <summary>
/// Simple implementation of saving a document in the plain text format. Implemented as a Visitor.
/// </summary>
public class MyDocToTxtWriter : DocumentVisitor
{
public MyDocToTxtWriter()
{
mIsSkipText = false;
mBuilder = new StringBuilder();
}
/// <summary>
/// Gets the plain text of the document that was accumulated by the visitor.
/// </summary>
public string GetText()
{
return mBuilder.ToString();
}
/// <summary>
/// Called when a Run node is encountered in the document.
/// </summary>
public override VisitorAction VisitRun(Run run)
{
AppendText(run.Text);
// Let the visitor continue visiting other nodes.
return VisitorAction.Continue;
}
/// <summary>
/// Called when a FieldStart node is encountered in the document.
/// </summary>
public override VisitorAction VisitFieldStart(FieldStart fieldStart)
{
// In Microsoft Word, a field code (such as "MERGEFIELD FieldName") follows
// after a field start character. We want to skip field codes and output field
// result only, therefore we use a flag to suspend the output while inside a field code.
//
// Note this is a very simplistic implementation and will not work very well
// if you have nested fields in a document.
mIsSkipText = true;
return VisitorAction.Continue;
}
/// <summary>
/// Called when a FieldSeparator node is encountered in the document.
/// </summary>
public override VisitorAction VisitFieldSeparator(FieldSeparator fieldSeparator)
{
// Once reached a field separator node, we enable the output because we are
// now entering the field result nodes.
mIsSkipText = false;
return VisitorAction.Continue;
}
/// <summary>
/// Called when a FieldEnd node is encountered in the document.
/// </summary>
public override VisitorAction VisitFieldEnd(FieldEnd fieldEnd)
{
// Make sure we enable the output when reached a field end because some fields
// do not have field separator and do not have field result.
mIsSkipText = false;
return VisitorAction.Continue;
}
/// <summary>
/// Called when visiting of a Paragraph node is ended in the document.
/// </summary>
public override VisitorAction VisitParagraphEnd(Paragraph paragraph)
{
// When outputting to plain text we output Cr+Lf characters.
AppendText(ControlChar.CrLf);
return VisitorAction.Continue;
}
public override VisitorAction VisitBodyStart(Body body)
{
// We can detect beginning and end of all composite nodes such as Section, Body,
// Table, Paragraph etc and provide custom handling for them.
mBuilder.Append("*** Body Started ***\r\n");
return VisitorAction.Continue;
}
public override VisitorAction VisitBodyEnd(Body body)
{
mBuilder.Append("*** Body Ended ***\r\n");
return VisitorAction.Continue;
}
/// <summary>
/// Called when a HeaderFooter node is encountered in the document.
/// </summary>
public override VisitorAction VisitHeaderFooterStart(HeaderFooter headerFooter)
{
// Returning this value from a visitor method causes visiting of this
// node to stop and move on to visiting the next sibling node.
// The net effect in this example is that the text of headers and footers
// is not included in the resulting output.
return VisitorAction.SkipThisNode;
}
/// <summary>
/// Adds text to the current output. Honours the enabled/disabled output flag.
/// </summary>
private void AppendText(string text)
{
if (!mIsSkipText)
mBuilder.Append(text);
}
private readonly StringBuilder mBuilder;
private bool mIsSkipText;
}
[Visual Basic]
Public Sub ToText()
' Open the document we want to convert.
Dim doc As Document = New Document(MyDir & "Visitor.ToText.doc")
' Create an object that inherits from the DocumentVisitor class.
Dim myConverter As MyDocToTxtWriter = New MyDocToTxtWriter()
' This is the well known Visitor pattern. Get the model to accept a visitor.
' The model will iterate through itself by calling the corresponding methods
' on the visitor object (this is called visiting).
'
' Note that every node in the object model has the Accept method so the visiting
' can be executed not only for the whole document, but for any node in the document.
doc.Accept(myConverter)
' Once the visiting is complete, we can retrieve the result of the operation,
' that in this example, has accumulated in the visitor.
Console.WriteLine(myConverter.GetText())
End Sub
''' <summary>
''' Simple implementation of saving a document in the plain text format. Implemented as a Visitor.
''' </summary>
Public Class MyDocToTxtWriter
Inherits DocumentVisitor
Public Sub New()
mIsSkipText = False
mBuilder = New StringBuilder()
End Sub
''' <summary>
''' Gets the plain text of the document that was accumulated by the visitor.
''' </summary>
Public Function GetText() As String
Return mBuilder.ToString()
End Function
''' <summary>
''' Called when a Run node is encountered in the document.
''' </summary>
Public Overrides Function VisitRun(ByVal run As Run) As VisitorAction
AppendText(run.Text)
' Let the visitor continue visiting other nodes.
Return VisitorAction.Continue
End Function
''' <summary>
''' Called when a FieldStart node is encountered in the document.
''' </summary>
Public Overrides Function VisitFieldStart(ByVal fieldStart As FieldStart) As VisitorAction
' In Microsoft Word, a field code (such as "MERGEFIELD FieldName") follows
' after a field start character. We want to skip field codes and output field
' result only, therefore we use a flag to suspend the output while inside a field code.
'
' Note this is a very simplistic implementation and will not work very well
' if you have nested fields in a document.
mIsSkipText = True
Return VisitorAction.Continue
End Function
''' <summary>
''' Called when a FieldSeparator node is encountered in the document.
''' </summary>
Public Overrides Function VisitFieldSeparator(ByVal fieldSeparator As FieldSeparator) As VisitorAction
' Once reached a field separator node, we enable the output because we are
' now entering the field result nodes.
mIsSkipText = False
Return VisitorAction.Continue
End Function
''' <summary>
''' Called when a FieldEnd node is encountered in the document.
''' </summary>
Public Overrides Function VisitFieldEnd(ByVal fieldEnd As FieldEnd) As VisitorAction
' Make sure we enable the output when reached a field end because some fields
' do not have field separator and do not have field result.
mIsSkipText = False
Return