Find and replace- block of text spanning over multiple lines

nexcom.as · January 23, 2012, 10:04am

Hello,

I’ve run into some issues while trying to find and replace a block of text inside a Word document.
The block I am trying to replace essentially is some custom domain specific language syntax, for which I have a translator. By the end, what I need is to find the block of domain syntax code, interpret and translate it, thus finally replace the entire text block in the word document with a new value returned from my translator.

The content of the Word file somewhat like this, where the first line (Hello World) represents some static text in the document that should not be replaced or removed. The actual domain specific code is enclosed in the %Tag%, as shown below:

Hello World
%Tag:PropertyGridBegin%
%Tag:PropertyGrid.CustomerName%
%Tag:PropertyGrid.City%
%Tag:PropertyGridEnd%

I’ve initially worked my way through the “Find and highlight” example. With regards to the sample code I’ve worked on, the issue is that, when I go about using the replace evaluator I don’t get the entire block I need, e.g. the domain code block segment. I would imagine that the regular expression needed to find the block in the example above could be as simple as:

%ETRAY:PropertyGridBegin.+?%ETRAY:PropertyGridEnd%

However, this only gives me back the node of the %Tag:PropertyGridBegin% line of the entire block segment. I suspect that the regex may not be sufficient, however, I still can’t wrap my head around why it should only give back a node containing the first line. This gives me a bit of problems, since the next Replacing call will still contain the remaining lines of the then fragmented domain code segment, which causes my interpreter to crash.

It would be appreciated if you could give me some pointer to how I should work my way into solving this particular problem.

Kind regards,
S.Engel

imran.rafique · January 24, 2012, 2:24am

Hi
Søren,

Thanks for your inquiry. According your code segments, you can extract it as:

Regex regex = new Regex(@"(?%(.*?)End%)", RegexOptions.IgnoreCase);

An exception will be thrown if captured or replacement strings contain one or more special characters: paragraph break, cell break, section break, field start, field separator, field end, inline picture, drawing object, footnote.

Hope this will help.

nexcom.as · January 24, 2012, 7:49am

Hi Imran,

Thanks for your reply, I’ve tried the regex you’ve supplied which is more reasonable then the expression I gave.

However, the code is still not giving me the correct answer, which troubles me a bit. Once the replace method is called during the replace action in my evaluator, I only get back the node of the first Run, e.g. the one which has the “Begin” in its string according to the example of the domain language I supplied in my original inquiry.

I would expect, that given the regex returns a result, I would somehow get access to the nodes in the document directly contained in this regex result, since I need to replace these nodes with something else - in my case, I look for segments of domain code, extract it from the word document, translate it into something useful, thus finally replacing that block within the word document with a value returned from my translator. However, I am currently stuck on the last bit, since I can’t figure out how to replace all nodes in the word document which is contained in the regex result. As a note, each line of the code segment is wrapped in its own paragraph, which may be way I can only get back the first single Run.

Also, the exception you mentioned, does this mean that I cannot have a construct like this:

%Tag:GridBegin%
image
%Grid.City = 1%
%Tag:GridEnd%

where the ‘image’ indicates an bitmap image inserted in the word document. The image is not something I need for my translator/parser, but something which need to stay as it were within the document, e.g. something I don’t want to remove. An example could be, considering the construct above, that the City value gets translated into ‘London’, so the final result should be

image
London

Any advice on this?

Thanks, Søren

imran.rafique · January 24, 2012, 11:32am

Hi Søren,

Thanks for your inquiry. Please try this code with new application.

Document doc = new Document("D:/temp/textreplace.docx");
Regex regex = new Regex(@"(?%(.*?)End%)", RegexOptions.IgnoreCase);
// doc.Range.Replace(regex, new ReplaceEvaluator(GetID), true);
doc.Range.Replace(regex, new InsertDocumentAtReplaceHandler(), true);
doc.Save("D:/temp/replacetextOut.docx");

public class InsertDocumentAtReplaceHandler : IReplacingCallback
{
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        String title = e.Match.Groups["title"].Value.Trim();
        DocumentBuilder builder = new DocumentBuilder((Document)e.MatchNode.Document);
        // Pass title variable to interpret translator 
        e.Replacement = translator(title);
        return ReplaceAction.Replace;
    }
}

Regular expression will extract following code snippets one by one from word document.
%Tag:PropertyGridBegin%
%Tag:PropertyGrid.CustomerName%
%Tag:PropertyGrid.City%
%Tag:PropertyGridEnd%

OR

%Tag:GridBegin%
image
%Grid.City = 1%
%Tag:GridEnd%
> Also, the exception you mentioned, does this mean…

There might be an exception, when translator will return a string having paragraph break, cell break, section break, field start, field separator, field end, inline picture, drawing object, footnote.

In case of any ambiguity, please share more states like word document, code snippet etc.

nexcom.as · January 25, 2012, 3:10am

Hi Imran,

Thanks for your reply! I now got the replacement part of my code working, the example you provided helped. Basically, I had misunderstood how I should use the ReplaceAction, which is now clear.
However, as you said, the exception is thrown during the actual, which isn’t a good thing.

To give an example, let’s say that the text I found in the match looks like this

“%Tag:PropertyGridBegin%\r\v %Tag:PropertyGrid.CityCode%\r\v%Tag:PropertyGridEnd%”

After being translated, it comes back as the following

“\r\v 90210\r\v”

As you can see, the begin and end tags are completely removed, and the CityCode is replaced with the value 90210. However, on the replacement it gives me the follow exception “The match includes one or more special or break characters and cannot be replaced.”.

To this I have two questions

First and foremost, what is causing this to happen, and how can I solve it?
If I were to have embedded items, say images, within the code block, e.g. an image between the begin and end of my code block, how can I preserve this during my translation - is that even possible? If so, how would I go about doing such? - I can image a simple replace won’t work, since I actually need to keep the image node etc.?

Thanks,
Søren

imran.rafique · January 25, 2012, 4:59am

Hi Søren,

Thanks for your details.

>> First and foremost, what is causing this to happen, and how can I solve it?

"\r\v 90210\r\v"

ObjString.Replace(ControlChar.LineBreak, " ");
ObjString.Replace(ControlChar.Cr, " ");

>>If I were to have embedded items, say images, within the code block, e.g. an image between the begin and end of my code block, how can I preserve this during my translation - is that even possible? If so, how would I go about doing such? - I can image a simple replace won’t work, since I actually need to keep the image node etc.?

public class InsertDocumentAtReplaceHandler : IReplacingCallback
{
    ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
    {
        String title = e.Match.Groups["title"].Value.Trim();
        string[] words = title.Split('\r');
        string ExtractString = "";

        for (int i = 0; i < words.Length; i++)
        {
            if (Regex.IsMatch(words[i], @"%(.*?)%"))
            {
                ExtractString = ExtractString + words[i];
            }
        }
        // ExtractString variable contains all lines of code except images or other contents...
        // Pass title variable to interpret translator 
        e.Replacement = translator(title);
        return ReplaceAction.Replace;
    }
}

nexcom.as · January 30, 2012, 5:42am

Hi Imran,

I got it working, thanks for your support on my problem.

Kind regards,
Søren

imran.rafique · January 30, 2012, 6:17am

Hi Søren,

Thank you very much for your feedback. It is great to hear that your issue has been resolved. You are always welcome and please feel free to ask if you have any query in future.
We always welcome constructive response from our customers.

robertal · April 2, 2012, 8:57pm

Can you please share how you got it working? I think I have a similar problem.

Thanks!

imran.rafique · April 3, 2012, 3:21am

Hi Rob,

Thanks for the inquiry. First off, I will suggest you to go through the whole thread where I have shared code snippet. In case of any ambiguity, please us share scenario along with findings.

robertal · April 3, 2012, 12:31pm

Hi,

My problem is with capturing text that will contain paragraph breaks. If you look at the thread I started with post 371222, you’ll see exactly what I am talking about. Reviewing the code snippet you provided, I do not see how one can match the desired text if there is paragraph breaks. I have tried putting the regex into a regex tester, and the only match I get is the last line containing the “End”. Any help you can provide will be greatly appreciated.

Thank you.

imran.rafique · April 3, 2012, 11:38pm

Hi Rob,
Thanks for your request. Unfortunately, you cannot replace text with paragraph breaks. An exception is thrown if a captured or replacement string contain one or more special characters: paragraph break, cell break, section break, field start, field separator, field end, inline picture, drawing object, footnote.
https://docs.aspose.com/words/java/find-and-replace/

robertal · April 4, 2012, 2:24pm

Is it possible, at least, to highlight several paragraphs selected between two tags? Or is highlighting limited to just text contained in runs with any of the fore mentioned special characters?

Another option I guess if I will need to go run by run. Could you provide me an easy way to traverse the nodes? basically I’ll start at one point and then progress along the nodes till I find my desired keyword. I will then goto the next node. If the third node contains my second keyword, I will then know that the second node is the one to be changed. I can then change the text in the second node as needed. I am assuming that the lay out of the nodes per my example above is correct, or would there be more nodes that I need to account for?

Also, I would like to put in a request that the replace feature be modified to allow for special characters. Based on searching the forums, there seems to be a few folks needing this.

Thanks.

imran.rafique · April 9, 2012, 3:13am

Hi Rob,

Thanks for the details.

Another option I guess if I will need to go run by run. Could you provide me an easy way to traverse the nodes? basically I’ll start at one point and then progress along the nodes till I find my desired keyword. I will then goto the next node. If the third node contains my second keyword, I will then know that the second node is the one to be changed. I can then change the text in the second node as needed. I am assuming that the lay out of the nodes per my example above is correct, or would there be more nodes that I need to account for?

Please note that all text of the document is stored in runs of text. The Run class Represents a run of characters with the same font formatting and it can only be a child of Paragraph instance. A Paragraph can have many instances of the Run class. Also, each paragraph ends with a paragraph break character and this is how you can iterate though the Run nodes.

NodeCollection runs = doc.GetChildNodes(NodeType.Run, true);
foreach (Run run in runs)
{
    if (run.Text.EndsWith(ControlChar.ParagraphBreak))
        run.Text = "";//Set text according your need
}

Moreover, please follow up the sample code in our documentation to highlight a word or phrase:
https://docs.aspose.com/words/net/find-and-replace/

We will consider improving our Replace method in order to allow replacing special characters. Your request has been linked to the appropriate issue as WORDSNET-1252. We will let you know once it is resolved.

David_LeBow · May 27, 2012, 5:44pm

Hello. I’ve got a very similar problem and I KNOW that the answer is in this thread somewhere, but I just can’t seem to tease it out. I have a document with begin/end keyword pairs, as above.
Under certain circumstances, I’d like to delete (or replace with “”, I suppose) the entire block between the begin and end keyword inclusive. For example:
«%ObjectStart_01»
…all manner of things between here, could be pictures, texts, line breaks, spanning pages, anything …
«%ObjectEnd_01»
Unfortunately, I don’t know enough about regular expressions to see how <a href=“mailto:’@”(?%(.?)End%)"’">’@"(?%(.?)End%)"’ gets the job done in the above case, nor how to adapt it for my case and how to get over the fact that there are ‘special characters’ (breaks, etc.) in the middle. I also can’t wrap my mind around how the regular expression selects the entire area.
Using the Word object model, I searched for the first keyword, selected, then extended the selection to the second keyword and replaced it with “”. That’s really all I’m looking for here, but of course I can’t define a ‘Range’ spanning the given area.
All help GREATLY appreciated. You know… yesterday…
Thanks,
David

awais.hafeez · May 30, 2012, 2:51am

Hi David,

Thanks for your inquiry. May be you can use the following regular expression to be able to extract text between the start/end tags:

Regex regex = new Regex(@"(?<=«%ObjectStart_01»).*(?=\«%ObjectEnd_01»)",
RegexOptions.IgnoreCase);

Moreover, to be able to extract all other document elements (e.g. Shapes, Tables, Paragraphs etc) enclosed in between these Start and End keywords, I think, you need to implement the following workflow:

Find the node which represents the starting keyword i.e. «%ObjectStart_01»
Find the node which represents the ending keyword i.e. «%ObjectEnd_01»
You can then use the code suggested in this article to be able to extract content between start and end nodes

I hope, this will help. Please let us know if you need code for the above 1,2 and 3 points. We’ll create a sample for you.

Best Regards,

David_LeBow · July 1, 2012, 4:14pm

The trick is… I’m not trying to extract the text (e.g.: for use elsewhere)… I’m trying to make it disappear - i.e., delete it (or replace it with nothingness).
I’m using the following Regex to find the start/end area:
regexKeyword =
new Regex(@"(?»%ObjectStart_(?[0-1][0-9])»(.*?)»%ObjectEnd_\k»)", RegexOptions.IgnoreCase);
The idea would be that this would find the following document areas:
»%ObjectStart_01» … through … »%ObjectEnd_01»
then…
»%ObjectStart_02» … through … »%ObjectEnd_02»
then…
»%ObjectStart_03» … through … »%ObjectEnd_03»
etc.
My GOAL is to then replace each selection with “”, but of course I get “The match includes one or more special or break characters and cannot be replaced.” …

awais.hafeez · July 3, 2012, 8:24pm

Hi David,

Thanks for the additional information. Unfortunately, there is no simple way to remove/replace content between placeholders. However, you can try using the technique suggested in the following code to be able to remove content between such placeholders:

Public Sub Test001()
Dim doc As Document = New Document("C:\Temp\in.doc")
Dim startFinder As PlaceholderFinder = New PlaceholderFinder(doc, True, "{start}")
Dim endFinder As PlaceholderFinder = New PlaceholderFinder(doc, True, "{end}")
Dim startNode As Node = startFinder.FindPlaceholder()
Dim endNode As Node = endFinder.FindPlaceholder()
RemoveSequence(startNode, endNode)
startNode.Remove()
endNode.Remove()
doc.Save("C:\Temp\out.doc")
End Sub
''' 
''' Remove all nodes between start and end nodes, except start and end nodes
''' 
''' The start node
''' The end node
''' 
Private Sub RemoveSequence(ByVal startNode As Node, ByVal endNode As Node)
Dim curNode As Node = startNode.NextPreOrder(startNode.Document)
While ((curNode IsNot Nothing) And (Not curNode.Equals(endNode)))
'Move to next node
Dim nextNode As Node = curNode.NextPreOrder(startNode.Document)
'Check whether current contains end node
If (curNode.IsComposite) Then
If (Not DirectCast(curNode, CompositeNode).GetChildNodes(NodeType.Any, True).Contains(endNode) And _
Not DirectCast(curNode, CompositeNode).GetChildNodes(NodeType.Any, True).Contains(startNode)) Then
nextNode = curNode.NextSibling
curNode.Remove()
End If
Else
curNode.Remove()
End If
curNode = nextNode
End While
End Sub
Private Class PlaceholderFinder
''' 
''' Creates new instance of Placeholderfinder
''' 
''' Document where we ned to find placeholder.
''' Set this flag is captured strign is start fo region.
''' Placeholder.
''' 
Public Sub New(ByVal doc As Document, ByVal isStart As Boolean, ByVal placeHolder As String)
mDoc = doc
mIsStart = isStart
mPlaceHolder = placeHolder
End Sub
Public Function FindPlaceholder() As Node
Dim myRegex As Regex = New Regex(mPlaceHolder)
mDoc.Range.Replace(myRegex, New ReplaceEvaluator(AddressOf ReplaceEvaluatorFindPlaceholder), True)
Return mPlaceholderNode
End Function
''' 
''' This method is called by the Aspose.Words find and replace engine for each match.
''' This method initialize placeholders node if find some.
''' 
Private Function ReplaceEvaluatorFindPlaceholder(ByVal sender As Object, ByVal e As ReplaceEvaluatorArgs) As ReplaceAction
' This is a Run node that contains either the beginning or the complete match.
Dim currentNode As Node = e.MatchNode
' The first (and may be the only) run can contain text before the match,
' in this case it is necessary to split the run.
If e.MatchOffset > 0 Then
currentNode = SplitRun(CType(currentNode, Run), e.MatchOffset)
End If
' This array is used to store all nodes of the match.
Dim runs As ArrayList = New ArrayList()
' Find all runs that contain parts of the match string.
Dim remainingLength As Integer = e.Match.Value.Length
Do While (remainingLength > 0) AndAlso (currentNode IsNot Nothing) AndAlso (currentNode.GetText().Length <= remainingLength)
runs.Add(currentNode)
remainingLength = remainingLength - currentNode.GetText().Length
' Select the next Run node.
' Have to loop because there could be other nodes such as BookmarkStart etc.
Do
currentNode = currentNode.NextSibling
Loop While (currentNode IsNot Nothing) AndAlso (currentNode.NodeType <> NodeType.Run)
Loop
' Split the last run that contains the match if there is any text left.
If (currentNode IsNot Nothing) AndAlso (remainingLength > 0) Then
SplitRun(CType(currentNode, Run), remainingLength)
runs.Add(currentNode)
End If
If (mIsStart) Then
mPlaceholderNode = CType(runs.Item(0), Node)
Else
mPlaceholderNode = CType(runs.Item(runs.Count - 1), Node)
End If
' Signal to the replace engine to stop searching
Return ReplaceAction.Stop
End Function
''' 
''' Splits text of the specified run into two runs.
''' Inserts the new run just after the specified run.
''' 
Private Shared Function SplitRun(ByVal run As Run, ByVal position As Integer) As Run
Dim afterRun As Run = CType(run.Clone(True), Run)
afterRun.Text = run.Text.Substring(position)
run.Text = run.Text.Substring(0, position)
run.ParentNode.InsertAfter(afterRun, run)
Return afterRun
End Function
Private mPlaceholderNode As Node
Private mDoc As Document
Private mIsStart As Boolean
Private mPlaceHolder As String
End Class

I hope, this will help.

Best Regards,

aspose.notifier · August 9, 2016, 11:09pm

The issues you have found earlier (filed as WORDSNET-1252) have been fixed in this .NET update and this Java update.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.
(7)