Extract Content between Pages

adam.skelton · April 17, 2011, 5:11pm

Hi Alicia,

I suppose this is happening due to the reason stated in my last post, the table contained with the page spans across multiple pages.

I think I have a nice solution to this, I will provide you with some code within a day.

Thanks,

AGontarek · April 18, 2011, 11:02am

Thanks! I am using the version that I downloaded with the link provided:
https://releases.aspose.com/words/net/

Is this the correct version?

rumata · April 18, 2011, 7:08pm

Hello,

Thank you for additional information.
Yes, the link is right. The latest version of our product 9.8.0.0.
Please wait a little longer, Adam will give you the code.

AGontarek · April 21, 2011, 11:56am

Any luck with this?

Thanks for your help!

adam.skelton · April 21, 2011, 11:38pm

Hi there,

Thanks for your inquiry.

I’m afraid this is still a work in progress. I haven’t had time to finish coding it yet. I will look into doing this in the weekend. I appreciate your patience.

Thanks,

adam.skelton · April 25, 2011, 10:35am

Hi Alicia,

Thanks for waiting.

Please find attached an upated version of the PageNumberFinder class. This update includes a new method SplitNodesAcrossPages that you can use to beable to extract pages into separate document properly.

You can use the code like below to extract pages to an external document. The SplitNodes method will split the sections of the document which contain content across multiple pages into separate sections, which are one per page. You can then extract each page by extracting each section and insert it into a new document.

Document doc = new Document("Document.docx");
// Set up the document which pages will be copied to. Remove the empty section.
Document dstDoc = new Document();
dstDoc.RemoveAllChildren();

PageNumberFinder finder = new PageNumberFinder(doc);

// Split nodes which are found across pages.
finder.SplitNodesAcrossPages(true);

// Copy all content including headers and footers from the specified pages into the destination document.
ArrayList pageSections = finder.RetrieveAllNodesOnPage(3, 5, NodeType.Section);

foreach (Section section in pageSections)
    dstDoc.AppendChild(section);

dstDoc.Save(dataDir + "Document Out.docx");

If you have any issues, please attach your document here for testing.

Thanks,

AGontarek · April 26, 2011, 8:56am

Thank you, Adam.

I copied your code and replaced the old PageFinder.cs with the one you supplied. When I run the code, however, I am getting the following error: “The newChild was created from a different document than the one that created this node.” on this line:

dstDoc.AppendChild(section)

Have I forgotten a step?

alexey.noskov · April 26, 2011, 2:35pm

Hi

Thanks for your request. I think, you should just use NodeImprter in this case to import section to the destination document:

Document doc = new Document("Document.docx");
// Set up the document which pages will be copied to. Remove the empty section.
Document dstDoc = new Document();
dstDoc.RemoveAllChildren();

PageNumberFinder finder = new PageNumberFinder(doc);

// Split nodes which are found across pages.
finder.SplitNodesAcrossPages(true);

// Copy all content including headers and footers from the specified pages into the destination document.
NodeImporter importer = new NodeImporter(doc, dstDoc, ImportFormatMode.UseDestinationStyles);
for (int page = 3; page <= 5; page++)
{
    List<Node> pageSections = finder.RetrieveAllNodesOnPage(page, true, NodeType.Section);
    foreach (Section section in pageSections)
    {
        dstDoc.AppendChild(importer.ImportNode(section, true));
    }
}

dstDoc.Save(dataDir + "Document Out.docx";

Best regards,

AGontarek · July 28, 2011, 9:27am

Can you provide me with any information as to when this feature will be supported? Is there a scheduled release date?

AGontarek · July 28, 2011, 9:30am

AndreyN:
Hi

Thanks for your request. Word document is flow document and does not contain any information about its layout into lines and pages. Therefore, technically there is no “Page” concept in Word document.

Aspose.Words uses our own Rendering Engine to layout documents into pages. And we have plans to expose layout information. Your request has been linked to the appropriate issue. You will be notified as soon as this feature is supported.

Also, I think, as a workaround you can try using PageNumberFinder class suggested by Adam in this thread:

https://forum.aspose.com/t/58199

Best regards,

I was speaking in reference to the above post…

alexey.noskov · July 28, 2011, 9:32am

Hi

Thanks for your request. Unfortunately, the issue is not planed yet. So I cannot provide you a reliable estimate regarding this feature. We will consider exposing layout information of node in future, but no timeframe is available yet.

Best regards,

AGontarek · September 21, 2011, 8:29am

The code provided seems to be working to my specifications, with just one flaw. The page numbers are being reset in the cloned documents. I need to retain the original page numbers. Is this possible?

Thank you so much for all of the assistance.

adam.skelton · September 21, 2011, 9:45pm

Hi there,

Thanks for your inquiry.

Could you please attach your input and code here which allows me to reproduce the issue? I will take a closer look into this for you.

Thanks,

AGontarek · September 23, 2011, 8:19am

here is the code and a sample:

public static Document ExtractContentBetweenPages(Document srcDoc, int fromPage, int toPage)
{
    // Set up the document which pages will be copied to. Remove the empty section.
    Document dstDoc = new Document();
    dstDoc.RemoveAllChildren();
    PageNumberFinder finder = new PageNumberFinder(srcDoc);
    // Split nodes which are found across pages.
    finder.SplitNodesAcrossPages(true);
    // Copy all content including headers and footers from the specified pages into the destination document.
    NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.UseDestinationStyles);
    for (int page = fromPage; page <= toPage; page++)
    {
        List<Node> pageSections = finder.RetrieveAllNodesOnPage(page, true, NodeType.Section);
        foreach (Section section in pageSections)
        {
            //dstDoc.AppendChild(section);
            dstDoc.AppendChild(importer.ImportNode(section, true));
        }
    }
    return dstDoc;
}

adam.skelton · September 24, 2011, 3:59am

Thanks for this additional information.

This was a minor bug which I have fixed, please try downloading the class again.

Thanks,

AGontarek · September 27, 2011, 10:37am

Thank you for this, it seems to have fixed the problem!

I am seeing one other problem now, however. The code (as a whole) does not seem to work with DOCX files. I get the following attached error in the debug.

alexey.noskov · September 27, 2011, 2:45pm

Hi

Thanks for your request. Could you please attach a sample document that causes this problem?

Best regards,

AGontarek · September 28, 2011, 11:11am

simple test document attached. this produces the error that I included in the above post.

adam.skelton · September 29, 2011, 8:58am

Hi there,

Thanks for your inquiry.

I can’t reproduce any problem on my side. Make sure that values you pass to your method are within the valid page range (1 to 4 in the case of your document).

Thanks,

AGontarek · September 29, 2011, 2:17pm

I am definitely using a valid page range. May I ask, are you running this code in a Windows Form or a Web form? Because it actually does work in a Windows form…however, I need it to work in a Web form. It doesn’t make any sense to me why the DOCX files don’t work in both…the code is identical. I must be missing SOMETHING.