Splitting Large PDF file into individual file causing outofmemory exception

everbank · August 25, 2014, 12:22pm

Hi There,

I have a usecase where we get a large pdf file (of size 700to 800mb) containg 400k pages and the task is split the file into individual pages. I am getting into out of memory exception after splitting files around 100k pages. Please suggest any alternatives or workaround. My code is below:

My code to split pdf file is below:

 public int SplitPdfFileByPageDelimiter(string outputFilesDir, string pageDelimiterString, int extractStringLength)
        {
            if (string.IsNullOrEmpty(pageDelimiterString))
            {
                throw new ArgumentNullException("pageDelimiterString", "[SplitPdfFileByPageDelimiter()]: delimiterString for spliting pdf file is null or empty.");
            }
       <SPAN style="COLOR: green">// open document</SPAN>
        <SPAN style="COLOR: blue">var</SPAN> pdfDocument = <SPAN style="COLOR: blue">new</SPAN> Aspose.Pdf.<SPAN style="COLOR: #2b91af">Document</SPAN>(<SPAN style="COLOR: blue">this</SPAN>.InputFileStream);

        <SPAN style="COLOR: blue">if</SPAN> (<SPAN style="COLOR: blue">string</SPAN>.IsNullOrEmpty(outputFilesDir))
        {
            <SPAN style="COLOR: blue">throw</SPAN> <SPAN style="COLOR: blue">new</SPAN> <SPAN style="COLOR: #2b91af">Exception</SPAN>(<SPAN style="COLOR: #a31515">"OutputFileDirectory missing"</SPAN>);
        }


        <SPAN style="COLOR: blue">if</SPAN> (!<SPAN style="COLOR: #2b91af">Directory</SPAN>.Exists(outputFilesDir))
        {
            <SPAN style="COLOR: #2b91af">Directory</SPAN>.CreateDirectory(outputFilesDir);
        }

        <SPAN style="COLOR: blue">var</SPAN> outputFileName = <SPAN style="COLOR: #2b91af">Path</SPAN>.GetFileName(<SPAN style="COLOR: blue">this</SPAN>.InputFilePath);

        <SPAN style="COLOR: blue">if</SPAN> (<SPAN style="COLOR: blue">string</SPAN>.IsNullOrEmpty(outputFileName))
        {
            outputFileName = <SPAN style="COLOR: blue">string</SPAN>.Format(<SPAN style="COLOR: #a31515">"</SPAN><SPAN style="COLOR: mediumseagreen">{0}</SPAN><SPAN style="COLOR: #a31515">.pdf"</SPAN>, <SPAN style="COLOR: #2b91af">Path</SPAN>.GetDirectoryName(outputFilesDir));
        }

        <SPAN style="COLOR: blue">var</SPAN> outputFileFormat = <SPAN style="COLOR: blue">string</SPAN>.Concat(<SPAN style="COLOR: #2b91af">Path</SPAN>.GetFileNameWithoutExtension(outputFileName), <SPAN style="COLOR: #a31515">"_{0}"</SPAN>, <SPAN style="COLOR: #2b91af">Path</SPAN>.GetExtension(outputFileName));
        
        <SPAN style="COLOR: blue">var</SPAN> docCount = 1;
        <SPAN style="COLOR: blue">var</SPAN> resetCount = 0;
        <SPAN style="COLOR: blue">var</SPAN> document = <SPAN style="COLOR: blue">new</SPAN> Aspose.Pdf.<SPAN style="COLOR: #2b91af">Document</SPAN>();

        <SPAN style="COLOR: green">// loop through all the pages</SPAN>
        <SPAN style="COLOR: blue">foreach</SPAN> (<SPAN style="COLOR: #2b91af">Page</SPAN> pdfPage <SPAN style="COLOR: blue">in</SPAN> pdfDocument.Pages)
        {
            resetCount++;
            document.Pages.Add(pdfPage);

            <SPAN style="COLOR: blue">var</SPAN> textFragmentAbsorber = <SPAN style="COLOR: blue">new</SPAN> <SPAN style="COLOR: #2b91af">TextFragmentAbsorber</SPAN>();
            pdfPage.Accept(textFragmentAbsorber);

            <SPAN style="COLOR: blue">var</SPAN> textFragmentCollection = textFragmentAbsorber.TextFragments;

            <SPAN style="COLOR: blue">foreach</SPAN> (<SPAN style="COLOR: #2b91af">TextFragment</SPAN> textFragment <SPAN style="COLOR: blue">in</SPAN> textFragmentCollection)
            {
                <SPAN style="COLOR: blue">var</SPAN> pdfFileUniqueIdIndex = textFragment.Text.IndexOf(pageDelimiterString);

                <SPAN style="COLOR: blue">if</SPAN> (pdfFileUniqueIdIndex <= -1)
                {
                    <SPAN style="COLOR: blue">continue</SPAN>;
                }

                pdfFileUniqueIdIndex += pageDelimiterString.Length;
                <SPAN style="COLOR: blue">var</SPAN> pdfFileUniqueId = extractStringLength > 0 ? textFragment.Text.Mid(pdfFileUniqueIdIndex, extractStringLength).Trim() : docCount.ToString();

                <SPAN style="COLOR: blue">if</SPAN> (<SPAN style="COLOR: blue">string</SPAN>.IsNullOrWhiteSpace(pdfFileUniqueId))
                {
                    pdfFileUniqueId = <SPAN style="COLOR: blue">string</SPAN>.Format(<SPAN style="COLOR: #a31515">"NoPdfFileUniqueId.</SPAN><SPAN style="COLOR: mediumseagreen">{0}</SPAN><SPAN style="COLOR: #a31515">"</SPAN>, Guid.NewGuid().ToString());
                }

                <SPAN style="COLOR: blue">var</SPAN> pdfFileFullName = <SPAN style="COLOR: #2b91af">Path</SPAN>.Combine(outputFilesDir, <SPAN style="COLOR: blue">string</SPAN>.Format(outputFileFormat, pdfFileUniqueId));

                <SPAN style="COLOR: blue">if</SPAN> (!<SPAN style="COLOR: #2b91af">File</SPAN>.Exists(pdfFileFullName))
                {
                    document.Save(pdfFileFullName);
                }

                document.FreeMemory();
                document.Dispose();
                docCount += 1;
                document = <SPAN style="COLOR: blue">new</SPAN> Aspose.Pdf.<SPAN style="COLOR: #2b91af">Document</SPAN>();
                <SPAN style="COLOR: blue">break</SPAN>;
            }

            pdfPage.FreeMemory();
            
            <SPAN style="COLOR: green">// After every 200 pages let the process to sleep for couple of secs.</SPAN>
            <SPAN style="COLOR: blue">if</SPAN> (resetCount >= 200)
            {
                pdfDocument.FreeMemory();
                <SPAN style="COLOR: #2b91af">Console</SPAN>.WriteLine(<SPAN style="COLOR: blue">string</SPAN>.Format(<SPAN style="COLOR: #a31515">"[SplitPdfFileByPageDelimiter()]: Going to sleep at document count : [</SPAN><SPAN style="COLOR: mediumseagreen">{0}</SPAN><SPAN style="COLOR: #a31515">]"</SPAN>, docCount));
                <SPAN style="COLOR: #2b91af">Thread</SPAN>.Sleep(2000);
                resetCount = 0;
            }
        }</PRE>

tilal.ahmad · August 26, 2014, 10:44am

Hi Dustin,

Thanks for your inquiry. Aspose.Pdf use memory for processing and performance depends upon the resources of the system and file size/contents. As you are processing a large PDF file, you may try to increase the memory of your system. Hopefully it will improve the situation. However, we have logged an investigation ticket PDFNEWNET-37392 for further investigation and will share our findings as soon as possible.

We are sorry for the inconvenience caused.

Best Regards,

tilal.ahmad · August 26, 2014, 11:01am

Hi Dustin,

In addition to above, If you are using some old version of Aspose.Pdf for .NET then please download and try latest version of Aspose.Pdf for .NET i.e. Aspose.Pdf for .NET 9.5.0. Hopefully it will also improve the memory management.

Best Regards,

everbank · August 26, 2014, 11:52am

Yes processing a file with 200k pages. I have 8gb memory on server. Do you think it is good enough ?

Thanks,

everbank · August 26, 2014, 2:09pm

Also size of the individual files after split are larger in size compared to original page. Please let us know if any suggestion to reduce the size of individual files.

tilal.ahmad · August 27, 2014, 10:50am

Hi Dustin,

everbank:

Yes processing a file with 200k pages. I have 8gb memory on server. Do you think it is good enough ?

Thanks,

Thanks for your inquiry. As you are getting OutOfMemoryException for large PDF files, you may increase memory of server. It will improve the processing in your workflow. However we will update you suggestion as soon as our development team investigate the issue.

Best Regards,

tilal.ahmad · August 27, 2014, 10:54am

Hi Dustin,

everbank:
Also size of the individual files after split are larger in size compared to original page. Please let us know if any suggestion to reduce the size of individual files.

Thanks for your inquiry. You may check following documentation link for optimize the PDF file size. Hopefully it will help you to accomplish the task.

Optimize PDF file size.

Please feel free to contact us for any further assistance.

Best Regards,