Convert pdf to html

When i convert pdf to html, i found the html file size is too larger than the original pdf file.The attachment is the pdf and html file.

The pdf file size is 60kb, and the html file size is 800kb.

Hi Dalonggeng,


Thanks for contacting support.

I have tested the scenario using Aspose.Pdf for .NET 10.3.0 in VisualStudio 2010 application running over Windows 7(x64) and as per my observations, the resultant file of 67KB is generated and the folder containing its resource is just 100KB. Can you please share your code snippet so that we can again test the scenario in our environment.

[C#]

Document document = new Document(@“C:\pdftest\pdftestExport\104kb.pdf”);<o:p></o:p>

document.Save(@“C:\pdftest\pdftestExport\104kb_converted.html”,
SaveFormat.Html);

Hi,

Thanks for your reply.
We need convert from pdf to html to one file.
Below is my code.

[C#]
Document exportDoc = new Document(@“d:/pdftest/104kb.pdf”);
HtmlSaveOptions newOptions = new HtmlSaveOptions();

            newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
            newOptions.RemoveEmptyAreasOnTopAndBottom = true;
            newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
            newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
            newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;

            exportDoc.Save(@"d:/pdftest/104kb.html", newOptions);

Hi Dalonggeng,


Thanks for sharing the resource files.

I
have tested the scenario and I am able to reproduce the same problem. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWNET-38684. We
will investigate this issue in details and will keep you updated on the status
of a correction. <o:p></o:p>

We apologize for your inconvenience.

Hi Dalonggeng,


Thanks for your patience. We have investigate the issue and fount it It’s not a bug. As it does not contain irrelevant information.

The main reason of big size of result is fonts of source PDF. It uses standard set of fonts of PDF documents (“Times New Roman”, “Helvetica”, “Arial”). Such fonts must be present in any PDF viewer and are not present in PDF document itself. But when we do export, font data must be present in result for correct rendering of result in browsers, so, it goes into result PDF.

Also you are using “SaveInAllFormats” font saving mode , therefore font data are tripled (for each output font format and URL).

Moreover, also you are embedding all resources into output HTML, in such case binary data of fonts and images are also base-64 encoded and saved as texts in main HTML, and therefore output file’s size is enlarged even more (cause base64 enlarges binary size of output).


Please feel free to contact us for any further assistance.


Best Regards,