Sign In  Sign Up Live-Chat

Support for getting information from existing pdf document.

Last post 06-28-2007, 7:14 PM by GeorgieYuan. 45 replies.
Page 3 of 4 (46 items)   < Previous 1 2 3 4 Next >
Sort Posts: Previous Next
  •  06-11-2007, 10:11 PM 80172 in reply to 78835

    Re: Support for getting information from existing pdf document.

    Attachment: Present (inaccessible)

    I was using PdfExtractor.ExtractText() to extract a pdf that only has one sentence in it, it took about 20-30 seconds to do that.  Is the performance of this method a known issue? Pdf I tested attached.

    Thanks.

     
  •  06-11-2007, 10:19 PM 80174 in reply to 80172

    Re: Support for getting information from existing pdf document.

    That will definitely be too much of a perfomance hit for us.  Are you guys going to be able to come up with a solution for this within the next 2 weeks?

    Thanks
     
  •  06-11-2007, 10:37 PM 80176 in reply to 80172

    Re: Support for getting information from existing pdf document.

    I tested this pdf and the text is extracted within one second. Are you sure you are using the latest version of Aspose.Pdf.Kit?
    Tommy Wang
    Lead Developer
    Aspose Changsha Team
    About Us
    Contact Us
     
  •  06-12-2007, 11:28 AM 80266 in reply to 80176

    Re: Support for getting information from existing pdf document.

    Thanks for your reply.

    I made a mistake by testing it in debug mode.

    Is extracting text per page going to be supported anytime soon? Is the bug I sent you in an earlier post still being investigated (the bug is about ExtractText() throws exception on the example file I gave you).

     

     
  •  06-12-2007, 12:31 PM 80274 in reply to 80266

    Re: Support for getting information from existing pdf document.

    Hi,

    Certainly we have plans in near future, but right now extracting text per page is in its development stages. You can try, but it has few limitations right now. You can use it like:

    PdfExtractor m_pdfExtractor= new PdfExtractor();

    m_pdfExtractor.BindPdf(@"D:\AsposeTest\File1_NonSearch.pdf");

    m_pdfExtractor.StartPage = 1;

    m_pdfExtractor.EndPage = 1;

    m_pdfExtractor.ExtractText();
     
    About the second, bug problem our developers are working hard to find the root cause of this problem. As Georgie already told that it is difficult to give a ETA for this problem, but I will again reconfirm it and will get back to you.
     
    Thanks.
     
     
     
  •  06-12-2007, 11:17 PM 80329 in reply to 80274

    Re: Support for getting information from existing pdf document.

    Hi,

    We will provide a .Net2.0 version of Aspose.Pdf.Kit which support extracting text per page before tommorrow.

    The ExtractText bug with PDF file that doesn't contain text hasn't fix now. We are working hard to solve this problem but we could not give an ETA now.



    Georgie Yuan
    Lead Developer
    Aspose Changsha Team
    About Us
    Contact Us
     
  •  06-13-2007, 12:48 AM 80337 in reply to 80329

    Re: Support for getting information from existing pdf document.

    Attachment: Present (inaccessible)
    Hi,

    The attachment is a .Net 2.0 version of Aspose.Pdf.Kit, Please try it.

    Best Regards.

    Georgie Yuan
    Lead Developer
    Aspose Changsha Team
    About Us
    Contact Us
     
  •  06-13-2007, 10:00 AM 80414 in reply to 80337

    Re: Support for getting information from existing pdf document.

    Great, we will test this out later today and let you know.  Thanks
     
  •  06-14-2007, 12:43 PM 80587 in reply to 80337

    Re: Support for getting information from existing pdf document.

    We tested the new dll and here is what we found

    1. Documents with text:
    Seems to be working well for getting the text.   It seem though that PdfExtractor.HasNextPageText() only works if you extract the text for the current page.  Is this true?   I.e. it seems we should be able to do the following:

     // Starting a 0 because want to know if 1 - pageCount has text
     for (int i = 0; i < pageCount; i++ )
     {
                   extractor.StartPage = i;
                   extractor.EndPage = i;

                   bool nextPageHasText = extractor.HasNextPageText();
    }

    but this only seems to work if we ExtractText() before calling HasNextPageText().  We have cases when we only want to know if there is text but dont need to extract it.  Let me know if I am just setting it up incorrectly.
    2. Documents with no text:

    We have some documents that have no text, but Extract() and GetText() are returning "blanks" and HasNextPageText() is returnnig true.   I have attached an example.    
          
     
  •  06-14-2007, 7:53 PM 80613 in reply to 80587

    Re: Support for getting information from existing pdf document.

    Hi,

    I didn't see the attached example. Please attach again. Meanwhile, I will discuss with the developers about the possibility of first page problem.

    Thanks.

    Adeel Ahmad
    Support Developer
    Aspose Changsha Team
    http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

     
  •  06-14-2007, 9:14 PM 80627 in reply to 80613

    Re: Support for getting information from existing pdf document.

    Attachment: Present (inaccessible)
    Here is the attached.

    Thanks
     
  •  06-16-2007, 9:01 PM 80816 in reply to 80627

    Re: Support for getting information from existing pdf document.

    Hi,

    I should clarify the HasNextPageText() method means if the current PageNumber is at the end of Page or not.  It doesn't means the next page has text or not. So, combining using GetNextPageText() and HasNextPageText() can get each page's text iterative, no matter that each page has text or not. So
    1. We can't check whether some Page has Text or not.
    2. HasNextPageText() return true if the current Page is not the last Page and not because next Page doesn't contain any Text.

    Best Regards.

    Georgie Yuan
    Lead Developer
    Aspose Changsha Team
    About Us
    Contact Us
     
  •  06-25-2007, 2:52 PM 81697 in reply to 77759

    Re: Support for getting information from existing pdf document.

    Attachment: Present (inaccessible)
    All,

    We are still having an issue with the following document.   The document causes Aspose.Pdf to throw the following exception when trying to extract text:

    TestCase 'FTI.NFP.Processing.Test.FileProcessing.PdfProcessorTest.AsposeBug1'
    failed: System.IO.IOException: Unexpected character
        at xfc3c9f4b173edcf4.x29772c3c37b291f5.parseCOSName()
        at xfc3c9f4b173edcf4.x29772c3c37b291f5.parseCOSDictionary()
        at xfc3c9f4b173edcf4.x1fd8ff6bce9da376.x261dc983f778e30a()
        at xfc3c9f4b173edcf4.x1fd8ff6bce9da376.parse()
        at xfc3c9f4b173edcf4.x521bb537cf4901b7.get_StreamTokens()
        at xfc3c9f4b173edcf4.xdf01e4a3bbbc619b.processSubStream(x7316f1f64cf72e86 aPage, xd7c4e1d53048f9ce resources, x521bb537cf4901b7 cosStream)
        at xfc3c9f4b173edcf4.xdf01e4a3bbbc619b.processStream(x7316f1f64cf72e86 aPage, xd7c4e1d53048f9ce resources, x521bb537cf4901b7 cosStream)
        at xfc3c9f4b173edcf4.x50a55e0dd0902139.processPage(x7316f1f64cf72e86 page, x521bb537cf4901b7 content)
        at xfc3c9f4b173edcf4.x50a55e0dd0902139.processPages(IList pages)
        at xfc3c9f4b173edcf4.x50a55e0dd0902139.writeText(x830721144d567671 doc, StreamWriter outputStream)
        at xfc3c9f4b173edcf4.x59ea5a64f22abd3a.extractText(Stream input, String password, Int32 startPage, Int32 endPage)

     
  •  06-25-2007, 10:59 PM 81735 in reply to 81697

    Re: Support for getting information from existing pdf document.

    Hi,

    I am able to reproduce this error. We need some time to investigate this issue and will let you know as soon as solution is found. Sorry for inconvenience.

    Thanks.

    Adeel Ahmad
    Support Developer
    Aspose Changsha Team
    http://www.aspose.com/Wiki/default.aspx/Aspose.Corporate/ContactChangsha.html

     
  •  06-28-2007, 4:02 PM 82225 in reply to 80274

    Re: Support for getting information from existing pdf document.

    Hi,

    I have tested using version 2.5.2.0, It looks like the ExtractText() method extracts text for the whole pdf anyways instead of based on StartPage and EndPage specified.  Since it can really hit the performance if extracting a large pdf, is there a way to extract page by page?

    Thanks!

     

     
Page 3 of 4 (46 items)   < Previous 1 2 3 4 Next >
View as RSS news feed in XML