Performance Issue

Last post 03-05-2009, 4:10 AM by jmsm. 8 replies.
Sort Posts: Previous Next
  •  03-02-2009, 8:25 AM 167632

    Performance Issue Java

    Attachment: Present (inaccessible)
    Hi there. I am using Aspose.Words for Java and I am having some performance issues for documents with about 50k words.

    In attachment are 3 files.
    • total.doc - around 50k words
    • half.doc - around 25k words
    • AsposeTestC.txt - my test class
    For the "total.doc" took about 25 minutes processing, and the "half.doc" took about 5 minutes.

    My goal is to go for each word in a Word document and process it.
    In the attached class I am only counting words.

    My question is: why takes so long to process these files and why for 2x words in the document takes 5x times more time to process. Am I doing something wrong?

    The pattern tries to represent a word in portuguese language.


    Thanks for th support

    Filed under: performance
     
  •  03-02-2009, 2:55 PM 167711 in reply to 167632

    Re: Performance Issue

    Hi

     

    Thanks for your request. You regular expression matches all words in the document. The following regular expression matches the same.

     

    Pattern.compile("[\\w-]+")

     

    Could you please explain me what your goal is? Maybe there is another, more efficient way to achieve this.

    Current implementation just iterate through all words in the document.

     

    Best regards.
    Alexey Noskov
    Developer/Technical Support
    Aspose Auckland Team
     
  •  03-03-2009, 3:12 AM 167827 in reply to 167711

    Re: Performance Issue

    Hi

    Thanks for the reply. The pattern must have the other characters to match a word in the Portuguese language. For instance "coração" is the Portuguese word for "heart" and a pattern only with "\w" doesn't match it.

    What I am trying (and achieve) to do is: for each word in a document, test if is a known word and change it for their value.


    Thanks

     
  •  03-03-2009, 4:12 AM 167843 in reply to 167827

    Re: Performance Issue

    Hi

     

    Thank you for additional information. Maybe it is better to convert your document to TXT format (just a string.). Then search for Portuguese words in this string. When you find known word in the string, you can find the same word in the document and replace it. If you need, I can try to create code example for you.

     

    Best regards.


    Alexey Noskov
    Developer/Technical Support
    Aspose Auckland Team
     
  •  03-03-2009, 5:48 AM 167860 in reply to 167843

    Re: Performance Issue

    It would be great if I could have some sample code, because I am not understanding how You are saying.


    thanks
     
  •  03-03-2009, 7:50 AM 167880 in reply to 167860

    Re: Performance Issue

    Hi

     

    Thanks for your request. Maybe the following code could help you to achieve what you need:

     

    //This is our dictionary that contains known words

    Dictionary dict = new Hashtable();

    dict.put("UNIVERSIDADE", "This is new value");

    dict.put("CNICA", "This is new value");

    dict.put("nomes", "This is new value");

     

    //Open document

    Document doc = new Document("C:\\Temp\\half.doc");

     

    //Get document range

    Range reprange = doc.getRange();  //.replace(regex, new MyReplaceEvaluator(), true);

    //compile pattern

    Pattern regex = Pattern.compile("[\\wзЗаАбБйЙнНуУъЪгГвф-]+");

     

    //create matcher

    Matcher match = regex.matcher(doc.toTxt());

    int i = 0;

    while (match.find())

    {

        //Replace nwon word with ints value in the document

        if (dict.get(match.group()) != null)

            reprange.replace(match.group(), (String)dict.get(match.group()), true, true);

    }

     

    doc.save("C:\\Temp\\out.doc");

     

    Best regards.


    Alexey Noskov
    Developer/Technical Support
    Aspose Auckland Team
     
  •  03-04-2009, 5:47 AM 168082 in reply to 167880

    Re: Performance Issue

    Thank you very much for the reply.

    The example you gave is indeed very very fast.

    Thanks a lot.

    Just one more question. Is it possible in the ReplaceEvaluator.replace have the page and the line of the match?
     
  •  03-04-2009, 6:23 AM 168088 in reply to 168082

    Re: Performance Issue

    Hi

     

    Thanks for your request. MS Word document is flow document and does not contain any information about its layout into lines and pages. That’s why there is no way to get page or line number of the match.

     

    Best regards.


    Alexey Noskov
    Developer/Technical Support
    Aspose Auckland Team
     
  •  03-05-2009, 4:10 AM 168295 in reply to 168088

    Re: Performance Issue

    Yes I understand. That info depends on the word version, styles, etc.

    Many thanks for the support.
     
View as RSS news feed in XML