Sometimes, it might be required by a developer to count the number of words present in a PDF document. So, Aspose.Pdf.Kit provides GetWordCount method in PdfExtractor class to do so.
Just follow these four steps:
Code Snippet
[C#]
//Instantiate PdfExtractor object
PdfExtractor extractor = new PdfExtractor();
//Bind the input PDF document to extractor
extractor.BindPdf(".\\text.pdf");
//Extract text from the PDF document
extractor.ExtractText();
//Call GetWordCount method to get word count of the input PDF file
int wordCount = extractor.GetWordCount();
[VB.NET]
'Instantiate PdfExtractor object
Dim extractor As PdfExtractor = New PdfExtractor()
'Bind the input PDF document to counter
extractor.BindPdf(".\\text.pdf")
'Extract text from the PDF document
extractor.ExtractText()
'Call GetWordCount method to get word count of the input PDF file
Dim wordCount As Integer = extractor.GetWordCount()
Sometimes, GetWordCount method of PdfExtractor class won't give the desired output for some languages like Chinese. So, for this alternate method could be applied. Follow the following steps:
- Create an object of PdfExtractor (Java Version ) class by calling its empty constructor
- Provide the password for PDF document using Password (Java Version ) property of PdfExtractor (Java Version ) class.
- Bind the PDF document with extractor by calling BindPdf (Java Version ) method that would take the input PDF file path or a stream as argument.
- Call ExtractText (Java Version ) method that will extract text from PDF document.
- call GetText method to store the extracted text in a memory stream.
- Create an object of StreamReader. Use this object to Read all the text from memory stream and store into the string object .
- Create an array of char. Define all the characters use to split any word.
- Create an array of string. Call the split function of string class and pass array of characters as a splitting filter.
- Get the Length property of array of string. This length is actual Word Count.
[C#]
PdfExtractor extractor = new PdfExtractor();
//Bind the input PDF document to extractor
extractor.BindPdf("Extract.pdf");
//Extract text from the PDF document
extractor.ExtractText();
MemoryStream mem = new MemoryStream();
extractor.GetText(mem);
StreamReader reader = new StreamReader(mem);
mem.Seek(0, SeekOrigin.Begin);
string text = reader.ReadToEnd();
char[] charsToSplit = new char[] { ' ', ':', ';', ',', '.', '-', '\r', '\n', '\t' };
//Split function use to split string on the basis of split character
string[] tmp = text.Split(charsToSplit, StringSplitOptions.RemoveEmptyEntries);
int NoOfWords = tmp.Length;
[VB.NET]
Dim extractor As New PdfExtractor()
'Bind the input PDF document to extractor
extractor.BindPdf("0.pdf")
'Extract text from the PDF document
extractor.ExtractText()
Dim mem As New MemoryStream()
extractor.GetText(mem)
Dim reader As New StreamReader(mem)
mem.Seek(0, SeekOrigin.Begin)
Dim [text] As String = reader.ReadToEnd()
Dim charsToSplit() As Char = {" "c, ":"c, ";"c, ","c, "."c, "-"c, ControlChars.Cr, ControlChars.Lf, ControlChars.Tab}
'Split function use to split string on the basis of split character
Dim tmp As String() = [text].Split(charsToSplit, StringSplitOptions.RemoveEmptyEntries)
Dim NoOfWords As Integer = tmp.Length