DOC to TXT and font size

Hi!

I want to get words (each word separately) from DOC file to TXT file and their font size. Can anyone help mi?

Bisior.

You could save to html and extract the text and font sizes.
I’ve done something similar to extract the plain text from formatted paragraphs.
Example code:

Document document = new Document(documentPath);
MemoryStream memory = new MemoryStream();
document.Save(memory, SaveFormat.FormatHtml);

html = Encoding.ASCII.GetString(memory.GetBuffer());

memory.Close();
memory = null;

Font sizes will be in the style of span tags.

Hi Bisior,
There’s another way to achieve this goal. Simply use IDocumentVisitor interface. Have a look at this code sample that extracts words and font sizes from .doc and writes it to .txt as you have requested:


public void ExtractWordsAndSizes()

{

Document doc = new Document(“original.doc”);

WordExtractor extractor = new WordExtractor(“words.txt”);

doc.Accept(extractor);

}



class WordExtractor : IDocumentVisitor

{

string fileName;

StreamWriter writer;

bool isExtracting;

public WordExtractor(string fileName)

{

if (fileName == null)

throw new ArgumentNullException();

this.fileName = fileName;

}

public void DocumentStart(Document doc)

{

try

{

writer = new StreamWriter(fileName);

writer.WriteLine(“Word Font size”);

writer.WriteLine("--------------------------------------");

}

catch (IOException e)

{

// Process error

}

}

public void DocumentEnd()

{

writer.Flush();

writer.Close();

}

public void ParagraphStart(ParagraphFormat paragraphFormat)

{

isExtracting = true;

}

public void ParagraphEnd()

{

isExtracting = false;

}

public void RunOfText(Font font, string text)

{

if (!isExtracting)

return;

try

{

string[] words = text.Replace("\r", “”).Split(’ ');

foreach (string word in words)

if (word != String.Empty)

writer.WriteLine(word.PadRight(36) + font.Size);

}

catch (IOException e)

{

// Process error

}

}


// Other interface methods were skipped


}



Thank you very much Dmitry, your solution works very good.