Sign In  Sign Up Live-Chat

Vladimir Averkin

  • Funny: How the new powerful cryptography implemented in Word 2007 turns it into a perfect tool for document password removal.

    Today I will talk a bit about the Word 2007, its new document format DOCX, and its not very clever implementation of an old MS Word feature called Document Protection.

    So what is a document protection essentially? Called by Review | Protect Documents | Restrict Formatting and Editing in MS Word 2007, it allows you to set different types of protection, including Readonly, Tracked changes, Comments and Filling in forms. Very convenient to prevent undesired changes to the document template, it allows users to edit only certaing things in the document or, in read-only mode, disallows editing at all.  It comes as a feature of DOCX format. In \word\settings.xml package part it will look like this:

    <w:documentProtection w:edit="forms" w:enforcement="1" w:cryptProviderType="rsaFull" w:cryptAlgorithmClass="hash" w:cryptAlgorithmType="typeAny" w:cryptAlgorithmSid="4" w:cryptSpinCount="50000" w:hash="0AMSgIVdSif6F5unNC/Lk3rBvr4=" w:salt="m3sJnUyPgf0hUjz+U1Sdxg=="/>

    Of course, this type of document protection can be easily stripped away by removing the line above from the Settings package part. So it is just a fool-proof protection, to prevent unintended, accidental change to the documents. That is in fact explicitly stated in Office Open XML spec:

    Document protection is a set of restrictions used to prevent unintentional changes to all or part of a WordprocessingML document - since this protection does not encrypt the document, malicious applications may circumvent its use. This protection is not intended as a security feature and may be ignored.

    This document protection can be enforced with password, but again, that is just a fool proof. As password setting line is exposed in plain text in WordML and RTF you can easily remove the protection if you only know what and where to search.

    And now let's look at DOCX implementation. Have you noticed these impressive cryptography attributes. They offer 11 built-in hashing algorithms with the possibility of extending them to an infinite number, using volunatry number of spins (hashing rounds) and salt (extra line added to password initially) to make it even more secure. Just look what they write about the need for extenting of hashing algorithms set:

    This extensibility affords the fact that with exponentially increasing computing power, documents created in the future will likely need to utilize as yet undefined hashing algorithms in order to remain secure.

    Isn't it (mmm, how to say it mildly) not very clever to use this cryptographical mumbo jumbo to protect things that essentially could not be protected? But the funniest thing is still ahead.

    In MS Word 2003 they used a simple hashing algorithm for it and everything was fine. In Word 2003 XML it looked like this:

    <w:documentProtection w:edit="forms" w:enforcement="on" w:unprotectPassword="64CEED7E" />

    Of course, you could still go on and remove that line to strip away the protection. But who cares? And that at least required some special knowledge from the person who decided to do this.

    In Word 2007 it was decided that the hashing algorithm was not secure enough. (Bwahaha! As if it was ever meant to be secure.) So all these fearful cryptography stuff has appeared. But to maintain compatibility with the previous version old hash stayed. New hashing is put above the old hash. So that old hash, calculated and represented as hex string like "7EEDCE64" is used as a source for a secondary, very cryptographical and very secure, hashing.

    But what hashing means? Based on hash you can never discover the source string, because hash is an incomplete reflection of the initial information, like the checksum or CRC code. It cannot be used to restore the hashed string. So storing only secondary hash in DOCX document you will never be able to find out the primary hash, which is stored in Word 2003 formats. And that means that password hash stored in Word 2007 format can never be reversed to its original form and will ineviatbly be lost when the document is saved to the old format.

    So with all these wise and clever cryptography implemented you no longer need to use any 3rd party editors or tools to remove passwords from the document. With MS Word 2007 you can remove the document password in seconds just by saving the document to DOC format, closing the word and then opening the saved DOC file again. You can also use Word 2007 to remove password from old DOC format. For this you need to open DOC file in Word 2007, save it in DOCX, close Word, open file again and repeat the steps described above.

    How all that makes sense and why all this stuff was developed and implemented is unclear. Maybe it was intended to be used later, with Information Rights Management software. I have not played with this techology yet but on the first look it does not seem a very convenient option as all verification is done server-side.

  • WordML export is now supported in Aspose.Words.

    Hello Everybody,

     

    That is my first entry in this blog, so let me introduce myself. My name is Vladimir Averkin, I am development and support engineer in Aspose Auckland team. Besides providing all kinds of technical support for Aspose.Words users at the forums my primary responsibility is the development of XML-related features of Aspose.Words.

     

    Today I would like to introduce the first major result of my development efforts - brand new Aspose.Words DOC to WordML conversion capability.

     

    Some history first.

     

    WordML is yet another MS Word document format that was introduced first in MS Word 2003 and was intended to be the first truly opened and fully documented format for Word documents. The strive for opened and interchangeable document standards began in late 90s and Microsoft was continuously criticized for closed nature of MS Word formats ever since.

     

    MS already had an opened and decently documented format by this time - RTF. But it had several major drawbacks. It was still proprietary, corporate standard, which means that it was created and maintained inside Microsoft and could be changed or updated by MS anytime without anyone's else approval. Even more serious drawback was that DOC to RTF conversion was not lossless. Some formatting was irreversibly lost when converting to RTF. The third problem was that RTF required implementation of special reader/writer, creating of which required quite large development effort from any implementer. Still it was a very good effort for its time and RTF eventually became quite popular in document processing applications.

     

    So WordML became a second attempt of creating an open document format. Its creation began when the industry inclination toward XML became quite evident and the choice of XML markup for the new format was natural and well-founded. The XML handling was already natively supported on many platforms and XSLT, the technology for XML content processing already became quite popular. Also, from the very beginning WordML was intended to support a lossless conversion from DOC. All features existing in DOC format were to be built in WordML. That was not an easy task at the moment as DOC format had already became very large and complex.

     

    Microsoft was quite successful in the task of creating a new lossless format, based on widely adopted XML and suitable for XSLT processing. The task of creating well-documented and MS-independent format was not fulfilled though. The documentation for WordML is largely incomplete. Document properties, styles, lists, paragraphs and tables are all  described, although very briefly, and the examples and usage recommendations are scarce. Graphics remained absolutely undocumented. Except for the reference to VML description in MSDN there is no documentation for graphics at all. Even xml schema is incomplete. For example xml elements for charting are entirely missing even from WordML XML schema. Reference to VML description in MSDN helps but not much as VML in WordML is wildly and heavily different from VML described in this article. Some constructs look similar enough but in no way this could regarded as the reliable documentation source.

     

    There were also large chunks of document data that MS found difficult or unnecessary to translate to XML form and they were included in WordML as is – as a big piles of Base64-encoded data. That included embedded OLE objects, VBA macros, OCX controls, which could probably be left behind during document processing. But that also included all imaging data, which was tightly and sometimes incoherently packed inside ‘w:bindata’ chunks with no documentation or tip on how to extract it from there. All in all that meant that WordML was suitable enough for processing formatted text data using conventional XML+XSLT means but was entirely unsuitable to handle any graphics or OLE information that was stored inside the document.

     

    All of that is contributing to the fact that there are no reliable tools that can render complex WordML files with a decent fidelity or convert them to other formats. The only instrument that did that job good enough was MS Word 2003. But even MS Word 2003 fails to render some of its WordML files correctly. Try for example to convert Microsoft RTF specification document into WordML and then close and open it in MS Word again. You will see some serious table rendering problems there. The good news that these problems seem to be corrected in the new MS Word 2007.

     

    With all of the above said, we are extremely proud to announce that Aspose.Words is now featuring the first full-scale non-Microsoft DOC to WordML conversion. To try it for yourslef you can download new Aspose.Words 4.0 from here, install it and run DocumentExplorer demo source-code project which provides among other features the ability to open DOC files and save them in WordML.

     

    We have thoroughly checked the conversion quality on several hundreds of tests and we have put a major effort to ensure that not only the resulting WordML files are looking in MS Word exactly as the original documents, but also that the inner xml itself matches the one created by MS Word almost exactly.

     

    With this new and exciting capability Aspose.Words can be used in existing WordML-based document processing solutions as a preprocessor for incoming doc files, thus eliminating the need for MS Word automation use.

     

    The conversion is also extremely fast and can crunch about 2-3 megabytes of a complex, mixed-content documents in a second.

     

    I will be extremely happy if you try this new WordML export capability for yourself and let me know what you think. Please be sure to report if you will find any bugs or deficiencies in there. Your feedback on this is highly welcome!