DOCX2HTM for DOS ----------------------- This is a quick and dirty attempt to grab the "important" stuff from one of Microsoft's .DOCX files. All necessary files are in the .zip package. Note: Updated versions of DOCX2HTM.EXE, REL2DOC.EXE and TABL2DOC.EXE: http://www.ausreg.com/files/docx2htm/docx2htm.exe http://www.ausreg.com/files/docx2htm/rel2doc.exe http://www.ausreg.com/files/docx2htm/tabl2doc.exe ----------------------------------------------------------------------------- Syntax: DOCX2HTM.EXE filename.doc(x) then: REL2DOC.EXE (adds links to any graphics) then: TABL2DOC.EXE (renders tables) Requires: Fold.exe, texrep.exe and either 7za.exe or unzip.exe in the same directory or in the path. docx2htm.dat MUST be in the same directory. Use: Place the offending .doc(x) file in the same directory, and (from the commandline) type: docx2htm filename rel2doc tabl2doc Action: Docx2htm.exe uses 7za.exe or unzip.exe to unzip the in-file, with NO directory structure. It calls texrep.exe and fold.exe to separate out the XML tags and to shorten long lines, then goes through the file DOCUMENT.XML line by line, editing some of the XML tags to useable HTML tags, and writes them to the output file DOC2HTM.HTM. Rel2doc.exe searches DOCUMENT.REL for graphic files and their ID-tags, then places their links into DOC2HTM.HTM. Tabl2doc searches DOC2HTM.HTM for links, and attempts to trans-code them into HTML. It is mostly successful with simple tables, but is unable, at this stage, to do a perfect job on tables within tables. NOTE: For large input files, say 200,000 kb and larger, this process is unacceptably slow. There may be 100,000 or more tags to check, change or delete, so it is never going to be a quick job. However, improving the speed is near the top of my todo list. Further Development: REL2DOC.EXE is a temporary kludge while this function is still being developed. Once it is deemed stable, it's code will be merged into DOCX2HTM.EXE. TABL2DOC.EXE: same as REL2DOC.EXE above. The clarity of the derived HTM file depends on the way in which OOXML tags are matched to HTML tags. At this stage, known xml tags, and some font-size tags, are being processed. The number of these will increase as my understanding of this strange language expands. Limiting the success (?) of this is going to be the range and complexity of this awful file format, in that it includes a lot of stuff that really only has a place in word processing, spreadsheets, and other such junk. Ron Clarke ron@ausreg.com