What is word xml format

From Wikipedia, the free encyclopedia

WordProcessingML

Filename extension .XML (XML document)
Developed by Microsoft
Type of format Document file format
Extended from XML, DOC
DataDiagramingML

Filename extension .VDX (XML Drawing),
.VSX (XML Stencil),
.VTX (XML Template)
Developed by Microsoft
Type of format Diagramming vector graphics
Extended from XML, VSD, VSS, VST
SpreadsheetML

Filename extension .XML (XML Spreadsheet)
Developed by Microsoft
Type of format Spreadsheet
Extended from XML, XLS

The Microsoft Office XML formats are XML-based document formats (or XML schemas) introduced in versions of Microsoft Office prior to Office 2007. Microsoft Office XP introduced a new XML format for storing Excel spreadsheets and Office 2003 added an XML-based format for Word documents.

These formats were succeeded by Office Open XML (ECMA-376) in Microsoft Office 2007.

File formats[edit]

  • Microsoft Office Word 2003 XML Format — WordProcessingML or WordML (.XML)
  • Microsoft Office Excel 2002 and Excel 2003 XML Format — SpreadsheetML (.XML)
  • Microsoft Office Visio 2003 XML Format — DataDiagramingML (.VDX, .VSX, .VTX)
  • Microsoft Office InfoPath 2003 XML Format — XML FormTemplate (.XSN) (Compressed XML templates in a Cabinet file)
  • Microsoft Office InfoPath 2003 XML Format — XMLS FormTemplate (.XSN) (Compressed XML templates in a Cabinet file)

Limitations and differences with Office Open XML[edit]

Besides differences in the schema, there are several other differences between the earlier Office XML schema formats and Office Open XML.

  • Whereas the data in Office Open XML documents is stored in multiple parts and compressed in a ZIP file conforming to the Open Packaging Conventions, Microsoft Office XML formats are stored as plain single monolithic XML files (making them quite large, compared to OOXML and the Microsoft Office legacy binary formats). Also, embedded items like pictures are stored as binary encoded blocks within the XML. In case of Office Open XML, the header, footer, comments of a document etc. are all stored separately.
  • XML Spreadsheet documents cannot store Visual Basic for Applications macros, auditing tracer arrows, chart and other graphic objects, custom views, drawing object layers, outlining, scenarios, shared workbook information and user-defined function categories.[1] In contrast, the newer Office Open XML formats support full document fidelity.
  • Poor backward compatibility with the version of Word/Excel prior to the one in which they were introduced. For example, Word 2002 cannot open Word 2003 XML files unless a third-party converter add-in is installed.[2] Microsoft has released a Word 2003 XML Viewer which allows WordProcessingML files saved by Word 2003 to be viewed as HTML from within Internet Explorer.[3] For Office Open XML, Microsoft provides converters for Office 2003, Office XP and Office 2000.
  • Office Open XML formats are also defined for PowerPoint 2007, equation editing (Office MathML), vector drawing, charts and text art (DrawingML).

Word XML format example[edit]

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<w:wordDocument
   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
   xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
   xmlns:o="urn:schemas-microsoft-com:office:office"
   w:macrosPresent="no"
   w:embeddedObjPresent="no"
   w:ocxPresent="no"
   xml:space="preserve">
  <o:DocumentProperties>
    <o:Title>This is the title</o:Title>
    <o:Author>Darl McBride</o:Author>
    <o:LastAuthor>Bill Gates</o:LastAuthor>
    <o:Revision>1</o:Revision>
    <o:TotalTime>0</o:TotalTime>
    <o:Created>2007-03-15T23:05:00Z</o:Created>
    <o:LastSaved>2007-03-15T23:05:00Z</o:LastSaved>
    <o:Pages>1</o:Pages>
    <o:Words>6</o:Words>
    <o:Characters>40</o:Characters>
    <o:Company>SCO Group, Inc.</o:Company>
    <o:Lines>1</o:Lines>
    <o:Paragraphs>1</o:Paragraphs>
    <o:CharactersWithSpaces>45</o:CharactersWithSpaces>
    <o:Version>11.6359</o:Version>
  </o:DocumentProperties>
  <w:fonts>
    <w:defaultFonts
       w:ascii="Times New Roman"
       w:fareast="Times New Roman"
       w:h-ansi="Times New Roman"
       w:cs="Times New Roman" />
  </w:fonts>

  <w:styles>
    <w:versionOfBuiltInStylenames w:val="4" />
    <w:latentStyles w:defLockedState="off" w:latentStyleCount="156" />
    <w:style w:type="paragraph" w:default="on" w:styleId="Normal">
      <w:name w:val="Normal" />
      <w:rPr>
        <wx:font wx:val="Times New Roman" />
        <w:sz w:val="24" />
        <w:sz-cs w:val="24" />
        <w:lang w:val="EN-US" w:fareast="EN-US" w:bidi="AR-SA" />
      </w:rPr>
    </w:style>
    <w:style w:type="paragraph" w:styleId="Heading1">
      <w:name w:val="heading 1" />
      <wx:uiName wx:val="Heading 1" />
      <w:basedOn w:val="Normal" />
      <w:next w:val="Normal" />
      <w:rsid w:val="00D93B94" />
      <w:pPr>
        <w:pStyle w:val="Heading1" />
        <w:keepNext />
        <w:spacing w:before="240" w:after="60" />
        <w:outlineLvl w:val="0" />
      </w:pPr>
      <w:rPr>
        <w:rFonts w:ascii="Arial" w:h-ansi="Arial" w:cs="Arial" />
        <wx:font wx:val="Arial" />
        <w:b />
        <w:b-cs />
        <w:kern w:val="32" />
        <w:sz w:val="32" />
        <w:sz-cs w:val="32" />
      </w:rPr>
    </w:style>
    <w:style w:type="character" w:default="on" w:styleId="DefaultParagraphFont">
      <w:name w:val="Default Paragraph Font" />
      <w:semiHidden />
    </w:style>
    <w:style w:type="table" w:default="on" w:styleId="TableNormal">
      <w:name w:val="Normal Table" />
      <wx:uiName wx:val="Table Normal" />
      <w:semiHidden />
      <w:rPr>
        <wx:font wx:val="Times New Roman" />
      </w:rPr>
      <w:tblPr>
        <w:tblInd w:w="0" w:type="dxa" />
        <w:tblCellMar>
          <w:top w:w="0" w:type="dxa" />
          <w:left w:w="108" w:type="dxa" />
          <w:bottom w:w="0" w:type="dxa" />
          <w:right w:w="108" w:type="dxa" />
        </w:tblCellMar>
      </w:tblPr>
    </w:style>
    <w:style w:type="list" w:default="on" w:styleId="NoList">
      <w:name w:val="No List" />
      <w:semiHidden />
    </w:style>
  </w:styles>
  <w:docPr>
    <w:view w:val="print" />
    <w:zoom w:percent="100" />
    <w:doNotEmbedSystemFonts />
    <w:proofState w:spelling="clean" w:grammar="clean" />
    <w:attachedTemplate w:val="" />
    <w:defaultTabStop w:val="720" />
    <w:punctuationKerning />
    <w:characterSpacingControl w:val="DontCompress" />
    <w:optimizeForBrowser />
    <w:validateAgainstSchema />
    <w:saveInvalidXML w:val="off" />
    <w:ignoreMixedContent w:val="off" />
    <w:alwaysShowPlaceholderText w:val="off" />
    <w:compat>
      <w:breakWrappedTables />
      <w:snapToGridInCell />
      <w:wrapTextWithPunct />
      <w:useAsianBreakRules />
      <w:dontGrowAutofit />
    </w:compat>
  </w:docPr>
  <w:body>
    <wx:sect>
      <w:p>
        <w:r>
          <w:t>This is the first paragraph</w:t>
        </w:r>
      </w:p>
      <wx:sub-section>
        <w:p>
          <w:pPr>
            <w:pStyle w:val="Heading1" />
          </w:pPr>
          <w:r>
            <w:t>This is a heading</w:t>
          </w:r>
        </w:p>
        <w:sectPr>
          <w:pgSz w:w="12240" w:h="15840" />
          <w:pgMar w:top="1440"
		   w:right="1800"
		   w:bottom="1440"
		   w:left="1800"
		   w:header="720"
		   w:footer="720"
		   w:gutter="0" />
          <w:cols w:space="720" />
          <w:docGrid w:line-pitch="360" />
        </w:sectPr>
      </wx:sub-section>
    </wx:sect>
  </w:body>
</w:wordDocument>

Excel XML spreadsheet example[edit]

<?xml version="1.0" encoding="UTF-8"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="https://www.w3.org/TR/html401/">
<Worksheet ss:Name="CognaLearn+Intedashboard">
<Table>
<Column ss:Index="1" ss:AutoFitWidth="0" ss:Width="110"/>
<Row>
<Cell><Data ss:Type="String">ID</Data></Cell>
<Cell><Data ss:Type="String">Project</Data></Cell>
<Cell><Data ss:Type="String">Reporter</Data></Cell>
<Cell><Data ss:Type="String">Assigned To</Data></Cell>
<Cell><Data ss:Type="String">Priority</Data></Cell>
<Cell><Data ss:Type="String">Severity</Data></Cell>
<Cell><Data ss:Type="String">Reproducibility</Data></Cell>
<Cell><Data ss:Type="String">Product Version</Data></Cell>
<Cell><Data ss:Type="String">Category</Data></Cell>
<Cell><Data ss:Type="String">Date Submitted</Data></Cell>
<Cell><Data ss:Type="String">OS</Data></Cell>
<Cell><Data ss:Type="String">OS Version</Data></Cell>
<Cell><Data ss:Type="String">Platform</Data></Cell>
<Cell><Data ss:Type="String">View Status</Data></Cell>
<Cell><Data ss:Type="String">Updated</Data></Cell>
<Cell><Data ss:Type="String">Summary</Data></Cell>
<Cell><Data ss:Type="String">Status</Data></Cell>
<Cell><Data ss:Type="String">Resolution</Data></Cell>
<Cell><Data ss:Type="String">Fixed in Version</Data></Cell>
</Row>
<Row>
<Cell><Data ss:Type="Number">0000033</Data></Cell>
<Cell><Data ss:Type="String">CognaLearn Intedashboard</Data></Cell>
<Cell><Data ss:Type="String">janardhana.l</Data></Cell>
<Cell><Data ss:Type="String"></Data></Cell>
<Cell><Data ss:Type="String">normal</Data></Cell>
<Cell><Data ss:Type="String">text</Data></Cell>
<Cell><Data ss:Type="String">always</Data></Cell>
<Cell><Data ss:Type="String"></Data></Cell>
<Cell><Data ss:Type="String">GUI</Data></Cell>
<Cell><Data ss:Type="String">2016-10-14</Data></Cell>
<Cell><Data ss:Type="String"></Data></Cell>
<Cell><Data ss:Type="String"></Data></Cell>
<Cell><Data ss:Type="String"></Data></Cell>
<Cell><Data ss:Type="String">public</Data></Cell>
<Cell><Data ss:Type="String">2016-10-14</Data></Cell>
<Cell><Data ss:Type="String">IE8 browser_Modules screen tool tip text is shown twice</Data></Cell>
<Cell><Data ss:Type="String">new</Data></Cell>
<Cell><Data ss:Type="String">open</Data></Cell>
<Cell><Data ss:Type="String"></Data></Cell>
</Row>
</Table>
</Worksheet>
</Workbook>

See also[edit]

  • List of document markup languages
  • Comparison of document markup languages

References[edit]

  1. ^ «Features and limitations of XML Spreadsheet format (broken)». Archived from the original on 2007-10-09. Retrieved 2007-11-01.
  2. ^ «Polar WordML add-in (broken)». Archived from the original on 2009-04-11. Retrieved 2007-11-01.
  3. ^ Word 2003 XML Viewer
  • Overview of Office 2003 Developer Technologies
  • Office 2003 XML. ISBN 0-596-00538-5

External links[edit]

  • MSDN: XML Spreadsheet Reference
  • MSDN: Word 2003 XML Reference
  • Lawsuit about XML patent

The Microsoft Office XML formats are XML-based document formats (or XML schemas) introduced in versions of Microsoft Office prior to Office 2007. Microsoft Office XP introduced a new XML format for storing Excel spreadsheets and Office 2003 added an XML-based format for Word documents.

Contents

  • 1 Should I remove custom XML data in word?
  • 2 What does removing custom XML data do?
  • 3 Where can I find XML data in a word document?
  • 4 How do I use XML in word?
  • 5 What is XML used for?
  • 6 How do you remove XML from a word document?
  • 7 Can Word documents be traced?
  • 8 How do I disinfect a word document?
  • 9 How do you clean up a word document?
  • 10 Does Microsoft Office use XML?
  • 11 What is word custom XML data?
  • 12 Can you convert XML to word?
  • 13 Is XML a software?
  • 14 How does XML describe data?
  • 15 What is XML with example?
  • 16 What are the benefits of using XML?
  • 17 How do I get rid of custom XML in Word?
  • 18 How do I remove metadata from Word 2020?
  • 19 How do I turn off metadata in Word?
  • 20 Can you see who has opened a Word document?

Should I remove custom XML data in word?

Further, from that date onward, any version of Word that opened a document in which custom XML was already present would be required to remove the custom XML from the file. So the bottom line is that custom XML in a Word document is not a security risk to anyone.

What does removing custom XML data do?

What removing Custom XML will mean. Microsoft is producing (against its will) a version of Office 2007 without the Custom XML feature. This means a document with Custom XML code will be read by Office but as soon as you save the document, all the Custom XML code will be removed.

Where can I find XML data in a word document?

Double click the folder you wish to inspect (for example word). Double click the file you wish to inspect (for example document. xml). The document last selected should now appear in an Internet Explorer tab.

How do I use XML in word?

#1) Open Windows Explorer and browse to the location where the XML file is located. We have browsed to the location of our XML file MySampleXML as seen below. #2) Now right-click over the file and select Open With to choose Notepad or Microsoft Office Word from the list of options available to open the XML file.

What is XML used for?

The Extensible Markup Language (XML) is a simple text-based format for representing structured information: documents, data, configuration, books, transactions, invoices, and much more. It was derived from an older standard format called SGML (ISO 8879), in order to be more suitable for Web use.

How do you remove XML from a word document?

  1. In the XML Structure pane, select the Show XML tags in the document check box.
  2. In the document, rest the mouse pointer on a Start of Tag Name or End of Tag Name tag.
  3. Right-click, and then click Remove Tag Name tag to remove the tag without deleting its content. Note Each element has a start tag and an end tag.

Can Word documents be traced?

You can set Word for the Web to track changes for all users who are collaborating on the document or to track just your changes. On the Review tab, go to Tracking. In the Track Changes drop-down list, do one of the following: To track only the changes you make to the document, select Just Mine.

How do I disinfect a word document?

Word Document Sanitization Basic Procedure

  1. Create a copy of the document.
  2. Turn off reviewing features and remove associated data.
  3. Review and delete sensitive content.
  4. Check redacted content and run document inspector.
  5. Verify Acrobat conversion settings and convert.

How do you clean up a word document?

Clean metadata in Microsoft Word

  1. Open the Microsoft Word document you would like to clean.
  2. Click the DocsCorp tab in the toolbar.
  3. Click ‘Clean’ in the cleanDocs section.

Does Microsoft Office use XML?

Starting with the 2007 Microsoft Office system, Microsoft Office uses the XML-based file formats, such as . docx, . xlsx, and .These formats and file name extensions apply to Microsoft Word, Microsoft Excel, and Microsoft PowerPoint.

Custom XML parts were introduced in Microsoft word 2007 along with Open XML formats. Custom part of microsoft word document is used to store custom data and it is not suprising that data format is XML.Custom Control is set of individual objects to control and customize content of the document.

Can you convert XML to word?

Yes. Using the “Open” dialog, Word can open any xml file you want. Then using the “Save As” dialog, you can save the result to a Word docx.

Is XML a software?

XML is a software- and hardware-independent tool for storing and transporting data.

How does XML describe data?

Explanation: XML uses Document Type Definition(DTD) to describe the data. DTD is specified document defining and constraining definition and XML uses a description node to describe the data. Extensible Stylesheet Language(XSL) is used to transform and render the XML document.

What is XML with example?

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
XML.

Filename extension .xml
Developed by World Wide Web Consortium
Type of format Markup language
Extended from SGML

What are the benefits of using XML?

Advantages of XML

  • XML uses human, not computer, language. XML is readable and understandable, even by novices, and no more difficult to code than HTML.
  • XML is completely compatible with Java™ and 100% portable. Any application that can process XML can use your information, regardless of platform.
  • XML is extendable.

How do I get rid of custom XML in Word?

Click File > Info > Remove Personal Information. Click the Personal Information tab. Select the Remove these items from the document check box.

How do I remove metadata from Word 2020?

Removing Metadata From Word Using a Mac

  1. Open the file you would like to remove metadata from.
  2. Click on the “Tools” menu and select the “Protect Document” option.
  3. In the “Protect Document” window check the box next to “Remove personal information from this file on save”
  4. Finish working on your document and then save.

How do I turn off metadata in Word?

To view the Personal Information click on Show All Properties to the right. In Office 2007 click on the Office Button, Prepare and then Inspect Document. To view the Personal Information before removing it click on Prepare and then Document Properties. If Word finds metadata, it will prompt you to Remove All.

Can you see who has opened a Word document?

Click the File tab of the ribbon and then click Info | Properties | Advanced Properties. Word displays the Properties dialog box.The dialog box then displays the statistics for your document, as already described. Click on OK when you are done reviewing the statistics.

What is Microsoft Word XML format?

Office Open XML (also informally known as OOXML) is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. The format was initially standardized by the Ecma (as ECMA-376), and by the ISO and IEC (as ISO/IEC 29500) in later versions.

Can I create an XML file in word?

Creating sophisticated, multi-purpose XML files can involve highly technical processes that are designed by experienced systems analysts and application developers. However, with Word 2007, anyone can participate in these processes by creating a Word document and then saving it as an XML file.

What is an Office Open XML file?

Office Open XML, also known as OpenXML or OOXML, is an XML-based format for office documents, including word processing documents, spreadsheets, presentations, as well as charts, diagrams, shapes, and other graphical material. docx, . xlsx, and . pptx).

What is Excel XML file?

Excel has a defined XML schema that defines the contents of an Excel workbook, including XML tags that store all workbook information, such as data and properties, and define the overall structure of the workbook. Custom applications can use this Excel macro-enabled Office XML Format File.

How do I add an XML file to Word?

To add an XMLNode control to a document If the Developer tab is not visible, you must first show it. For more information, see How to: Show the developer tab on the ribbon. In the XML group, click Schema. The Templates and Add-ins dialog box opens.

How do I open a XML file?

XML files are encoded in plaintext, so you can open them in any text editor and be able to clearly read it. Right-click the XML file and select “Open With.” This will display a list of programs to open the file in. Select “Notepad” (Windows) or “TextEdit” (Mac).

Is XML Excel format?

Excel for Microsoft 365 Word for Microsoft 365 PowerPoint for Microsoft 365 Excel 2021 Word 2021 PowerPoint 2021 Office 2021 Excel 2019 Word 2019 PowerPoint 2019 Office 2019 Excel 2016 Word 2016 PowerPoint 2016 Office 2016 Excel 2013 Word 2013 PowerPoint 2013 Office 2013 Excel 2010 Word 2010 PowerPoint 2010 Office 2010 Office 2007 More…Less

Starting with the 2007 Microsoft Office system, Microsoft Office uses the XML-based file formats, such as .docx, .xlsx, and .pptx. These formats and file name extensions apply to Microsoft Word, Microsoft Excel, and Microsoft PowerPoint. This article discusses key benefits of the format, describes the file name extensions and discusses how you can share Office files with people who are using earlier versions of Office.

Names of file extensions

In this article

What are the benefits of Open XML Formats?

What are the XML file name extensions?

Can different versions of Office share the same files?

What are the benefits of Open XML Formats?

The Open XML Formats include many benefits — not only for developers and the solutions that they build, but also for individual people and organizations of all sizes:

  • Compact files     Files are automatically compressed and can be up to 75 percent smaller in some cases. The Open XML Format uses zip compression technology to store documents, offering potential cost savings as it reduces the disk space required to store files and decreases the bandwidth needed to send files via e-mail, over networks, and across the Internet. When you open a file, it is automatically unzipped. When you save a file, it is automatically zipped again. You do not have to install any special zip utilities to open and close files in Office.

  • Improved damaged-file recovery     Files are structured in a modular fashion that keeps different data components in the file separate from each other. This allows files to be opened even if a component within the file (for example, a chart or table) is damaged or corrupted.

  • Support for advanced features    Many of the advanced features of Microsoft 365 require the document to be stored in the Open XML format. Things like AutoSaveand the Accessibility Checker, for two examples, can only work on files that are stored in the modern Open XML format.

  • Better privacy and more control over personal information     Documents can be shared confidentially, because personally identifiable information and business-sensitive information, such as author names, comments, tracked changes, and file paths can be easily identified and removed by using Document Inspector.

  • Better integration and interoperability of business data     Using Open XML Formats as the data interoperability framework for the Office set of products means that documents, worksheets, presentations, and forms can be saved in an XML file format that is freely available for anyone to use and to license, royalty free. Office also supports customer-defined XML Schemas that enhance the existing Office document types. This means that customers can easily unlock information in existing systems and act upon it in familiar Office programs. Information that is created within Office can be easily used by other business applications. All you need to open and edit an Office file is a ZIP utility and an XML editor.

  • Easier detection of documents that contain macros     Files that are saved by using the default «x» suffix (such as .docx, .xlsx, and .pptx) cannot contain Visual Basic for Applications (VBA) macros and XLM macros. Only files whose file name extension ends with an «m» (such as .docm, .xlsm, and .pptm) can contain macros.

Saving file as type

Before you decide to save the file in a binary format, read Can different versions of Office share the same files?

How do I convert my file from the old binary format to the modern Open XML format?

With the file open in your Office app, click File > Save as (or Save a copy, if the file is stored on OneDrive or SharePoint) and make sure the Save as type is set to the modern format.

Click the file type drop down to select a different file format for your document

This will create a new copy of your file, in the Open XML format.

What are the XML file name extensions?

By default, documents, worksheets, and presentations that you create in Office are saved in XML format with file name extensions that add an «x» or an «m» to the file name extensions that you are already familiar with. The «x» signifies an XML file that has no macros, and the «m» signifies an XML file that does contain macros. For example, when you save a document in Word, the file now uses the .docx file name extension by default, instead of the .doc file name extension.

Saving file as .docx

When you save a file as a template, you see the same kind of change. The template extension used in earlier versions is there, but it now has an «x» or an «m» on the end. If the file contains code or macros, you must save it by using the new macro-enabled XML file format, which adds an «m» for macro to the file extension.

The following tables list all the default file name extensions in Word, Excel, and PowerPoint.

Word

XML file type

Extension

Document

.docx

Macro-enabled document

.docm

Template

.dotx

Macro-enabled template

.dotm

Excel

XML file type

Extension

Workbook

.xlsx

Macro-enabled workbook

.xlsm

Template

.xltx

Macro-enabled template

.xltm

Non-XML binary workbook

.xlsb

Macro-enabled add-in

.xlam

PowerPoint

XML file type

Extension

Presentation

.pptx

Macro-enabled presentation

.pptm

Template

.potx

Macro-enabled template

.potm

Macro-enabled add-in

.ppam

Show

.ppsx

Macro-enabled show

.ppsm

Slide

.sldx

Macro-enabled slide

.sldm

Office theme

.thmx

Can different versions of Office share the same files?

Office lets you save files in the Open XML Formats and in the binary file format of earlier versions of Office and includes compatibility checkers and file converters to allow file-sharing between different versions of Office.

Opening existing files in Office     You can open and work on a file that was created in an earlier version of Office, and then save it in its existing format. Because you might be working on a document with someone who uses an earlier version of Office, Office uses a compatibility checker that verifies that you have not introduced a feature that an earlier version of Office does not support. When you save the file, the compatibility checker reports those features to you and then lets you remove them before continuing with the save.

Need more help?

Want more options?

Explore subscription benefits, browse training courses, learn how to secure your device, and more.

Communities help you ask and answer questions, give feedback, and hear from experts with rich knowledge.

With approximately one billion people using Microsoft Office, the DOCX format is the most popular de facto standard for exchanging document files between offices. Its closest competitor — the ODT format — is only supported by Open/LibreOffice and some open source products, making it far from standard. The PDF format is not a competitor because PDFs can’t be edited and they don’t contain a full document structure, so they can only take limited local changes like watermarks, signatures, and the like. This is why most business documents are created in the DOCX format; there’s no good alternative to replace it.

While DOCX is a complex format, you may want to parse it manually for simpler tasks such as indexing, converting to TXT and making other small modifications. I’d like to give you enough information on DOCX internals so you don’t have to reference the ECMA specifications, a massive 5,000 page manual.

The best way to understand the format is to create a simple one-word document with MSWord and observe how editing the document changes the underlying XML. You’ll face some cases where the DOCX doesn’t format properly in MS Word and you don’t know why, or come across instances when it’s not evident how to generate the desired formatting. Seeing and understanding exactly what’s going on in the XML will help that.

I worked for about a year on a collaborative DOCX editor, CollabOffice, and I want to share some of that knowledge with the developer community. In this article I will explain the DOCX file structure, summarising information that is scattered over the internet. This article is an intermediary between the huge, complex ECMA specification and the simple internet tutorials currently available. You can find the files that accompany this article in the toptal-docx project on my github account.

A Simple DOCX file

A DOCX file is a ZIP archive of XML files. If you create a new, empty Microsoft Word document, write a single word ‘Test’ inside and unzip it contents, you will see the following file structure:

Our brand new test DOCX structure.

Even though we’ve created a simple document, the save process in Microsoft Word has generated default themes, document properties, font tables, and so on, in XML format.

All the files inside a DOCX are XML files, even those with the «.rels» extension.

To start, let us remove the unused stuff and focus on document.xml, which contains the main text elements. When you delete a file, make sure you have deleted all the relationship references to it from other the xml files. Here is a code-diff example on how I’ve cleared dependencies to app.xml and core.xml. If you have any unresolved/missing references, MSWord will consider the file broken.

Here’s the structure of our simplified, minimal DOCX document (and here’s the project on github):

Our simplified DOCX structure.

Let’s break it down by file from here, from the top:

_rels/.rels

This defines the reference that tells MS Word where to look for the document contents. In this case, it references word/document.xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
   <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"
                 Target="word/document.xml"/>
</Relationships>

_rels/document.xml.rels

This file defines references to resources, such as images, embedded in the document content. Our simple document has no embedded resources, so the relationship tag is empty:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
</Relationships>

[Content_Types].xml

[Content_Types].xml contains information about the types of media inside the document. Since we only have text content, it’s pretty simple:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
   <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
   <Default Extension="xml" ContentType="application/xml"/>
   <Override PartName="/word/document.xml"
             ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
</Types>

document.xml

Finally, here is the main XML with the document’s text content. I have removed some of namespace declarations for clarity, but you can find the full version of the file in the github project. In that file you’ll find that some of the namespace references in the document are unused, but you shouldn’t delete them because MS Word needs them.

Here’s our simplified example:

<w:document>
   <w:body>
       <w:p w:rsidR="005F670F" w:rsidRDefault="005F79F5">
           <w:r><w:t>Test</w:t></w:r>
       </w:p>
       <w:sectPr w:rsidR="005F670F">
           <w:pgSz w:w="12240" w:h="15840"/>
           <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720"
                    w:gutter="0"/>
           <w:cols w:space="720"/>
           <w:docGrid w:linePitch="360"/>
       </w:sectPr>
   </w:body>
</w:document>

The main node <w:document> represents the document itself, <w:body> contains paragraphs, and nested within <w:body> are page dimensions defined by <w:sectPr>.

<w:rsidR> is an attribute that you can ignore; it’s used by MS Word internals.

Let’s take a look at a more complex document with three paragraphs. I have highlighted the XML with the same colors on the screenshot from Microsoft Word, so you can see the correlation:

Complex paragraph example with styling.

<w:p w:rsidR="0081206C" w:rsidRDefault="00E10CAE"> <w:r> <w:t xml:space="preserve">This is our example first paragraph. It's default is left aligned, and now I'd like to introduce</w:t> </w:r> <w:r> <w:rPr> <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/> <w:color w:val="000000"/> </w:rPr> <w:t>some bold</w:t> </w:r> <w:r> <w:rPr> <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/> <w:b/> <w:color w:val="000000"/> </w:rPr> <w:t xml:space="preserve"> text</w:t> </w:r> <w:r> <w:rPr> <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/> <w:color w:val="000000"/> </w:rPr> <w:t xml:space="preserve">, </w:t> </w:r> <w:proofErr w:type="gramStart"/> <w:r> <w:t xml:space="preserve">and also change the</w:t> </w:r> <w:r w:rsidRPr="00E10CAE"> <w:rPr><w:rFonts w:ascii="Impact" w:hAnsi="Impact"/> </w:rPr> <w:t>font style</w:t> </w:r> <w:r> <w:rPr> <w:rFonts w:ascii="Impact" w:hAnsi="Impact"/> </w:rPr> <w:t xml:space="preserve"> </w:t> </w:r> <w:r> <w:t>to 'Impact'.</w:t></w:r> </w:p> <w:p w:rsidR="00E10CAE" w:rsidRDefault="00E10CAE"> <w:r> <w:t>This is new paragraph.</w:t> </w:r></w:p> <w:p w:rsidR="00E10CAE" w:rsidRPr="00E10CAE" w:rsidRDefault="00E10CAE"> <w:r> <w:t>This is one more paragraph, a bit longer.</w:t> </w:r> </w:p>

Paragraph Structure

A simple document consists of paragraphs, a paragraph consists of runs (a series of text with the same font, color, etc), and runs consist of characters (such as <w:t>).<w:t> tags may have several characters inside, and there might be a few in the same run.

Again, we can ignore <w:rsidR>.

Text properties

Basic text properties are font, size, color, style, and so on. There are about 40 tags that specify text appearance. As you can see in our three paragraph example, each run has its own properties inside <w:rPr>, specifying <w:color>, <w:rFonts> and boldness <w:b>.

An important thing to note is that properties make a distinction between the two groups of characters, normal and complex script (Arabic, for instance), and that the properties have a different tag depending on which type of character it’s affecting.

Most normal script property tags have a matching complex script tag with an added “C” specifying the property is for complex scripts. For example: <w:i> (italic) becomes <w:iCs>, and the bold tag for normal script, <w:b>, becomes <w:bCs> for complex script.

Styles

There’s an entire toolbar in Microsoft Word dedicated to styles: normal, no spacing, heading 1, heading 2, title, and so on. These styles are stored in /word/styles.xml (note: in the first step in our simple example, we removed this XML from DOCX. Make a new DOCX to see this).

Once you have text defined as a style, you will find reference to this style inside the paragraph properties tag, <w:pPr>. Here’s an example where I’ve defined my text with the style Heading 1:

<w:p>
   <w:pPr>
       <w:pStyle w:val="Heading1"/>
   </w:pPr>
   <w:r>
       <w:t>My heading 1</w:t>
   </w:r>
</w:p>

and here is the style itself from styles.xml:

<w:style w:type="paragraph" w:styleId="Heading1">
   <w:name w:val="heading 1"/>
   <w:basedOn w:val="Normal"/>
   <w:next w:val="Normal"/>
   <w:link w:val="Heading1Char"/>
   <w:uiPriority w:val="9"/>
   <w:qFormat/>
   <w:rsid w:val="002F7F18"/>
   <w:pPr>
       <w:keepNext/>
       <w:keepLines/>
       <w:spacing w:before="480" w:after="0"/>
       <w:outlineLvl w:val="0"/>
   </w:pPr>
   <w:rPr>
       <w:rFonts w:asciiTheme="majorHAnsi" w:eastAsiaTheme="majorEastAsia" w:hAnsiTheme="majorHAnsi"
                 w:cstheme="majorBidi"/>
       <w:b/>
       <w:bCs/>
       <w:color w:val="365F91" w:themeColor="accent1" w:themeShade="BF"/>
       <w:sz w:val="28"/>
       <w:szCs w:val="28"/>
   </w:rPr>
</w:style>

The <w:style/w:rPr/w:b> xpath specifies that the font is bold, and <w:style/w:rPr/w:color> indicates the font color. <w:basedOn> instructs MSWord to use “Normal” style for any missing properties.

Property Inheritance

Text properties are inherited. A run has its own properties (w:p/w:r/w:rPr/*), but it also inherits properties from paragraph (w:r/w:pPr/*), and both can reference style properties from the /word/styles.xml.

<w:r>
 <w:rPr>
   <w:rStyle w:val="DefaultParagraphFont"/>
   <w:sz w:val="16"/>
 </w:rPr>
 <w:tab/>
</w:r>

Paragraphs and runs start with default properties: w:styles/w:docDefaults/w:rPrDefault/*
and w:styles/w:docDefaults/w:pPrDefault/*. To get the end result of a character’s properties you should:

  1. Use default run/paragraph properties
  2. Append run/paragraph style properties
  3. Append local run/paragraph properties
  4. Append result run properties over paragraph properties

When I say “append” B to A, I mean to iterate through all B properties and override all A’s properties, leaving all non-intersecting properties as-is.

One more place where default properties may be located is in the <w:style> tag with w:type="paragraph" and w:default="1". Note, that characters themselves inside a run never have a default style, so <w:style w:type="character" w:default="1"> doesn’t actually affect any text.

Characters in a run can inherit from its paragraph and both can inherit from styles.xml.

1554402290400-dbb29eef3ba6035df7ad726dfc99b2af.png)

Characters in a run can inherit from its paragraph and both can inherit from styles.xml.

Toggle properties

Some of the properties are “toggle” properties, such as <w:b> (bold) or <w:i> (italic); these attributes behave like an XOR operator.

This means if the parent style is bold and a child run is bold, the result will be regular, non-bold text.

You have to do lots of testing and reverse-engineering to handle toggle attributes correctly. Take a look at paragraph 17.7.3 of ECMA-376 Open XML specification to get the formal, detailed rules for toggle properties/

Toggle properties are the most complex for a layouter to handle correctly.

Fonts

Fonts follow the same common rules as other text attributes, but font property default values are specified in a separate theme file, referenced under word/_rels/document.xml.rels like this:

<Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/>

Based on the above reference, the default font name will be found in word/theme/themes1.xml, inside a <a:theme> tag, a:themeElements/a:fontScheme/a:majorFont or a:minorFont tag.

The default font size is 10 unless the w:docDefaults/w:rPrDefault tag is missing, then it is size 11.

Text alignment

Text alignment is specified by a <w:jc> tag with four w:val modes available: "left", "center", "right" and "both".

"left" is the default mode; text is started at the left of paragraph rectangle (usually the page width). (This paragraph is aligned to the left, which is standard.)

"center" mode, predictably, centers all characters inside the page width. (Again, this paragraph exemplifies centered alignment.)

In "right" mode, paragraph text is aligned to the right margin. (Notice how this text is aligned to the right side.)

"both" mode puts extra spacing between words so that lines get wider and occupy the full paragraph width, with the exception of the last line which is left aligned. (This paragraph is a demonstration of that.)

Images

DOCX supports two sorts of images: inline and floating.

Inline images appear inside a paragraph along with the other characters, <w:drawing> is used instead of using <w:t> (text). You can find image ID with the following xpath syntax:

w:drawing/wp:inline/a:graphic/a:graphicData/pic:pic/pic:blipFill/a:blip/@r:embed

The image ID is used to look up the filename in the word/_rels/document.xml.rels file, and it should point to gif/jpeg file inside word/media subfolder. (See the github project’s word/_rels/document.xml.rels file, where you can see the image ID.)

Floating images are placed relative to paragraphs with text flowing around them. (Here’s th github project sample document with a floating image.)

Floating images use <wp:anchor> instead of <w:drawing>, so if you delete any text inside <w:p>, be careful with the anchors if you don’t want the images removed.

Inline vs. floating.

MS Word’s image options refer to image alignment as «text wrapping mode».

Tables

XML tags for tables are similar to HTML table markup– is the same as <table>, matches with <tr>, etc.

<w:tbl>, the table itself, has table properties <w:tblPr>, and each column property is presented by <w:gridCol> inside <w:tblGrid>. Rows follow one by one as <w:tr> tags and each row should have same number of columns as specified in <w:tblGrid>:

<w:tbl>
 <w:tblPr>
   <w:tblW w:w="5000" w:type="pct" />
 </w:tblPr>
 <w:tblGrid><w:gridCol/><w:gridCol/></w:tblGrid>
 <w:tr>
   <w:tc><w:p><w:r><w:t>left</w:t></w:r></w:p></w:tc>
   <w:tc><w:p><w:r><w:t>right</w:t></w:r></w:p></w:tc>
 </w:tr>
</w:tbl>

Width for table columns can be specified in the <w:tblW> tag, but if you don’t define it MS Word will use its internal algorithms to find the optimal width of columns for the smallest effective table size.

Units

Many XML attributes inside DOCX specify sizes or distances. While they’re integers inside the XML, they all have different units so some conversion is necessary. The topic is a complicated one, so I’d recommend this article by Lars Corneliussen on units in DOCX files. The table he presents is useful, though with a small misprint: inches should be pt/72, not pt*72.

Here’s a cheat sheet:

COMMON DOCX XML UNIT CONVERSIONS
20th of a point Points
dxa/20
Inches
pt/72
Centimeters
in*2,54
Font half size
pt/144
EMU
in*914400
Example 11906 595.3 8,27… 21.00086… 4,135 7562088
Tags using this pgSz/pgMar/w:spacing w:sz wp:extent, a:ext

Tips for Implementing a Layouter

If you want to convert a DOCX file (to PDF, for instance), draw it on canvas, or count number of pages, you’ll have to implement a layouter. A layouter is an algorithm for calculating character positions from a DOCX file.

This is a complex task if you need 100 percent fidelity rendering. The amount of time needed to implement a good layouter is measured in man-years, but if you only need a simple, limited one, it can be done relatively quickly.

A layouter fills a parent rectangle, which is usually a rectangle of the page. It add words from a run one by one. When the current line overflows, it starts a new one. If the paragraph is too high for the parent rectangle, it’s wrapped to the next page.

Here are some important things to keep in mind if you decide to implement a layouter:

  • The layouter should take care about text alignment and text floating over images
  • It should be capable of handling nested objects, such as nested tables
  • If you want to provide full support for such images, you’ll have to implement a layouter with at least two passes, the first step collects floating images’ positions and the second fills empty space with text characters.
  • Be aware of indentations and spacings. Each paragraph has spacing before and after, and these numbers are specified by the w:spacing tag. Vertical spacing is specified by w:after and w:before tags. Note that line spacing is specified by w:line, but this is not the size of the line as one may expect. To get the size of the line, take the current font height, multiply by w:line and divide by 12.
  • DOCX files contain no information about pagination. You won’t find the number of pages in the document unless you calculate how much space you need for each line to ascertain the number of pages. If you need to find exact coordinates of each character on the page, be sure to take into account all spacings, indentations and sizes.
  • If you implement a full-featured DOCX layouter that handles tables, note the special cases when tables span multiple pages. A cell which causes a page overflow also affects other cells.
  • Creating an optimal algorithm for calculating a table columns’ width is a challenging math problem and word processors and layouters usually use some suboptimal implementations. I propose using the algorithm from W3C HTML table documentation as a first approximation. I haven’t found a description of the algorithm used by MS Word, and Microsoft has fine-tuned the algorithm over time so different versions of Word may lay out tables slightly differently.

If something is unclear: reverse-engineer the XML!

When it’s not obvious how this or that XML tag works inside MS Word, there are two main approaches to figuring it out:

  • Create the desired content step-by-step. Start with a simple docx file. Save each step to its own file, as in 1.docx, 2.docx, for example. Unzip each of them and use a visual diff tool for folder comparison to see which tags appear after your changes. (For a commercial option, try Araxis Merge, or for a free option, WinMerge.)

  • If you generate a DOCX file that MS Word doesn’t like, work backwards. Simplify your XML step by step. At some point you will learn which change MS Word found incorrect.

DOCX is quite complex, isn’t it?

It is complex, and Microsoft’s license forbids using MS Word on the server side for processing DOCX– this is pretty standard for commercial products. Microsoft has, however, provided the XSLT file to handle most DOCX tags, but it won’t give you 100 percent or even 99 percent fidelity. Processes such as text wrapping over images are not supported, but you will be able to support the majority of documents. (If you don’t need complexity, consider using Markdown as an alternative.)

If you have a sufficient budget (there is no free DOCX rendering engine), you may want to use commercial products such as Aspose or docx4j. The most popular free solution is LibreOffice for converting between DOCX and other formats, including PDF. Unfortunately, LibreOffice contains many small bugs during conversion, and since it’s a sophisticated, open-source C++ product, it’s slow and difficult to fix fidelity issues.

Alternatively, if you find DOCX layouting too complicated to implement yourself, you can also convert it to HTML and use a browser to render it. You can also consider one of Toptal’s freelance XML developers.

DOCX Resources for further reading

  • ECMA DOCX specification
  • OpenXML library for DOCX manipulation from C#. It doesn’t contain information on layouting or rendering code, but offers a class hierarchy matching each possible XML node in DOCX.
  • You can always search or ask on stackoverflow with keywords like docx4j, OpenXML and docx; there are people in the community who are knowledgeable.

Понравилась статья? Поделить с друзьями:
  • What is word wrap in visual studio
  • What is word wrap in css
  • What is word wrap feature
  • What is word wrap around
  • What is word wide web