Leveraging the XML Features of Microsoft Office Word 2003...
With the ability to save as and read from XML, you can create sophisticated documents by processing and manipulating XML.
With the ability to save as and read from XML, you can create sophisticated documents by processing and manipulating XML.
Word 2003 came up with a new feature of saving the conventional word document (.doc) files in the XML format. You can retrieve the information inside the Word 2003 documents by using the XPath queries and some logic. This feature is particularly useful as the .doc file format that is still present in Word 2003 is essentially a proprietary binary format; sadly, .doc files are difficult to extract information from.
This new feature in Word 2003 also allows you to force users into entering data into an XML document without the users actually knowing it. Basically, you can annotate a document with an XML schema. You can protect the document, thus allowing the user to add or edit information in specific locations throughout the document. This way, when the user saves the document the data is written directly to an XML document. Thus the document can be easily consumed by another application or a database.
Another good feature of using XML with Word 2003 documents is the ability to transform XML into other formats. Microsoft provided XSLT takes a Word 2003 XML document and transforms it into an HTML document for viewing in a Web Browser. The facility of viewing the .doc file in HTML format feature was available in earlier versions of the Word also. But here the distinguishing factor is that by designing your own XSLT you can decide the format in which you want to view your Word document on an HTML page.
The best way of utilizing this feature is; Create documents from the data within an application e.g. letters or other document templates can be filled programmatically. You can send Word 2003 documents to a client workstation over the wire as XML and have it correctly interpreted at the client workstation as a Word 2003 document. You can return Word 2003 documents from Web services. We will now see how to produce the final document for use in the application by going through a series of steps.
Step I: - Creating a Schema
The very step in this process is to create a schema for the data that you can insert into the Word 2003
document template. Though having a schema is not a necessary condition, but it becomes simpler to work with the
document if you apply a schema to it. If you don’t have the schema you can always use a feature like bookmark,
which will be rendered like an XML shown below:
|
<aml:annotation aml:id="0" w:type="Word.Bookmark.Start" w:name="ContactName"/> <w:p> <w:r> <w:t>[ContactName]</w:t> </w:r> <aml:annotation aml:id="0" w:type="Word.Bookmark.End"/> </w:p> |
Notice how the bookmark, named ContactName in this example, is delimited by two empty annotation elements. The only things that distinguish these elements are the type attribute values of Word.Bookmark.Start and Word.Bookmark.End. This is slightly more complex than applying a schema to the document, which produces the XML in the following snippet:
|
<ns0:ContactName> <w:p> <w:r> <w:t>[ContactName]</w:t> </w:r> </w:p> </ns0:ContactName> |
As we are starting from scratch, the schema approach seems to be a slightly easier way to go. But definitely there can be situations where you are migrating your approach from an earlier version of Word and where your documents are marked up with bookmarks. As you can see, it's still possible to use the bookmarks, just a tiny bit more work than using an attached schema.
Let's have a look at the simple created using the Northwind Customers table from SQL Server.
A simple XML schema based upon Northwind's Customers table.
|
<?xml version="1.0" encoding="utf-8" ?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="http://schemas.eps-software.com/NWindTest" xmlns:eps="http://schemas.eps-software.com/NWindTest"> <xs:element name="Address"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:maxLength value="60" /> </xs:restriction> </xs:simpleType> </xs:element> <xs:element name="City"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:maxLength value="15" /> </xs:restriction> </xs:simpleType> </xs:element> <xs:element name="ContactName"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:maxLength value="30" /> </xs:restriction> </xs:simpleType> </xs:element> <xs:element name="ContactTitle"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:maxLength value="30" /> </xs:restriction> </xs:simpleType> </xs:element> <xs:element name="Country"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:maxLength value="15" /> </xs:restriction> </xs:simpleType> </xs:element> <xs:element name="Fax"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:maxLength value="24" /> </xs:restriction> </xs:simpleType> </xs:element> <xs:element name="PostalCode"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:maxLength value="10" /> </xs:restriction> </xs:simpleType> </xs:element> <xs:element name="Region"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:maxLength value="15" /> </xs:restriction> </xs:simpleType> </xs:element> </xs:schema> |
This simple schema points out another advantage to using a schema-based approach: Word 2003 enforces the restrictions defined in the schema for the document. Any violations appear as errors in Word 2003's task pane feature, but you can also validate the document against the schema with any XML validation tool.
The schema that you create can be as simple or as complex as you like. What is important is how to mark up the Word 2003 document with this schema so that you get the desired XML output from your application.
Step II: - Making a Word 2003 document
As we are done with schema, now lets apply it to a Word 2003 document. Start by creating or opening a document in Word 2003 with the desired text. You can highlight or somehow mark the locations for XML placeholders in your document so you can find them easily when it comes time to edit the document. My convention is to write the node names into the text of the document, and surround them with square brackets (e.g., [ContactName]). These become the placeholders for the schema elements in the document.
To apply schemas follow the steps:
Tips for Saving as XML
To make things a little cleaner in the XML output, you should ensure that you either spell everything correctly or that you ignore any spelling errors flagged by Word 2003. If you leave in something that the Word 2003 spelling checker doesn't like, the resultant XML looks similar to the following snippet:
|
<ns0:ContactName> <w:p> <w:r> <w:t>[</w:t> </w:r> <w:proofErr w:type="spellStart" /> <w:r> <w:t>ConyactName</w:t> </w:r> <w:proofErr w:type="spellEnd" /> <w:r> <w:t>]</w:t> </w:r> </w:p> </ns0:ContactName> |
As you can see, with the proofing errors, this changes the expected XML, because Word 2003 has embedded some proofErr elements. Once you handle the spelling errors (e.g., right-click the error in the document and choose "Ignore All"), the XML appears as shown in this snippet:
|
<ns0:ContactName> <w:p> <w:r> <w:t>[ContactName]</w:t> </w:r> </w:p> </ns0:ContactName> |
Also, be aware of where your paragraph marks appear in relation to your applied schema elements. In the snippet shown above, the [ContactName] text appears on a line all by itself. This places a paragraph element (the w:p element) completely within the ContactName element.
If, on the other hand, you placed ContactName on the same line as some other text or another element, the paragraph element won't appear within the ContactName element but outside of it. Because my document contains both of these examples, the code will have to handle both situations appropriately.
Opening the XML File
Now that you've saved the document as XML, you can see the document on your hard drive with its XML extension. When you double-click it, it opens up within Word 2003, not in your associated program for XML files (which is, by default, Microsoft Internet Explorer). This is because there is a processing instruction at the top of the XML document that declares the ProgID to use when opening this XML file, as shown in this snippet:
|
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?> <?mso-application progid="Word.Document"?> <w:wordDocument . . . |
If you comment out the second line of this document and then save it, you no longer launch Word 2003 when double-clicking the XML file. I found this useful during testing so that I could quickly view the XML produced by saving the Word 2003 document as XML.
Step III: - Creating the output
Now that the template has been defined and annotated as desired, you can write a small program to read data from an XML file and merge this data with the template. I've used a console application (as I don't need a GUI) and chose Visual Basic .NET as the language.
Let's first have a look at the XML that we will merge with the document. It contains a single record from the Northwind database on SQL server. Let’s save this XML file as NWData.xml.
The XML data to be merged with the marked-up document.
|
<?xml version="1.0" encoding="utf-8" ?> <results> <customers> <CustomerID>OLDWO</CustomerID> <CompanyName>Old World Delicatessen</CompanyName> <ContactName>Rene Phillips</ContactName> <ContactTitle>Sales Representative</ContactTitle> <Address>2743 Bering St.</Address> <City>Anchorage</City> <Region>AK</Region> <PostalCode>99508</PostalCode> <Country>USA</Country> <Phone>(907) 555-7584</Phone> <Fax>(907) 555-2880</Fax> </customers> </results> |
The code (the complete code snippet is provided at the end of the article) uses the XMLDocument class from the .NET Framework to do the bulk of the work. The code starts by loading both the data file and the Word 2003 template file into separate XML DOM objects. The Word 2003 document (saved as XML) is loaded through a method of a class instantiated as the oProcess object.
|
Dim oProcess As New WordXMLTest Dim sDocPath As String Dim sDataPath As String Dim sSaveFile As String sDocPath = "sample2.xml" sDataPath = "NWdata.xml" sSaveFile = "OutFile.xml" Try 'load the WordXML into a DOM oProcess.LoadFile(sDocPath) 'load data into DOM xmlDataDoc.Load(sDataPath) |
Next, select the nodes from the data document with a simple XPath query, and iterate through them with a For-Next loop. Note that this code only assumes that a single customer record exists in the XML file. If there are multiple customers, add another outer loop to iterate through each customer record.
|
'iterate through data nodes xmlNodes = xmlDataDoc.SelectNodes("/results/customers/*") 'replace Word doc area with data If Not xmlNodes Is Nothing Then For i = 0 To xmlNodes.Count - 1 xmlNode = xmlNodes(i) sNodeName = xmlNode.Name sNewText = xmlNode.InnerText oProcess.ProcessNodes(sNodeName, sNewText) Next End If |
For the ProcessNodes method, the desired node name and new text are passed as parameters. A separate method is used because in the template, the ContactName element is present in two locations within the document. Thus, should ensure that both of these locations are replaced with the same name.
So, in the ProcessNodes method, the specified node name is used to create XPath queries to retrieve lists of matching nodes. Then each query is executed with the SelectNodes method on the Word 2003 XML DOM object, oXMLWordDoc.
|
Public Sub ProcessNodes(ByVal sNodeName As String, ByVal sNodeValue As String) 'replace the node(s) in the document 'with the specified value Dim oNodeList As XmlNodeList 'get nodes that have 'embedded paragraph marks oNodeList = oXMLWordDoc.SelectNodes("//ns0:" + sNodeName + "//w:p", oNSMgr) |
The interesting part of the code is the XPath queries; there are two of them, to ensure that you catch all of the nodes with the specified node name. Because some of the nodes are within a single paragraph and others are embedded within a paragraph, there are queries to account for both situations.
|
If Not oNodeList Is Nothing Then FillNodes(oNodeList, sNodeValue) End If 'get nodes that do NOT have 'embedded paragraph marks oNodeList = oXMLWordDoc.SelectNodes("//ns0:" + sNodeName, oNSMgr) If Not oNodeList Is Nothing Then FillNodes(oNodeList, sNodeValue) End If |
The namespace prefix requires that the SelectNodes method specifies a NamespaceManager object, which is part of .NET's System.XML namespace. Otherwise, your SelectNodes query will fail with errors. The NamespaceManager object, stored in a property of the WordXMLTest class, is populated within the New method, so it runs when the WordXMLTest class is instantiated.
The namespace URIs come directly from the Word 2003 XML file and may vary depending upon the target namespace declared in your schema and what Word 2003 assigns as a prefix to your schema.
The FillNodes method referenced in the ProcessNodes method receives a node list object and a new node value as parameters. It changes the contents of the specified nodes on the oXMLWordDoc object.
|
Private Sub FillNodes(ByVal oNodeList As XmlNodeList, ByVal sNodeValue As String) Dim i As Integer Dim oXMLNode, oInnerNode As XmlNode For i = 0 To oNodeList.Count - 1 oXMLNode = oNodeList(i) oInnerNode = oXMLNode.SelectSingleNode("w:r/w:t", oNSMgr) If Not oInnerNode Is Nothing Then oInnerNode.InnerText = sNodeValue End If Next |
The replacement actually occurs on the text between the <w:t> and </w:t> tags that appear within the specified node object. This ensures that no formatting is lost, as font and paragraph properties are specified in the elements that surround the <w:t> element.
The last bit is to take the modified XML and save it to disk with a different file name so that it can be viewed. This is done by calling the Save method on the Word 2003 XML DOM object:
|
'write out the new Doc file. oProcess.save(sSaveFile) . . . Class WordXMLTest Public oXMLWordDoc As New XmlDocument Public oNSMgr As New XmlNamespaceManager(oXMLWordDoc.NameTable) Public Sub save(ByVal sFileName As String) oXMLWordDoc.Save(sFileName) End Sub |
The Final Output
After running the program, you should now be able to double-click the output file and see the output in Word 2003, as shown below.
If you double-click the output XML file and it doesn't load in Word 2003, most likely you have commented out the processing instruction in your template file so that you could view the XML in your registered XML application. Simply remove the comment so that the processing instruction becomes active again, allowing the document to open directly in Word 2003.
Another trick for ensuring that the document opens in Word 2003 is to force a DOC extension on the final output of the program. For example, to force the OutFile.xml file to open in Word 2003, rename the file as Outfile.xml.doc.
You have seen how to take a Word document and process it using XML, a new feature of Word 2003. By marking up the desired document with an associated XML schema and saving it as XML, you've exposed the contents of the document through XML. With a little processing, the Word 2003 XML file is easily merged with XML data and can act as a template for a multitude of documents.
The complete VB.NET code that merges the XML with the marked-up document.
|
Sub Main() Dim oProcess As New WordXMLTest Dim sDocPath, sDataPath, sSaveFile As String Dim sNodeName, sNewText As String Dim xmlDataDoc As New XmlDocument Dim xmlNodes As XmlNodeList Dim xmlNode As XmlNode Dim oExc As Exception Dim i As Integer sDocPath = " sample2.xml" sDataPath = "NWdata.xml" sSaveFile = " OutFile.xml" Try 'load the WordXML into a DOM oProcess.LoadFile(sDocPath) 'load data into DOM xmlDataDoc.Load(sDataPath) 'iterate through data nodes xmlNodes = xmlDataDoc.SelectNodes("/results/customers/*") 'replace Word doc area with data If Not xmlNodes Is Nothing Then For i = 0 To xmlNodes.Count - 1 xmlNode = xmlNodes(i) sNodeName = xmlNode.Name sNewText = xmlNode.InnerText oProcess.ProcessNodes(sNodeName, sNewText) Next End If 'write out the new Doc file. oProcess.save(sSaveFile) Catch oExc MsgBox(oExc.Message, MsgBoxStyle.Critical, "Error") End Try End Sub End Module Class WordXMLTest Public oXMLWordDoc As New XmlDocument Public oNSMgr As New XmlNamespaceManager(oXMLWordDoc.NameTable) Public Sub New() 'add the schema's namespace to a name space manager LoadNS("ns0", "http://schemas.eps-software.com/NWindTest") LoadNS("w", "http://schemas.microsoft.com/office/word/2003/wordml") End Sub Public Sub LoadFile(ByVal sFilePath As String) oXMLWordDoc.Load(sFilePath) End Sub Private Sub LoadNS(ByVal sPrefix, ByVal sURI) oNSMgr.AddNamespace(sPrefix, sURI) End Sub Public Sub save(ByVal sFileName As String) oXMLWordDoc.Save(sFileName) End Sub Public Sub ProcessNodes(ByVal sNodeName As String, ByVal sNodeValue As String) 'replace node(s) in document with value Dim oNodeList As XmlNodeList 'gets nodes that have embedded paragraph marks oNodeList = oXMLWordDoc.SelectNodes("//ns0:" + sNodeName + "//w:p", oNSMgr) If Not oNodeList Is Nothing Then FillNodes(oNodeList, sNodeValue) End If 'gets nodes that do NOT have 'embedded paragraph marks oNodeList = oXMLWordDoc.SelectNodes("//ns0:" + sNodeName, oNSMgr) If Not oNodeList Is Nothing Then FillNodes(oNodeList, sNodeValue) End If End Sub Private Sub FillNodes(ByVal oNodeList As XmlNodeList, ByVal sNodeValue As String) Dim i As Integer Dim oXMLNode, oInnerNode As XmlNode For i = 0 To oNodeList.Count - 1 oXMLNode = oNodeList(i) oInnerNode = oXMLNode.SelectSingleNode("w:r/w:t", oNSMgr) If Not oInnerNode Is Nothing Then oInnerNode.InnerText = sNodeValue End If Next End Sub End Class |
You may download the code here.