ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Processing XML with Xerces and the DOM

by Q Ethan McCallum
09/08/2005

From data storage to data exchange and from Perl to Java, it's rare to write software these days and not bump into XML. Adding XML capabilities to a C++ application, though, usually involves coding around a C-based API.

Even the cleanest C API takes some work to wrap in C++, often leaving you to choose between writing your own wrappers (which eats up time) or using third-party wrappers (which means one more dependency). Adopt the Xerces-C++ parser and you can skip these middlemen. This mature, robust toolkit is portable C++ and is available under the flexible Apache Software License (version 2.0).

Xerces' benefits extend beyond its C++ roots. It gives you a choice of SAX and DOM parsers, and supports XML namespaces. It also provides validation by DTD and XML schema, as well as grammar caching for improved performance.

This article uses the context of loading, modifying, and storing an XML config file to demonstrate Xerces-C++'s DOM side. My first example shows some raw code for reading XML. Then I revise it a couple of times to address deficiencies. The last example demonstrates how to modify the XML document and write it back out to disk. Along the way, I've made some helper classes that make using Xerces a little easier. My next article will cover SAX and validation.

I compiled the sample code under Fedora Core 3/x86 using Xerces-C++ 2.6.0 and GCC 3.4.3.

A Quick DOM Primer

The Document Object Model (DOM) is a specification for XML parsing designed with portability in mind. That is, whether you're using Perl or Java or C++, the high-level DOM concepts are the same. This eases the learning curve when moving between DOM toolkits. (Of course, implementations are free to add special features and convenience above and beyond the requirements of the spec.)

DOM represents an XML document as a tree of nodes (Xerces class DOMNode). Consider Figure 1, an XML document of some airport information. DOM sees the entire document as a document node (DOMDocument), the only child of which is the root <airports> element node (DOMElement). Were there any document type declarations or comments at this level, they would also be child nodes of the document node.

the DOM of an XML document
Figure 1. The DOM of an XML document

The <airport> element is a child node of <airports>. Its only attribute, name, is an attribute node (DOMAttr). <airport> children include the <aliases>, <location>, and <comment> elements. <comment> has a child text node (DOMText), which contains the string "Terminal 1 has a very 1970's sci-fi decor."

DOM even makes XML comments available as nodes (DOMComment). The example comment block is another <airports> child node.

There are several other nodes between the elements, too: each chunk of white space (such as that between </location> and <comment>) is its own text node. It's a text node of white space, but it's still a valid node to the DOM.

You can create, change, or remove nodes on this object representation of your document, then write the whole thing--comments included--back to disk as well-formed XML.

DOM requires that the parser load the entire document into memory at once, which can make handling large documents very memory intensive. For small to midsize XML documents, though, DOM offers portable read/modify/write capabilities to structured data when a full relational database (such as PostgreSQL or MySQL) is overkill.

First Look at Xerces Code

I prefer to explain this with source code. I will share some code excerpts inline, but as always, the complete source code for the examples is available for download.

The program step1 represents a portion of a fictitious report viewer. The config file tracks the time of its most recent modification, the user's login and password to the report system, and the last reports the user ran. Here's a sample of the config file:

<config lastupdate="1114600280">

  <login user="some name" password="cleartext" />

  <reports>
    <report tab="1" name="Report One" />
    <report tab="2" name="Report Two" />
    <report tab="3" name="Third Report" />
    <report tab="4" name="Fourth Report" />
    <report tab="5" name="Fifth Report" />
  </reports>

</config>

(Xerces also supports XML namespaces, though the sample code doesn't use them.)

The first thing to notice about step1 is the number of #included headers. Xerces has several header files, roughly one per class or concept. Some such projects have one master header file that includes the others. You could write one yourself, but including just the headers you need may speed up your build process.

Most Xerces constructs exist under the xercesc C++ namespace. You're certainly welcome to put using namespace directives in your code; but following good C++ form, the sample code explicitly states the namespace where needed.

main() calls routines to initialize and teardown the Xerces library:

xercesc::XMLPlatformUtils::Initialize();

// ... regular program ...

xercesc::XMLPlatformUtils::Terminate();

Your code must call Initialize() before using any Xerces classes. In turn, attempts to use Xerces classes after the call to Terminate() will yield a segmentation fault. Initialize() may throw an exception, so I've wrapped it in a try/catch block. Notice the call to XMLString::transcode() in the catch section:

}catch( xercesc::XMLException& e ){

  char* message = xercesc::XMLString::transcode( e.getMessage() ) ;

  std::cerr << "XML toolkit initialization error: "
        << message
        << std::endl
  ;

  xercesc::XMLString::release( &message ) ;
...
XML Pocket Reference

Related Reading

XML Pocket Reference
By Simon St. Laurent, Michael Fitzgerald

Xerces uses its own UTF-16 character type, XMLCh, instead of plain char or std::string. The function transcode() converts between char* and XMLCh* strings. It relies on the caller to free the memory it allocates for strings, hence the call to XMLString::release().

main() itself is very small. The XMLConfigData class encapsulates most of the Xerces calls. The accessors and mutators roughly match the contents of the XML file:

class XMLConfigData {

  public:
  
  void load() throw( std::runtime_error ) ;
  void commit() throw( std::runtime_error ) ;

  std::ostream& print( std::ostream& s ) const ;

  const std::string& getLastUpdate() const throw() ;

  void setLastUpdate( const std::string& in ) ;
  const std::string& getLoginUser() const throw() ;

  void setLoginUser( const std::string& in ) ;
  void setLoginPassword( const std::string& in ) ;

  int getReportCount() const throw() ;
  void addReport( const std::string& report ) ;

} ;

In addition to the configuration properties, the XMLConfigData constructor initializes two Xerces-related objects. The first is a XercesDOMParser named parser_. As its name implies, XercesDOMParser parses an XML document into a tree of DOMNode structures.

The second such object, tags_, is of the custom TagNames class. This convenience class holds XMLChar versions of the element and attribute names, such that code can uniformly address them without repeated calls to XMLString's transcode() and release(). It may be tempting to make these values static constants; but because C++ offers no guarantees on the order of static member initialization, it's impossible to make sure nothing uses the Xerces classes before the call to XMLPlatformUtils::Initialize().

All the magic happens in XMLConfigData::load(). It first configures the parser object to disable validation:

parser_.setValidationScheme(
  xercesc::XercesDOMParser::Val_Never ) ;

parser_.setDoSchema( false ) ;
parser_.setLoadExternalDTD( false ) ;

(I'll revisit validation in my next article.)

The call to parser_.parse() parses the XML document into a DOMDocument object called xmlDoc. step1 passes parse() a filename, so behind the scenes Xerces creates a LocalFileInputSource to read data from a local file. parse() is overloaded to accept other input, such as a buffer of memory (MemBufInputSource), standard input (StdInInputSource), and data loaded via URL (URLInputSource).

step1 calls DOMDocument::getDocumentElement to fetch the top-level element (here, <config>) as a DOMElement object:

DOMElement* elementConfig = xmlDoc->getDocumentElement() ;

It then calls DOMElement::getAttribute() to fetch some attribute values:

// "tags_.ATTR_LASTUPDATE" is the
// XMLCh* version of "lastupdate"

const XMLCh* lastUpdateXMLCh =
  elementConfig->getAttribute( tags_.ATTR_LASTUPDATE ) ;

Given the document's tree structure, many nodes have child nodes. DOMNode::getChildNodes() returns a DOMNodeList, which is useful to iterate through those immediate children:

xercesc::DOMNodeList* children =
  elementConfig->getChildNodes() ;

const XMLSize_t nodeCount = children->getLength() ;
        
for( XMLSize_t ix = 0 ; ix < nodeCount ; ++ix ){
  xercesc::DOMNode* currentNode = children->item( ix ) ;
  // ... do something with currentNode ...
}

Use getChildNodes() to walk the document, one level at a time. Call DOMNode's getTagName() function to see the name of the current element:

if( XMLString::equals(
  tags_.TAG_LOGIN ,
  element->getTagName()
) ){
  // ... it's a <login> tag ...

(Of course, getTagName() returns meaningful values only for element-type nodes.)

Remember, nodes can be more than elements. Be careful to avoid blindly downcasting a DOMNode to a subclass thereof. Compare a DOMNode constant with the result of a node's getNodeType() member to determine its type:

if( DOMNode::ELEMENT_NODE == currentNode->getNodeType() ){
  DOMElement* currentElement =
        dynamic_cast< DOMElement* >( currentNode )

In theory, you could just dynamic_cast<> the node to an element and check the return value for NULL. Explicitly checking the tag type is more in tune with DOM style. (Remember, it's a standard meant to work with several different languages.) It's also a little more verbose, which serves as a maintenance hint.

Whereas getChildNodes() returns all child nodes, you can pass a filter to the parent DOMDocument's createTreeWalker() or createNodeIterator. A DOMTreeWalker lets you navigate the document hierarchy by sibling or parent/child relationship. DOMDocument::createNodeIterator, by comparison, is more like a database result set or cursor: you can scroll forward or backward over the list of returned nodes.

Walking node children is a brute-force means to find elements of interest. You can also call DOMElement::getElementsByTagName() to fetch a list of descendant elements of a certain name. step1 uses this to find the <report> child elements of the <reports> element:

// tags_.TAG_REPORT is the XMLCh* version of "report"

xercesc::DOMNodeList* reportNodes =
  element->getElementsByTagName( tags_.TAG_REPORT ) ;

  // ... iterate through the node list ...

Though it returns a DOMNodeList, Xerces guarantees that getElementsByTagName() will return only element nodes. As such, it's safe to call blindly dynamic_cast<>() on the node list's elements.

Character Conversion

That was fine as a first try at using Xerces, but step1 still leaves room for improvement. (The code is certainly demo quality, but that's another issue.) The primary headache was the memory management: XMLString::transcode()'s dynamic memory allocation leaves a lot of cleanup work for the developer. I'd be surprised if the sample code doesn't risk a memory leak somewhere.

Two helper classes demonstrated in step2 use C++'s Resource Initialization Is Acquisition (RIIA) idiom to ease the pain: they take ownership of loose pointers allocated by transcode(), and release() those strings in their destructors. Because C++ promises to call the destructor as the object goes out of scope, without developer intervention, it's useful for such fire-and-forget tactics.

The first helper object is StringManager. Its convert() method calls transcode() behind the scenes to convert between char* and XMLChar* strings:

StringManager sm ;
const XMLCh* someTag = sm.convert( "someTag" ) ;

When sm goes out of scope, its destructor calls XMLString::release() on all of the strings it created in convert(). StringManager is convenient when a block of code requires several loose string conversions.

The second helper class is DualString. It takes ownership of a single string, and lets code address the same logical character sequence as either C or Xerces character types:

// constructor is overloaded to accept const XMLCh* as well
DualString TAG_CONFIG( "config" ) ;
someXercesFunction( TAG_CONFIG.asXMLString() ) ;
someCFunction( TAG_CONFIG.asCString() ) ;

Giving credit where it's due, I didn't create DualString. I based it on the StrX class used in some of the Xerces sample code.

You can pass DualString directly to an output stream, so it's nice to use for one-offs such as printing the message from a Xerces exception:

}catch( xercesc::XMLException& e ){

  std::cerr << "XML toolkit teardown error: "
        << DualString( e.getMessage() )
        << std::endl
  ;
}

Of course, be sure to instantiate neither DualString nor StringManager as a pointer; because you must manually and explicitly invoke pointer destructors via delete(), that would defeat the automatic nature of the RIIA technique.

You've probably noticed that step2 is a little more compact than step1--it replaces the awkward transcode()-temporary-release() dance with DualString and StringManager--but the real benefit is the lack of explicit memory management. These helper classes let you focus on using Xerces in your app, rather than pointer wrangling.

Some XMLString functions take a custom MemoryManager object that handles allocation. As an alternative to StringManager and DualString, you could create a memory manager and pass it to all XMLString calls. Either way works; I just find the simple helper classes more convenient.

Another Helper Class: Finding Elements

step1 is also a little rough in its search for elements: it walks the tree and chooses a path of action based on the name of the current element. This code is already messy, and in a complex document it would be much worse.

step3 moves element-finding logic into a separate helper class called ElementFinder. Instead of walking the tree, code can call ElementFinder functions. For example,

finder_.getConfigElement()

fetches the top-level <config> element.

This version of ElementFinder uses the same tree-walking logic as before; but the move to a separate class isn't about meaningless code shuffling. Hiding this behavior in another class makes it easy to switch from the ElementFinder code's brute-force method to XPath. (The Xerces-C++ FAQ suggests using Xalan or Pathan for true XPath searches.)

Don't be fooled--the Xerces API docs include classes for limited XPath features (DOMXPathEvaluator and DOMXPathExpression), but they're just placeholders. These classes' methods throw exceptions when called.

Modifying the DOM Tree in Memory

Having in-memory, tree-style access to an XML document is useful for plucking out pieces of data; but half of the benefit of DOM is the ability to modify that tree and save it back to a file. The program step4 rounds out the examples by altering the in-memory tree and saving it back to the original config file. All of this happens in the XMLConfigData:::commit() method.

commit() first calls updateObject() and updateXML(). These update the last-modified date and sync the object with the backing DOM tree, respectively. updateXML() involves updating some attributes, replacing a node, and adding some new nodes.

Updating an attribute is similar to getting its value: call the element node's setAttribute() member. This excerpt sets the <config> element's lastupdate attribute:

xercesc::DOMElement* configElement =
  finder_.getConfigElement() ;

configElement->setAttribute(
  finder_.ATTR_LASTUPDATE_.asXMLString() ,
  sm.convert( getLastUpdate() )
) ;

The sample code doesn't update the <login> tag's user or password attributes, but those would follow the same formula.

Updating the <reports> node takes a little more work. You could delete all of the child <report> elements and create new ones. However, for purely illustrative purposes, step4 takes the long route: it creates a new <reports> element, populates that element with new <report> children, and then swaps the old <reports> for the new.

The parent document owns all nodes, by default. To create an element, call the parent document's createElement() member:

xercesc::DOMElement* newReportsElement =
  xmlDoc->createElement( finder_.TAG_REPORTS_.asXMLString() )  ;

Next, create new <report> elements and add them under the new <reports> element:

for( ... each report in the XMLConfigData object ...){
  xercesc::DOMElement* element =
    xmlDoc->createElement( ... ) ;

  newReportsElement->appendChild( element ) ;

Finally, swap the old and new elements:

xercesc::DOMElement* oldReportsElement =
  finder_.getReportsElement() ;

configElement->replaceChild(
  newReportsElement ,
  oldReportsElement
) ;

You don't have to free the oldReportsElement pointer explicitly, as the parent XMLDocument still owns it.

Writing XML

The last part of commit() takes care of saving the DOM tree back to disk, using a LocalFileFormatTarget object. Xerces also supports storing XML in a memory buffer (MemBufFormatTarget) and writing to standard output (StdOutFormatTarget). You're free to implement your own FormatTarget class for custom output.

A DOMWriter object is responsible for writing out the data. step4 configures the DOMWriter to add spacing and formatting to make the document more human-readable:

xercesc::DOMWriter* writer =  ... create new writer ...
writer->setFeature(
  xercesc::XMLUni::fgDOMWRTFormatPrettyPrint ,
  true
);

Finally, step4 calls the writer to write out the document:

writer->writeNode( outfile , *xmlDoc ) ;

Note that because the parent document does not own the LocalFileFormatTarget and DOMWriter, the code calls delete() on them explicitly.

If you check step4's output, you'll notice the in-memory DOM tree has become well-formed XML. Furthermore, the file is an accurate representation of the DOM tree managed by the program: unmodified nodes remain as is, including comments. (Remember, comments are valid XML constructs; they're just not valid elements.)

That's all for Xerces-C++ and DOM. My next article will show Xerces's SAX side and explain XML validation using DTD and schema.

Resources

Q Ethan McCallum grew from curious child to curious adult, turning his passion for technology into a career.


Return to ONLamp

Copyright © 2009 O'Reilly Media, Inc.