ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Processing XML with Xerces and SAX

by Q Ethan McCallum
11/10/2005

In my previous article, I introduced the Xerces-C++ XML toolkit and explained how to use Xerces for DOM parsing. This time, I'll explain Xerces SAX parsing, plus error handling and validation.

SAX and DOM offer very different approaches to reading XML. Many people say the difference between the two is just about memory efficiency--but it's also a matter of control. With SAX you pluck out exactly what you want from the XML document, instead of wandering the DOM tree.

Xerces provides customizable error handling for both types of parsing. This means you can (in limited fashion) tell the parser how to react to certain types of problems.

Finally, there's validation using DTD and XML schema. (Come to think of it, validation is the reason for a lot of error handling.) Letting the parser handle the validation means your code can (safely) assume a certain document structure. That keeps your code clean, because it can focus on the business logic behind what's in the XML instead of watching out for missing elements.

I compiled the sample code under Fedora Core 3/x86 using Xerces-C++ 2.6.0 and GCC 3.4.3. The code uses the helper classes described in the previous article, but you don't need to understand them to follow this article.

SAX at a High Level

Related Reading

C++ Cookbook
By Ryan Stephens, Christopher Diggins, Jonathan Turkanis, Jeff Cogswell

Whereas DOM gives you an object graph of the entire document at once, SAX parsing is a streaming model. It reads a document sequentially, and hands your code chunks of data to process. You have to catch items of interest when they appear, because you can't rewind the stream to get them later. It's not unlike scrolling movie credits in that the information is gone once it rolls off the screen. SAX is reminiscent of parsing with lex and yacc, though it's much stronger: the rules of XML define much of the grammar for you.

Because you don't load the entire document into memory, SAX can be quite resource-efficient. On the other hand, it requires more elbow grease than DOM: it's up to your code to churn the stream of data into usable objects.

This makes SAX feel lower-level than DOM, similar to how people compare memory management in C and Java. It's true that SAX puts you closer to the raw parsing, but that's one reason to choose SAX over DOM. What if you have a large document and you're interested in only a few particular elements? DOM's overhead is wasteful in that case. SAX lets you sift through the data and extract just what you need.

SAX Mechanics

At the code level, a SAX parser interprets an XML document as a series of events. The most common events are element start tags and end tags and the body content between them.

Each event is tied to a callback function on a handler object--this is the piece you write--that you register with the parser. When a parser encounters an element's start tag, for example, it passes the element and any attributes to the start tag callback. When it's parsing the body text between the start and end tags, it passes chunks of content to the body content callback.

In fact, the handler does all the work. The main() function of the sample program step1 is very short. There's just enough code to create a parser, assign a handler, and run:

// ... skipping basic Xerces setup explained
// in the previous article ...

xercesc::SAX2XMLReader* p =
  xercesc::XMLReaderFactory::createXMLReader();

xercesc::ContentHandler* h =
  new SimpleContentHandler() ;

p->setContentHandler( h ) ;

// ... set some options on the parser ...

p->parse( xmlFile ) ;

SAX handler classes implement the ContentHandler class. The abbreviated interface is:

class ContentHandler {

  void startDocument() ;
  void endDocument() ;
  
  void startElement(
    const XMLCh* const uri,
    const XMLCh* const localname,
    const XMLCh* const qname,
    const Attributes&  attrs 
  ) ;

  void endElement(
    const XMLCh* const uri,
    const XMLCh* const localname,
    const XMLCh* const qname
  ) ;

  void characters(
    const XMLCh* const chars,
    const unsigned int length
  ) ;

  void startPrefixMapping(
    const XMLCh* const prefix,
    const XMLCh* const uri
  ) ;

  void endPrefixMapping(
    const XMLCh* const prefix
  ) ;

  void processingInstruction (
    const XMLCh* const target,
    const XMLCh* const data
  ) ;

  // ... a couple of other member functions ...
} ;

Most of the methods are self-explanatory. When the parser encounters body content between a start and end tag, it calls characters(). There may be several calls for the same element's body content, because the parser makes no guarantee that it will be able to hand you all that content at once. The parser's character buffer may be several kilobytes in size, so typically you might be able to assume that you'll have only one call element--but that's not a safe assumption. Do yourself a favor and stash characters() data in a buffer, such as std::ostringstream.

ContentHandler is a pure virtual class, or interface, and it can be tedious to implement all its methods if you're interested only in certain events. (Most developers care only about startElement(), endElement(), and characters().) Xerces includes a convenience class called DefaultHandler that provides do-nothing implementations of ContentHandler's methods. Your handler can simply inherit from DefaultHandler and override the methods of interest.

The stub program step1 demonstrates this. It uses a simple "talking" handler, the callback methods of which announce when they are called. Here's a sample XML file and the corresponding output from step1.

[begin: parse] [setDocumentLocator()] [startDocument()]
<airports> [startElement( "airports" )] [characters( XMLCh[11] ) ]
<airport name="CDG"> [startElement( "airport" )] "name" => "CDG" (type: CDATA) [characters( XMLCh[14] ) ]
<aliases> [startElement( "aliases" )] [characters( XMLCh[13] ) ]
<alias>Charles de Gaulle airport</alias> [startElement( "alias" )] [characters( XMLCh[25] ) ] [endElement( "alias" )] [characters( XMLCh[13] ) ]
<alias>Roissy airport</alias> [startElement( "alias" )] [characters( XMLCh[14] ) ] [endElement( "alias" )] [characters( XMLCh[10] ) ]
</aliases> [endElement( "aliases" )] [characters( XMLCh[14] ) ]
<location>Paris, France</location> [startElement( "location" )] [characters( XMLCh[13] ) ] [endElement( "location" )] [characters( XMLCh[14] ) ]
<comment> Terminal 3 has a very 1970s sci-fi decor </comment> [startElement( "comment" )] [characters( XMLCh[61] ) ] [endElement( "comment" )] [characters( XMLCh[11] ) ]
</airport> [endElement( "airport" )] [characters( XMLCh[11] ) ]
<!-- ... other airport defs ... -->
</airports> [characters( XMLCh[8] ) ] [endElement( "airports" )]
[endDocument()] [end: parse] [done]

This code is rather dull, even for a learning device. What it demonstrates, though, is that startElement() and endElement() announce the name of the current element. You can use this to create a selective handler, one that reports only a certain subset of SAX events.

Suppose that the sample file included definitions for several airports, and you wanted to see the names and aliases for each one. Your handler's startElement() method would entail:

void startElement( ... element name , attributes , etc .. ){

  if( ... element name is "airport" ... ){

    ... print the element's "name" attribute ...

  } else if( ... element name is "alias" ... ){

    ... print the alias ...

  } else {

    ... ignore it ...

  }

}

Such a handler would show:

CDG
        Roissy airport
        Charles de Gaulle airport
...

(I'm cheating here, because I don't show how to capture the elements' body content. I'll explain that shortly.)

Another point made evident by the previous example is that SAX, unlike DOM, doesn't make XML comments available to your code. In the sample XML file, the phrase ... other airport defs ... is invisible to SAX handlers.

Also notice that the phrase Terminal 3 has a very 1970s sci-fi decor has only 42 characters, yet the characters() callback reports 61. The remaining characters constitute white space between the phrase and its surrounding tags. If I were to do anything useful with the <comment> element in the example, I would have to trim the white space from either end.

From XML to Objects

Programs don't usually work with raw XML. Instead, they use XML as a storage and transport medium and parse it into objects. That is to say, an XML document typically reflects some data structure(s).

The sample program step2 is a more practical example, and it shows how to use SAX to turn XML into objects. As a bonus, it demonstrates SAX's power to extract small pieces of information from a large document.

The XML Package Metadata project defines an XML format that describes all the RPMs in a yum repository. step2's job is to extract the name, license, and vendor of each package defined in the file.

Instead of unmarshaling the entire document--roughly 5MB--you can use SAX to extract only the pieces of interest. Think of the package metadata file as a database and the handler as the query.

To quote a colleague, a SAX handler is just an event-driven state machine. Parsing a document into usable objects thus requires objects that you can assemble in stages, and a plan to match states--SAX events--to calls on those objects.

It's easier to start with the objects. Ideally, they can be simple data-holder classes with accessors (getXyz) and mutators (setXyz) to get and set properties, respectively.

step2 uses the RPMInfo object:

class RPMInfo {

  const std::string& getName() const throw() ;
  void setName( const std::string& in ) ;

  const std::string& getVersion() const throw() ;
  void setVersion( const std::string& in ){

  const std::string& getLicense() const throw() ;
  void setLicense( const std::string& in ) ;

  const std::string& getVendor() const throw() ;
  void setVendor( const std::string& in ) ;

} ;

Next, the plan of attack. At a high level, the plan to parse the package metadata file is as follows:

namespace   element    action

{default}   <package>  rpm = new RPMInfo()
{default}   <name>     rpm->setName()
rpm         <license>  rpm->setLicense()
rpm         <vendor>   rpm->setVendor()

This isn't a completely accurate depiction of what happens, though. step2 looks at the body content of the <name>, <license>, and <vendor> elements. Unlike DOM parsing, where you get the entire element at once--body content and all--SAX requires a little more creativity. Going after body content requires you use start and end tags to mark state, then collect what you need in the characters() handler callback.

In other words, a handler must keep track of the current tag so that it can decide what to do. There are many ways to do this. I use a function to match element names to symbolic constants, and push those constants onto a std::stack of int. In turn, each handler callback is a switch() block:

switch( ... top element on stack ... ){

  case ELEMENT_LICENSE:
    ... handle case for
      <license> element ...

  case ELEMENT_VENDOR:
    ... handle case for
      <vendor> element ...

  ... and so on ...
}

The following table matches SAX events to handler actions:

startElement() Note the element name, and put its corresponding numeric constant on the stack. If it's a <package> start tag, create a new RPMInfo object and assign it to a member variable.
characters() If the current element (based on the stack) is the target, stash the buffer content in a std::ostringstream.
endElement() Based on the current element, assign the contents of the std::ostringstream to one of the current RPMInfo objects' mutator methods. Decrement the tag stack and clear the buffer used by characters().

Generally speaking, start tag events are a good place to create a new object. End tag events are a place to wrap up: assign the character buffer to some property, decrement the tag stack, and so on.

Finally, note that step2 requires you to cheat; you must uncompress the package metadata file before you can parse it. This data is usually compressed, but writing code to decompress the file before feeding it to Xerces is beyond the scope of this article. I've included a small sample file for demonstrative purposes.

Error Handling

The previous example is incomplete, or at least optimistic. What happens if the input is missing a closing tag, or otherwise not well-formed XML? By default, the parser throws an exception and stops reading the document.

Xerces offers limited control over what to do in the event of an error. You can assign a custom error handler object to the parser, which gives you the chance to report a more useful error message (such as one that includes the error's filename and location) or, in some cases, ignore the error altogether. The sample program step3 describes the use of a simple error handler.

Custom error handlers implement the ErrorHandler interface:

class ErrorHandler {

  void warning( const SAXParseException& e )

  void error( const SAXParseException& e )

  void fatalError( const SAXParseException& e )

  void resetErrors()

} ;

The first three methods are callbacks, which fire in the event of a warning, parsing error, and fatal parsing error, respectively. resetErrors() is called at the end of each parse to give the object a chance to refresh itself. For example, if you track the number of errors you encounter, you can use resetErrors() to reset that counter to zero.

Writing a custom ErrorHandler doesn't offer too much in the way of recovery, though. You can choose to swallow the exceptions instead of rethrowing them; but consider whether the document is worth parsing after you get an error.

You can still use the error handler to tell you where the error occurred. (Hint: assign the same Locator to both the parser and your custom error handler.) It's much nicer to report "Error in file X, line Y, column Z" instead of just "Your very large document has an error. Good luck."

Validation

Note the difference between well-formed and valid XML. The first is purely structural ("Does every start tag have a corresponding end tag?"), and the parser handles it automatically. The second is specific to your application ("Does the <airport> element exist?"), and it is your responsibility to handle it, at least in part.

For example, the basic error handling in step3 will force the parser to halt if the document is not well-formed (not structurally sound); but if required element is missing, step3 will blindly pass half-formed RPMInfo objects around the rest of the app. Code that handles RPMInfo objects shouldn't have to worry about that. At the same time, how does a parser know when an XML document is suitable for your purposes?

Your side of the valid-document bargain is to provide a DTD or schema/XSD (collectively, grammars) that defines how your XML document should look. Grammars are a contract between your code and the incoming XML documents. Assign a grammar to the parser, and it will enforce that contract for you.

Declare a document's grammar in its XML prolog. For example, the following code excerpt declares a DTD:

<xml version="1.0" encoding="UTF-8"?>
<DOCTYPE
  metadata
  SYSTEM
  "http://linux.duke.edu/projects/metadata/dtd/primary.dtd">

... rest of XML document ...

The parser will ignore this unless you explicitly tell it to load external grammars and enable validation. For a Xerces SAX parser, that is:

// enable schema
saxParser->setFeature(
  xercesc::XMLUni::fgXercesSchema ,
  true
) ;

// load external DTDs
saxParser->setFeature(
  xercesc::XMLUni::fgXercesLoadExternalDTD ,
  true
) ;

// enable validation
saxParser->setFeature(
  xercesc::XMLUni::fgSAX2CoreValidation ,
  true
) ;

The code is very similar for DOM parsers:

// load external DTDs
domParser->setLoadExternalDTD( true ) ;

// enable schema
domParser->setDoSchema( true ) ;

// enable validation
domParser->setDoValidation( true ) ;

XML validation makes your code cleaner because it focuses on the task at hand, knowing for certain the invariants of the contract will hold true ("There will always be a <foo/> element")--or, if you don't already account for document structure in your code, validation lets you sleep easier in spite of the lack of explicit checks.

Entity Resolvers

It's easy enough to point to a DTD or schema in a known path, but you may have noticed a lot of DTD and schema reference off-site URLs. You certainly don't expect to hit the internet every time you parse a document, do you? That and URLs-as-schema-IDs are more for uniqueness than anything else. Some of those URLs don't resolve to anything. How does this work?

Xerces handles this with an entity resolver (Xerces class EntityResolver for SAX and XMLEntityResolver for DOM). When the parser encounters an entity in the document, it asks the resolver where to find it. The default resolver just tries to load entities from whatever location is specified; custom entity resolvers match the incoming name to some other resource--local file, in-memory document, alternate URL--and return that instead.

The EntityResolver class has a simple interface:

class EntityResolver {
  xercesc::InputSource* resolveEntity (
    const XMLCh* const publicId ,
    const XMLCh* const systemId
  )
} ;

Its resolveEntity() callback method returns an InputSource from which the parser can read the grammar. For example, a LocalFileInputSource reads from an on-disk file. A MemBufInputSource loads data from an in-memory buffer.

(The code for DOM is only slightly different; it takes an XMLResourceIdentifier object instead of the public ID/system ID pair.)

The sample program step4 demonstrates this using the resolver SimpleEntityResolver. This class uses an internal map to match grammar URIs to local resources. Note that resolveEntity() returns a new InputSource each time it is called. The parser takes ownership of the pointer. The resolveEntity() returns NULL if it cannot find the document.

Conclusion

Xerces-C++ is a robust, feature-filled XML parser toolkit. These two articles have introduced the basics of using Xerces, but by no means do they cover everything you need to know. They should, however, serve as a starting point for further exploring the product documentation and trying your own experiments.

Resources

Q Ethan McCallum grew from curious child to curious adult, turning his passion for technology into a career.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.