ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.


AddThis Social Bookmark Button

Simple XML Parsing with SAX and DOM
Pages: 1, 2, 3

Unmarshalling with SAX

SAX, the Simple API for XML, is a traditional, event-driven parser. It reads the XML document incrementally, calling certain callback functions in the application code whenever it recognizes a token. Callbacks events are generated for the beginning and the end of a document, the beginning and end of an element, etc. They are defined in the interface org.xml.sax.ContentHandler, which every SAX-based document handler class must implement. It is the responsibility of the application programmer to implement these callback functions. Often, the application may not care about certain events reported by the SAX parser. For these cases, there exists a convenience class, org.xml.sax.helpers.DefaultHandler, which provides empty implementations for all functions defined in ContentHandler; custom classes simply extend DefaultHandler and need only override those callbacks in which they are specifically interested. This is done in the code below.

At the heart of a program (or class) utilizing the SAX parser typically lies a stack. Whenever an element is started, a new data object of the appropriate type is pushed onto the stack. Later, when the element is closed, the topmost object on the stack has been finished and can be popped. Unless it has been the root element (in which case the stack will be empty after it has been popped), the most recently popped element will have been a child element of the object that now occupies the top position of the stack, and can be inserted into its parent object. This process corresponds to the shift-reduce cycle of bottom-up parsers. Note how the requirement that XML elements must not overlap is crucial for the proper functioning of this idiom.

Example 1. Unmarshalling with SAX.

class SaxCatalogUnmarshaller extends DefaultHandler {
    private Catalog catalog;

    private Stack stack;
    private boolean isStackReadyForText;

    private Locator locator;

    // ----- 

    public SaxCatalogUnmarshaller() {
	stack = new Stack();
	isStackReadyForText = false;

    public Catalog getCatalog() { return catalog; }

    // ----- callbacks: -----

    public void setDocumentLocator( Locator rhs ) { locator = rhs; }

    // ----- 

    public void startElement( String uri, String localName, String qName,
			      Attributes attribs ) {

	isStackReadyForText = false;

	// if next element is complex, push a new instance on the stack
	// if element has attributes, set them in the new instance
	if( localName.equals( "catalog" ) ) {
	    stack.push( new Catalog() );

	}else if( localName.equals( "book" ) ) {
	    stack.push( new Book() );

	}else if( localName.equals( "magazine" ) ) {
	    stack.push( new Magazine() );

	}else if( localName.equals( "article" ) ) {
	    stack.push( new Article() );
	    String tmp = resolveAttrib( uri, "page", attribs, "unknown" );
	    ((Article)stack.peek()).setPage( tmp );
	// if next element is simple, push StringBuffer 
	// this makes the stack ready to accept character text
	else if( localName.equals( "title" ) || localName.equals( "author" ) ||
		 localName.equals( "name"  ) || localName.equals( "headline" ) ) {
	    stack.push( new StringBuffer() );
	    isStackReadyForText = true;
	// if none of the above, it is an unexpected element		 
	    // do nothing

    // ----- 

    public void endElement( String uri, String localName, String qName ) {

	// recognized text is always content of an element
	// when the element closes, no more text should be expected
	isStackReadyForText = false;

	// pop stack and add to 'parent' element, which is next on the stack
	// important to pop stack first, then peek at top element!
	Object tmp = stack.pop();
	if( localName.equals( "catalog" ) ) {
	    catalog = (Catalog)tmp;
	}else if( localName.equals( "book" ) ) {
	    ((Catalog)stack.peek()).addBook( (Book)tmp );

	}else if( localName.equals( "magazine" ) ) {
	    ((Catalog)stack.peek()).addMagazine( (Magazine)tmp );
	}else if( localName.equals( "article" ) ) {
	    ((Magazine)stack.peek()).addArticle( (Article)tmp );
	// for simple elements, pop StringBuffer and convert to String
	else if( localName.equals( "title" ) ) {
	    ((Book)stack.peek()).setTitle( tmp.toString() );

	}else if( localName.equals( "author" ) ) {
	    ((Book)stack.peek()).setAuthor( tmp.toString() );

	}else if( localName.equals( "name" ) ) {
	    ((Magazine)stack.peek()).setName( tmp.toString() );

	}else if( localName.equals( "headline" ) ) {
	    ((Article)stack.peek()).setHeadline( tmp.toString() );
	// if none of the above, it is an unexpected element:
	// necessary to push popped element back!
	    stack.push( tmp );

    // -----
    public void characters( char[] data, int start, int length ) {

	// if stack is not ready, data is not content of recognized element
	if( isStackReadyForText == true ) {
	    ((StringBuffer)stack.peek()).append( data, start, length );
	    // read data which is not part of recognized element
    // -----
    private String resolveAttrib( String uri, String localName, 
			          Attributes attribs, String defaultValue ) {
	String tmp = attribs.getValue( uri, localName );
	return (tmp!=null)?(tmp):(defaultValue);

Of the various callback methods declared in the ContentHandler interface, only four are implemented here. In unmarshalling a document, we are primarily interested in the contents that are encoded in it. Therefore, the relevant events are the beginning and end of an element, and the occurrence of raw character data inside an element. We also implement the setDocumentLocator() method. Although not used in the application code, it can be very helpful in debugging. The org.xml.sax.Locator interface acts like a cursor, pointing to the position in the XML document where the last event occurred. It provides useful methods such as getLineNumber() and getColumnNumber().

Start of Element

When the startElement() function is called, the SAX parser passes it a number of arguments. The first three are (in order): the namespace URI, the local name, and the fully qualified name of the element. By default, only the URI and the local name need to be supplied, while the qualified name is optional. Since the catalog document does not introduce any XML namespaces, we only use the local name in the present application.

The last argument holds the attributes of the present element (if any) in a specific container, which allows retrieval of the attributes by their names, as well as iteration over all attributes using an integer index.

Related Reading

Java and XML
Solutions to Real-World Problems
By Brett McLaughlin

Elements are recognized by their local names. If the current element is a complex element, an object of the appropriate type is instantiated and pushed onto the stack. If the current element is simple, a new StringBuffer is pushed onto the stack instead, ready to accept character data.

Finally, the <article> element has an attribute, which is read from the attribs argument and inserted into the newly created article object on top of the stack. The attribute is extracted using the convenience function resolveAttrib(), which returns the attribute value or a default text, if the attribute is missing.

End of Element

The endElement() function is called with essentially the same arguments as the startElement() function; only the list of attributes is missing. In any case, the topmost element on the stack is popped, converted to the proper type, and inserted into its parent, which now occupies the top of the stack. Only the root element, which has no parent, is treated differently.

Raw Text

Finally, the callback function named characters() is called when the parser encounters raw text. It is passed a char array, containing the actual data, as well as a position at which to start reading and the length of data to be read from the array. Of course, it is illegal to access the data array outside of those boundaries. The implementation of the callback method inserts the data into the StringBuffer on the stack.

The way the characters() function is called by the underlying SAX parser often leads to some initial confusion, for two reasons. Firstly, there is no guarantee that a stretch of contiguous data results in only a single call to characters() -- it would be perfectly legal for the parser to invoke the callback function for each individual character of text! Although this is certainly an extreme scenario, it is quite common for text with embedded entity references to result in several calls to characters(): one for the text before the reference, a separate call for the entity itself, and finally, one for the remaining text. This is the reason that a StringBuffer is pushed on the stack if a simple element is encountered when reading the example document. (In fact, using a StringBuffer with the characters() callback function is a common idiom when using the SAX API.)

The second reason that characters() can lead to confusion results from the fact that it is called for all text characters encountered by the parser, including whitespace, even the whitespace between element tags (such as newlines and tabs). This is surprising, since ContentHandler defines a special callback method ignorableWhitespace(), taking the same arguments as characters(). However, without a DTD or XML Schema, this method is never called, since there is no way for the parser to distinguish whether some whitespace is ignorable or not. In the present example program, the boolean flag isStackReady serves to distinguish between the two. The stack only becomes ready to accept text when a simple element has started and before it has ended.

Pages: 1, 2, 3

Next Pagearrow