ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

Simple XML Parsing with SAX and DOM
Pages: 1, 2, 3

Unmarshalling with DOM

The Document Object Model (DOM) describes an XML document as a tree-like structure, with every XML element being a node in the tree. A DOM-based parser reads the entire document, and (at least in principle) forms the corresponding document tree in memory. The DOM tree is formed from classes that all implement the org.w3c.dom.Node interface. This interface provides functions to walk or modify the tree (such as getChildNodes(), or appendChild() and removeChild()), and, of course, methods to query each node for its name and value.



The present unmarshalling code does not need to modify the DOM tree. The tree traversal itself is essentially recursive: the root node is unmarshalled, then each of its child nodes (which are either of type book or magazine), and, in the case of the magazine, its children (article). Whenever a child node has been unmarshalled, the resulting object representation of that node is inserted into the parent object.

Example 2. Unmarshalling with DOM.


class DomCatalogUnmarshaller {

    public DomCatalogUnmarshaller() { }

    // -----

    public Catalog unmarshallCatalog( Node rootNode ) {
	Catalog c = new Catalog();

	Node n;
	NodeList nodes = rootNode.getChildNodes();
	
	for( int i=0 ; i<nodes.getLength(); i++ ){
	    n = nodes.item( i );

	    if( n.getNodeType() == Node.ELEMENT_NODE ){

		if( n.getNodeName().equals( "book" ) ) {
		    c.addBook( unmarshallBook( n ) );
		    
		}else if( n.getNodeName().equals( "magazine" ) ){
		    c.addMagazine( unmarshallMagazine( n ) );
		    
		}else{
		    // unexpected element in Catalog
		}
	    }else{
		// unexpected node-type in Catalog
	    }
	}
	return c;
    }

    // -----

    private Book unmarshallBook( Node bookNode ) {
	Book b = new Book();

	Node n;
	NodeList nodes = bookNode.getChildNodes();
	
	for( int i=0 ; i<nodes.getLength(); i++ ){
	    n = nodes.item( i );

	    if( n.getNodeType() == Node.ELEMENT_NODE ){

		if( n.getNodeName().equals( "author" ) ){
		    b.setAuthor( unmarshallText( n ) );

		}else if( n.getNodeName().equals( "title" ) ){
		    b.setTitle( unmarshallText( n ) );

		}else{
		    // unexpected element in Book
		}
	    }else{
		// unexpected node-type in Book
	    }
	}
	return b;
    }

    // -----

    private Magazine unmarshallMagazine( Node magazineNode ) {
	Magazine m = new Magazine();

	Node n;
	NodeList nodes = magazineNode.getChildNodes();
	
	for( int i=0 ; i<nodes.getLength(); i++ ){
	    n = nodes.item( i );

	    if( n.getNodeType() == Node.ELEMENT_NODE ){

		if( n.getNodeName().equals( "name" ) ) {
		    m.setName( unmarshallText( n ) );

		}else if( n.getNodeName().equals( "article" ) ) {
		    m.addArticle( unmarshallArticle( n ) );

		}else{
		    // unexpected element in Magazine
		}
	    }else{
		// unexpected node-type in Magazine
	    }
	}
	return m;
    }

    // -----

    private Article unmarshallArticle( Node articleNode ) {
	Article a = new Article();

	if( articleNode.hasAttributes() == true ) {
	    a.setPage( unmarshallAttribute( articleNode, "page", "unknown" ) );
	}
	
	Node n;
	NodeList nodes = articleNode.getChildNodes();
	
	for( int i=0 ; i<nodes.getLength(); i++ ){
	    n = nodes.item( i );

	    if( n.getNodeType() == Node.ELEMENT_NODE ){

		if( n.getNodeName().equals( "headline" ) ) {
		    a.setHeadline( unmarshallText( n ) );

		}else{
		    // unexpected element in Article
		}
	    }else{
		// unexpected node-type in Article
	    }
	}
	return a;
    }
    
    // -----

    private String unmarshallText( Node textNode ) {
	StringBuffer buf = new StringBuffer();

	Node n;
	NodeList nodes = textNode.getChildNodes();

	for( int i=0; i<nodes.getLength(); i++ ){
	    n = nodes.item( i );

	    if( n.getNodeType() == Node.TEXT_NODE ) {
		buf.append( n.getNodeValue() );
	    }else{
		// expected a text-only node!
	    }
	}
	return buf.toString();
    }

    // -----

    private String unmarshallAttribute( Node node, 
    	String name, String defaultValue ){
	Node n = node.getAttributes().getNamedItem( name );
	return (n!=null)?(n.getNodeValue()):(defaultValue);
    }
}

There are subtypes of the Node interface representing elements, text, comments, entities, and many others. The tree model, by which each part of the document is represented as a Node, is followed very consistently. Character data, for instance, is considered a child of its enclosing Element and is represented by its own Text instance, which has to be queried using getNodeValue() to find the actual string.

Related Reading

SAX2
By David Brownell

The Node supertype offers getNodeName(), getNodeValue(), and getAttributes() to provide access to information about a Node instance without having to downcast it.

Not all three of these methods make sense for every node type, however. For instance, only an Element can have attributes; for all other Node subtypes the corresponding function returns null. For Element nodes, getNodeName() returns the tag name, but getNodeValue() returns null. In contrast, for a Text node, getNodeValue() returns the character data, while getNodeName() returns the fixed string "#TEXT". The www.w3.org DOM specification contains a table detailing the behavior of all three functions for every possibly node type.

In the present program, we are only interested in three kinds of nodes: those representing elements, text, and attributes. All of the unmarshalling functions are very similar to each other. They accept the topmost node of the subtree they are to unmarshall as an argument. Then they create an object representing the current node and iterate over its child nodes, unmarshalling each in turn. If a child node describes a complex element, the node is passed on to the appropriate unmarshalling function, depending on the element name. A child node of type TEXT_NODE describes a simple element, and the node value is simply the character data.

Nodes describing attributes are a bit different, since attributes are not really part of the document's tree structure: attributes are not proper children of the elements in which they are contained. They can therefore not be reached by tree-walking operations; instead, the Node class provides a getAttributes() function, which returns a collection of key/value-pairs, containing the attributes. Again, we provide a convenience function that returns a default value in case no attribute can be found for the given name.

The Driver

Finally, we need a driver class, containing static void main(). The main() function reads the API to use (SAX or DOM) and the name of the XML file from the command line. It creates a org.xml.sax.InputSource from the filename. This class is acceptable to both SAX and DOM as an encapsulation of an XML document. Then it creates instances of the the appropriate parser and unmarshaller classes and passes the input file to them. Finally, it prints the contents of the created objects to standard output.

Example 3. Driver class.


public class Driver {
    
    public static void main( String[] args ) {
	Catalog catalog = null;

	try {
	    File file = new File( args[1] );
	    InputSource src = new InputSource( new FileInputStream( file ) );
	
	    if( args[0].equals( "SAX" ) ) {
		System.out.println( "--- SAX ---" );

		SaxCatalogUnmarshaller saxUms = new SaxCatalogUnmarshaller();

		XMLReader rdr = XMLReaderFactory.
		    createXMLReader( "org.apache.xerces.parsers.SAXParser" );
		rdr.setContentHandler( saxUms );
		rdr.parse( src );

		catalog = saxUms.getCatalog();

	    }else if( args[0].equals( "DOM" ) ) {
		System.out.println( "--- DOM ---" );

		DomCatalogUnmarshaller domUms = new DomCatalogUnmarshaller();

		org.apache.xerces.parsers.DOMParser prsr = 
		    new org.apache.xerces.parsers.DOMParser();
		prsr.parse( src );
		Document doc = prsr.getDocument();
		
		catalog = domUms.unmarshallCatalog( doc.getDocumentElement() );

	    }else{
                System.out.println( "Usage: SAX|DOM filename" );
                System.exit(0);
            }

	    System.out.println( catalog.toString() );

	}catch( Exception exc ) {
	    System.out.println( "Usage: SAX|DOM filename" );
	    System.err.println( "Exception: " + exc );
	}
    }
}

SAX and DOM are interface specifications. Implementations of these interfaces are available from various sources (both commercial and free), and it is part of the driver's responsibility to load the specific parser class. The code above uses the Apache Xerces implementations of the SAX and DOM specifications; these are freely available, open source, high-quality implementations. Be sure that the corresponding classes are included in your CLASSPATH.

The SAX specification contains a factory class that can be used to select which SAX parser implementation will be used. After instantiating the XMLReader class, we need to register with it our SAX unmarshaller as application-specific content handler. Finally, we can retrieve the unmarshalled objects from the unmarshaller instance.

As opposed to SAX, the DOM specification covers only the tree representation of the XML document. Instantiating and using the parser is not actually covered by DOM itself, and the specific implementation must be named directly in the application code. After the input document has been parsed, the resulting DOM tree can be retrieved from the parser using the getDocument function, which returns a Document instance. The Document interface extends the Node interface and represents the root node of the document. It is then used with the appropriate unmarshaller class, similar to the SAX case.

Conclusion

It bears repeating that the code above is for instructional purposes only. It ignores many XML structures (such as namespaces, entities, and, of course, constraints), as well as more advanced features of the parser classes (such as additional SAX callback handlers, or more powerful ways to walk and modify a DOM tree). But the most immediate omission concerns the handling of unexpected elements and similar errors. The locations in the code where these conditions should be handled are clearly marked. It can be enlightening to insert some logging code and then observe the behavior of the program after some "errors" (such as unexpected elements) have been introduced into the XML document. Finally, the document structure has been hard-coded into program. A real-world application would need greater flexibility, or at least better diagnostics.

I hope to have demonstrated how to use either API to parse a simple XML document and turn its data into a set of Java objects. The example application is simple, but it should be enough to get you started. The references contain additional resources.

References

Books

  • Brett McLaughlin: Java & XML, 2nd edition, O'Reilly (2001)
  • Erik T. Ray: Learning XML, 1st edition, O'Reilly (2001)
  • David Brownell: SAX2, 1st edition, O'Reilly (2002)

Online

Philipp K. Janert is a software project consultant, server programmer, and architect.


Return to ONJava.com