ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

Parsing and Processing Large XML Documents with Digester Rules Parsing and Processing Large XML Documents with Digester Rules

by Eugene Kuleshov
09/01/2004

XML is commonly used for integration with third-party applications or web services, especially those that are running on non-Java platforms. On the other hand, if the code is running in a managed environment (e.g., a J2EE container) under a large number of concurrent requests from clients, it is very important to reduce the usage of runtime resources and to minimize performance impact from components that are doing XML processing. Of course, this must be very carefully profiled, but in order to minimize memory requirements, in most cases is not a good idea to handle XML using in-memory representations such as DOM or JDom.

Applications based on SAX or the new StAX APIs can process documents iteratively during parsing. The SAX API is very mature, and is part of the standard JAXP API and supported by many tools and frameworks. It also allows you to chain handlers together in order to implement sophisticated transformations and processing rules.

SAX is based on a event-driven model, where a parser or previous filter in a chain calls a provided ContentHandler instance for each parsing event (such as the start or end of elements). That is why the ContentHandler implementation has to keep the current state of processing, and that makes implementation quite complex and difficult to maintain. However, the Jakarta Digester component provides an extendable ContentHandler implementation that can help to separate processing logic from the parser.

Using Digester

Let's take a simple example. Imagine a raw-database-reporting or export/import tool that must be able to load a countless number of rows into a database from a large XML document.

The core class of the Digester framework is Digester, which implements SAX's ContentHandler, and provides an internal stack. The stack can be used to store intermediate data during processing. Here is a simple DBLoader class that illustrates typical usage of Digester for loading an XML document from a given InputSource.

public class DBLoader {
  private Digester digester;
  
  public DBLoader( RuleSet ruleSet) {
    digester = new Digester();
    digester.addRuleSet(ruleSet);
  }
  
  public void load( Connection connection, 
      Reader reader) throws DBLoaderException {
    Map ctx = new HashMap();
    ctx.put("CONNECTION", connection);
    digester.push(ctx);
    try {
      digester.parse( reader);
      
    } catch( SAXException ex) {
      Exception ex2 = ex.getException()==null ?
      		ex : ex.getException();
      throw new DBLoaderException(ex2);

    } catch( Exception ex) {
      throw new DBLoaderException(ex);

    } finally {
      digester.clear();
      
    }
  }
}

Related Reading

XML Hacks
100 Industrial-Strength Tips and Tools
By Michael Fitzgerald

The Digester instance is initialized with a RuleSet, which defines a set of rules and their mapping to XML elements, as required by the processing logic. For a complex RuleSet, this initialization could be an expensive operation, and because rules do not keep state information, the Digester instance is configured once and then reused for multiple calls. Note that a single Digester instance can't be used from multiple threads.

A map with processing context attributes is pushed into Digester's stack before parsing. It can be pre-populated with some properties required for processing, properties from the runtime info, configuration files, etc. In the example above, the JDBC connection is stored within this context. The context map is used within processing rules to capture information from an input XML file. It could be also used to store processing errors or collect processing results.

Digester automatically wraps any exceptions from the rules into SAXException (even RuntimeExceptions), so it is necessary to catch SAXException and unwrap exceptions thrown from the processing code.

Note that the clear() method is called after processing in order to clean up the Digester instance, and release resources if the XML processing has been terminated.

Implementing RuleSet

Actual processing logic is defined in Digester's rules. A collection of rules for a particular XML format can be grouped into the RuleSet. The RuleSet must implement addRuleInstances(), which should add all rules into the Digester instance.

Usually, Digester is configured with predefined common rules or loads these rules from an XML-based configuration file. However, to get better control of XML processing, it is better to implement custom rules.

For an illustration, we can use database layouts used by the DBUnit testing framework. One of the layouts is a traditional, normalized XML structure, defined by the following DTD:

<!ELEMENT dataset (table+)>
<!ELEMENT table (column+, row)>
<!ATTLIST table name CDATA #REQUIRED>
<!ELEMENT column (#PCDATA)>
<!ELEMENT row (value+)>
<!ELEMENT value (#PCDATA)>

For example:

<dataset>
  <table name="TABLE1">
    <column>col1</column>
    <column>col2</column>
    <row>
      <value>1</value>
      <value>11</value>
    </row>
    <row>
      <value>2</value>
      <value>22</value>
    </row>
  </table>
</dataset>

The following DBUnitRuleSet class illustrates a RuleSet that can be used to process the XML document above. As you can see, each element has assigned a custom Rule that processes data from the corresponding part of an XML document.

public final class DBUnitRuleSet 
      extends RuleSetBase {
  public void addRuleInstances( Digester d) {
    d.addRule("dataset/table", 
       new TableRule());
    d.addRule("dataset/table/column", 
       new TableColumnRule());
    d.addRule("dataset/table/row", 
       new TableRowRule());
    d.addRule("dataset/table/row/value", 
       new TableRowValueRule());
  }
  ...

DBUnit also has a more efficient flat XML layout. Unfortunately, it can't be represented by a DTD. Here is how the same data will appear in this layout.

<dataset>
  <TABLE1 col1="1" col2="11"/>
  <TABLE1 col1="2" col2="22"/>
</dataset>

Because each table is represented as a single element in the XML file, a single custom Rule is sufficient to handle this format. However, to match all table elements, it is necessary to use RegexRules that support wildcards. The DBLoader constructor shall look like the following.

  ...
  public DBLoader( RuleSet ruleSet) {
    digester = new Digester();
    RegexMatcher m = new SimpleRegexMatcher();
    digester.setRules(new RegexRules(m));
    digester.addRuleSet( ruleSet);
  }
  ...

Then DBUnitFlatRuleSet can use patterns for assigning rules.

public class DBUnitFlatRuleSet 
      extends RuleSetBase {
  public void addRuleInstances( Digester d) {
    d.addRule("dataset/*", new FlatTableRule());
  }
  ...

Pages: 1, 2

Next Pagearrow