XSLT Processing with Java
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9
The approach
It turns out that writing a SAX parser is quite easy (our examples use SAX 2). All a SAX parser does is read an XML file top to bottom and fire event notifications as various elements are encountered. In our custom parser, we will read the CSV file top to bottom, firing SAX events as we read the file. A program listening to those SAX events will not realize that the data file is CSV rather than XML; it sees only the events. Figure 5-4 illustrates the conceptual model.
|
In this model, the XSLT processor interprets the SAX events as XML data and uses a normal stylesheet to perform the transformation. The interesting aspect of this model is that we can easily write custom SAX parsers for other file formats, making XSLT a useful transformation language for just about any legacy application data.
In SAX, org.xml.sax.XMLReader is a
standard interface that parsers must implement. It works in conjunction with
org.xml.sax.ContentHandler, which is the interface
that listens to SAX events. For this model to work, your XSLT processor must
implement the ContentHandler interface so it can
listen to the SAX events that the XMLReader
generates. In the case of JAXP, javax.xml.transform.sax.TransformerHandler is used for
this purpose.
Obtaining an instance of TransformerHandler requires a few extra programming
steps. First, create a TransformerFactory as usual:
TransformerFactory transFact = TransformerFactory.newInstance( );
As before, the TransformerFactory is
the JAXP abstraction to some underlying XSLT processor. This underlying
processor may not support SAX features, so you have to query it to determine
if you can proceed:
if (transFact.getFeature(SAXTransformerFactory.FEATURE)) {
If this returns false, you are out of
luck. Otherwise, you can safely downcast to a SAXTransformerFactory and construct the TransformerHandler instance:
SAXTransformerFactory saxTransFact =
(SAXTransformerFactory) transFact;
// create a ContentHandler, don't specify a
// stylesheet. Without a stylesheet, raw
// XML is sent to the output.
TransformerHandler transHand = saxTransFact.newTransformerHandler( );
In the code shown here, a stylesheet was not specified. JAXP
defaults to the identity transformation stylesheet, which means that the SAX
events will be "transformed" into raw XML output. To specify a stylesheet that
performs an actual transformation, pass a Source to
the method as follows:
Source xsltSource = new StreamSource(myXsltSystemId);
TransformerHandler transHand = saxTransFact.newTransformerHandler(xsltSource);
Detailed CSV to SAX design
Before delving into the complete example program, let's step back and look at a more detailed design diagram. The conceptual model is straightforward, but quite a few classes and interfaces come into play. Figure 5-5 shows the pieces necessary for SAX-based transformations.
|
This diagram certainly appears to be more complex than previous
approaches, but is similar in many ways. In previous approaches, we used the
TransformerFactory to create instances of Transformer; in the SAX approach, we start with a
subclass of TransformerFactory. Before any work can
be done, you must verify that your particular implementation supports
SAX-based transformations. The reference implementation of JAXP does support
this, although other implementations are not required to do so. In the
following code fragment, the getFeature method of
TransformerFactory will return true if you can safely downcast to a SAXTransformerFactory instance:
TransformerFactory transFact = TransformerFactory.newInstance( );
if (transFact.getFeature(SAXTransformerFactory.FEATURE)) {
// downcast is allowed
SAXTransformerFactory saxTransFact = (SAXTransformerFactory) transFact;
If getFeature returns false, your only option is to look for an implementation
that does support SAX-based transformations. Otherwise, you can proceed to
create an instance of TransformerHandler:
TransformerHandler transHand = saxTransFact.newTransformerHandler(myXsltSource);
This object now represents your XSLT stylesheet. As Figure 5-5 shows, TransformerHandler extends org.xml.sax.ContentHandler, so it knows how to listen to
events from a SAX parser. The series of SAX events will provide the "fake XML"
data, so the only remaining piece of the puzzle is to set the Result and tell the SAX parser to begin parsing. The
TransformerHandler also provides a reference to a
Transformer, which allows you to set output
properties such as the character encoding, whether to indent the output or any
other attributes of <xsl:output>.
Writing the custom parser
Writing the actual SAX parser sounds harder than it really is.
The process basically involves implementing the org.xml.sax.XMLReader interface, which provides numerous
methods you can safely ignore for most applications. For example, when parsing
a CSV file, it is probably not necessary to deal with namespaces or
validation. The code for AbstractXMLReader.java is
shown in Example 5-5. This is an abstract class that provides basic implementations of
every method in the XMLReader interface except for
the parse( ) method. This means that all you need
to do to write a parser is create a subclass and override this single method.
Example 5-5: AbstractXMLReader.java
package com.oreilly.javaxslt.util;
import java.io.IOException;
import java.util.*;
import org.xml.sax.*;
/**
* An abstract class that implements the SAX2
* XMLReader interface. The intent of this class
* is to make it easy for subclasses to act as
* SAX2 XMLReader implementations. This makes it
* possible, for example, for them to emit SAX2
* events that can be fed into an XSLT processor
* for transformation.
*/
public abstract class AbstractXMLReader implements org.xml.sax.XMLReader {
private Map featureMap = new HashMap( );
private Map propertyMap = new HashMap( );
private EntityResolver entityResolver;
private DTDHandler dtdHandler;
private ContentHandler contentHandler;
private ErrorHandler errorHandler;
/**
* The only abstract method in this class. Derived classes can parse
* any source of data and emit SAX2 events to the ContentHandler.
*/
public abstract void parse(InputSource input) throws IOException,
SAXException;
public boolean getFeature(String name)
throws SAXNotRecognizedException, SAXNotSupportedException {
Boolean featureValue = (Boolean) this.featureMap.get(name);
return (featureValue == null) ? false
: featureValue.booleanValue( );
}
public void setFeature(String name, boolean value)
throws SAXNotRecognizedException, SAXNotSupportedException {
this.featureMap.put(name, new Boolean(value));
}
public Object getProperty(String name)
throws SAXNotRecognizedException, SAXNotSupportedException {
return this.propertyMap.get(name);
}
public void setProperty(String name, Object value)
throws SAXNotRecognizedException, SAXNotSupportedException {
this.propertyMap.put(name, value);
}
public void setEntityResolver(EntityResolver entityResolver) {
this.entityResolver = entityResolver;
}
public EntityResolver getEntityResolver( ) {
return this.entityResolver;
}
public void setDTDHandler(DTDHandler dtdHandler) {
this.dtdHandler = dtdHandler;
}
public DTDHandler getDTDHandler( ) {
return this.dtdHandler;
}
public void setContentHandler(ContentHandler contentHandler) {
this.contentHandler = contentHandler;
}
public ContentHandler getContentHandler( ) {
return this.contentHandler;
}
public void setErrorHandler(ErrorHandler errorHandler) {
this.errorHandler = errorHandler;
}
public ErrorHandler getErrorHandler( ) {
return this.errorHandler;
}
public void parse(String systemId) throws IOException, SAXException {
parse(new InputSource(systemId));
}
}
Creating the subclass, CSVXMLReader,
involves overriding the parse( ) method and
actually scanning through the CSV file, emitting SAX events as elements in the
file are encountered. While the SAX portion is very easy, parsing the CSV file
is a little more challenging. To make this class as flexible as possible, it
was designed to parse through any CSV file that a spreadsheet such as
Microsoft Excel can export. For simple data, your CSV file might look like
this:
Burke,Eric,M
Burke,Jennifer,L
Burke,Aidan,G
The XML representation of this file is shown in Example
5-6. The only real drawback here is that CSV files are strictly
positional, meaning that names are not assigned to each column of data. This
means that the XML output merely contains a sequence of three <value> elements for each line, so your stylesheet
will have to select items based on position.
Example 5-6: Example XML output from CSV parser
<?xml version="1.0" encoding="UTF-8"?>
<csvFile>
<line>
<value>Burke</value>
<value>Eric</value>
<value>M</value>
</line>
<line>
<value>Burke</value>
<value>Jennifer</value>
<value>L</value>
</line>
<line>
<value>Burke</value>
<value>Aidan</value>
<value>G</value>
</line>
</csvFile>
One enhancement would be to design the CSV parser so it could accept a list of meaningful column names as parameters, and these could be used in the XML that is generated. Another option would be to write an XSLT stylesheet that transformed this initial output into another form of XML that used meaningful column names. To keep the code example relatively manageable, these features were omitted from this implementation. But there are some complexities to the CSV file format that have to be considered. For example, fields that contain commas must be surrounded with quotes:
"Consultant,Author,Teacher",Burke,Eric,M
Teacher,Burke,Jennifer,L
None,Burke,Aidan,G
To further complicate matters, fields may also contain quotes ("). In this case, they are doubled up, much in the same way you use double backslash characters (\\) in Java to represent a single backslash. In the following example, the first column contains a single quote, so the entire field is quoted, and the single quote is doubled up:
"test""quote",Teacher,Burke,Jennifer,L
This would be interpreted as:
test"quote,Teacher,Burke,Jennifer,L

