ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

Reading and Writing Excel Files with POI
Pages: 1, 2

Reading Document Properties with HPSF

Users of Microsoft Word, Excel, or PowerPoint can attach additional information to their documents using the File->Properties... menu. Properties include a document's title, a summary, a category, and keywords. The application itself adds further information: the last author, the date and time of the last modification, the date and time the document was printed, and so on.



Properties are stored separately from the document itself. As you know, an OLE 2 CDF file is a container consisting of directories and files internally, and POIFS gives you access to these files. The files are also called streams. Document properties are stored in streams of their own within a POIFS filesystem. Consider a Word document: while you see a simple file named myfile.doc on your hard disk, internally, it consists of a file called WordDocument and two files called SummaryInformation and DocumentSummaryInformation. There are usually other files, but we won't care about them here.

Can you guess what these streams contain? Yes, WordDocument contains the text you edited with Word. The document properties are stored in the SummaryInformation and DocumentSummaryInformation streams. It would be too simple to have all of the properties in a single stream, so Microsoft decided to use two of them. To make things even more complicated, these streams' names start with the octal \005 character. This is a non-printable character and is omitted above.

The nice thing with the standard Microsoft-defined properties is that they don't care whether the main document is a Word document, an Excel file, a PowerPoint presentation, or some other document. Once you know how to read the properties of an Excel file, you can retrieve the properties of all other documents, too.

Luckily, you don't have to figure out too much, because your Java program can extract these information using POI's HPSF package. HPSF stands for Horrible Property Set Format and is POI's implementation for reading properties. Writing is not yet supported.

HPSF has a simple API for reading Microsoft's standard properties and a more complicated but general API for reading any sort of property sets. We'll focus on the simple API here because it should suffice for most applications.

A Sample Application

Let's have a look at a sample Java program that reads the "title" property of an OLE 2 CDF document:

import java.io.*;
import org.apache.poi.hpsf.*;
import org.apache.poi.poifs.eventfilesystem.*;

/**
 * <p>Sample application showing how to read a OLE 2 document's
 * title. Call it with the document's file name as command line
 * parameter.</p>
 *
 * @author Rainer Klute (klute@rainer-klute.de)
 */

public class ReadTitle
{
    public static void main(String[] args) throws IOException
    {
        final String filename = args[0];
        POIFSReader r         = new POIFSReader();
        r.registerListener(new MyPOIFSReaderListener(),
            "\005SummaryInformation");
        r.read(new FileInputStream(filename));
    }

    static class MyPOIFSReaderListener implements POIFSReaderListener
    {
        public void processPOIFSReaderEvent(POIFSReaderEvent event)
        {
            SummaryInformation si = null;
            try
            {
                si = (SummaryInformation)
                PropertySetFactory.create(event.getStream());
            }
            catch (Exception ex)
            {
                throw new RuntimeException
                    ("Property set stream \"" + event.getPath() +
                        event.getName() + "\": " + ex);
            }

            final String title = si.getTitle();

            if (title != null)
                System.out.println("Title: \"" + title + "\"");
            else
                System.out.println("Document has no title.");
            }
        }
}

The main() method uses POIFS' eventing filesystem to read a stream named \005SummaryInformation from the OLE 2 document named on the command line. When the POIFSReader encounters this stream, it transfers control to the processPOIFSReaderEvent() method of MyPOIFSReaderListener. You've already learned the details of this mechanism from the first article of this series.

What does processPOIFSReaderEvent() do? As a parameter, it essentially gets an input stream containing some of the document's properties, including the title. To access the properties, the application must create a PropertySet instance from the contents of the input stream, as in the following statement:

si = (SummaryInformation) PropertySetFactory.create(event.getStream());

The statement consists of three steps:

  • event.getStream() retrieves the input stream from the POIFSReaderEvent passed in by the POIFSReader.
  • The class PropertySetFactory's static method create() is called with the input stream as parameter. As the name implies, PropertySetFactory is a factory class with a "machine" to transform an input stream into a PropertySet instance. This machine is the create() method.
  • We cast the PropertySet returned from create() to SummaryInformation. The PropertySet class has all of the mechanisms to read property sets in general. SummaryInformation is a subclass of PropertySet with special knowledge about Microsoft's standard properties.

Since several things can go wrong, the statement is encapsulated in a try block. The sample application simply catches all exceptions. A production program will likely differentiate between several possible causes for exceptions. Besides any I/O exceptions, an HPSF-specific exception could occur. For example, the NoPropertySetStreamException is thrown if the input stream does not contain a property set or if the property set is faulty.

A quite unlikely but still possible error is that the \005SummaryInformation stream contains a valid property set, but not the summary information property set. If this should ever happen, the cast to SummaryInformation is not possible and a ClassCastException is thrown.

Once the SummaryInformation instance is established, the rest is simple. The application just calls getTitle() and prints the result.

Besides getTitle(), SummaryInformation contains other convenience methods, including getApplicationName(), getAuthor(), getCharCount(), and getCreateDateTime(). The HPSF Javadoc documentation describes them in more detail.

The Document Summary Information

Unfortunately, not all properties reside in the summary information property set. Many, but not all, OLE 2 files have an additional property set. It is called the document summary information and resides in a stream named \005DocumentSummaryInformation. This property set holds properties like the document's category, the number of multimedia clips in a PowerPoint presentation, and more.

To access this property set, your application should proceed as shown above, except that you should register for the \005DocumentSummaryInformation stream. You probably want to register for both summary information and document summary information. As before, you should also hand over the input stream containing the document summary information to PropertySetFactory.create(). However, the factory method will return a DocumentSummaryInformation object, not a SummaryInformation instance. If you registered for both types of property sets, you should check the type of the returned object. Either use Java's instanceof operator or the query methods isSummaryInformation() and isDocumentSummaryInformation(). Remember that create() always returns a PropertySet, so you can call these methods on it. The PropertySet class provides two methods, because the property set could be custom-defined .

More Property Sets

If you want to deal with custom-defined property sets, or if you want to read user-defined properties from the standard property sets, you will have to use the general property set API. This is more complicated than the API outlined above and is not covered by this article. To learn about it, you should read the HPSF HOW-TO in the POI documentation and consult the Javadocs.

Conclusion

This article covered HSSF and how to output to Excel. We also covered HPSF and how to read property sets and document summary information. Next time, we'll show how to transform from XML to Excel format using the HSSF Serializer and Cocoon, as well as the current HDF and Word format.

References

Avik Sengupta is a committer on the Jakarta POI project. He is currently Director of Technology at Itellix Software Solutions.

Andrew Oliver is a professional cat herder who moonlights as a software developer, and runs a small consultancy with an ugly retro-looking web site that one day will be updated when he and his compadres aren't too busy working with clients.

Rainer Klute is a committer on the Jakarta POI project.

Appendix 1

Example 1. To create and return a workbook from a servlet

package org.apache.poi.hssf.usermodel.examples;

import java.io.*;
import java.net.*;
import javax.servlet.*;
import javax.servlet.http.*;
import org.apache.poi.hssf.usermodel.*;

public class HSSFCreate extends HttpServlet {
    public void init(ServletConfig config) throws ServletException {
        super.init(config);  
    }

    public void destroy() {
    }

    /** Processes requests for both HTTP GET and POST methods.
     * @param request servlet request
     * @param response servlet response
     */

    protected void processRequest(HttpServletRequest request,
        HttpServletResponse response) throws ServletException, IOException {

        response.setContentType("application/vnd.ms-excel");
        HSSFWorkbook wb = new HSSFWorkbook();
        HSSFSheet sheet = wb.createSheet("new sheet");

        // Create a row and put some cells in it. Rows are 0 based.
        HSSFRow row     = sheet.createRow((short)0);

        // Create a cell and put a value in it.
        HSSFCell cell   = row.createCell((short)0);

        cell.setCellValue(1);

        // Or do it on one line.
        row.createCell((short)1).setCellValue(1.2);
        row.createCell((short)2).setCellValue("This is a string");
        row.createCell((short)3).setCellValue(true);
        // Write the output 
        OutputStream out = response.getOutputStream();
        wb.write(out);
        out.close();
    }

    /** Handles the HTTP <code>GET</code> method.
     * @param request servlet request
     * @param response servlet response
     */

    protected void doGet(HttpServletRequest request,
        HttpServletResponse response) throws ServletException, IOException {
        processRequest(request, response);
    }

    /** Handles the HTTP POST method.
     * @param request servlet request
     * @param response servlet response
     */

    protected void doPost(HttpServletRequest request,
        HttpServletResponse response) throws ServletException, IOException {
        processRequest(request, response);
    }

    /** Returns a short description of the servlet.
     */

    public String getServletInfo() {
       return "Example to create a workbook in a servlet using HSSF";
    }
}

Return to ONJava.com.