ONJava.com    
 Published on ONJava.com (http://www.onjava.com/)
 See this if you're having trouble printing code examples


Expiring Data with Hashbelts

by William Grosso, author of Java RMI
02/13/2002

In the first article in this series, I explained why it is sometimes necessary to expire data, and discussed the standard approaches for doing so. The second article discussed a new approach to data expiration, called the Hashbelt algorithm, and documented a Java library which implements it. In this article I will show you how to use the hashbelt algorithm by using two distinct examples: implementing session keys and reimplementing the RemoteStubCache class from my previous articles on command objects in RMI. By the end of this article, you should feel comfortable using hashbelts in your code and understand when it is appropriate to do so.

About the Source Code

The source code for both the hashbelt framework and the examples used in this article can be downloaded from here. Please feel free to use and extend the source code for this article. But, please, give credit where credit is due and do not remove either the licensing information or my name from the source code. Moreover, please send me e-mail if you come up with any interesting ideas for using or extending the framework; I'm planning on maintaining this code and keeping it up to date for the foreseeable future.

Note: In order to use these classes, you must have downloaded and installed the generics specification. Furthermore, the jar files that come with the generics distribution must be the first items on your classpath. The batch file assumes the jars are in the same directory as the batch file -- if this is incorrect, you should edit the part of the batch file reading:

J-Xbootclasspath/p:./javac.jar -bootclasspath ./collect.jar

An Overview of the Process

Download the example files from this article here.

Before we dive into the examples, I want to outline what's involved in using hashbelts in an application. Incorporating hashbelts into an application is actually quite easy, and involves the the following three steps:

  1. Choose a subclass of AbstractHashbeltExpirationSystem (either StandardHashbeltExpirationSystem or UpdatingHashbeltExpirationSystem).
  2. Choose an type of container to use. All of the implementations of HashbeltContainer have the same semantics; choosing the appropriate container is simply an optimization and will depend on how you expect the system to be used.
  3. Write an expiration handler. Since an expiration system can store any type of object, and is a library being provided without any knowledge of your code, it makes sense that you have to provide the actual code that performs the expiration.

We're going to follow this procedure for two different examples: session keys and remote stub caches. The discussion for session keys will be a little abstract (without knowing more about the whole application, it's hard to write a concrete example that includes expiration handlers). The discussion of remote stub caches, however, will be very concrete.

Session Keys

Session keys have been used as an example throughout these articles. Thus far, I've relied on the assumption that you probably have at least a vague intuition about what a session key is. However, now that we're actually getting down to coding, precise definitions are important.

A session key is a string, usually randomly generated, which has no value other than to identify a particular client to a server. The session key is generated by the server at the end of an authentication (or identification) sequence and sent to the client. All subsequent requests from a client are accompanied by the session key. The server often uses the session key to look up information related to the client, such as permission levels or shopping cart contents. Depending on the application, the session key expires after either a fixed period of time or a period of inactivity.

In our case, we're going to implement session keys that expire after a period of inactivity. This is how Web servers handle session information and is the most common way of using session keys (session keys that expire after a fixed period of time are often used by authentication systems and protocols such as Kerberos). Our session keys, and the associated session objects, therefore have the following characteristics:

The first of these bullet points immediately implies that we will be using a subclass of UpdatingHashbeltExpirationSystem as our expiration system. The second bullet point implies that HashlistBasedHashbeltContainer is a reasonable choice for a container class (we will be iterating through the expired objects and might want the iteration to be reasonably fast). This is all encapsulated in the code for SessionExpirationSystem:

  package grosso.sessions;
  import grosso.expiration.hashbelt.containers.*;
  import grosso.expiration.hashbelt.*;
  import grosso.expiration.*;

  import java.util.*;

  /*
    SessionExpirationSystem

    Extends UpdatingHashbeltExpirationSystem so that active sessions get
    their expiration delayed.

    Uses HashlistBasedHashbeltContainer because it seems like a good compromise
    choice.
  */

  public class SessionExpirationSystem extends UpdatingHashbeltExpirationSystem<sessionkey, session> {
    protected  HashbeltContainer<sessionkey,session> getNewContainer() {
      return new HashlistBasedHashbeltContainer<sessionkey, session>();
    }
  }

This actually has more import statements than lines of code! One reason this is so simple is that it uses the default number of containers and the default rotation time defined in AbstractHashbeltExpirationSystem. Recall that these were defined as follows:

  protected static final int DEFAULT_NUMBER_OF_CONTAINERS = 5;   protected static final long ONE_MINUTE = 60 * 1000;   protected static final long DEFAULT_ROTATION_TIME = ONE_MINUTE;

These are almost certainly not the best choice for session keys. In practice, I like to start by assuming there are 11 containers. A first pass at the rotation time is simply to divide the required expiration time by 10; this will ensure that every key, no matter when it is first generated, will have at least the required expiration time (some keys will have much longer).

Once you've estimated the number of containers and the rotation time, you will need to balance the following three forces:

The first two of these forces imply that rotation times should be fairly large relative to the expected time between uses of a session key. That is, if lookups using a particular session key occur every three minutes, on average, then a rotation time of one minute is bad, because keys will frequently make it into the fourth or fifth container (and, therefore, using the keys to get session information will be more expensive than is necessary).

On the other hand, the third force implies that rotation times should be as small as possible (and the number of containers should be increased). Using a short rotation time means that sessions will be expired closer to the required expiration time.

Let's make this more concrete by considering a Web server. Suppose our requirement is that sessions must be kept for four hours (if the user hasn't come back in four hours, we assume that they're done). And suppose that, when a user is active, they fetch a new page every minute or so. If we start by using the "divide by ten" rule of thumb, we arrive at: 11 containers and a rotation time of 24 minutes.

This is actually quite nice. The rotation time is huge compared to the time between fetches. On average, we're going to see the following access pattern for a session object:

and so on, until the user is no longer accessing the session. From a performance point of view, this is astonishingly close to simply using an instance of Hashtable.

Related Reading

Java RMI
By William Grosso

On the other hand, session objects won't be expired at the four-hour mark. A session object might, in fact, live for almost four-and-a-half hours after it is last accessed. If this half-hour is significant, we might need to increase the number of containers (and thereby reduce the rotation time).

Of course, this whole discussion is predicated on a rather strange assumption. Four hours is just too short a time for a Web-based session key. For example, according to Joel Spolsky's book on user interface design, Amazon.com keeps its sessions (e.g. the contents of the shopping carts) around for 90 days.

You probably can't keep all sessions from the past 90 days in memory; it's just too much data. A two-part strategy like the following is a much better idea:

This solution has some fairly nice benefits: when a session is active, it gets pulled into memory. When a session isn't active, it gets written out to disk. The session information is kept for a long time in a persistent data store and, while the session is being used, the information is kept in memory, and the expiration system is fast.

Using a Hashbelt with RemoteStubCache

Our second example of using hashbelts involves rebuilding the RemoteStubCache class from the second article from my earlier series on command objects.

Also by William Grosso

The Hashbelt Data Structure

Introducing Automatic Data Expiration

Java RMI: Serialization

RemoteStubCache solves a fairly common problem in client-server applications. The problem is that a naive use of naming services doubles the number of remote calls being made. This happens because programmers want to use a naming service to store the stubs to servers; doing so leads to a nice clean architecture and allows, for example, servers to be dynamically redeployed without changing client code. The obvious way to use a naming service is to write code using the following schema:

  RemoteServer server = Naming.lookup(logical_name);   server.methodCall(arguments);

The problem is that each of these lines involves a remote call. The first line fetches a stub from the naming service and the second line actually calls the server.

Fortunately, there's an obvious solution. Transform the first line into a local call by using a RemoteStubCache. Instead of fetching the stub each time, simply fetch the stub the first time and store it locally. The second time the translator is used, the stub is already available locally and, hence, only one remote method invocation is necessary in most cases.

Here's the implementation of RemoteStubCache that I used for that series of articles.

  import java.rmi.*;
  import java.rmi.server.*;
  import java.util.*;

  /*
    Holds stubs to remote RMI servers.
  */

  public class RemoteStubCache {
    private static Hashtable<serverdescription, remotestub> _serverDescriptionsToStubs= new Hashtable<serverdescription, remotestub>();

    public static RemoteStub getStubToRemoteObject(ServerDescription serverDescription) throws ServerUnavailable {
      RemoteStub returnValue = _serverDescriptionsToStubs.get(serverDescription);
      if (null == returnValue) {
        returnValue = serverDescription.getStub() ;
        if (null!= returnValue) {
          _serverDescriptionsToStubs.put(serverDescription, returnValue);
        }
        else {
          throw new ServerUnavailable();
        }
      }
      return returnValue;
    }
  
    public static void removeStubFromCache(ServerDescription serverDescription) {
      _serverDescriptionsToStubs.remove(serverDescription);
    }
  }

This is, by and large, a fine stub cache for simple applications. However, it can cause more problems if you use advanced RMI features like Activation or actively take advantage of the distributed garbage collector because stubs keep references to servers. An example scenario will illustrate the problem. Suppose the people building the servers for a system make the following three design decisions (all of which are quite reasonable):

What happens to these architectural decisions when the client decides to cache stubs? They all break. Badly. The servers are launched correctly. But, since the client is keeping stubs around indefinitely, the client maintains connections to the servers, and the distributed garbage collector never thinks the server is unreferenced. The server never stores its data out to a relational database, nor does it ever shut down.

This is bad.

Fortunately, the solution is simple: the client shouldn't simply store the data in a global cache; the client should expire stubs that it hasn't used recently. Here's the code for an implementation of RemoteStubCache that uses a hashbelt to get rid of old stubs:

  package grosso.rmi;
  import grosso.expiration.*;
  import grosso.expiration.hashbelt.*;
  import grosso.expiration.hashbelt.handlers.*;
  import grosso.expiration.hashbelt.containers.*;

  import java.rmi.*;
  import java.rmi.server.*;
  import java.util.*;

  /*
    Holds stubs to remote RMI servers.
  */

  public class RemoteStubCache {
    /*
      In previous versions of this, we used a Hashtable to store our stubs.
      Now we automatically expire stubs using a cache.
      We use an updatingHashbeltExpirationSystem (so stubs are reinserted into the
      front of the conveyor belt every time we use them).
      We use a NullExpirationHandler, because the only expiration logic we need is
      done by the garbage collector.

      And we expire our stubs if they've been inactive for more than 30 minutes
    */

    private static long FIVE_MINUTES = 5 * 60 * 1000;
    private static long ROTATION_TIME = FIVE_MINUTES;
    private static int NUMBER_OF_CONTAINERS = 6;
    private static RemoteStubCacheExpirationSystem _expirationSystem =  new RemoteStubCacheExpirationSystem();;

    public static RemoteStub getStubToRemoteObject(ServerDescription serverDescription) throws ServerUnavailable {
      RemoteStub returnValue = _expirationSystem.get(serverDescription);
      if (null == returnValue) {
        returnValue = serverDescription.getStub() ;
        if (null!= returnValue) {
          _expirationSystem.put(serverDescription, returnValue);
        }
        else {
          throw new ServerUnavailable();
        }
      }
      return returnValue;
    }

    public static void removeStubFromCache(ServerDescription serverDescription) {
      _expirationSystem.remove(serverDescription);
    }

    private static class RemoteStubCacheExpirationSystem extends UpdatingHashbeltExpirationSystem<serverdescription, remotestub> {
      public RemoteStubCacheExpirationSystem() {
        super(NUMBER_OF_CONTAINERS, ROTATION_TIME);
      }
    
      protected HashbeltContainer<serverdescription, remotestub> getNewContainer() {
        return new StandardHashbeltContainer<serverdescription,remotestub>();
      }
    }
  }

Notice that this is entirely a change to the implementation of RemoteStubCache; we simply used an inner class that extended UpdatingHashbeltExpirationSystem instead of Hashtable to implement the cache. None of the code that used RemoteStubCache needed to change at all.

Note: This change to RemoteStubCache is a change that's often worth making in your code. Changing a global hashtable to an expiration system is easy to do and allows resources (whether distributed or local) to be released.

Summary And Conclusion

We've now spent three articles, and about 50 pages in total (if you printed these articles out), discussing data expiration. In the first article, we discussed various scenarios where data expiration is necessary, and then covered some common solutions to the problem. The second article introduced the hashbelt algorithm and code library, and this article has focussed on examples drawn from real-world applications.

In both the first and second articles, we stated a list of requirements for any data expiration system. Let's wrap things up by revisiting those requirements and making sure that hashbelts meet them.

It's official: Hashbelts are a new data structure that expire data in a way that meets all our requirements! At this point you should be a true believer in hashbelts. If you're not, or if you have questions, email me and we can continue the conversation. Otherwise, feel free to use and extend the source code for this article (and send me e-mail if you come up with any interesting ideas for extending the library; I'm planning on maintaining this code and keeping it up to date for the foreseeable future).

William Grosso is a coauthor of Java Enterprise Best Practices.


Return to ONJava.com.

Copyright © 2009 O'Reilly Media, Inc.