ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

Expiring Data with Hashbelts

by William Grosso, author of Java RMI
02/13/2002

In the first article in this series, I explained why it is sometimes necessary to expire data, and discussed the standard approaches for doing so. The second article discussed a new approach to data expiration, called the Hashbelt algorithm, and documented a Java library which implements it. In this article I will show you how to use the hashbelt algorithm by using two distinct examples: implementing session keys and reimplementing the RemoteStubCache class from my previous articles on command objects in RMI. By the end of this article, you should feel comfortable using hashbelts in your code and understand when it is appropriate to do so.

About the Source Code

The source code for both the hashbelt framework and the examples used in this article can be downloaded from here. Please feel free to use and extend the source code for this article. But, please, give credit where credit is due and do not remove either the licensing information or my name from the source code. Moreover, please send me e-mail if you come up with any interesting ideas for using or extending the framework; I'm planning on maintaining this code and keeping it up to date for the foreseeable future.

Note: In order to use these classes, you must have downloaded and installed the generics specification. Furthermore, the jar files that come with the generics distribution must be the first items on your classpath. The batch file assumes the jars are in the same directory as the batch file -- if this is incorrect, you should edit the part of the batch file reading:

J-Xbootclasspath/p:./javac.jar -bootclasspath ./collect.jar

An Overview of the Process

Download the example files from this article here.

Before we dive into the examples, I want to outline what's involved in using hashbelts in an application. Incorporating hashbelts into an application is actually quite easy, and involves the the following three steps:

  1. Choose a subclass of AbstractHashbeltExpirationSystem (either StandardHashbeltExpirationSystem or UpdatingHashbeltExpirationSystem).
  2. Choose an type of container to use. All of the implementations of HashbeltContainer have the same semantics; choosing the appropriate container is simply an optimization and will depend on how you expect the system to be used.
  3. Write an expiration handler. Since an expiration system can store any type of object, and is a library being provided without any knowledge of your code, it makes sense that you have to provide the actual code that performs the expiration.

We're going to follow this procedure for two different examples: session keys and remote stub caches. The discussion for session keys will be a little abstract (without knowing more about the whole application, it's hard to write a concrete example that includes expiration handlers). The discussion of remote stub caches, however, will be very concrete.

Session Keys

Session keys have been used as an example throughout these articles. Thus far, I've relied on the assumption that you probably have at least a vague intuition about what a session key is. However, now that we're actually getting down to coding, precise definitions are important.

A session key is a string, usually randomly generated, which has no value other than to identify a particular client to a server. The session key is generated by the server at the end of an authentication (or identification) sequence and sent to the client. All subsequent requests from a client are accompanied by the session key. The server often uses the session key to look up information related to the client, such as permission levels or shopping cart contents. Depending on the application, the session key expires after either a fixed period of time or a period of inactivity.

In our case, we're going to implement session keys that expire after a period of inactivity. This is how Web servers handle session information and is the most common way of using session keys (session keys that expire after a fixed period of time are often used by authentication systems and protocols such as Kerberos). Our session keys, and the associated session objects, therefore have the following characteristics:

  • A session key has a limited time until expiration. Each time the key is used, the time until expiration is extended.
  • The session key has associated resources and will require expiration.
  • A session key tends to be very frequently accessed while a client is active. After the client stops being active for a while, the session key is unlikely to be used again.

The first of these bullet points immediately implies that we will be using a subclass of UpdatingHashbeltExpirationSystem as our expiration system. The second bullet point implies that HashlistBasedHashbeltContainer is a reasonable choice for a container class (we will be iterating through the expired objects and might want the iteration to be reasonably fast). This is all encapsulated in the code for SessionExpirationSystem:

  package grosso.sessions;
  import grosso.expiration.hashbelt.containers.*;
  import grosso.expiration.hashbelt.*;
  import grosso.expiration.*;

  import java.util.*;

  /*
    SessionExpirationSystem

    Extends UpdatingHashbeltExpirationSystem so that active sessions get
    their expiration delayed.

    Uses HashlistBasedHashbeltContainer because it seems like a good compromise
    choice.
  */

  public class SessionExpirationSystem extends UpdatingHashbeltExpirationSystem<sessionkey, session> {
    protected  HashbeltContainer<sessionkey,session> getNewContainer() {
      return new HashlistBasedHashbeltContainer<sessionkey, session>();
    }
  }

This actually has more import statements than lines of code! One reason this is so simple is that it uses the default number of containers and the default rotation time defined in AbstractHashbeltExpirationSystem. Recall that these were defined as follows:

  protected static final int DEFAULT_NUMBER_OF_CONTAINERS = 5;   protected static final long ONE_MINUTE = 60 * 1000;   protected static final long DEFAULT_ROTATION_TIME = ONE_MINUTE;

These are almost certainly not the best choice for session keys. In practice, I like to start by assuming there are 11 containers. A first pass at the rotation time is simply to divide the required expiration time by 10; this will ensure that every key, no matter when it is first generated, will have at least the required expiration time (some keys will have much longer).

Once you've estimated the number of containers and the rotation time, you will need to balance the following three forces:

  • Most of the lookups using the key to an active session ought to involve only one container. Active session keys ought to be in the first container.
  • Active session keys should be in the second container for very brief intervals (after the containers rotate) and should rarely be in the third container.
  • If keys are allocating expensive resources, you need to expire them quickly.

The first two of these forces imply that rotation times should be fairly large relative to the expected time between uses of a session key. That is, if lookups using a particular session key occur every three minutes, on average, then a rotation time of one minute is bad, because keys will frequently make it into the fourth or fifth container (and, therefore, using the keys to get session information will be more expensive than is necessary).

On the other hand, the third force implies that rotation times should be as small as possible (and the number of containers should be increased). Using a short rotation time means that sessions will be expired closer to the required expiration time.

Let's make this more concrete by considering a Web server. Suppose our requirement is that sessions must be kept for four hours (if the user hasn't come back in four hours, we assume that they're done). And suppose that, when a user is active, they fetch a new page every minute or so. If we start by using the "divide by ten" rule of thumb, we arrive at: 11 containers and a rotation time of 24 minutes.

This is actually quite nice. The rotation time is huge compared to the time between fetches. On average, we're going to see the following access pattern for a session object:

  • 12 accesses of the session object while it is in the first container.
  • 1 access of the session object while it is in the second container.
  • 23 or 24 accesses of the session object while it is in the first container.
  • 1 access of the session object while it is in the second container.

and so on, until the user is no longer accessing the session. From a performance point of view, this is astonishingly close to simply using an instance of Hashtable.

Related Reading

Java RMI
By William Grosso

On the other hand, session objects won't be expired at the four-hour mark. A session object might, in fact, live for almost four-and-a-half hours after it is last accessed. If this half-hour is significant, we might need to increase the number of containers (and thereby reduce the rotation time).

Of course, this whole discussion is predicated on a rather strange assumption. Four hours is just too short a time for a Web-based session key. For example, according to Joel Spolsky's book on user interface design, Amazon.com keeps its sessions (e.g. the contents of the shopping carts) around for 90 days.

You probably can't keep all sessions from the past 90 days in memory; it's just too much data. A two-part strategy like the following is a much better idea:

  • Use a four-hour expiration time for the in-memory session objects.
  • When an in-memory session object is expired, write it out to a database along with the date.
  • When a session object is requested, first look in memory. If that lookup fails, attempt to retrieve the session from the database.
  • Once a day, delete old sessions from the database.

This solution has some fairly nice benefits: when a session is active, it gets pulled into memory. When a session isn't active, it gets written out to disk. The session information is kept for a long time in a persistent data store and, while the session is being used, the information is kept in memory, and the expiration system is fast.

Pages: 1, 2

Next Pagearrow