Java RMI: Serialization
Pages: 1, 2, 3, 4, 5, 6
Versioning Classes
A few pages back, I described the serialization mechanism:
The serialization mechanism automatically, at runtime, converts class objects into metadata so instances can be serialized with the least amount of programmer work.
This is great as long as the classes don't change. When classes change, the metadata, which was created from obsolete class objects, accurately describes the serialized information. But it might not correspond to the current class implementations.
The Two Types of Versioning Problems
There are two basic types of versioning problems that can occur.
The first occurs when a change is made to the class hierarchy (e.g., a
superclass is added or removed). Suppose, for example, a personnel application
made use of two serializable classes:
Employeeand
Manager(a subclass of
Employee). For the next version of the application, two
more classes need to be added:
Contractorand
Consultant. After careful thought, the new hierarchy is
based on the abstract superclass
Person, which has
two direct subclasses:
Employeeand
Contractor.
Consultantis
defined as a subclass of
Contractor, and
Manageris a subclass of
Employee. See Figure
10-8.
|
While introducing
Personis probably
good object-oriented design, it breaks serialization. Recall that
serialization relied on the class hierarchy to define the data format.
The second type of version problem arises from local changes to
a serializable class. Suppose, for example, that in our bank example, we want
to add the possibility of handling different currencies. To do so, we define a
new class,
Currency, and change the definition of
Money:
public class Money extends ValueObject {
public float amount;
public Currency typeOfMoney;
}
This completely changes the definition of
Moneybut doesn't change the object hierarchy at all.
The important distinction between the two types of versioning problems is that the first type can't really be repaired. If you have old data lying around that was serialized using an older class hierarchy, and you need to use that data, your best option is probably something along the lines of the following:
- Using the old class definitions, write an application that deserializes the data into instances and writes the instance data out in a neutral format, say as tab-delimited columns of text.
- Using the new class definitions, write a program that reads in the neutral-format data, creates instances of the new classes, and serializes these new instances.
The second type of versioning problem, on the other hand, can be handled locally, within the class definition.
How Serialization Detects When a Class Has Changed
In order for serialization to gracefully detect when a versioning problem has occurred, it needs to be able to detect when a class has changed. As with all the other aspects of serialization, there is a default way that serialization does this. And there is a way for you to override the default.
The default involves a hashcode. Serialization creates a single
hashcode, of type
long, from the following
information:
- The class name and modifiers
- The names of any interfaces the class implements
- Descriptions of all methods and constructors except
privatemethods and constructors - Descriptions of all fields except
private,static, andprivate transient
This single
long, called the class's
stream unique identifier (often abbreviated
suid),
is used to detect when a class changes. It is an extraordinarily sensitive
index. For example, suppose we add the following method to
Money:
public boolean isBigBucks( ) {
return _cents > 5000;
}
We haven't changed, added, or removed any fields; we've simply
added a method with no side effects at all. But adding this method changes the
suid. Prior to adding it, the
suidwas
6625436957363978372L;
afterwards, it was
-3144267589449789474L. Moreover,
if we had made
isBigBucks( )a protected method,
the
suidwould have been
4747443272709729176L.
TIP: These numbers can be computed using the serialVer program that ships with the JDK. For example, these were all computed by typing
serialVer com.ora.rmibook.chapter10.Moneyat the command line for slightly different versions of theMoneyclass.
The default behavior for the serialization mechanism is a
classic "better safe than sorry" strategy. The serialization mechanism uses
the
suid, which defaults to an extremely sensitive
index, to tell when a class has changed. If so, the serialization mechanism
refuses to create instances of the new class using data that was serialized
with the old classes.
Implementing Your Own Versioning Scheme
While this is reasonable as a default strategy, it would be
painful if serialization didn't provide a way to override the default
behavior. Fortunately, it does. Serialization uses only the default
suidif a class definition doesn't provide one. That is,
if a class definition includes a
static final longnamed
serialVersionUID, then serialization will use
that
static
final longvalue as the
suid. In the case of our
Moneyexample, if we included the line:
private static final long serialVersionUID = 1;
in our source code, then the
suidwould be 1, no matter how many changes we made to the rest of the class.
Explicitly declaring
serialVersionUIDallows us to
change the class, and add convenience methods such as
isBigBucks( ), without losing backwards compatibility.
TIP:
serialVersionUIDdoesn't have to be private. However, it must bestatic,final, andlong.
The downside to using
serialVersionUIDis that, if a significant change is made
(for example, if a field is added to the class definition), the
suidwill not reflect this difference. This means that
the deserialization code might not detect an incompatible version of a class.
Again, using
Moneyas an example, suppose we
had:
public class Money extends ValueObject {
private static final long serialVersionUID = 1;
protected int _cents;
and we migrated to:
public class Money extends ValueObject {
private static final long serialVersionUID = 1;
public float amount;
public Currency typeOfMoney;
}
The serialization mechanism won't detect that these are completely incompatible classes. Instead, when it tries to create the new instance, it will throw away all the data it reads in. Recall that, as part of the metadata, the serialization algorithm records the name and type of each field. Since it can't find the fields during deserialization, it simply discards the information.
The solution to this problem is to implement your own versioning
inside of
readObject( )and
writeObject( ). The first line in your
writeObject( )method should begin:
private void writeObject(java.io.ObjectOutputStream out) throws IOException {
stream.writeInt(VERSION_NUMBER);
....
}
In addition, your
readObject( )code
should start with a switch statement based on the version number:
private void readObject(java.io.ObjectInputStream in) throws IOException,
ClassNotFoundException {
int version = in.readInt( );
switch(version) {
// version specific demarshalling code.
....}
}private void readObject(java.io.ObjectInputStream in) throws IOException,
ClassNotFoundException {
int version = in.readInt( );
switch(version) {
// version specific demarshalling code.
....}
}
Doing this will enable you to explicitly control the versioning
of your class. In addition to the added control you gain over the
serialization process, there is an important consequence you ought to consider
before doing this. As soon as you start to explicitly version your classes,
defaultWriteObject( )and
defaultReadObject( )lose a lot of their usefulness.
Trying to control versioning puts you in the position of explicitly writing all the marshalling and demarshalling code. This is a trade-off you might not want to make.
Performance Issues
Serialization is a generic marshalling and demarshalling algorithm, with many hooks for customization. As an experienced programmer, you should be skeptical--generic algorithms with many hooks for customization tend to be slow. Serialization is not an exception to this rule. It is, at times, both slow and bandwidth-intensive. There are three main performance problems with serialization: it depends on reflection, it has an incredibly verbose data format, and it is very easy to send more data than is required.
Serialization Depends on Reflection
The dependence on reflection is the hardest of these to
eliminate. Both serializing and deserializing require the serialization
mechanism to discover information about the instance it is serializing. At a
minimum, the serialization algorithm needs to find out things such as the
value of
serialVersionUID, whether
writeObject( )is implemented, and what the superclass
structure is. What's more, using the default serialization mechanism, (or
calling
defaultWriteObject( )from within
writeObject( )) will use reflection to discover all the
field values. This can be quite slow.
TIP: Setting
serialVersionUIDis a simple, and often surprisingly noticeable, performance improvement. If you don't setserialVersionUID, the serialization mechanism has to compute it. This involves going through all the fields and methods and computing a hash. If you setserialVersionUID, on the other hand, the serialization mechanism simply looks up a single value.
Serialization Has a Verbose Data Format
Serialization's data format has two problems. The first is all
the class description information included in the stream. To send a single
instance of
Money, we need to send all of the
following:
- The description of the
ValueObjectclass - The description of the
Moneyclass - The instance data associated with the specific instance
of
Money.
This isn't a lot of information, but it's information that RMI computes and sends with every method invocation. (Recall that RMI resets the serialization mechanism with every method call.) Even if the first two bullets comprise only 100 extra bytes of information, the cumulative impact is probably significant.
The second problem is that each serialized instance is treated as an individual unit. If we are sending large numbers of instances within a single method invocation, then there is a fairly good chance that we could compress the data by noticing commonalities across the instances being sent.
It Is Easy to Send More Data Than Is Required
Serialization is a recursive algorithm. You pass in a single
object, and all the objects that can be reached from that object by following
instance variables, are also serialized. To see why this can cause problems,
suppose we have a simple application that uses the
Employeeclass:
public class Employee implements Serializable {
public String firstName;
public String lastName;
Public String socialSecurityNumber;
}
In a later version of the application, someone adds a new piece
of functionality. As part of doing so, they add a single additional field to
Employee:
public class Employee implements Serializable {
public String firstName;
public String lastName;
Public String socialSecurityNumber;
Public Employee manager;
}
What happens as a result of this? On the bright side, the application still works. After everything is recompiled, the entire application, including the remote method invocations, will still work. That's the nice aspect of serialization--we added new fields, and the data format used to send arguments over the wire automatically adapted to handle our changes. We didn't have to do any work at all.
On the other hand, adding a new field redefined the data format
associated with
Employee. Because
serialVersionUIDwasn't defined in the first version of
the class, none of the old data can be read back in anymore. And there's an
even more serious problem: we've just dramatically increased the bandwidth
required by remote method calls.
Suppose Bob works in the mailroom. And we serialize the object associated with Bob. In the old version of our application, the data for serialization consisted of:
- The class information for
Employee - The instance data for Bob
In the new version, we send:
- The class information for
Employee - The instance data for Bob
- The instance data for Sally (who runs the mailroom and is Bob's manager)
- The instance data for Henry (who is in charge of building facilities)
- The instance data for Alison (Director, Corporate Infrastructure)
- The instance data for Mary (VP in charge of IT)
And so on...
The new version of the application isn't backwards-compatible because our old data can't be read by the new version of the application. In addition, it's slower and is much more likely to cause network congestion.
The Externalizable Interface
To solve the performance problems associated with making a class
Serializable, the serialization mechanism allows
you to declare that a class is
Externalizableinstead. When
ObjectOutputStream's
writeObject( )method is called, it performs the
following sequence of actions:
- It tests to see if the object is an instance of
Externalizable. If so, it uses externalization to marshall the object. - If the object isn't an instance of
Externalizable, it tests to see whether the object is an instance ofSerializable. If so, it uses serialization to marshall the object. - If neither of these two cases apply, an exception is thrown.
Externalizableis an interface that
consists of two methods:
public void readExternal(ObjectInput in);
public void writeExternal(ObjectOutput out);
These have roughly the same role that
readObject( )and
writeObject(
)have for serialization. There are, however, some very important
differences. The first, and most obvious, is that
readExternal( )and
writeExternal(
)are part of the
Externalizableinterface.
An object cannot be declared to be
Externalizablewithout implementing these methods.
However, the major difference lies in how these methods are used. The serialization mechanism always writes out class descriptions of all the serializable superclasses. And it always writes out the information associated with the instance when viewed as an instance of each individual superclasses.
Externalization gets rid of some of this. It writes out the
identity of the class (which boils down to the name of the class and the
appropriate
serialVersionUID). It also stores the
superclass structure and all the information about the class hierarchy. But
instead of visiting each superclass and using that superclass to store some of
the state information, it simply calls
writeExternal(
)on the local class definition. In a nutshell: it stores all the
metadata, but writes out only the local instance information.
TIP: This is true even if the superclass implements
Serializable. The metadata about the class structure will be written to the stream, but the serialization mechanism will not be invoked. This can be useful if, for some reason, you want to avoid using serialization with the superclass. For example, some of the Swing classes, while they claim to implementSerializable, do so incorrectly (and will throw exceptions during the serialization process). (JTextAreais one of the most egregious offenders.) If you really need to use these classes, and you think serialization would be useful, you may want to think about creating a subclass and declaring it to beExternalizable. Instances of your class will be written out and read in using externalization. Because the superclass is never serialized or deserialized, the incorrect code is never invoked, and the exceptions are never thrown.
