Published on ONJava.com (http://www.onjava.com/)
 See this if you're having trouble printing code examples

Scaling Enterprise Java on 64-bit Multi-Core X86-Based Servers

by Michael Juntao Yuan, Dave Jaffe

Multi-core and 64-bit CPUs are the hottest commodities in the enterprise server market these days. In recent years, as the cost and power requirement of faster CPU clock speeds has increased, the growth in raw clock speed (usually measured in megahertz) of single CPUs has slowed down. Hardware manufacturers continue to improve X86-based server performance by increasing both the multitasking capability and internal data bandwidth. Both Intel and Advanced Micro Devices are shipping 64-bit processors with two internal CPU cores, and quad core processors are soon to follow. Ninth-generation servers from Dell exploit this new generation of chips. The PowerEdge 1955 blade server, for example, supports up to two 64-bit dual core processors in a blade configuration, with up to ten such blades in a 7-rack unit (12.25") chassis.

However, those new generations of servers also pose new challenges for the software. For instance, to take advantage of the multi-core CPUs, the software application must be able to execute tasks in parallel across the CPUs; to take advantage of the 64-bit memory bandwidth, the application must also be able to manage a large amount of memory efficiently. As a key software platform on enterprise servers, Java Enterprise Edition (Java EE) is on the forefront of this multi-core, 64-bit revolution. Java EE developers must adapt to those challenges to make the most out of hardware investment.

When Java first came out in 1997, the state-of-the-art PC had a single CPU with a less than 300MHz clock speed and less than 64MB of RAM. The first Java applications were mostly on the client side. High-performance multitasking and large memory handling were clearly not the priority for Java designers at that time. But as Java became widely adopted for server-side applications, things started to change. Web applications are inherently multithreaded since each web request can be handled in a separate thread, parallel to other requests. The latest Java platform has greatly improved performance on modern server hardware.

In this article, we look at the current state of enterprise Java and analyze the challenges it faces with the new generation of servers. Based on our experience working on Java EE applications running on the JBoss Application Server in the Dell Scalable Enterprise Technology Center, we provide solutions and tips to scale your Java EE applications to the latest server hardware.

Tune the JVM

The core of the Java platform is the Java Virtual Machine (JVM). The entire Java application server runs inside a JVM. The JVM takes many startup parameters as command line flags, and some of them have great implications on the application performance. So, let's examine some of the important JVM parameters for server applications.

First, you should allocate as much memory as possible to the JVM using the -Xms<size> (minimum memory) and -Xmx<size> (maximum memory) flags. For instance, the -Xms1g -Xmx1g tag allocates 1GB of RAM to the JVM. If you don't specify a memory size in the JVM startup flags, the JVM would limit the heap memory to 64MB (512MB on Linux), no matter how much physical memory you have on the server! More memory allows the application to handle more concurrent web sessions, and to cache more data to improve the slow I/O and database operations. We typically specify the same amount of memory for both flags to force the server to use all the allocated memory from startup. This way, the JVM wouldn't need to dynamically change the heap size at runtime, which is a leading cause of JVM instability. For 64-bit servers, make sure that you run a 64-bit JVM on top of a 64-bit operating system to take advantage of all RAM on the server. Otherwise, the JVM would only be able to utilize 2GB or less of memory space. 64-bit JVMs are typically only available for JDK 5.0.

With a large heap memory, the garbage collection (GC) operation could become a major performance bottleneck. It could take more than ten seconds for the GC to sweep through a multiple gigabyte heap. In JDK 1.3 and earlier, GC is a single threaded operation, which stops all other tasks in the JVM. That not only causes long and unpredictable pauses in the application, but it also results in very poor performance on multi-CPU computers since all other CPUs must wait in idle while one CPU is running at 100% to free up the heap memory space. It is crucial that we select a JDK 1.4+ JVM that supports parallel and concurrent GC operations. Actually, the concurrent GC implementation in the JDK 1.4 series of JVMs is not very stable. So, we strongly recommend you upgrade to JDK 5.0. Using the command line flags, you can choose from the following two GC algorithms. Both of them are optimized for multi-CPU computers.

Furthermore, there are a few more JVM parameters you can tune to optimize the GC operations.

Since the GC has a big impact on performance, the JVM provides several flags to help you fine-tune the GC algorithm for your specific server and application. It's beyond the scope of this article to discuss GC algorithms and tuning tips in detail, but we'd like to point out that the JDK 5.0 JVM comes with an adaptive GC-tuning feature called ergonomics. It can automatically optimize GC algorithm parameters based on the underlying hardware, the application itself, and desired goals specified by the user (e.g., the max pause time and desired throughput). That saves you time trying different GC parameter combinations yourself. Ergonomics is yet another compelling reason to upgrade to JDK 5.0. Interested readers can refer to Tuning Garbage Collection with the 5.0 Java Virtual Machine. If the GC algorithm is misconfigured, it is relatively easy to spot the problems during the testing phase of your application. In a later section, we will discuss several ways to diagnose GC problems in the JVM.

Finally, make sure that you start the JVM with the -server flag. It optimizes the Just-In-Time (JIT) compiler to trade slower startup time for faster runtime performance. There are more JVM flags we have not discussed; for details on these, please check out the JVM options documentation page.

Use New Platform APIs

Besides the JVM, the Java platform libraries have also gone through extensive changes to accommodate the newer server hardware. We strongly recommend you upgrade your application to JDK 5.0+ in order to take advantage of all the performance enhancements built into the platform. Three new library APIs introduced in the last two major versions of the JDK are of particular importance for multi-CPU computers.

You should write new applications and upgrade older applications to use the concurrency, NIO, and logging APIs whenever possible. If you cannot upgrade, you should use alternative open source libraries that provides similar features. For instance, Doug Lea's util.concurrent library has many of the same features as the JDK 5.0 concurrency API; furthermore, the Apache Log4j library is comparable to the JDK 1.4+ logging library.

Optimize Your Code

In the previous sections, we discussed the general guidelines to build and run high-performance Java EE applications for multiple CPU and large memory servers. However, each application is unique with its own performance requirements and bottlenecks. The only way to make sure that your application is optimized for your hardware is through extensive performance testing. In this section, we cover some of the basic techniques to diagnose performance problems in your application.

It's beyond the scope of this article to cover performance-testing tools and frameworks. In our tests, we used Grinder, an open source performance-testing framework in Java. It can simulate hundreds of thousands of concurrent users across multiple testing computers and gather statistics on a central console. It provides a utility for you to record your test scripts by going through your web application in a browser. The generated script is in Jython, and you can easily modify it to suit your own needs.

As we discussed before, tuning GC operations in the JVM is crucial for performance. The easiest way to see the effects of various GC algorithm parameters is to monitor the time the application spends on GC throughout the load testing. There are two simple ways to do it.

Thumbnail, click for full-size image.

Figure 1. JConsole in JDK 5.0 (click for full-size image)

To pinpoint the exact location of a memory leak, you can use an application profiler. The JBoss Profiler is an open source profiler for applications inside the JBoss Application Server.

When the application is fully loaded, the CPU should run between 80% and 100% of its capacity. If the CPU usage is substantially lower, you should look for other bottlenecks, such as whether the network or disk I/O is saturated. However, an underutilized CPU could also indicate contention points inside the application. For instance, as we mentioned before, if there is a synchronized block on the critical path of multiple threads (e.g., a code block frequently accessed by most requests), the multiple CPUs would not be fully utilized. To find those contention points, you can do a thread dump when the server is fully loaded:

The thread dump prints out detailed information (stack trace with source code line numbers) about all current threads in the server. If all the request-handling threads are waiting at the same point, it would indicate a contention point, and you can go back to the code to fix it.

Sometimes, the contention point is not in the application but in the application server itself. Most Java EE application servers have not completely evolved their code base to take advantage of JDK 5.0 APIs, especially the concurrent utility libraries. In this case, it is crucial to choose an open source application server, such as the JBoss Application Server, where you can make changes to the server code.

Collapse the Tiers

Traditionally, Java EE had been designed for the multitiered architecture. This architecture envisions that the web server, servlet container, EJB server, and database server each runs on its own physical computer, and those computers are tied together through remote call protocols on the local network.

But with the new generation of more powerful server hardware, a single computer is powerful enough to run all those components for a medium-sized website. Running everything on the same physical machine is much more efficient than the distributed architecture described above. All communications are now interthread communications that can be handled efficiently by the same operating system or even inside the same JVM in many cases. It eliminates the expensive object serialization requirements and high network latency associated with remote calls. Furthermore, since different components tend to use different kind of server resources (e.g., the database is heavy on disk usage while Java EE is CPU-intensive), the integrated stack helps us to balance the server usage and reduce overall contention points.

Thumbnail, click for full-size image.

Figure 2. Choose between call-by-reference and call-by-value in the JBoss AS installer (click for full-size image)

The JBoss Application Server has built-in optimizations to support the single JVM deployment. For instance, by default, JBoss AS makes call-by-reference method calls from the servlet to the EJB objects. Call-by-reference can be up to ten times faster than the standard Java EE call-by-value approach, because call-by-value requires object serialization and is primarily for remote calls across JVMs. Figure 2 shows that you can choose from the two call-isolation methods.

The JBoss Web Server project goes one step further and builds native Apache web server functionalities directly into the Java EE servlet container. It allows much tighter integration between components when deployed on the same server, and hence could deliver much better performance than older Java EE servers.

With the entire middleware stack running on the same physical server, we also drastically simplify deployment and management. When you need to scale the application up, you simply add a load balancer, move the shared database server to a different computer, and then add any number of server nodes with the integrated middleware stack (see Figure 3).

The load balanced architecture
Figure 3. The load-balanced architecture

All web requests are made against the load balancer, which then forwards the requests to application servers in a manner that ensures all nodes have similar numbers of request per unit time. The load balancer should be configured to forward all requests from the same user session to the same node (i.e., use sticky session). Most Java EE application servers also support automatic state replication between the nodes to avoid application state loss when a node fails. For instance, the JBoss AS supports several state-replication strategies including buddy replication, in which each node can choose a "buddy" as its failover. Such load balancing and state replication would be very difficult to deploy and manage if we had three or four tiers of servers.

Virtualize the Hardware

The new multi-core 64-bit servers are capable of running heavy-load web applications. But for small web applications, they can be overkill. To fully utilize server capabilities, we sometimes run multiple small websites on the same physical server. Of course, you can deploy multiple applications on the same Java EE container. But to achieve the optimal performance, stability, and manageability, we often wish to run each application in its own Java EE container. How do we do that?

Technically, the primary challenge to run multiple Java EE server instances on the same physical server is to avoid the port conflicts. Like any other server application, the Java EE application server listens on TCP/IP ports to provide services. For instance, the HTTP service listens for web requests on the 80 or 8080 port; the RMI service listens for RMI invocation requests on the 4444 port; the naming service listens on the 1099 port; etc. For the server to run properly, it must obtain exclusive control over the port it listens to. So, when you start multiple server instances on the same computer, you are likely to get port conflict errors. There are three ways to avoid port conflicts.

To achieve optimal results, we should run no more server instances than the number of physical CPUs in the server; otherwise, the server instances would wait on one another to use the CPUs, creating more contention points. We should also keep the memory allocation for each server instance at around 1GB for optimal GC results.


Michael Yuan would like to thank Phillip Thurmond of Red Hat for reviewing this article and providing helpful suggestions.


Michael Juntao Yuan specializes in lightweight enterprise / web application, and end-to-end mobile application development.

Dave Jaffe is an engineer in Dell's Scalable Enterprise Technology Center.

Return to ONJava.com.

Copyright © 2009 O'Reilly Media, Inc.