Memory Contention in J2EE Applications for Multiprocessor Platforms
by Ramchandar Krishnamurthy, Deepak Goel11/10/2004
With the need for highly scalable J2EE applications in the enterprise environment, parallel processing of threads is required on multi-processor platforms. The memory requirements in the JVM heap for the processing of these threads and concurrent processing have caused to create performance and scalability bottlenecks in the deployment of these J2EE applications. This article explores the issue of synchronization of threads while accessing the memory within the JVM heap on a multi-processor platform for a J2EE application.
|
Related Reading
|
Memory Requirement of J2EE Applications
J2EE applications are currently being deployed in the enterprise environment that require thousands (if not millions) of customers requiring data and its processing every second. This huge requirement of data from so many concurrent customers has created a demand for larger heap size, and hence more RAM. With more RAM, which has helped in increasing the heap size available for J2EE applications, and multiple processors available for simultaneous processing of threads, the bottleneck has shifted to how the memory is accessed by these threads and the time being spent in this access.
Multi-Processor Platforms
As the number of processors is increased to resolve the issue of scalability, more and more threads are being simultaneously processed within the system. These threads require memory for the processing of data, the creation of new Java objects, and other Java operations. As multiple threads are processed in the processors, there is a need to maintain the coherency and the sanctity of the data in the system. The threads in these processors are reading and writing to the memory simultaneously. Hence there is a need to synchronize these threads so that they do not read incorrect data or overwrite each other's data. In Figure 1 below, the idea of single-processor versus multi-processor platforms accessing memory is shown. On a single-processor platform, there is only one thread executing at any given moment of time and therefore no need for synchronization. However, with multiple threads executing simultaneously on the multi-processor platform, the access to the memory needs to be synchronized, which results in contention and bottlenecks.

Figure 1. Single-processor versus multi-processor platforms accessing memory
The "Thread Local Area" attempts to solve this issue by pre-allocating a small amount of memory in the JVM heap for each Java thread. However, this memory space is found to be not sufficient for J2EE applications with high memory requirements.
Experiments and Observations
Experiments were done for a J2EE application with the code shown below on an eight-CPU platform.
String mem = request.getParameter("memory");
if (mem != null && !mem.equals("")) {
int mem_kbytes = Integer.parseInt(mem);
byte[] i = new byte[mem_kbytes*1024];
}
For every test, these J2EE threads were asked to create Java object of a particular memory size. A multi-user load was fired and the response times, throughput, and resource utilizations were observed for these tests. The JVM heap size was kept to a large value so that the frequency of garbage collections was kept to a minimum. We have observed that even with increased loads, the ratio of the "GC Pause Time" to the "Total time the test has been run" is about 1:35, and therefore considered its effect as small in this experiment. The results are shown in Figure 2.

Figure 2. Results of shared memory access on multi-processor platform
The environment used in this experiment is as follows:
- Application server: WebLogic Application Server 7
- JVM: Sun JVM 1.3
- Operating system: Windows 2000 Advanced Server
- Hardware: Intel Dell PowerEdge 8450 (Eight Intel Xeon 800MHz Processors, 4GB RAM)
- Network: 100Mbps Cisco dedicated network
- Load-testing tool: WebLoad
It was observed that as the load was increased for a Java object of a particular memory size, the rate of increase in the throughput was not as the same as that of the rate of increase in server-CPU utilization. Also, the increase in the response time was not linear, but exponential. These experiments clearly show that high memory requirements of these J2EE threads clearly create a contention in the system and acts as a bottleneck in the scaling of the J2EE applications. It was also observed that the time it takes for these threads to access the memory and create objects was added to the processor utilizations. This can be seen from the increase in the service demands for increasing loads, as in the chart in Figure 3 for the creation of a Java object of a particular memory size.

Figure 3. Increase in service demand versus load for Java objects of different memory sizes
Scale Out
As the memory contention is inherent in the system architecture, the best way to alleviate this problem is to add one more system box to the deployment configuration (scale out) with load balancing, as seen in the figure below. Scale up will not help, as adding more processors to the system will only increase the memory contention by allowing more parallel threads in your application to process, which will not help in increasing the throughput and response time.

Figure 4. Scale out of application server machine
Infrastructure Sizing
The time it takes to access memory by the threads in the processor gets added to the processor time, and hence the processor utilization. With more threads in the multiple processors waiting to access the memory, the processors do not get utilized effectively. The normal response of adding more processors to increase the output will not help in this case and will lead to ineffective capacity planning of the J2EE applications. Quantitative or simulation models for predicting performance and infrastructure need to take this into account.
The Memory Tuning of Your Applications
One of the most important steps in reducing this problem is to tune your J2EE application for unnecessary and loitering objects. Any of the memory debuggers can help you identify memory bottlenecks and achieve optimization; one example is JProbe. Memory debuggers give the memory allocation map of the application, indicating where the objects are being created and how much memory is being used. They also point out loitering objects, which contribute to the memory leak in the application. Object reuse also needs to be considered as a strategy for memory optimization. This can be done with the help of caching frameworks like Apache JCS. These caching frameworks help in storing the objects specified in the caching policy in your application for a particular period of time and allowing the reuse of the same.
Conclusion
The need to maintain the sanctity of the data in multi-processor environment causes synchronizations in the processing of the threads, which attribute to the bottleneck in the J2EE applications with high memory requirement. Some of the possible solutions to the memory contention include:
- Scale out by adding one or more system boxes to the deployment configuration.
- Tuning your applications for unnecessary objects and hence achieving optimization in memory contention.
Ramchandar Krishnamurthy is a senior technical architect in the Software Engineering and Technology Labs of Infosys Technologies Limited.
Deepak Goel is presently tinkering on a product in the artificial intelligence space.
Return to ONJava.com
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 6 of 6.
-
Web traffic
2005-03-04 23:41:03 JimAskwell [Reply | View]
Hi!
Did not you simulate any incomimg web traffic?did you not generate any incoming http traffic? or how did you test the network part receiving data?
Also : what does 'getParameter("memory"); means? did you fill a value in that parameter somewhere?
-
JVM: Sun JVM 1.3
2004-11-15 10:25:04 whirlycott [Reply | View]
I would have guessed that 1.4 and 1.5 would have been more relevant, especially 1.5 since it uses a new memory model.
More importantly, do your charts really indicate anything at all? How can we know what affect the additional CPUs are having when there is no comparison given for a single/dual/quad CPU boxes? This doesn't indicate anything more than "under load, things go slower."
I would also have been interested in seeing how different operating systems might handle memory contention issues.
Your suggestion to scale out doesn't address the possibility that you might need to create shared locks across the network when different nodes are trying to get exclusive resource to something, like a networked cache.
There's also a typo in your chart. The last value in column 1 ought to be 7500.
What is "Service Demand" and how is it measured?
-
JVM: Sun JVM 1.3
2004-11-22 03:20:01 DeepakGoel [Reply | View]
1. Service Demand is the amount of resource spent on a single request. Here we are looking at the CPU service demand which can be arrived with the following formula:
CPU Service Demand = Utilization
----------------------------
Number of Request Served/sec
In the chart 3 we see that the service demand is increasing with increasing load. This is due to the memory contention. Normally for most of the applications the service demand should remain more or less constant with increasing load.
2. We have tried with 2/4/6/8 CPU's and the behavior is very much the same. The effect of increasing load is that all the processors in the system would have threads running in them simultaneously which will bring about this contention.
3. We have tried with JVM 1.4 and the behavior is very much the same.
4. We have also run this on Windows and Solaris platforms. The memory contention remains.
-
Too simplistic
2004-11-11 01:52:43 jonathanrowe [Reply | View]
Sorry, but this benchmark is way too simplistic:
1) On a 32bit windows box, 3gb to 4gb are always reserved for memory mapping. Therefore, a windows box with 3gb and one with 4gb will both still have only 3gb available to the OS. I believe this limitation is imposed by the BIOS.
2) The Xeon MP does not scale very well running most types of application, even the latest XeonMP's have a shared 400mhz FSB (3.2gb/s of bandwidth) - Intel have tried to improve scalability by increasing the L3 cache size, the latest Xeon Mp's have upto 4mb of cache. The Xeon 800 has an FSB of 133mhz (1.06gb/s), sharing this bus between 8 processors is like trying to suck a bucket of water through a straw! (scuse the analogy). Remember also that all I/O goes through the FSB, a 15k SCSI drive or a single gigabit ethernet LAN is capable of bursting at 80mb/s (8% of your FSB bandwidth!).
3) Also windows does not scale well at this level, try running this on an OS that really scales (solaris, some linux, AIX etc).





Also we used a dedicated network backbone to eliminate the possibility of a network bottleneck.