Published on ONJava.com (http://www.onjava.com/)
 See this if you're having trouble printing code examples

Web Development in Heavy Traffic

by Pier Fumagalli

It happens from time to time: you spend a few years working on one peculiar aspect of a problem, you believe you become "experienced" in that problem, and, once your environment changes, you notice how you were looking at it with the eyes of a blind man.

It happened to me recently. Since 1997, I have been working on the problems related to the integration of Web servers (especially the Apache Web server) with servlet containers (notably, Apache JServ and Tomcat), and I thought I became an expert in that field. And then I went through the common round of layoffs, and had to look for a new job.

I ended up, luckily, at VNU Business Publications, in London, on its Web team. VNU has been a news company for 20-odd years; not a small Internet startup with a revolutionary idea, nor a software giant with thousands of engineers. Our business is simple: we sell information, through newspapers and magazines and on our Web site.

Given our relatively simple business plan (you wish), our problem is even simpler: our Web site has a lot of traffic (around 130 servlet requests per second at peak times) and it needs to be up and running 24 hours a day, 365 days a year. It's quite different from what I was used to: the soft and nice world of research far from practical use.

To handle this kind of load, of course, your infrastructure needs to be almost perfect. You need to have the best Web server, the best servlet container, the best of everything. I concentrate my efforts on the first two: Web server (Apache) and servlet container (Tomcat).

But instead of speaking about these issues now (you can come to the O'Reilly Open Source Conference and browse around at my talk on Wednesday, I would like to address a topic related to high-load Web sites and servlets.

The topic I want to address is how to design (or adapt) your Web application for deployment under high loads, and I want to share some of the obvious (and not obvious) performance enhancements we currently use on our news site.

Static Content

I was lately addressed by one of the Tomcat developers about the fact that the Apache module I am developing (mod_webapp) "still forwards requests for static webapp resources to Tomcat." Pointless to say that it does but, if you see it from my perspective, this is indeed a pretty low priority for us.

VNU handles several million requests each day throughout its Web servers, and static content accounts for approximately 70 percent of our bandwidth. In our case, the problem would be that if every single image, PDF, or CSS file had to go through the servlet engine to be processed and delivered, we would overload our Java Virtual Machine input/output.

We use JSPs to style our content, so the solution we adopted was simple: we invented a new tag. In our JSPs, instead of using the HTML <img> tag, we developed a very small custom tag: <vnu:img>. Our tag takes the SRC argument, and prints out a different (real HTML this time) <img> tag. For example, if in our JSPs you see something like:

<vnu:img src="/v6_logo_main.gif" ...>

our custom tag handler would convert it to something like:

<img src="http://images.vnunet.com/v6_image/v6_logo_main.gif" ...>

Why did we do that? It is pretty easy to explain. First of all images.vnunet.com is another Web server. Simply enough, all of our static traffic is handled by another instance of Apache, finely tuned to deliver only files, without the additional load of servlet requests, CGIs, Perl, or PHP (you name it).

Another advantage of this custom tag is that we can easily scale. If the amount of static data served becomes too much to be handled by one server, we can promptly put up a second one, modify our tag library, and "rewrite" the SRC attribute, one time with images.vnunet.com and another with images2.vnunet.com, without needing to start to load balance our servers or modify our router's configurations.

But probably the biggest advantage we can get out of a simple idea like this is that we can distribute our static content geographically. If, for example, one day we decided to use a service like Akamai, again, the only place throughout our entire application where I have to change something is our tag library, without even thinking about the thousands of JSP pages sitting on my server.

We use JSP, but such a simple approach can be used with any templating language and, if you have static HTML files, the only thing you have to do is parse them once, rewrite what needs to be rewritten, and the trick is done with a simple SED script.

Resources and Caches

Another interesting topic to keep in mind when deploying a large-scale application is how resource-hungry it is, and you have to consider what impact this new component is going to have on our overall infrastructure.

Let's look again at a practical example. VNU delivers news on the Web. All of our articles are nicely stored in a huge Oracle database, we have a library of several years accessible online at all times, and at the rate in which James Middleton writes articles, I believe that our Sun StorEdge is going to be full in few months.

Our idea is that each new story, each new article, might have references in the past and, although most of those references are quite recent, some of them go back quite a long way. If you browse our site and check out the related articles at the bottom of every page, you will understand what I mean.

If, for every article you click, we had to make a query onto the database, we would probably need a Sun E10000 by now, only to pass the articles back to our Web server. But a couple of years ago, a colleague of mine named Julian Mitchell came out with the idea of caching articles in the servlet container itself.

Brilliant solution: for a few pounds, we added some extra RAM onto our Web server and gave it all to the Java Virtual Machine. Every time a user looks for an article, it is not retrieved from the database itself most of the time, but rather from the cache stored in the Virtual Machine memory, and we hit the database only if the article is not cached. You can even see now when we restart our application: the site, for the first four to five minutes, is quite slow, as we are effectively refilling the cache of articles.

Several products now incorporate this feature as the default. If you look throughout Apache, for example, you will notice that Cocoon has an "embedded" cache storing the result of each atomic operation (XML parsing, XSLT processing, everything) so that it is able to deliver the quite-heavy-to-process XML-based content in basically no time.

If you are going to implement a cache for some part of your application, make sure you know exactly how much RAM you can spare for that, and that you have the right tools for measuring how much memory you are using at any given point. Several books, also, can explain different caching algorithms. My favorites are those about microprocessors' hardware architecture: every microprocessor has some sort of cache and, since it needs to be implemented in hardware, usually their algorithms are pretty small and functional.

Of course, VNU's cache is designed to deal with articles; Cocoon's is optimised for XML content. The Java Platform one day will probably have a caching engine in itself; JSR-107 is aimed exactly at that. It specifies API and semantics for temporary, in-memory caching of Java objects, including object creation, shared access, spooling, invalidation, and consistency across JVMs.

Funnily enough, the JSR was submitted by Oracle. (Maybe they had the same problem we had on our site?)

Tuning and Monitoring the Virtual Machine

The virtual machine is the most critical piece in the overall problem of deploying large-scale Web applications.

Apache, for example, goes to a great deal of trouble to make sure that it won't crash. The architecture of Apache 2.0, for example, was designed to include threads for performance reasons (each one of your processes can serve several concurrent requests), but at the same time to guarantee that if one of the Apache processes dies, you won't have to wait for its restart to continue serving requests. It is usual to see configurations with 4, 8, or 16 Apache processes, each one of them using 64, 128, 256 threads.

The virtual machine is just one big process with hundreds of threads processing requests but, if this one goes down, you have to wait until it comes back up again and, during that downtime, requests obviously cannot be processed.

One way to overcome this is to load-balance several JVMs with the same set of Web applications deployed, but this is not an easy thing to do.

A much simpler approach is to separate virtual applications across several containers. For example, run each of your applications in a different servlet container, in a different Java Virtual Machine.

This allows you to have a fine-grained control over each individual component of your Web site: you can individually control, for example, how much memory each application requires in comparison to the number of requests (see how well your application reacts to spikes of traffic), or what its overall impact on your OS will be (top can tell you all about it).

The advantage is that, if one of these falls over, your site (or most parts of it) will still be up and running, and it will take much less time to restart a VM holding one single application, rather than a VM containing four, five, or six of them (less memory, less classes to load, less JSP to recompile).

One other important thing to monitor in your virtual machine is the number of file descriptors. As with every other process, the Java Virtual Machine has a limit imposed by the operating system on the number of file descriptors (files and sockets for the most part) it can open at the same time.

Things like Lucene (search engine), JDBC connections pools, and client connections from the Web servers can greatly vary the number of file descriptors opened at any given time by the Virtual Machine.

Given that most operating systems are quite conservative about the number of file descriptors each process can open (usually it is 256 or 512) you want to make sure that your limit is high enough for your Web application. ulimit is a great utilty, but sometimes forgotten, and if you start seeing IOExceptions mentioning that the VM cannot open a file that is actually on the disk, and has the right permissions, this is the problem causing it.

When Things Go Wild

If something bad can happen, it will. Murphy's Law is an everyday reality for those involved with production servers.

The first thing to stress is the importance of logging. Logging is quite an expensive operation, and sometimes someone turns it off for performance's sake. This is the first big mistake you can commit, because without logs, you will not know what has actually happened if something crashes. And make sure you have relevant logging details.

Another common mistake is to be overloaded by logging information. In some situations I've had the not-so-pleasant experience to visit, if something crashes, someone (manually) will have to go through several megabytes of log files just to figure out at what time things went bad. A good approach is to make sure that each one of your log files contains relevant information. It is a good idea is to split log files into several different categories, each one for a particular area you want to focus on, as it is easier to combine log files than split them.

Make sure also that your logs can be easily parsed. Most of the time, the data is too much to be analyzed manually, but a couple of good Perl or bash scripts, with a introductory book to statistic analysis, can do marvels that you'd never imagined possible. For instance, remember that if you log exception stack traces, these are in a very ugly format, and are not well suited, for example, to be in the same file as your Apache error logs.

Another thing to remember is to monitor resources. A spike in traffic (the usual one at 10 PM when all geeks turn on their computers and look for news) can put your entire Web server at risk. Things that in the past few months had quite some relevance for us were:

RAM. On the overall system, RAM consumption might vary quite a lot, so be sure to monitor it and not swap too much.

Swap. As before, the more Swap is in use, the slower performances are. On a side note, remember that at least on Solaris, /tmp is mounted on your Swap so it's usually a bad idea to store 500 megabytes of tarballs in there just because it's fast.

Web Server Connections. Remember to monitor the state of the Web server, how many processes and threads (the -L option for ps under Solaris tells you about threads) are active, how many clients have active requests, and what they are requesting (Apache's mod_status can tell you all about this). About access logging, please do not trust services such as WebTrends or NetTraffic. They are marvelous marketing tools, but not even close to being a reliable way to figure out the activity on your Web server.

Network traffic. How much data are you sending or receiving from the network? This is essential to know, for example, in case of denial-of-service attacks.

If you collect all of this information in a timely manner (for example, we monitor each one of these parameters every minute), when something goes wrong you will have the situation pretty much under control. You will start to have a rough idea of actually what happened, at what time exactly and more or less why.

And then, given that you have all the logs available in a nice parsable format, and organized, you will be able to pinpoint exactly what caused the problem, and (hopefully) find a solution in no time.

But that's far from saying that you won't be called at 5:30am on a beautiful Sunday morning because the Web site is down (again).

Pier Fumagalli is an Apache Software Foundation member and active in the Jakarta and HTTPD/APR projects. He works for VNU Business Publications in London.

Return to ONJava.com.

Copyright © 2009 O'Reilly Media, Inc.