The PHP Scalability Mythby Jack Herrington
PHP scales. There, I said it. The word on the street is that "Java scales and PHP doesn't." The word on the street is wrong, and PHP needs someone to stand up and tell the truth: that it does scale.
Those with a closed mind can head straight to the inevitable flame war located at the end of this article. Those with an open mind who are interested in taking their web development skills and putting them to use building applications in the cross-platform, easy-to-write, easy-to-maintain, scalable, and robust PHP platform, but were hesitant because of the scalability myth, should read on. It starts by looking at the term scalability.
What is Scalability?
There are a number of different aspects of scalability. It always starts with performance, which is what we will cover in this article. But it also covers issues such as code maintainability, fault tolerance, and the availability of programming staff.
These are all reasonable issues, and should be covered whenever you are choosing the development platform for any large project. But in order to convey a convincing argument in this small space, I need to reduce the term scalability to its core concern: performance.
Language and Database Performance
Both Java and PHP run in virtual machines, which means that neither perform as well as compiled C or C++. In the great language shootout, Java beat PHP on most of the performance benchmarks, even substantially on some. However, overall the two languages were not an order of magnitude different. In addition, an older version of PHP was used in the test, and substantial performance improvements have been made since and are continuing to be made.
Another area of performance concern is in the connection to the database. This is a misnomer, however, as the majority of the time spent in a database query is on the database server end, processing the query, and the transmit time to marshal the data between the server and the client. PHP's connectivity to the database consists of either a thin layer on top of the C data access functions, or a database abstraction layer called PEAR::DB. There is nothing to suggest that there is any PHP-specific database access performance penalty.
Yet another area of efficiency concern is in the connection between the language and the web server. In the CGI model, the program or interpreter is booted on each request. In the in-process model, the interpreter stays around after each request. One of the original Java-versus-scripting-languages (e.g. PHP) benchmarks pit in-process Java against CGI invocation on the server. In the CGI model, each page incurred the overhead of the startup and shutdown of the interpreter. Even at the time, the comparison was unfair, as production machines used server-scripting extensions (such as PHP), which run in-process and stay loaded between each page fetch. There is no performance penalty for loading the interpreter and compiled pages remain in memory.
With these basic efficiency questions out of the way, it's time to look at the overall architecture of the web application.
There are three basic web architectures in common use today: two-tier, logical three-tier, and physical three-tier. Engineers give them different names and slightly different mechanics, so to be clear about what I mean, I will illustrate the three architectures.
A two-tier application web application has one tier for the display and application logic, connected to a second tier, which is the database server. Two-tier web applications perform best because they require the least processing for a single request.
A three-tier architecture separates the display and application logic into two separate tiers. To the left, a physical three-tier architecture puts the three tiers onto three separate machines or processes. To the right, a logical three-tier architecture has the application and business logic layer together in one process, but still divided by an API layer.
Three-tier web applications trade some performance hit for better code factoring. With the business logic separated from the user interface, it is possible to access the application logic from multiple directions. For example, we could expose the layer through web services.
J2EE Web Server Architecture
Perhaps the second most contentious part of this article is my definition of a J2EE web server application architecture. Externally to the Java community, the application structure looks clear: JSPs talk to EJBs, which talk to the database. Within the Java community, the standard J2EE topology is anything but clear. A comparison is only valid between two things, so to decide whether "Java scales and PHP doesn't," I need to be clear about what a Java web application server is.
I'll take the two most common interpretations of J2EE architecture. The first is Sun's EJB 1.0 architecture, and the second is the EJB 2.0 architecture. Shown below is Sun's EJB 1.0 architecture for web application servers:
This is classic physical three-tier architecture, and it pays the performance price. I've highlighted the portions of the architecture that involve network traffic, either via database connection, an overhead shared by PHP, or by Remote Method Invocation (RMI), an overhead not shared by PHP.
To be fair, the connection between the web server and the servlet engine can be avoided with modern application servers and web servers, such as Tomcat and Apache 2.0. At the time when the first versions of the JSP and EJB standards were released, the prevalent web server was (and still is) Apache 1.x, which had a process model that was not compatible with Java's threading model. This meant that a small stub was required on the web server side to communicate with the servlet engine. The remains a non-trivial performance overhead for those that decide to pay it, and was a significant performance overhead when the first scalability comparisons were made.
A much more significant source of overhead was in the RMI connection between the servlet engine and the EJB layer. A page showing ten fields from twenty objects would make two hundred RMI calls before it was completed. This overhead was removed with the EJB 2.0 standard, which introduced local interfaces. This topology is shown below:
This is logical three-tier architecture. The web server box has been removed because more recent web servers are not separated from the servlet code (e.g. Tomcat, Apache 2.0, etc.). As you will see when we compare this model to the PHP model, EJB 2.0 moved Java web application server development closer to the successful, and scalable, PHP model.
PHP Web Server Architecture
PHP has always been capable of running the gamut between a two-tier architecture and a logical three-tier architecture. Early versions of PHP could abstract the business and database access logic into a functional second tier. More recent versions can abstract the business logic with objects, and with PHP 5, these objects can use public, protected, and private access control.
A modern PHP architecture, strikingly similar to the EJB 2.0 model, is shown below:
This is logical three-tier architecture, and this is how modern PHP applications are written. As with Java web servers, the PHP code is in-process with the web server, so there is no overhead in the server talking to the PHP code.
The PHP page acts as a broker between second-tier business objects and Smarty templates, which format the page for presentation. As with the JSP "best practice," the Smarty templates are only capable of displaying the data presented, with rudimentary looping and conditional control structures.
But this is all about the design of the server. What about the architecture of the application itself?
Stateful and Stateless Architecture
The lack of an external, stateful object store, where the application can hold session state, is often voiced as a scalability concern. PHP can use the database as the back-end session store. There is little performance difference, because a network access is required in both cases. An argument can be made that the external object store allows for any arbitrary data to be stored conveniently; however, this is easily offset by the fact that the object store itself is a single point of failure. If the object store is replicated across multiple web servers, that becomes an issue of data replication and cache coherency, which is a very complex problem.
Another familiar Java pattern is the use of a local persistent object store on each web server. The user is limited to a single server by use of sticky sessions on the router. The same could be done in PHP: a local, persistent data store. But this is an anti-pattern anyway, because a sticky session-based server pool is prone to overloading of a single web server. Or should the server go down, the result is the denial of service to a group of customers.
The ideal multi-server model is a pod architecture, where the router round-robins each of the machines and there is only a minimal session store in the database. Transient user interface information is stored in hidden variables on the web page. This allows for the user to run multiple web sessions against the server simultaneously, and alleviates the "back button issue" in web user interfaces.
This section has covered some very complex issues in web application server design. Scalability is mainly about the architecture of the application layer, and there is no one true panacea architecture that will work for all application architectures. The key to success is not in any particular technology, but in simplifying your server model and understanding all of the components of the application layer, from the HTML and HTTP on the front end to the SQL in the back end. Both PHP and Java are flexible enough to create scalable applications for those who spend the time to optimize their application architecture.
The Convergence of Web Application Architecture
This article started by asserting that PHP scales. When the tag-line "Java scales and scripting languages don't" was born, it was based on EJB 1.0, an architecture that most Java architects would consider absurd, based on its high overhead. Based on EJB 1.0, Java's performance was much worse than that of scripting languages. It is only the addition of local interfaces in EJB 2.0 that makes the J2EE architecture perform well.
The argument for PHP scalability is further simplified, however, by the fact that both PHP and J2EE architecture (as well as others) are converging on the same design. And if "J2EE scales" given this simpler, logical three-tier architecture, then it follows that PHP does as well.
The performance principle for scalability is simple: if you want to scale, then you have to serve web pages quickly. To serve web pages quickly, you either have to do less, or do what you do faster. Faster is a non-starter, because Java is not so much faster than PHP that it makes much of a difference. Doing less is the key. By using a logical three-tier architecture and by reducing the number of queries and commands sent to the database, we can get web applications that scale, both in Java and in PHP.
For the open-minded developer, there is a world of applications that can be built quickly, cheaply, robustly, and scalably with PHP. Services such as Amazon, Yahoo, Google, and Slashdot have known about scripting languages for years and used them effectively in production. Yahoo even adopted PHP as its language of choice for development. Don't believe the hype in the white papers that says that PHP isn't for real applications or doesn't scale.
I'm sure that what I have said in this article will be picked to death and ridiculed by some. I stand by what I have said. The idea that PHP does not scale is clearly false at the performance level. In fact, we should have never even gotten to the point where this article was necessary, because as engineers, we should recognize that the argument that one language clearly "scales better" than another is, on its face, ridiculous. As engineers and architects, we need to look objectively at technologies and use a factual and rational basis to make technology decisions.
Still not convinced? Consider JSR 223, the effort to turn PHP into the front end for J2EE by porting it to Java. If PHP on top of Java is scalable, then why isn't PHP on top of C?
Author's Note: These views are my own and have nothing to do with those of Macromedia.
Jack Herrington is an engineer, author and presenter who lives and works in the Bay Area. His mission is to expose his fellow engineers to new technologies. That covers a broad spectrum, from demonstrating programs that write other programs in the book Code Generation in Action. Providing techniques for building customer centered web sites in PHP Hacks. All the way writing a how-to on audio blogging called Podcasting Hacks.
Return to ONJava.com.