Creation time: 2008-11-09 01:12:51 Last modified: 2008-11-09 01:12:51
This article was published in the 11th issue of Programmer magazine, 2008
PHP Meditation Six: Drupal performance issues
Left light Hou
Drupal is a PHP-based open-source CMS system and one of the best PHP applications I've seen technically. Drupal's architecture is excellent, with micro-kernel +plugin to achieve excellent scalability, so that drupal far beyond the scope of the general CMS. In this sense, it seems more appropriate to call Drupal the Web OS. There's so much to say about Drupal, and maybe I'll write an article about it in a later time. But what I want to talk about in this article is that everyone in the Drupal community will face it, but not everyone has a clear understanding of it, Drupal's performance issues.
I have done a more comprehensive test of Drupal because of customer needs. The environment at that time was a dual server (DB Server+web server) and the hardware configuration was single cpu+4g. There are thousands of node records in the database. Read and write operations on different pages (logged user/anonymous user) are tested with JMeter for various scenarios (on/off various cache modules).
The results of the test may not be the same as many people. The two main results are as follows:
1. The performance gap between logged user and anonymous user is very large. On the same page, logged user's RPS (requests per second) generally does not exceed 20, while the anonymous user with the cache enabled has more than 100 RPS, and when the file-based cache is used, it can even exceed 300.
2. Database pressure is relatively small. Because Drupal places a large amount of configurable content in the database, it is often easy to create the impression that Drupal should be very demanding on the database. In fact, however, the pressure on DB server is quite small (CPU is below 10%) in both the cache and the non-cache mode, and the Web server has more than 80% CPUs. Tracking the execution time of all DB query also proves this (all DB query execution time is only a fraction of the page generation time).
After repeated testing and thinking, I came to some conclusions. It is clear that Drupal's bottleneck in the case of a large number of logged user concurrency is the CPU time of executing Drupal code, not the database or anywhere else. The reason for this is that the implementation mechanism of PHP itself is related to the way Drupal is implemented. When Drupal builds a non-cached page, no matter how simple the page is, a complete bootstrap process is performed, even if only the fewest modules are enabled, and the process calls dozens of PHP files to execute thousands of lines of PHP code. The PHP mechanism also determines that no PHP code or objects can reside in memory, and each response request must perform a full initialization. The reason why anonymous user is fast is that Drupal does not perform the full bootstrap process when it executes the cached page, it first checks whether the page is cached, reads the cache, and then ends the work. Of course it's fast.
On the premise of this conclusion, some things can be explained:
1. Why the performance of Drupal is not much different in various environments. Whether it is a dual server, a single server, or even a very small memory virtual machine, the RPS value of logged user is always between 10~20. There are hundreds of or hundreds of thousands of records in the database, the impact is not big. Because the bottleneck is not in db or memory, it is the process of executing the code.
2. Why use a code optimizer like Apc/xcache to get a great performance boost. In my own virtual machine environment, RPS has been raised from around 12. Because it improves the execution time of the PHP code.
From this conclusion, we list some measures to optimize Drupal's logged user performance and have no obvious effect:
Not of obvious effect:
1. Add memory. When the number of concurrent only 10 +, even if each request accounted for 20M of memory, but also only 200m+ memory.
2. Separate DB server from Web server, or enhance the configuration of DB server. A medium-performance MySQL server, it is easy to cope with 200~300 concurrency, and DB server is actually very idle when the number of concurrent numbers is only 10 +.
3. Optimization of the underlying software, such as transferring from Windows to Linux, transferring from Apache to LIGHTTPD, migrating from MySQL to other databases, in addition to moving from windows to Linux will have a noticeable boost (because PHP is more efficient on Linux than it is on Windows), other measures may be faster, but there will be no significant improvement because the bottlenecks are not there.
Having an obvious effect:
1. Using code optimizer such as Apc/xcache, the speed will be several times increased. It is estimated that everyone has done so.
2. Increase the number of CPUs for the Web server. A dual-core is certainly faster than a single core, and 4 CPUs are certainly much faster than 2 CPUs.
3. Use the multi-Web server+ single DB server configuration to spread the code execution pressure across different Web servers. As mentioned above, a single DB server can easily cope with 200+ concurrency, which means it can theoretically support more than 10 Web servers.
4. Using an engine like Quercus, compiling the PHP code into Java and running it in a Java VM will theoretically improve a lot. The reason is, first, Java is more efficient than PHP, second, Java code can be cache, do not need to reload every time. Here's a test result: Http://www.workhabit.org/resin-backed-php-drives-4x-performance-improvements-drupal. Drupal has a 4 times-fold performance boost under Quercus, but this figure is almost as good as Drupal's apc/eaccelerator under open, so it probably doesn't have much practical value.
Another way of thinking is the optimization of the code itself.
Using the cache API is basically meaningless because the cache API is not called for logger User,drupal. Drupal.org has suggested that even logged user, there are many pages that are not customized, which means that they can be cache. But Drupal does not provide such a mechanism. As long as the logged User,drupal executes the complete bootstrap process, even if only one Hello world is printed, there is actually no way to cache a single page in the logged user state.
To the current version of Drupal (Drupal 6.4), the logger User,drupal only provides a cache feature that allows some blocks to be set to be cache. Block cache can effectively increase efficiency when the block takes up a lot of server time. However, because the block cache has no effect on the bootstrap process, the block cache is powerless when the bottleneck lies in the bootstrap itself.
In the drupal.org community, the cache of the logger user has been hotly debated. The basic conclusion is that, because of the architecture of Drupal, there is no good solution at the moment, only to expect Drupal to be improved in a future release.
I studied Drupal's bootstrap process and found that it might be possible to implement the Hook_boot function, which is the most pre-executed function in bootstrap, and when it is called, most of the bootstrap process is not yet executed. In Hook_boot, check whether the current page requires the cache, and if so, directly read the cache Generation page and then call exit () to force the end. This is theoretically feasible, but too hack.
This is the case with Drupal, and what about the performance of other PHP frameworks, especially the semi-official Zend Framework? Through the search, I found a PHP framework comparison benchmarks on the Internet, the URL is: Http://www.avnetlabs.com/php/php-framework-comparison-benchmarks. According to the report, the performance of the Zend framework is only 10% of native PHP, not even 3% if APC is not used. Of course, the data in this report is not necessarily exhaustive, and the performance of the Zend framework in different contexts should also differ. However, the performance of the Zend framework lags significantly behind baseline PHP, which should be conclusive.
Why is there such a problem with the performance of the mainstream PHP framework? In fact, this is not difficult to understand. Review of the PHP meditation series in the first part of the discussion of the PHP work model, since PHP does not have memory-resident processes, so every request occurs, all objects must be initialized, which results in a significant amount of time spent in process code execution. This is irrelevant when the PHP program is simply a script, but in a structurally complex architecture, the problem becomes very prominent as thousands of lines of code are repeatedly invoked each time the request is processed. And, unless a later version of PHP improves on this mechanism, the problem cannot be solved completely.
So does this mean that PHP can only be used for small sites and not be able to perform on large, high-volume sites? Of course not. In fact, PHP is used extensively on Yahoo and many other well-known giant websites. The reason is that PHP is only used as a content generator, the generated content will be converted to static text, the vast majority of users are browsing the cache of static text. This has nothing to do with the performance of the PHP program. However, when users do not just browse, but frequently interact with the site, PHP's performance is not only comparable to C and Java, not even compared to Python and Ruby, the scripting language. That is, PHP is more suitable for content publishing sites such as news portals than for Web 2.0 applications.
At the time of this series of articles, we saw the limitations of PHP. People who love PHP may feel frustrated about this. However, this does not detract from the reputation of PHP as a good language. The ruler is short, inch, for the tools we know and love, we should understand their limitations, which also helps us to use them more effectively.