This article is about the site performance optimization experience, performance optimization is a complex topic, involving a lot of things, I just follow my understanding of the performance optimization of the entire process to consider the factors to be considered. Donuts, contains the content with the superficial introduction mainly, if you have the opinion to tell me that is very good. Whatever you want to read it, you have something to gain.
My view of the site performance issues are reflected in a site "availability" (Chinese is called usability, but this translation is not enough), in the past I know that if the site is completely or partially unavailable, it is a functional problem, but if the response is slow, poor load, this is the performance problem But then I came to realize that the performance issues covered a broader range, and I couldn't give an accurate definition, but many of the Web site problems that were caused by non-business logic errors could be a performance issue, such as poor scalability, such as a single point of failure.
In the initial stage of website performance optimization, the so-called "first Realm", to do local positioning, analysis and correction, consider only "optimization", this is the initial performance optimization problem of the majority of the understanding. After the problem has occurred, and found that it does not have much to do with the business logic, start to try to find out the cause of the problem and solve it.
Whether the site is unresponsive or slow to respond, or if the response curve is unusually volatile, for example, you can ask yourself a few questions about the CPU's use:
See if the system is busy from CPU usage. If the system is busy, what the system is doing and why. (Typical problem: HashMap the dead loop caused by unsafe concurrency) if the system is idle, then where is the bottleneck. (Typical problem: IO is unresponsive) if the response fluctuates, whether there is a period, what the period is. (Typical problem: The connection quickly fills up, each cycle of batch timeout disconnects a batch) if the response fluctuates, the performance to the trough when the system is doing. Whether there is background CPU usage. (i.e. the use of the CPU is observed without pressure.) Typical problem: The scheduled task being executed consumes too much system resources)
In these problems, although the situation is changeable, in short, CPU use is the core, high CPU utilization, indicating that the system resources are fully utilized, may be the system in the actual work, on the contrary, need to find other bottlenecks. Initially determine the scope of the problem by combining snapshots of processes and threads. Low CPU usage is mostly and easy to locate, only need to look for other system bottlenecks, high CPU utilization is often more difficult to locate, although there are some ways. On the specific performance of the positioning technology, there is no too much ink, follow-up opportunity to introduce in detail.
For a site system that is just starting to do performance optimization, here are some things you can do and have an immediate effect:
For the use of mature technology, the technical community, official documents, will often give this technical white paper or optimization guidance, please refer to. For example, STRUTS2 's official Performance Tuning guide, Java6 performance Optimization whitepaper. Platform and virtual machine tuning. This is necessary for projects that use platforms and virtual machines, and the parameters of a JVM can have a significant impact on the system. such as the parameters of the connection management under Linux, the JVM about the heap size distribution parameters and so on. Front-end review. The review here refers to the problem of evaluating existing pages through tools such as page speed, YSlow, and some industry-common rules and experiences, such as several of Yahoo's front-end performance optimization rules.
If you need the guidance of the system, you may wish to refer to this map (click here to download the big picture and Mmap file: Site_performance_practice_road_map):
From the tools used, the positioning of the performance problem is largely the problem orientation of the operating system and virtual machine system. From the point of positioning the problem, it can be divided into:
Interception: A snapshot of the interception system at one level is analyzed. For example, some stack facets and tools for analysis, Jstack, Jmap, kill-3, MAT, Heap Analyser, and so on. Monitor: Monitor system changes and even data flow. such as Jprofiler, JConsole, JStat, Btrace and so on. Autopsy: The system has gone down, but left some "incriminating evidence" to analyze them later. The most famous is the hs_err_pid.log that may be left after the JVM is hung, or the generated crash dump file.
Here's a look at some of the common problems that are located:
First Class: The request is unresponsive and the browser is always in a wait state.
Location method: Kill-3 or Jstack parses the thread stack first and finds the current block's thread.
Common in: External interface call no return or network IO blocking no response; deadlock; death cycle;
Class II: Downtime, process hangs.
Positioning methods (This type of problem is generally more difficult to locate):
(1) Looking for a JVM log such as Hs_err_pidxxx.log
(2) using JVM parameters to write to the dump file when the JVM crash
(3) Find the last log in Catalina.out
Common in: JDK bug (this kind of problem that has been encountered JIT several times); Call DLL problem;
Class III: Request response time is long.
Positioning method: Kill-3 or Jstack first analyze the thread stack, see what the thread is mostly stuck on top of what operation, and then refine the analysis.
Common: Insufficient memory, visible to continuous full GC, network congestion, LoadRunner and other pressure client bottleneck, database bottleneck, can further analyze DB snapshot;
Class Fourth: TPS is low, TPS is gradually reduced, and TPS oscillation amplitude is too large.
Positioning methods (This type of problem is the most common, the method of positioning is also the most complex):
First look at the increase in pressure, the CPU utilization can go up, if not up, looking for other bottlenecks: Network/Memory/disk/... ; CPU
Usage up, observe whether there is a background CPU when there is no pressure (for example, a background timer task thread consumes a lot of CPU resources), if not, you can try Jprofiler and other tools to combine threading analysis, business analysis, and find hotspots.
Common: Other business threads interfere with memory leaks; connection handles run out; cache hit rate is low ...
Well, for the time being here, let's look at the second realm. Reaching this level means that you have been able to step outside of the "post-optimization" process to formally and comprehensively consider performance factors in the design and coding processes, such as:
Reduce time-sensitive container management, while using capacity or quantity-sensitive container management. For example, I put some data into a buffer, a design is flush storage every 10 minutes, there is a design is the data reached 10M size when the flush storage, usually, you think which scheme more reliable. The unified management of threads is used. My experience is that 10 times the use of threads or thread pools is usually 5 times problematic. Avoid using synchronous Ajax. Synchronizing Ajax causes the browser to feign death until the response is returned. Analyze the use of synchronization and locks. Even in some well-known open source libraries, we have found unreasonable synchronization design more than once, n multi-data, a single global synchronization block (this is a performance design level of "centrality"), the result it becomes a bottleneck, changes are not easy to start, very troublesome.
For immature teams, it is recommended to arrange experienced programmers to check the performance issues in the design documentation and coding, and to list common issues for reference learning.
Another obvious feature of the second realm is the validation of performance objectives and performance issues before the middle of the software process:
Performance facet analysis. This refers to the initial system design, in order to assess the performance of a system, to make a similar performance of the system prototype, and to do performance testing and evaluation, this time because of performance problems involving the change of the program, the impact is small. As far as I know, there are very few projects that can do this. In most teams, architects who rely on the architect and the right to speak rely on experience to avoid the big change in performance issues (or, simply, a fall and a painful "refactoring"). Automated test validation of performance. This step must be accompanied by coding to make a significant difference in order to identify performance issues as early as possible. Design and code-level reviews. In fact, functional problems are much considered, exposed early, and the real danger is often neglected non-functional problems, such as performance problems.
Finally, the third realm. Teams that have reached this level are able to include performance as an essential factor in the early planning phase, not just an estimate of the experience, but a data-driven theoretical design, such as performance modeling, which calculates performance metrics based on market size, business volume, service level, etc. And in this request to do a reasonable architecture design.
There are many things involved here, in addition to the data, there is a lot of thinking, for a website, may wish to ask the following questions:
How much data it will be, and what kind of storage I should design. What are the requirements for consistency? What is the real-time requirement? How much time the user can accept data latency. The extent to which the site needs to take into account scalability. Which processes have a performance risk in data processing, and what is the level of volume. How to solve the problem. The main business time consumption is how, I need to design what kind of business flow to meet.
All performance issues, like all other non-functional issues, are partly trade off, so the better designers need to think about the solutions to these problems, and the factors involved in the planning because of performance issues, too much.
There are a lot of things to discuss about the performance issues that you anticipate in your planning, and here are some things to consider:
1, cluster network: This is the most basic way to scale-out, the single-node pressure through the load-balanced distribution to multiple nodes, improve the system load capacity, but also improve the stability. 2, Reverse proxy: A large internet site can not introduce the reverse proxy to the static resources processing, servlet containers used to handle static images and text is very extravagant, Apache, Nginx, squid are excellent solutions. 3, the page static: The Internet Application "cache for King", which may be a number of scenarios can bring Whaley most obvious one, through static page generation and access, effectively reduce the system load. WEB2.0 's application cache hit rate is usually slightly worse. 4, database optimization: The user's access is difficult to meet, the database hardware equipment reinforcement, from the most basic split table, SQL tuning, to vertical and horizontal sub-library almost become an essential solution, or replace the inexpensive storage solution, using NoSQL database and so on. 5, CDN:CDN refers to the content distribution network, through the network of the wide-area level of user demand, avoid the Internet may affect the speed and stability of data transmission bottlenecks and links, improve the user experience. 6, distributed storage: The explosion of massive information, the need for inexpensive storage solutions, especially Web2.0 data. The distributed storage System can guarantee the high-throughput data read-write and the massive data storage, the real-time sex is not so important. 7, data cache: Here the data cache is separated from the page buffer, the data cache usually includes the persistent layer level cache and the external interface call cache, the data cache can reduce the various I/O calls, increase the average time of user response. 8, functional cluster: The initial cluster is equivalent, this kind of cluster mode is simple and controllable; however, as the product becomes more and more complicated, the user's access pressure is increasing, the simple peer-only cluster can not solve all the problems, and generate a lot of redundant processing logic, using functional clustering to complete the different functions of the node protocol together. 9. Page partitioning: This is essential for a large web site. The goal is to make the page static, and to leave the dynamic and static areas, so that when the user accesses, only simple aggregation operations. 10, the page fragment generation and the aggregation of the page split: Many frequently accessed relatively static page fragments usually only need to be timed or event triggered in the case of the generation only once, and can even be placed in the system under the light of the night generation. Each time a user requests, it is necessary to aggregate a static page fragment into a full page (and add the dynamic part). 11, Isolation: The isolation and backup of the complex system is mainly to solve the stability problem, keep each unit of "simple", piecemeal, more easily independent development of the unit, product. 12, the improvement of aggregation: the introduction of high-performance service-side page aggregation method (verified, the performance of the general SSI, ESI is defective); even the clientAggregation: The presentation template is sent to the client, and then through the AJAX request to the JSON (or other simple format) data flow to the client, in the client using Ajax aggregation out the final page, the advantage is to share the pressure on the server side to the client. 13, Component service: The advantage of service is that it is easy to parallelize the processing of components and increase the overall response speed. Patterns can follow the approach of SOA, where high-performance ESBs are used for service orchestration and task dispatch.
To reach the third level, you have to be able to predict performance issues. This requires a mature monitoring system, monitoring system changes, as soon as possible to respond.
For example, a major event in the country, the user volume of sudden, monitoring system can promptly identify the user monitoring curve a very obvious jumping process (such as continuous events over a certain value, and the slope of the curve over a certain value), alarm, and automatically expand to meet the potential risk. These are built on the basis of regular business operations data collection, and then data mining is needed to give key points.
Another example is the Internet application "cache as King". For the design of the cache, even to a large extent determines the success or failure of the application (if you are rich, relying on a large number of CDN This unconventional route of the other said, hehe). The design of the cache takes into account the size of the cache, ratings, queues, hit rate calculations, lifecycles, update paging, data distribution, data consistency, and data persistence, which are often overlooked by people who focus only on the effects and functionality of those pages, but if you are a good designer, you need to accumulate these thoughts.
Think big. There is such a real example, we have found that the page template OGNL performance is not high (two times the reflection of the reason), so in the project to the majority of OGNL expressions are changed to El Expression, spent a lot of time and energy, performance has actually improved, but can improve how much. About 30%, this is a steady improvement, the system is not very destructive, but the results are not enough to be complacent, but also lost some ognl flexibility. After that, we changed a train of thought, from the overall situation, to the page division, custom caching framework, the introduction of page caching capabilities, although the package is somewhat complex, but this architectural evolution, due to page generation or partial generation directly hit the cache file, performance has a leap, improved 600%~800%. This is think big, from the big point of view, see the structure of the project large, need sufficient field of vision, enough experience and accumulation, can bring significant results.
Usually the system capacity design will require more than the peak capacity, if it is like the second kill, snapping and other high performance requirements of the system, often there is a problem: the design of such a large capacity, usually most of the time the business volume is relatively small, these resources waste how to do. (Off-topic: this is probably the initial reason for Amazon's involvement in cloud storage and cloud computing.)
Let's take a look at this example of a performance-driven architecture evolution:
Initially, only simple application servers and DB servers were separated, using a simple jetty container, the bottleneck of the system on the DB side. Simple is the United States, the site has just run, can visit is king:
The system evolves continuously in the development.
One day found that the user pressure is growing, finally unbearable, the system repeatedly reached the edge of the crash, under the existing hardware and architecture conditions difficult to support the existing business, made such a change:
In this change, we have done these important things: 1, the introduction of full-page cache. Internet application cache is king, and full-page caching can have an immediate effect. 2, the page to show the abstract as a "theme", and the page data separate. Moreover, this paper introduces the concept of "aggregation", which lays a foreshadowing for further development. 3, in order to alleviate the bottleneck of the database, use the RAC mode to do the persistent layer of the cluster. 4, for JS, CSS, pictures and other static resources almost immutable, the introduction of reverse proxy, preferential treatment.
The website continues to develop silently, quietly evolve.
Finally, one day, the user traffic surge, millions PV reached, WEB2.0 business also increased, the cache hit rate is getting lower, the CPU becomes a bottleneck, access is unusually slow. This time, we have to move the knife again:
This time the architecture refactoring has done a few important things: 1. Static resources (especially files available for download), using CDN to relieve stress. 2, split the request into the main request, asynchronous data requests and static resource requests, where the main request is simply to get the page invariant parts (template + static data), dynamic Data in the form of asynchronous JSON, and use JavaScript aggregation on the browser side. This step puts some aggregation operations on the client side, easing the server pressure. 3, the real page aggregation display and page generation split, to ensure that the user response is fast. 4, the introduction of multi-level cache (in-Memory object collection using memcached cache, interface layer cache messages, page cache cache files, etc.), at the same time, for the division of the hierarchy, it is easy to split the entire system into several sub-components independent operation, simple, independent. 5, the database further split, read and write separation. 6. The page is divided into blocks. This is a large Web2.0 site common features, a page often there is always a part of the fixed, these parts should be able to be cached in the form of page fragments to disk, each time the page generated only need to pay more attention to the changes in the section.
Go on, go on ...
The traffic increased dozens of times times, the cluster server also reached three digits for the first time, the system is not stable, the speed falls again, the problem localization is extremely difficult, all again began to be confusing.
This time, inevitably, the architectural adjustment, the primary goal is to increase the stability of the system in isolation decoupling, and more convenient for product management: 1, the overall use of SOA layout, according to the function of the cluster, and each functional cluster defined as a "service", The service is accessed internally using the rest-style interface. Service-driven and programming engines (ESB roles) regularly store static data that can be generated in advance to shared storage. 2, clear the aggregation logic, static data as far as possible in the service side aggregation completed, reduce client data request traffic. 3, the introduction of NoSQL database and inexpensive storage, the appropriate waiver of consistency, for the massive data compromise. 4, the development of core business function package deployment engine (based on OSGi), for business customization, only need to follow the format defined by the Feature pack development, the completion can be done without restarting the application to increase business