Java Application Performance Tuning practices

Source: Internet
Author: User
Tags cpu usage jprofiler nginx load balancing log4j

Java Application Performance optimization is a commonplace topic, the author based on personal experience, the Java performance optimization is divided into 4 levels: Application layer, database layer, framework layer, JVM layer. Through the introduction of Java performance diagnostic tools and ideas, to give the Sogou business Platform Performance optimization case for reference.

Java Application Performance optimization is a commonplace topic, typical performance problems such as slow page response, interface timeout, high server load, low concurrency, database frequent deadlock and so on. Especially in the "rough fast" Internet development model of the big line today, with the increasing number of system access and code bloated, a variety of performance problems began to pour. Java application performance has a lot of bottlenecks, such as disk, memory, network I/O and other system factors, Java application code, JVM GC, database, cache and so on. Based on personal experience, the author divides Java performance optimization into 4 tiers: Application layer, database layer, framework layer, JVM layer, 1.

Figure 1.Java Performance optimization layered model

The level of difficulty in each layer of optimization increases, the knowledge involved and the problem solved will be different. For example, the application layer needs to understand the code logic, the Java thread stack to locate the problem line of code, database level needs to analyze SQL, location deadlock, etc., the framework layer needs to understand the source code, understand the framework mechanism; the JVM layer needs to have a thorough understanding of the type and mechanism of the GC, and the role of various JVM parameters.

Around Java performance optimization, there are two basic methods of analysis: field analysis and post-mortem analysis. The site analysis method is used to analyze the location by reserving the site and then using diagnostic tools. In-situ analysis has a greater impact on the line, and some scenarios (especially when it comes to user-critical online services) are not appropriate. Post-mortem analysis requires collecting as much field data as possible, and then immediately resuming the service, while analyzing and reproducing the collected field data afterwards. Below we start from the Performance diagnostic tool, share the Sogou business platform in some of the cases and practices.

Performance diagnostic tools

One of the performance diagnostics is to diagnose the system and code that has identified performance problems, and one is to test the pre-launch system ahead of time to determine whether the performance meets the on-line requirements. This article is mainly for the former, the latter can be used in a variety of performance measurement tools (such as JMeter) for testing, not in the scope of this article. For Java applications, the performance diagnostic tools are mainly divided into two tiers: OS level and Java application level (including application code diagnostics and GC Diagnostics).

OS Diagnostics

OS diagnosis is mainly concerned with CPU, Memory, I/O three aspects.

CPU Diagnostics

For the CPU the main concern is the average load (load Average), CPU utilization, and context switch.

The top command allows you to see the average system load and CPU usage, and Figure 2 looks at the status of a system with the top command.

Figure Command Example

The average load has three digits: 63.66,58.39,57.18, which represents the load of the machine for the past 1 minutes, 5 minutes, and 15 minutes. According to experience, if the value is less than 0.7*CPU number, then the system is working normally, if it exceeds this value, even up to four or five times times the number of CPU cores, the load on the system is obviously high. Figure 2 in the 15-minute load is already up to 57.18, 1 minutes Load is 63.66 (System 16 core), indicating that the system load problem, and there is a further rise in the trend, need to locate the specific reason.

The Vmstat command lets you see the number of context switches for the CPU, as shown in 3:

Figure 3.vmstat Command Example

Context switches occur mainly in the following scenarios: 1) The time slice runs out, the CPU normally dispatches the next task; 2) is preempted by other higher priority tasks; 3) Perform tasks that encounter I/O blocking, suspend current tasks, switch to the next task, 4) user code proactively suspends current tasks and yields CPU ; 5) multitasking preemption of resources due to no grab to be suspended; 6) hardware interrupt. Java thread Context switches mainly come from the competition of shared resources. General single object locking is rarely a system bottleneck unless the lock grain is large. However, in a code block with high frequency of access, which can lock multiple objects continuously, a large number of context switches may become the bottleneck of the system. For example, in our system has appeared log4j 1.x in large number of large concurrent print logs, frequent context switching, a large number of thread blocking, resulting in a large system throughput drop, the relevant code as shown in Listing 1, upgrade to log4j 2.x to solve this problem.

Listing 1. LOG4J 1.x Synchronous Code snippet
for (Category C = this; c! = null; c=c.parent) {//Protected against simultaneous call to Addappender, Removeappender,... sy Nchronized (c) {if (C.aai! = null) {write + = C.aai.appendloopappenders (event);}


From the operating system perspective, memory focuses on whether the application process is sufficient, and you can use the Free–m command to view memory usage. With the top command, you can see the virtual memory VIRT and physical memory RES used by the process, and according to the formula VIRT = swap + res, you can deduce the Swap partition (swap) used by the application, use the swap partition to influence Java application performance, and you can swappiness The value is adjusted to as small as possible. Because for Java applications, consuming too many swap partitions can affect performance, after all, disk performance is much slower than memory.

/ o

I/O includes disk I/O and network I/O, and in general the disk is more prone to I/O bottlenecks. With Iostat, you can see how the disk reads and writes, and I/O wait through the CPU to see if the disk I/O is healthy. If disk I/O remains high, the disk is too slow or faulty to become a performance bottleneck that requires application optimization or disk replacement.

In addition to the usual top, PS, Vmstat, Iostat and other commands, there are other Linux tools that can diagnose system problems such as Mpstat, tcpdump, Netstat, pidstat, SAR, etc. Brendan summarizes the performance diagnostic tools for different types of Linux devices, as shown in 4, for your reference.

Figure 4.Linux Performance Observation Tool Java Application Diagnostic Tool application code diagnostics

Applying code performance issues is a relatively good solution to a class of performance issues. Monitoring alarms through some application levels, if you identify problematic functions and code, can be located directly through the code, or through Top+jstack, to find the problematic line stacks, to locate the problem thread code, you can also find the problem. For more complex, logically more code snippets, the Stopwatch print performance log can often be used to locate most application code performance issues.

Common Java application diagnostics include threading, stacking, GC, and so on.


The Jstack command usually works with top, locates Java processes and threads through the Top-h-P PID, and then uses jstack-l PID to export the line stacks. Because the line stacks is transient, it is necessary to dump, generally 3 times dump, usually every 5s on the line. The Java thread PID of top positioning is turned into 16, and the Java line stacks nid can find the corresponding problem line stacks.

Figure 5. View long running Java threads with Top–h-p

5, where the thread 24985 run longer, there may be a problem, after turning to 16, the Java line stacks to find the corresponding thread 0x6199 stack as follows, so as to locate the problem point, 6 is shown.

Figure 6.jstack Viewing the thread stack


The Jprofiler can analyze the CPU, heap, and memory, and is powerful, as shown in 7. Combined with a pressure measurement tool, you can sample code time-consuming statistics.

Figure 7. Memory analysis GC Diagnostics via Jprofiler

Java GC solves the programmer's risk of managing memory, but the application suspension caused by GC is another problem that needs to be solved. The JDK provides a range of tools to locate GC issues, more commonly used jstat, Jmap, and third-party tools like MAT.


The Jstat command can print GC details, young GC and full GC times, heap information, and so on. Its command format is

Jstat–gcxxx-t pid <interval> <count>, shown in 8.

Figure 8.jstat Command Example


Jmap Print Java process heap information jmap–heap PID. Dump heap to file via Jmap–dump:file=xxx PID, and then further analyze its heap usage through other tools


MAT is an analytical tool for the Java heap, providing an intuitive diagnostic report, built-in OQL allows class-SQL queries to the heap, powerful, outgoing reference and incoming reference can be traced to object references.

Figure 9.MAT Example

Figure 9 is a mat using the example, the mat has two columns showing the object size, shallow size and retained size, which indicates that the object itself occupies memory size, does not contain its reference object, which is the object itself and its direct or indirect reference to the shallow size The sum, that is, the amount of memory the GC frees after the object is reclaimed, which is generally concerned about the size of the latter. For some large stacks (dozens of G) of Java applications, large memory is required to open the MAT. Usually the local development machine memory is too small, is unable to open, it is recommended that the online server-side installation of the graphics environment and MAT, remote Open view. or execute the mat command to generate the heap index, copy the index to local, but see the heap information in this way is limited.

In order to diagnose GC problems, we recommend adding-xx:+printgcdatestamps to the JVM parameters. Common GC parameter 10 is shown.

Figure 10. Common GC Parameters

For Java applications, the Top+jstack+jmap+mat can be used to locate most applications and memory problems, which is an essential tool. In some cases, Java application diagnostics need to refer to OS-related information and can use a number of more comprehensive diagnostic tools such as Zabbix (integrated OS and JVM monitoring). In distributed environment, the distributed tracking system and other infrastructures also provide strong support for the application performance diagnosis.

Back to top of page

Performance optimization Practices

After introducing some of the commonly used performance diagnostic tools, here are some examples from the JVM layer, the application code layer, and the database tier, combined with our practice in Java application tuning.

JVM Tuning: GC Pain

Sogou Commercial platform A system reconfiguration when the RMI is selected as the internal Remote call protocol, the system started to appear after the periodic service stop response, pause time from a few seconds to a few 10 seconds. By observing the GC logs, it is discovered that a full GC appears every hour after the service starts. Because the system heap settings are large, the full GC pauses the application for a long time, which has a large impact on the online real-time service. It is analyzed that there is no periodic full GC in the system before the reconstruction, so the problem of the RMI framework is suspected. By exposing the data, it is found that RMI's GDC (Distributed garbage Collection, distributed garbage Collection) initiates a daemon thread that periodically executes the full GC to reclaim the remote object, and listing 2 shows its daemon code.

Listing 2.DGC Guardian Line routines code
private static class Daemon extends Thread {public void run () {for (;;) {   //... long d = maxobjectinspectionage (); if (d >= l) {System.GC ();  d = 0; }//...}}  

It's easier to fix the problem after locating it. One is to directly disable the display invocation of the system GC by increasing the-XX:+DISABLEEXPLICITGC parameter, but there is a risk of out-of-heap memory overflow for systems using NIO. Another way is to increase the full GC interval by increasing the-dsun.rmi.dgc.server.gcinterval and-dsun.rmi.dgc.client.gcinterval parameters, while increasing the parameters-xx:+ Explicitgcinvokesconcurrent, a fully stop-the-world full GC is tuned to a concurrent GC cycle, reducing application pause times and no impact on NIO applications. As shown in Figure 11, the number of full GC times after adjustment is significantly reduced after March.

Figure 11.Full GC Monitoring statistics

It is also necessary for GC tuning to interact with high-concurrency large data volumes, especially when default JVM parameters typically do not meet business requirements and require special tuning. The interpretation of GC logs has a lot of public information, this article will not repeat. GC Tuning target has three basic ideas: Reduce GC frequency, can increase the heap space, reduce unnecessary object generation, reduce GC pause time, can be achieved by reducing heap space, using the CMS GC algorithm, avoid full GC, adjust the CMS trigger scale, avoid Promotion Failur E and Concurrent mode failure (the older generation allocates more space, increases the number of GC threads to recover faster), reduces large object generation, and so on.

Application layer tuning: Smelling the bad taste of code

It is undoubtedly one of the best ways to improve the performance of Java application to analyze the source of code efficiency degradation from Application layer code tuning.

A commercial advertising system (with Nginx load balancing) on a daily basis, several of which have a sharp increase in the load of the machine, the CPU usage quickly full. We made an emergency rollback on the wire and saved the site of one of the servers through Jmap and Jstack.

Figure 12. Stack scene through MAT analysis

As shown in stack field 12, according to MAT's analysis of dump data, the most memory objects are found to be byte[] and Java.util.HashMap $Entry, and the Java.util.HashMap $Entry object has a circular reference. The initial positioning of the HASHMAP in the put process is likely to have a dead loop (Java.util.HashMap $Entry 0x2add6d992cb8 and 0x2add6d992ce8 next reference to form a loop). Refer to related document positioning This is a typical concurrent use scenario error (, which is briefly called HashMap itself does not have the characteristics of multi-threaded concurrency, in the case of multiple threads simultaneously put operation, the internal array expansion will cause the HASHMAP internal chain list to form a ring structure, resulting in a dead loop.

For this launch, the biggest change is to improve system performance by caching site data in memory, while using lazy loading mechanisms, as shown in Listing 3.

Listing 3. Web site data Lazy loading code
private static Map<long, uniondomain> Domainmap = new Hashmap<long, uniondomain> ();p Rivate Boolean Isresetdomains () {if (Collectionutils.isempty (Domainmap)) {//Get site details from the remote HTTP interface list<uniondomain> Newdomains = Uniondomainhttpclient.queryalluniondomain (); if (Collectionutils.isempty (Domainmap)) {domainmap = new HashMap< Long, uniondomain> (); for (Uniondomain domain:newdomains) {if (domain! = null) {Domainmap.put (Domain.getsubdomainid ( ), domain);}}} return true;} return false;}

You can see that the domainmap here is a static shared resource, which is the HashMap type, which causes its internal linked list to form a circular structure and a dead loop in multithreaded situations.

Through the front-end nginx connection and access logs can be seen, due to the system restart after the Nginx accumulated a large number of user requests, in the Resin container start, a large number of user requests into the application system, multiple users at the same time the site data requests and initialization work, resulting in HashMap concurrency problems. After locating the fault reason, the solution is relatively simple, the main solution is:

(1) using Concurrenthashmap or synchronous block to solve the above concurrency problem;

(2) To complete the website cache loading before the system starts, to remove lazy loading and so on;

(3) Replace the local cache with distributed cache and so on.

For the location of bad code, in addition to the regular sense of code review, the use of tools such as the MAT can be used to a certain extent to the system performance bottleneck point to quickly locate. However, some scenarios, such as binding to a particular scenario or business data binding, require ancillary code walks, performance testing tools, data simulations, and even online drainage to finally identify the source of the performance problem. Here are some of the possible features of some of the bad code we've summarized for your reference:

(1) Code readability is poor, no basic programming specifications;

(2) objects generate too much or generate large objects, memory leaks, etc.;

(3) IO stream operation too much, or forget to close;

(4) Too many database operations, the transaction is too long;

(5) The scene error of the synchronous use;

(6) Cyclic iteration time-consuming operation, etc.

Database Layer tuning: Deadlock nightmare

For most Java applications, the scenario of interacting with a database is common, especially for applications with high data consistency requirements, and the performance of the database directly affects the performance of the entire application. Sogou commercial Platform system as the advertising of advertising and launch platform, the real-time and consistency of its materials have a high demand, we have accumulated some experience in relational database optimization.

For the advertising material library, a high degree of operational frequency (especially through the bulk material tool operation) is very easy to cause the deadlock situation of the database, one of the more typical scenario is the advertising material price adjustment. The customer often adjusts the bid of the material frequently, thus causing the load pressure to the database system indirectly, and also aggravating the possibility of deadlock occurrence. The following is a Sogou commercial platform AD system advertising material price adjustment case to explain.

A commercial advertising system has a sudden increase in traffic, resulting in increased system load and frequent database deadlock, deadlock statement 13 is shown.

Figure 13. Deadlock statements

Among them, the index on the Groupdomain table is Idx_groupdomain_accountid (AccountId), Idx_groupdomain_groupid (GroupID), Primary (Groupdomainid ) Three single-index structure with Mysql InnoDB engine.

This scenario occurs when the group, group industry (Groupindus table), and Group Web sites (groupdomain tables) exist in the scene when the group bids are updated. When updating group bids, use group bids for group industry bids (by Isusegroupprice, if 1 is used for group bidding). At the same time, if the group site bids using group industry bids (through Isuseindusprice, if 1 use group industry Bids), also need to update their group site bids. Because there can be up to 3,000 sites under each group, the related record is locked for a long time when the group bids are updated. As you can see from the deadlock above, both transaction 1 and transaction 2 have a single-column index of Idx_groupdomain_accountid selected. Depending on the features of the Mysql InnoDB engine lock, only one index will be selected in a single transaction, and if you lock with a level two index, you will attempt to lock the primary key index. Further analysis shows that transaction 1 in Request Transaction 2 holds the ' Idx_groupdomain_accountid ' two level index locking (lock range "Space ID 5726 page no 8658 n bits 824 index"), but transaction 2 has obtained the two-level cable ("Space ID 5726 page no 8658 n bits 824 index") locks on the lock on the index PRIMARY the primary key index that waits for the request to be locked. Transaction 1 eventually rolls back because transaction 2 waits too long for execution or does not release the lock for a long time.

Through the day access log tracking can be seen, the same day a customer through the script to initiate a large number of modifications to the Promotion Group bid, resulting in a large number of transactions in the loop wait for the previous transaction to release the locked primary key PRIMARY index. The root cause of the problem is that the Mysql InnoDB engine is limited in terms of index utilization and is not highlighted in the Oracle database. The way to resolve this is to expect a single transaction to lock down as few records as possible, so that the probability of a deadlock can be greatly reduced. Finally, a composite index (AccountId, GroupID) is used to reduce the number of records for a single transaction lock, and to isolate the data records of the promotion group under different plans, thus reducing the probability of such deadlocks.

Generally speaking, the tuning of the database tier is basically based on the following aspects:

(1) Optimization at the level of SQL statements: Slow SQL analysis, index analysis and tuning, transaction splitting, etc.;

(2) At the database configuration level to optimize: such as field design, sizing cache size, disk I/o database parameter optimization, data fragmentation, etc.

(3) Optimization from the database structure level: Considering the vertical splitting and horizontal splitting of the database;

(4) Select the appropriate database engine or type to adapt to different scenarios, such as introducing NoSQL.

Back to top of page

Summary and Suggestions

Performance tuning also follows the 2-8 principle that 80% of the performance issues are generated by 20% of the code, so optimizing the key code is a multiplier. At the same time, optimization of performance should be done on-demand optimization, and over-optimization may introduce more problems. For Java performance optimization, not only to understand the system architecture, application code, but also need to focus on the JVM layer or even the operating system bottom. Summed up can be considered from the following points:

1) Tuning of basic performance

The basic performance here refers to the hardware level or operating system level upgrade optimization, such as network tuning, operating system version upgrades, hardware equipment optimization and so on. For example, the use of F5 and the introduction of SDD HDD, including the new version of Linux in the NIO upgrade, can greatly promote the performance of the application improvement;

2) Database performance optimization

Including common transaction splitting, index tuning, SQL optimization, NoSQL Introduction, such as the introduction of asynchronous processing in transaction splitting, finally achieving consistency, including the introduction of various types of NoSQL database for specific scenarios, can greatly alleviate the traditional database in high concurrency under the shortcomings;

3) Application Architecture optimization

Introduce some new computing or storage framework, use the new features to solve the original cluster computing performance bottleneck, or introduce a distributed strategy to level the computation and storage, including pre-calculation preprocessing, and so on, take advantage of typical space time-changing practices, etc., can reduce the system load to some extent;

4) Business-level optimization

Technology is not the only way to improve the performance of the system, in many performance problems in the scene, in fact, you can see a large part of the cause of the special business scenario, if you can evade or adjust in business, in fact, is often the most effective.

Java Application Performance Tuning practices

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.