Record a complete report on middleware performance optimization during a business trip to Guangdong

Source: Internet
Author: User

XX System Performance Analysis and Optimization report

I. background

Since January 2014, when the XX system frequently receives access requests, when the system queries SQL statements and submits them to the oracle database for query but does not return the results in time, it is easy to see background thread suspension, the concurrency of the XX system is as high as 200 connections to each node, and the total number of connections reaches more than 800. The WAS thread pool fails to be released in time, resulting in the thread pool depletion, WAS needs to be restarted on a regular basis to solve the problem. Currently, the O & M Monitoring and Management System is used to monitor the system every 30 minutes. If the system cannot be accessed, an SMS will trigger an alarm, and then it will be restarted manually two or three times in the previous week, at the same time, the middleware logs are analyzed to optimize the Oracle database, but the effect is not good.

The company attaches great importance to the feedback from users and project teams, arranges senior R & D personnel and System Architects within the company to analyze the system logs and middleware logs sent back by the on-site project team, and troubleshoot potential system hidden risks, senior Technical Experts are arranged to conduct on-site troubleshooting to solve system performance problems.

On June 23, April 14, 2014, I went to the site and communicated with XXX director Li, general manager Zeng Of the company branch, and Xu of the project team director on the performance of the XXX system, the collection system has the following performance problems and possible rules:

1. After the system runs for a period of time, the system access is slow and access error occurs. For details, see 1-1.

Figure 1-1

2. When the traffic volume increases rapidly, the JDBC data source connection increases rapidly, the system request slows down, the response time gets longer, and the system query does not respond for a long time, such as 1-2.

Figure 1-2

3. servers consume 97% of CPU resources and 50% of memory resources. For details, see 1-2.

Figure 1-3

The preceding three cases can only be solved by Manually restarting the middleware.

Ii. Cause Analysis Process

After arriving at the site, I analyzed the middleware logs and server parameter settings of XX system, and compared the system access logs in recent years from different perspectives, such as year, month, and day, the data shows that the access volume of the XXX system increases linearly, And the access volume is always very high. Based on the above factors, the main cause of system downtime is preliminarily identified. When the number of users is high concurrency, the middleware WebSphere thread pool is very busy, causing JVM garbage collection to fail to be processed in a timely and effective manner. Finally, the middleware thread of the XXX system is suspended and down.

(1) Check server and middleware settings and optimize parameters

1. Check the Linux Kernel Parameter settings of the operating system. The maximum available memory in the kernel parameters is limited to 2 GB, and the maximum available memory in the kernel parameters is changed to 32 GB.

2. Check and analyze the middleware logs in the last week. The logs mainly show that some data sources are closed due to connection timeout when the system encounters peak access.

3. Check the middleware system configuration and modify the maximum heap size of the JVM from 1024 M to 2048 M and 4096 M respectively.

4. Change the Application Server session management from 1000 to 2000. The timeout value is reduced from 30 minutes to 10 minutes to accelerate the release of unused resources.

5. Make statistics on the queries referenced by each data source. This increases the connection pool minimum and maximum number of connections for the data source.

6. Adjust the timeout settings of the data source to accelerate GC collection accuracy.

(2) system operation monitoring and log analysis to precisely locate the cause of the problem

Monitor the XXX System Based on the above analysis ideas and verify the analysis results.

From February 1, April 15-20, 2014 to February 1, April 16, the operation status of the XXX system was tracked and analyzed in two days, and an outline node was added for operation, increasing the pressure sharing of the system. The monitoring results show:

1. When the business is highly concurrent (Am-Am, Am-Am), the XXX system is consuming the connection pool of the resource library data source (JDBC/XXX), and the connection pool resources are released very slowly.

2. As the middleware JVM virtual machine memory has been modified and increased, and a was instance has been added to run, the system is still stable.

3. system resource consumption: CPU resource consumption is very low, memory consumption is about 16 GB, and there is not much fluctuation.

4. Modify the load policy of the Server Load balancer and change the original load policy for the shortest response from the source IP address to the round robin load policy. Currently, the number of accesses to the five nodes is basically the same.

5. Communicate with the R & D personnel to check whether the program code has memory leakage or the connection is not closed.

6. Check the middleware logs and find that some of the solutions have configuration problems. An error will be reported during the query. Check the 117 nodes, but the JDBC resources of the 117 nodes will not be released until the 16 and 17 days.

(1) segment code segments in the XXX solution cannot be queried, AND the AND symbol is missing in SQL concatenation;

(2) The XXXZHM field in Table VW_001 does not exist;

(3) The XXXHM field of VW_002 does not exist;

(4) The XXX_ID field of VW_003 does not exist;

(5) The XXZH field of VW_004 does not exist;

(6) The XXXSFHM field of VW_005 does not exist;

The above is a part of the log error. On-site analysis of the entire log is required until no error is reported for all solutions.

Based on the above analysis, the cause is basically located, and the optimization measures are taken to focus on the remaining nodes:

(1) modify the memory limit of the kernel parameters of the server, and modify all four servers.

(2) modify the JVM Virtual Machine size of the middleware. The minimum heap and maximum heap are 2048 M and 4096 M, respectively.

(3) Adjust the load policy of the hardware Load balancer and set 10 connections as the starting point for round-robin.

(3) Comprehensive analysis of performance problems of the other three units

1. Unit A is observed mainly because connections to some data resources cannot be released and will accumulate after A long period of operation.

2. During business peak hours of unit A, the maximum number of JDBC busy applications on the application server is 20, the CPU usage is over 96%, and the memory usage is 22 GB. the disk IO is not obvious, and only kb/s is occupied, after the peak business hours, the system can respond to the resource occupation level when the business is low.

3. Unit B is mainly caused by network issues and data source Connections cannot be released.

4. A node is viewed in Unit C because the connection to the data source cannot be released.

Iii. Measures for optimizing the performance of unit A system and Unit B and Unit C System

1. view the middleware logs for each server and adjust the error scheme based on the error information. [Very important]

2. Adjust the jvm vm memory of the middleware to 2048 MB for the initial heap and 4096 MB for the maximum heap (provided that the operating system has no restrictions ).

The memory size of A server is 32 GB. We recommend that you add an overview file (from the current A unit, the CPU of the server is under heavy pressure during peak business hours ).

3. Set memory overflow in session management to prevent overflow. Set the timeout time to 15 minutes.

4. Increase the number of modified memory sessions from 1000 to 2000.

5. modify the number of sessions of an application from 1000 to 2000, and disable memory overflow.

6. Modify the timeout time of the data source. Set the connection timeout, collection time, unused timeout, and timeout time for the application traffic to 60, 60, 60, and 50, for medium traffic, the values are 120, 120, 120, and 300, and for small traffic, the values are 180, 180, 180, and 240.

7. Modify the minimum size of the default thread pool: 20; the maximum size: 100; the minimum size of webcontainer: 30; and the maximum size: 150. (The minimum size of system middleware default is 20, the maximum size is 100, and the minimum size of webcontainer is 30, and the maximum size is 200)

8. Enable ORB of Container service to transfer services by reference, and enable servlet high-speed cache of web containers.

9. After the modification, monitor the JDBC connection of the application system during peak business hours, and analyze whether error logs exist in the middleware logs.

10. We recommend that you install the middleware patch. The latest version of the patch is optimized to some extent at the code level. The latest version of websphere middleware 6.1 is 6.1.0.47, the latest patch version of websphere middleware 6.0 is 6.0.2.43.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.