Two or three things to know about back-end performance testing

Source: Internet
Author: User

A month black wind high night, a taxi platform on-line a big wave (g+) preferential activities, the people have to place a single. As a breakdowns, the use of the taxi platform of the intelligent services cannot carry the direct (such as). Afterwards, responsible for intelligent Alert service development and operation of the relevant departments to decide after the meeting: the Smart Alert Service must be a comprehensive in-depth performance mapping, immediately! Right now! Right away!

So a big lump of questions came to the head: for the smart tips such as background services, performance testing process should care about those indicators? What do these indicators mean? What are the criteria for the adoption of these indicators? Here's a solution for you. seo| SEO Optimization | SEO ranking | keywords included

Overview

Different groups of attention to the performance indicators are focused. Callers of a background service interface generally care only for external metrics such as throughput, response time, and so on. The owner of the backend service not only focuses on external metrics, but also on internal metrics such as CPU, memory, and load.

Take a taxi platform, it is concerned about the smart tips of external indicators can withstand the large wave of preferential traffic caused by the surge. For the development, operation and testing of intelligent prompt services, not only the external indicators, but also the CPU, memory, IO and other internal indicators, as well as deployment, server hardware and software configuration and other operations related matters.

 

External indicators

From the outside, the performance test mainly focuses on the following three indicators

    • Throughput: The number of requests and tasks that the system can process per second.

    • Response Time: The service takes time to process a request or a task.

    • Error rate: The percentage of requests in a batch of requests that result in an error.

The response time metrics depend on the specific service. such as smart tips for a service, the return of the data valid period is short (the user can enter a letter to re-request), the real-time requirements are relatively high, the response time limit is generally within 100ms. While navigating a class of services, the upper limit of response time is generally 2-5s due to the long lifetime of the returned results (throughout the navigation process).

For the response time statistics, should be from the mean,. 90,. 99, distribution and other angles of statistics, and not just give the mean value. is an example of response time statistics

Throughput metrics are affected by response time, server hardware and software configuration, network status, and many other factors.

    • The higher the throughput, the longer the response time.

    • The higher the server hardware configuration, the greater the throughput.

    • The poorer the network, the smaller the throughput.

The mean and distribution of response time at low throughput is stable and does not cause too much fluctuation.

At high throughput, response time increases with throughput, and the growth trend may be linear or close to exponential. When throughput approaches the peak of the system, the response time surges.

The error rate is related to the specific implementation of the service. Typically, the ratio of errors caused by external causes such as network timeouts should not exceed 5%, because the service itself results in an error rate that should not exceed 1%.

 

Internal indicators

 

From a server perspective, performance testing focuses on CPU, memory, server load, network, disk IO, etc.

Cpu

All instructions and data processing for the background service are CPU-bound, and service-to-CPU utilization plays a decisive role in the performance of the service.

The CPU of Linux system mainly has the following statistic data of several dimensions

    • US: Percentage of CPU time used by user state

    • Sy: Percentage of CPU time used by the system state

    • NI: User-configured CPU time percentage for a nice-weighted process allocation

    • ID: Idle percent of CPU time

    • WA:CPU Waiting IO Completion time percentage

    • Hi: Percentage of hard interrupt consumption time

    • Si: percent of soft interrupt consumption time

is the output of the top command on a server on an open platform forwarding service, the following example describes the CPU metrics

US & sy: Most of the CPU time slices used by the backend services use the same ratio of us and sy in the highest. At the same time, these two indicators are mutually affected, the ratio of us is high, the ratio of SY is low, and vice versa. Usually the high ratio of SY means that the service is switched more frequently between the user state and the system State, and the overall performance of the system will be decreased at this time. In addition, on a server that uses multicore CPUs, CPU 0 is responsible for scheduling between cores of CPUs, and high usage on CPU 0 can lead to lower scheduling efficiencies between other CPU cores. So CPU 0 in the test process needs to be focused.

ni: each Linux process has a priority, the priority of the process has priority to execute the right, this is called the PRI. In addition to the priority, the process has a priority correction value. This correction value is called the nice value of the process. Generally, the NI value of the service and server as a whole is not very high. If the value of NI is higher during testing, it is necessary to find the reason from the server Linux system configuration, the running parameters of the service under test

ID: During online service operation, it is necessary to retain some ID redundancy to cope with sudden traffic surges. In the performance testing process, if the ID has been very low, the throughput does not go, you need to check the service thread/process configuration, Server System configuration and so on.

wa: IO operations such as disks and networks can cause the CPU's WA indicator to increase. Typically, network IO consumes less WA resources, and frequent disk reads and writes can lead to a spike in WA. If the service being tested is not an IO-intensive service, it is necessary to check the log volume, data load frequency, etc. of the service under test.

Hi & si: A hard interrupt is a peripheral to the CPU interrupt, that is, the peripheral hardware sent to the CPU or memory of the asynchronous signal is a hard interrupt signal, soft interrupt by the software itself to the operating system core interrupt signal. This is usually caused by a hard interrupt handler or the process Scheduler's interruption to the operating system kernel, which is what we often call system calls. In the performance test process, HI will have a certain CPU usage, but not too high. For IO-intensive services, SI has a higher CPU occupancy rate.

Memory

The main purpose of memory monitoring during performance testing is to check the fluctuation of memory used by the service under test.

There are multiple commands in a Linux system to get memory usage for a specified process, most commonly the top command, as shown in

which

    • VIRT: The total number of virtual memory used by the process. It includes all the code, data and shared libraries, plus the pages that have been swapped out, all the total memory space that has been applied

    • RES: The process is using no swap of physical memory (stack, heap), the memory segment has been re-assigned after memory is requested

    • SHR: The total number of shared memory that the process uses. This value only reflects memory that may be shared with other processes, and does not mean that the memory is currently being used by another process

    • Swap: The size of the virtual memory that the process uses to swap out the space that has been applied but not used, including (stack, heap, shared memory)

    • DATA: The total amount of physical memory that the process has in addition to executable code, that is, the overall space of the process stack, heap requisition

As can be seen from the above explanation, the test process mainly monitor the res and virt, for the use of shared memory of the multi-process architecture services, but also need to monitor the sofa shr.

Load (server load)

Linux system load refers to the average length of the running queue, which is the average number of processes waiting for the CPU

From the definition of server load, it can be seen that the optimal state of the server operation is that all CPU cores run on a queue of 1, that is, all active processes are running, and there is no waiting. In this state, the server runs under the load threshold.

Typically, by experience, the server's load should be at the threshold of 70%~80%, which can take advantage of most of the server's performance while leaving some performance redundancy to cope with traffic growth.

Linux provides many commands to view the load on the system, most commonly the top and uptime

Top and uptime have the same output for the load, which is the system's load mean for the last 1 minutes, 5 minutes, 15 minutes.

The commands to view the system load thresholds are as follows

In the performance testing process, the system load is one of the most important indexes to evaluate the whole system health. Normally, the system load should be close but not exceed the threshold value when the stress test, the system load of concurrent test cannot exceed 80% of the threshold value, the system load should be about 50% of the threshold value when the stability test.

Internet

In performance testing, network monitoring mainly includes network traffic, network connection status monitoring.

Network traffic Monitoring

You can use the Nethogs command. Similar to top, this command is a real-time interactive command that runs the following interface

In a background service performance test, the service that returns text results does not require much attention in terms of traffic.

Network Connection Status Monitoring

The monitoring of network in performance testing is mainly to monitor the changes and anomalies of network connection state. For services that use the TCP protocol, it is necessary to monitor changes in the service's established connections (that is, TCP connections in the established state). For HTTP protocol services, it is necessary to monitor the status of the network buffers of the corresponding processes of the service under test, the number of connections in the TIME_WAIT state, etc. Linux comes with a lot of commands such as NETSTAT, SS are supported as above functions. Netstat is the result of monitoring the specified PID process.

Disk IO

During performance testing, if the service is too frequent to read and write to the disk, it can cause a large number of requests to be in IO waiting state, the system load increases, the response time becomes longer, and the throughput decreases.

Under Linux, you can use the Iostat command to monitor disk status, such as

    • TPS: The number of times the device is transferred per second. "One-time transfer" means "one-time I/O request". Multiple logical requests may be merged into "one I/O request". The size of the "one transfer" request is unknown

    • KB_READ/S: The amount of data read from the device (driveexpressed) per second, in units of kilobytes

    • KB_WRTN/S: The amount of data written to the device (driveexpressed) per second, in kilobytes

    • Kb_read: The total amount of data read, in units of kilobytes

    • KB_WRTN: The total amount of data written, in units of kilobytes

From the output of iostat, we can get the most basic statistic data of the system operation. But for performance testing, this data does not provide more information. Need to add the-x parameter

    • RRQM/S: How much of this device-dependent read request is merged per second (when the system call needs to read the data, the VFS sends the request to each FS, and if FS finds that different read requests read the same block data, FS merges the request into merge)

    • WRQM/S: How much of this device-related write request per second has been merge

    • Await: The average time (in milliseconds) of processing per IO request

    • %util: All processing io time, divided by total statistic time, in the statistical time. For example, if the statistic interval is 1 seconds, the device has 0.8 seconds to process the IO, and 0.2 seconds is idle, then the device's%util = 0.8/1 = 80%, which implies the device's busy level. seo| SEO Optimization | SEO ranking | keywords included

 

Common performance Bottlenecks

    • throughput to upper limit system negative Load not reached threshold : Typically, the system resources assigned by the service being tested are too small to be caused. If such a situation is found during testing, you can locate the problem cause

    • CPU for us and Sy not high, but WA very high : If the service being tested is a disk IO intensive service, WA High is a normal phenomenon. But if it is not such a service, the most likely cause of WA high Two, one is the service to disk read and write business logic problems, read and write frequency is too high, write data too large, such as unreasonable data loading strategy, log too much, etc., can cause this problem. The second is the server memory is not enough, the service in the swap partition constantly swapped in and out.

    • memory continues to rise : With fixed throughput, if memory continues to rise, it is likely that there is a significant memory leak in the service being tested and that it needs to be located using a memory-checking tool such as Valgrind.

 

Give an example of (chestnuts)

 

After the intelligent prompt service breakdowns, it must be done immediately to the performance mapping. Depending on the current situation, external and internal indicators need to be provided in the test results.

The architecture of the smart cue service and the capabilities of each module are shown in

As we can see, the underlying data service for the smart Cue service has determined the performance limit before testing. Therefore, our task in this test is to find the performance limit of each module in the upstream of the intelligent Cue service under the premise of 3500QPS of the data service capability.

A complete background service performance testing process is shown.

Prepare before testing:

    • Test data: Because the smart prompt is already running online, this test uses smart tips breakdowns that day's logs as test data

    • QPS estimate: This test is to find this number

    • Server configuration: Use the same server as the on-line hardware and software configuration

Pressure measurement process:

We use JMeter to send test data to simulate user requests, as shown in the original jmeter test configuration file used. As you can see, the profile of the performance test is mainly configured by the data file (sharing between threads, the behavior at the end, etc.), throughput control, the HTTP sampler (domain name, port, HTTP METHOD, request body, etc.), and the response assertion (verifying the contents of the returned result).

Data file Configuration

Throughput control

HTTP request Sampling

Response Assertion

    • Cpu

In Linux, the use of the SAR, top, PS and other commands can be monitored for CPU usage. In general, the top command is most commonly used. The output of the top command is as follows:

The top command is an interactive command that remains in the terminal and refreshes periodically after it is run. In a performance test, you can use the following parameters to have the top command run only once

$top –n 1–b–p ${pid}

    • Server load

In Linux, the server payload is obtained using the uptime command, and the output of the command is

Each column has the following meanings:

"Current time system run time logon user number of last 1 minutes, 5 minutes, 15 minutes average load"

    • Memory

In Linux, the top and PS commands can be used to view the memory usage of the specified process. But the most accurate information in the/proc/${pid}/status, such as

In the output of the above command, we focus on Vmrss, Vmdata, vmsize

    • Disk IO

Disk monitoring data is obtained using the Iostat command

Test report output

After you have counted the monitoring metrics collected during the performance test, you can output the performance report.

Typically, the performance report includes the following:

    • Test conclusion: Including the maximum QPS, response time and other indicators to meet expectations, deployment recommendations.

    • Test environment Description: Including performance requirements, test server configuration, test data sources, test methods, etc.

    • Monitoring Metrics Statistics: Response time Statistics, QPS, server metrics statistics, process metrics statistics. It is advisable to use charts to represent statistics.

 

Conclusion

seo| SEO Optimization | SEO ranking | keywords included

After testing, the conclusion is that the performance of a single intelligent Alert service is 300QPS, the performance of the whole line of intelligent service is 1800QPS, and the monthly black wind high traffic is probably 5000qps+, no wonder smart hint breakdowns, indeed the flow is too large, far beyond the throughput capacity of the line.

Finally, the smart prompt service to apply for the server to expand, and a taxi platform to limit the flow, both open source and throttling, to ensure that the next month of black and high night a crowd about wine, about rice, about p People's taxi experience, improve the success rate of various about, can be described as boundless beneficence.

Two or three things to know about back-end performance testing

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.