Zheng Practice report based on Liu Qin Red and Shi Yong Records was created in 2015/8/13 last updated on 2015/8/19 keywords: pressure measurement, tcpcopy, simulation test, real-time copy traffic This document applies to personnel: technical staff outline:
- Why do we do simulation tests?
- How the Tcpcopy works
- Implementation: The topology of the simulation test
- Implementation: Procedure
- Problems that you may encounter
- Ip_conntrack
- Small packet loss
- Offline replay
- Do not extract 7 layers of information
- Performance indicators for observations
0x00, why do you want to do the simulation testThe traditional pressure test under the line, it is difficult to simulate the real flow, especially difficult to simulate the normal flow mixed with abnormal flow of color. So, the line under the pressure of a good system, on-line may one day suddenly avalanche,
Say that can support 5 times times the flow of the system reconfiguration, perhaps the traffic doubled to completely hang up. But there are more ways than problems. Before the system reconfiguration or important changes go live, you can copy the real traffic on line, simulate the online traffic in real-time, even enlarge the real traffic, carry on the pressure test to evaluate the system carrying capacity. In turn, if running on the line to find a performance bottleneck, but the offline environment is difficult to reproduce, it is better to copy the actual traffic to the offline replay, after all, the offline environment to facilitate a variety of troubleshooting methods, replay several times, until the problem. So this time based on the Varnish of the Product Details page static before the launch, do a simulation of pressure measurement. How do I copy real-time online traffic? Tcpcopy.
2010, the NetEase technology Department of Wang bin in Wangbo work based on the development of Tcpcopy-a TCP Stream Replay Tool. September 2011 Open Source. The current version number is 1.0.0. Many of the company's simulated on-line testing is based on tcpcopy, such as Amoy.
Tcpcopy is a request replication (replication TCP-based packets) tool that replicates the online packet, modifies the TCP/IP header information, sends it to the test server, and achieves the purpose of tricking the test server's TCP program, thus laying a solid foundation for spoofing the upper application.
How the 0x01,tcpcopy worksWang Bin said, the Server-based request playback field, generally divided into offline playback and online real-time replication two. Where requests are replicated in real time, they can generally be divided into two categories:1) Request replication based on the application layer,2) Request replication based on the underlying packet. Replication from the application level, such as server-based request replication, is relatively straightforward to implement, but there are several drawbacks:1) Request replication from the application layer, through the entire protocol stack, so it is easy to squeeze the application of resources, such as valuable connection resources,2) The test is coupled with the actual application, it is easy to affect the online system,3) It is also difficult to support high-pressure request replication,4) It is difficult to control network latency. And based on the request replication of the underlying packet, can be done without penetrating the entire protocol stack, the shortest distance, can be from the Data Link layer capture request packet, from the Data Link layer packet, distance, can be in the IP layer to grab the request packet, from the IP layer, regardless of how to go, as long as not go TCP, the impact on the online This is also the basic idea of tcpcopy. From the rawsocket+iptable+netlink of the traditional architecture to the pacp+route of the new architecture, it has undergone three architectural adjustments, and today's tcpcopy is divided into three roles:
- Online Server (OS): The above to deploy tcpcopy, from the Data Link layer (Pcap interface) grab the request packet, the packet is sent from the IP layer;
- Test Server (TS): The latest architectural tweaks offload Intercept's work out of TS. TS sets routing information to route the response packet information that needs to be captured by the application being tested to as;
- Assistant Server (AS): This is a stand-alone secondary server, which in principle must be used as an idle server in the same network segment as a secondary server. As at the data link layer intercepts the response packet, extracting useful information from it and returning it to the tcpcopy process on the appropriate OS.
Please cooperate with 1 understanding: Figure 13 Characters of the data flow way
grab packets on the Online Server:The new architecture of the tcpcopy on the OS catches the request packet by default with the raw socket input interface. Wang Bin is recommended to use Pcap grab bag, installation command as follows:./configure--enable-advanced--enable-pcap MakeMake InstallThis can be filtered in the kernel state, or only in the user configuration of the packet filtering, and on the intercept side or Tcpcopy side set filter (through the-f parameter, similar to the tcpdump filter), to achieve a number of instances to work together to complete the clutch, This is more scalable and suitable for super-high concurrency scenarios.
To facilitate understanding of the Pcap grab bag, here is a brief description of how LIBPCAP works.
The capture of a package is divided into three main parts:
- For the underlying package capture,
- Packet filtering for the middle tier,
- User interface for the application layer.
This is the same process as the Linux operating system for packets (network interface, network interface, data link layer,->IP layer, transport layer, and so on). The packet capture mechanism is to add a bypass processing at the data link layer (which does not interfere with the processing of the system's own network protocol stack), filtering and buffering the packets sent and received through the Linux kernel, and finally passing them directly to the upper-level application. As shown in 2:
Figure 2 The three parts of Libpcap
bundle on Online Server :As shown in 1, the new architecture and the traditional architecture, the OS defaults to the raw socket output interface package, at this time the package command is as follows:./tcpcopy-x 80-Test machine IP: Test machine Application Port-s server ip-i eth0Where the-i parameter specifies the network card from which the PCAP fetches the request packet. In addition, the new architecture supports the contract by Pcap_inject (adding--enable-dlinject at compile time).
response Packet routing on Test Server:You need to add a static route on the test server to ensure that the response package for the test application is routed to the secondary test server instead of being wrapped back to the Online server.
Capture Response package on Assistant Server:For the secondary server to ensure that no route mode Cat/proc/sys/net/ipv4/ip_forward is turned on, 0 means no. The Intercept process on the secondary server fetches the response packet of the test machine application through PCAP, extracts the head and sends it to the tcpcopy process on the Online server to complete a request replication.
0x02, Practice: Topology of simulation testsThe online environment topology diagram for this simulation test is listed below. The environment is as follows:
- Online Server
- 4 Production Environment Nginx
- 172.16.***.110
- 172.16.***.111
- 172.16.***.112
- 172.16.***.113
- Test Server
- An Nginx in a mirrored environment
- Assistant Server
- A standalone server in a mirrored environment
Topology 3 shows: Figure 3 the data flow sequence of the pressure measurement environment is as follows: Figure 4 the sequence of data flow in the pressure measurement environment
0x03, Practice: ProcedureThe steps below are listed separately on the Online server/test server/assistant Server.
3.1 The actions on the Online Server:Download and install the Tcpcopy client;
git clone http://github.com/session-replay-tools/tcpcopy
./configureMake
&& make installConfiguration directory After installation: Configuration summary tcpcopy path prefix: "/usr/local/tcpcopy" tcpcopy binary file: "/USR/LOCAL/TCPCOPY/SB In/tcpcopy "tcpcopy configuration prefix:"/usr/local/tcpcopy/conf "tcpcopy configuration file:"/usr/local/tcpcopy/ conf/plugin.conf "tcpcopy pid file:"/usr/local/tcpcopy/logs/tcpcopy.pid "tcpcopy error log file:"/usr/local/tcpcopy/l Ogs/error_tcpcopy.log "Run Tcpcopy client, there are several options:
./tcpcopy-x 80-172.16.***.52:80-s 172.16.***.53-d #全流量复制
./tcpcopy-x 80-172.16. . 52:80-s 172.16.***.53-r 20-d #复制20% of flow
./tcpcopy-x 80-172.16.***.52:80-s 172.16.***.53-n 2-d #放大2倍流量
3.2 The action on the Test Server:To add a static route:
Route add-net 0.0.0.0/0 GW 172.16.***.53
3.3 Operations on Assistant Server:Download and install the Intercept server;
git clone http://github.com/session-replay-tools/intercept
./configureMake
&& make installConfiguration directory After installation: Configuration summary intercept path prefix: "/usr/local/intercept" Intercept binary file: "/usr/local/inte Rcept/sbin/intercept "Intercept configuration prefix:"/usr/local "Intercept configuration file:"/usr/local/intercept /"Intercept pid file:"/usr/local/intercept/logs/intercept.pid "Intercept error log file:"/usr/local/intercept/logs/e Rror_intercept.log "Run Intercept server;
./intercept-i eth0-f ' TCP and SRC Port '-DFigure 5 Data flow diagram for production environment and mirroring environmentControl 5, and then briefly explain how it works:
- Tcpcopy the copy port request from the data link layer, and then change the destination IP and destination port.
- The modified packets are routed to the data link layer, and the TCP connection request is maintained.
- Sent from the online server to test server through the data link layer.
- A service port that arrives at the Nginx response after the data link layer is unpacked.
- The data link layer is returned when the user requests the data to return the result.
- The returned results are sent from test server to assistant server through the data link layer. Note: Test server has only one default route pointing to Assistant server.
- The data arrives assistant server and is intercepted by the intercept process.
- Filter related information sends the request status to the tcpcopy of the online server, shutting down the TCP connection.
0x04, problems that may be encounteredWang bin himself said: To use good tcpcopy, need to be familiar with the system knowledge, including how to grasp the package efficiently, how to locate the system bottleneck, how to deploy the test application system, how to grasp packet analysis. Frequently asked questions are: 1) deployment test system is not in place, coupled on-line system, 2) ignore the system bottleneck problem, 3) do not know how to locate the problem, 4) resources are not in place, the problem caused by resource tension.
1) Ip_conntrackIn June 2014, Tang Folin of Weibo said: "The tcpcopy drainage tool is an excellent choice for online troubleshooting, but few users are interested in turning on the Tcpcopy service while the Ip_conntrack kernel module is turned on, which is responsible for tracking the status of all TCP links. and its internal storage has a length limit, and once it's exceeded, all new links will fail. "Wang Bin replied:" Open tcpcopy, itself will not open the Ip_conntrack kernel module. Open Ip_conntrack kernel module, is the user's own decision, and Tcpcopy does not matter. "He also suggested:"when the number of connections is very large, it should shut down ip_conntrack, otherwise it will seriously affect performance. As for tcpcopy, the default is from the IP layer contracted, so it will be ip_conntrack interference, the document also has a description, in fact, you can also use--enable-dlinject to contract, avoid the IP layer ip_conntrack. If not reported "ip_conntrack:table full, dropping packet", generally do not have to worry about Ip_conntrack. "and" There is not much connection on the line, open ip_conntrack and no problem. More online connection, it is best to close the ip_conntrack, or to the online application system port settings Notrack, at least the system around me is like this, for performance considerations, but also a good operation and maintenance habits. ”
2) Small packet losshow to find tcpcopy drops more or less? Wang bin himself said, in some scenarios, pcap capture packet loss rate will be much higher than raw socket capture, so it is best to use pf_ring to assist or use raw socket to grab the packet. The packet loss rate needs to be compared in the test environment according to the quantitative request send to expand the calculation, also need to analyze the content of the log, pending testing.
3) Offline replay
The tcpcopy has two modes of operation: 1) Real-time copy of the packet, 2) the offline (offline) request replay by using the files generated by the tcpdump and other capture packets. This simulation test, there is no successful test of the second mode of operation, left for further study.
4) Do not extract 7 layers of informationAt the meeting had proposed by the domain name to differentiate the copy traffic, saves the work which does not have in the scope of this pressure test to hang, but the tcpcopy principle is in the IP layer copy, does not extract 7 layer of information, namely, in our Nginx*4 deployment tcpcopy, can only be to copy all traffic to the mirror environment the Nginx On The server is not configured anyway, or the server is stopped, and the traffic that cannot be handled is discarded.
0x05, observed performance indicatorsWhen simulating the pressure measurement, it is necessary to record the performance indexes of the test Server and the back-end compression engineering. For this test, we have recorded the following indicators:
- Java Project Access times, response time, average response time, call success or failure, the number of web port connections;
- Web container thread, memory, and other situations;
- Virtual machine cpu-usage, Load-avg, io-usage and so on;
- Memcached/redis such as cache cluster hit rate, etc.;
Reference resources: 1,2014, using tcpcopy to import on-line traffic for functional and stress testing; 2,2012, a scouring: the use of tcpcopy drainage to do simulation on-line test; 3, Wang bin Weibo; 4,2013,tcpcopy architecture, 5,2014, NetEase QA, Tcpcopy two kinds of Architecture principles explained (serial two);-eof-
Real thing pressure measurement: a tcpcopy-based simulation pressure measurement scheme