Editor's note: high-availability architectures share and disseminate a typical article in the architecture area , this is the record of the Pang of the cloud operations in Beijing on the "million concurrent" offline event of March 27.
Not long ago, several people cloud joint Tsinghua University's Institute of Cross-Information, OCP Labs successfully hosted millions of concurrent HTTP requests through 10 OCP servers.
The goal of this experiment is to complete 1 million concurrent processing with minimum physical resources , and through this experiment, the ability to load the high voltage of a digital cloud DCOs (data center operating system) based on Mesos and Docker technology is maximized.
Million pressure measurement tools and hardware
Pressure measuring tools
This selection of pressure tool is distributed pressure measurement tool Locust + Tsung.
Locust (http://locust.io/) is an easy-to-use distributed load testing tool used primarily for load-stress testing of web sites.
Locust's official website in comparison with the Apache JMeter and Tsung of the merits and demerits mentioned
We've evaluated JMeter and Tsung, they're all good, we've used JMeter a lot of times, but its test scenarios need to be cumbersome to generate by clicking the interface, and it needs to create a thread for each test user, so it's hard to emulate a massive number of concurrent users.
Tsung does not have the above threading problem, it uses a lightweight process in Erlang, so it can initiate massive concurrent requests. But it faces the same shortcomings as JMeter in defining test scenarios. It uses XML to define the behavior of the test user, and you can imagine how scary it is. If you want to see any test results, you need to go through a bunch of test results log files yourself ...
Tsung is an open source distributed multi-Protocol load testing tool based on Erlang that supports HTTP, WebDAV, SOAP, PostgreSQL, MySQL, LDAP, and Jabber/xmpp. Visit http://tsung.erlang-projects.org/to learn more.
Hardware configuration
The OCP is an open Compute project launched by Facebook, with the aim of leveraging open source hardware technology to drive the development of IT infrastructure to meet the hardware requirements of the data center.
The OCP hardware configuration of this experiment is as follows:
CPU Type: frequency 2.20
Dual CPU 24-Core forwarding
Dual CPU 20-core pressurized end
Dual CPU 16-core pressure-bearing end
memory : DDR3 1600/128g
network : million gigabit network
The use of open-source container virtualization technology, the system and software environment, the software layer all the system-dependent software is encapsulated in Docker.
The server-based environment does not need to be configured with a complex, dependent environment for hosting services, and the application and dependent environments are encapsulated in containers, which are very handy when migrating, and the portability of applications can be greatly improved and easily migrated and extended.
How to do million-pressure measurement
As already mentioned, the goal of this experiment is to complete 1 million concurrent processing in the case of the minimum physical resource value. Several challenges were encountered:
how to Pressurize to 1 million ? In other words, what is the pressurized method?
How much physical resources will eventually be required? What is the processing power of each module in the architecture?
(Figure 1: The Pressure measurement architecture diagram, click the picture can be full-screen zoom)
The red box is the working cluster for the test, and the red box is the auxiliary function cluster of this experiment. It should be stated that 10 OCP servers carry 1 million of the 10 hardware in the HTTP request, referring to the forward side plus the pressure end of the machine , not including the pressurized end of the machine, because in the real scene of the pressurized end is to access the user itself.
The following is a detailed description of the pressure test.
Basic Environment
Base deployment
Experimental OCP hardware shelves, power-up, network construction, system installation;
The system used is CentOS 7.1, upgrade kernel upgrade to 3.19 required;
Use Ansible to deploy other underlying software for the machine, including installing Docker 1.9.1 (EXT4 + overlay), opening system default constraints (file handles, system cores, and interrupt optimizations).
Deploy a cloud Cluster
With a standard system and Docker, the machine can be loaded with a few people clouds. Pang on-site shows the steps to build a number of people gathering platforms and installs them within two minutes.
(Figure 2: The pressure end, click the picture can be full-screen zoom)
Pressure end design: seconds Kill project
Once the cloud cluster is installed, you can release the application for pressure measurement.
First release the pressure end, using Nginx + Lua combination, they are a common combination of high-voltage system, is also a number of people cloud seconds to kill the project native module , the return result is the dynamic non-cached data, ensure the accuracy of the test. Since the second kill module is not the focus of this article, the following is simply described.
(Figure 3: Program return result analysis)
In the second kill module, time is the timestamp that is automatically taken from the server, events is the data of the second kill service, Event1 is the activity of the second kill, and 480,000 is how much time it takes to kill the activity in seconds.
(Figure: Nginx + Lua Optimization)
Above is the Nginx optimization scheme, also taken from the network solution. The 2nd box below is Lua's own optimization, which is critical to the processing power of LUA.
Scenario A Pressure Measurement
Once the pressure end is released, you can begin deploying the pressurized end. The Locust has three features of distributed, simple installation and Web-ui interface.
(Figure: Scenario A, click the picture can be full-screen zoom)
Select the problem that locust is experiencing during the test
Locust cannot perform a pressure test on multiple service ends, so it adds Mesos-dns on top of it to provide a unified interface to locust Slave. Locust the pressure measurement force file is tasks.py, each slave need to go to config server before starting to pull tasks.py.
Next is the forwarding layer HA, and then the Nginx.
Test Locust steps
Test single-core performance: approximately equal to 500/s processing
Test stand-alone performance: 40-core Hyper-threading, approximately equal to 1w/s processing
Through testing, we found that Locust has three drawbacks:
Low single-core compression capability, poor support for Hyper-threading, and the Master side is unstable in the case of a large number of slave node connections.
(Figure: Locust-slave resource use)
As can be seen, the pressure measurement resources are divided into two groups, a group of 20 physical nuclear machines have 20 units, slave can be pressed to the ability is 20w/s. Group B 16 Physical nuclear machines have 15 units, can be pressed to 12w/s, the entire pressure group capacity of 32w/s.
(Figure: Single-machine Nginx + Lua HOST)
First step of pressure measurement
Get pressure single Nginx + Lua (HOST) capability is 197,000/s
Considering the optimization of HA, then a single test, the KeepAlive is finally selected.
Second step of pressure measurement
The result of the non-hyper-threading is 227,000, 95% for 1 seconds. The test found that the HA queuing phenomenon is very much, the ability of continuous pressure is very poor, in a few minutes there is a serious blockage. At the same time, several CPUs are full, indicating that it is unevenly distributed, some modules need to have a unified module scheduling, resulting in HA can not continue to maintain high-performance processing.
Single-machine hyper-threading has some attenuation than non-hyper-threading, the result is 219,000, 95% can respond within one second, but HA queue request is much less, CPU pressure is also on average, the test results are very stable.
HA Hyper-threaded non-Docker, measured with a result of 270,000. Docker does have a certain attenuation, you can obviously see 99% of the request in 1 seconds processing, has been able to achieve enterprise-level use of standards.
Now, the known total capacity is 32w/s, single Nginx + lua is 19w/s, forwarding layer single Haproxy maximum capacity 27w/s, then the single-machine Nginx + LUA NAT mode of the ability?
Can be seen before the single-machine Nginx HOST network mode test results are 197,000. Nginx mode with HA plus NAT mode, after attenuation is 143,000, CPU pressure is almost 100%.
(Figure: Overall test results, click the picture can be full-screen zoom)
Since the previously used locust for Hyper-threading support and its own performance problems, it is not possible to meet the requirements on the basis of existing hardware resources and use Tsung for testing.
Program B Pressure Measurement
To test the complete documentation, see: http://doc.shurenyun.com/practice/tsung_dataman.html
Change to plan B and continue into millions of concurrent
(Figure: Scenario B, click the picture to zoom in full screen)
Schema diagram Explanation: Tsung maste the slave operation via SSH, the EPMD of Erlang is used for communication between clusters.
To perform the steps:
First of all, Tsung is Mesos Docker.
Installing SSH, installing Tsung
Configuration file Mesos
Call the number of people cloud API to publish Tsung
API Call Script A
Solution B Pressure Measurement configuration
pressurized End : Tsung client pressurization machine
Qty 20
CPU 40 core Hyper-threading, CPU consumption is not bottleneck
Mem 128G
Network million Gigabit networks
Docker host mode, Docker issued 20 (1 per set)
Tsung Controller : The native configuration can be reduced a lot, testing the physical machine, randomly selected a
Qty 1
CPU 40 core Hyper-threading
Mem 128G
Network million Gigabit networks
Docker host mode, Docker issued 1
turn to the originator: haproxy
Qty 4
CPU 48 core Hyper-threading, CPU consumption ultra high-bottleneck
Mem 128G, memory consumption near 20g
Network million Gigabit networks
Docker host mode, Docker issued 4 (1 per set)
Pressure-bearing Nginx
Qty 6
CPU 32 core Hyper-threading, CPU consumption ultra high-bottleneck
Mem 128G, memory consumption close to 10g
Network million Gigabit networks
Docker NAT mode, Docker issued 48 (8 each, the conversion of 48w concurrent connectivity, processing each 14w about, the total number of about 800,000)
Scenario B detailed report download
Million Stress test report:
Http://qinghua.dataman-inc.com/report.html
Finally, a few people cloud on the basis of Tsung successfully completed the million stress test, the industry can fully reference the digital cloud this million concurrent practice for high-voltage system design. Click to read the original text to learn the detailed test parameters.
Highly Available architectures related articles
To discuss more about the Internet architecture pressure measurement technology, please follow the public number for more articles. Reprint please indicate from the highly available architecture "archnotes" public number and contains the following two-dimensional code.
Highly Available architecture
Changing the way the Internet is built
Long press QR code subscribe to "High Availability architecture" public number
Locust, Tsung-based million concurrent seconds kill pressure test case [go]