Article from: Listen to the Cloud blog
As our business continues to grow, our number of applications has exploded. With the growth of application explosion, the difficulty of management is increased. How to quickly complete the expansion while the business explosion is growing is a big challenge. The advent of Docker happened to solve our problem. With Docker, we can quickly complete the expansion and contraction, and the configuration is uniform and error-prone.
In the Docker cluster management selection, we are more tangled, the current is more popular is mesos and kubernetes. In terms of functionality, we are more inclined to use kubernetes, which is more capable of container orchestration than Meoso, and provides a persistent storage solution that is more suitable for our scenario. However, the Kubernetes network model is more complex than Mesos, and the ability to meet the demand in high concurrency is a key problem.
Mesos itself does not handle network problems, and with marathon we can choose the host mode and bridge mode provided by Docker itself. Host mode and host sharing network stack network performance is the highest, bridge mode needs to be evaluated.
Kubernetes uses a flat network model that requires each pod to have a globally unique ip,pod that directly communicates across hosts. At present, the more mature plan is to use flannel. Flannel has several modes of operation, namely UDP, VXLAN, HOST-GW, and aws-vpc. AWS-VPC has platform limitations we do not consider that UDP performance is poor. Focus on testing Vxlan and HOST-GW. There are related evaluation tests over HOST-GW performance better than Vxlan, but HOST-GW mode requires host network to meet the two-tier direct exchange. This is not sufficient in many shared cloud platforms. Even if a common cloud platform in a room can meet the requirements of the network, multi-room interconnection will not meet this requirement. If the Vxlan mode can not meet the demand, multi-machine room interconnection requires multiple kubernetes clusters, increase the management difficulty.
In order to make the results more realistic, we selected a high-concurrency system on the line to evaluate. machine configuration is 16C 32G, the application itself does not have a performance problem. As the kubernetes requires a network interconnection, we use two machines when testing the kubernetes.
In conclusion, the environment we are going to test is as follows:
Machine to be tested |
machine configuration |
Number of machines |
K8s Flannel Vxlan |
16C 32G |
2 |
K8s Flannel HOST-GW |
16C 32G |
2 |
Mesos Host Mode |
16C 32G |
1 |
Mesos Bridge mode |
16C 32G |
1 |
Public cloud virtual Machine (contrast) |
16C 32G |
1 |
Listening to cloud server can monitor code-level response times, and adding a header to load balancing can monitor the blocking time of load balancing to back-end realserver. Using this feature, we can evaluate the blocking time of the above-mentioned network models to find out whether the Vxlan model can meet our needs in high concurrency.
Listening to the cloud network can simulate the request to the service side, the comprehensive access time throughout the country, we can use it to monitor the use of these models at the time of the production system has an impact.
Test method: We run the same traffic on each of the 7 machines by the load balancer provided by the public cloud. Compare their performance differences by listening to cloud server to monitor the blocking times of these models. While observing the availability and access performance of the cloud network, whether the user experience is worse due to poor performance of a model network from the client's perspective. As shown in the following:
We're adding these machines after load balancing.
The first of the 8080 is the public cloud VM, 30099 ports for the Vxlan service,30098 of the two machines for HOST-GW service,8081 Port is Docker bridge mode, 8080 port is the host mode.
Let's start by listening to cloud network to see if the overall service is affected.
Is the performance curve, has the fluctuation, belongs to the normal range. We are the HTTPS service, and the front-end load balancer is responsible for decoding SSL, which consumes some time.
Eliminate some of the point itself network problems, the availability of basic in 100%.
Next, look at the cloud server.
Is the throughput rate, the average value is 425081rpm. A total of 7 units, the average throughput rate of about 60725rpm per station.
Response time graph for the server.
The time is approximately 0.67 seconds, which is basically consistent with the time to listen to the Cloud network test at the front.
Most of the time is spent on blocking time from the graph. Here we are going to break down the blocking time in detail to get the network performance we want to evaluate.
Blocking time is defined as the time from load balancing to back-end realserver. This time in our scene for
K8s: Blocking time =ssl decoding time + load Balancing response time + forwarding to back-end VM time +flanneld transfer packet time.
Mesos Bridge: Blocking time =ssl decoding time + load Balancing response time + forwarding to backend VM time + native NAT forwarding time
Mesos Host: Blocking time =ssl decoding time + load Balancing response time + forwarding to backend Docker time.
Cloud VM Comparison: Blocking time =ssl decoding time + load Balancing response time + forwarding to backend VM time.
We just have to compare the blocking time in each test case to 0 of the VM's Docker consumption time.
Other machines and cloud virtual machine to do the blocking time to do subtraction, you can draw a relative network consumption.
We averaged the two machines.
Machine to be tested |
Average blocking time |
Docker itself consumes |
K8s Flannel Vxlan |
644.778ms |
6.778ms |
K8s Flannel HOST-GW |
641.585ms |
3.585ms |
Mesos Host Mode |
650.021ms |
12.021ms |
Mesos Bridge mode |
643.67ms |
5.67ms |
Public cloud virtual Machine (contrast) |
638ms |
0 |
In the results of the table above, the host pattern took the longest time to our surprise, possibly due to a case. The other results are basically in line with our expectations. Among them, the Vxlan mode on average than the public cloud virtual machine more than 6ms and network adaptability, should be able to meet our needs.
What happens in the larger concurrency? Can we achieve our performance needs through scale-out?
On this basis, we tested 20 k8s Vxlan modes and 32 cloud host machines running in tens of millions of RPM scenarios. From the flow of each half to gradually increase the amount of k8s to observe the performance impact, while observing its stability, the results of the next phase of the test revealed.
Dear friends. When migrating applications to k8s, Mesos, or native Docker, you can also use the cloud-based tools to test the impact of a schema change on a real system.
Original link: http://blog.tingyun.com/web/article/detail/406
Using listening cloud server and listening cloud network to measure the performance of kubernetes and Mesos under high concurrency