(RPM) Neutron network failure caused by a bulk restart

Source: Internet
Author: User

Site Review

The story took place in the afternoon, using salt to update a cluster's neutron.conf (log related configuration items) and batch restart Neutron-openvswitch-agent (hereinafter referred to as Neutron-ovs-agent), and soon someone back to the cloud host outage.

Find out now that the cloud hosts are not down, only the network is out of the way, and the OvS flow table of most compute nodes is empty. Nova and Neutron the ERROR level log.

$ ovs-ofctl dump-flows Br-bond
Nxst_flow repy (xid=0x4) cookies=0x0, Duration=433.691s, table=0, n_packages=568733, n_bytes=113547542, Idle_age=0, Priority=1 actions=normal cookies=0x0, duration=432.358s, table=0, n_packages=8418, n_bytes=356703, Idle_age=0, priority=2, In_port=3 Actions=drop

Neutron-ovs-agent Log:

Devicelistretrievalerror:unable to retrieve port details for devices because of Error:remote error:timeouterror Queuepo OL limit of size ten overflow reached, connection timed out, timeout 10

Neutron-server Log:

File "/usr/lib64/python2.6/site-packages/sqlalchemy/pool.py", ... ' Timeouterror:queuepool limit of size ten overflow reached, connection timed out, timeout 10\n '

Nova Log:

Neutronclientexception:request failed:internal Server Error while processing your Request.

The above information may draw the following conclusions:

    • The Neutron log indicates that Neutron-ovs-agent has failed to request virtual machine port information through RPC to Neutron-server because the number of connections to the Neutron-server and database exceeded the connection pool limit.
    • The Nova log indicates that Neutron-server cannot respond to HTTP requests.
    • The emptied OvS stream table causes the virtual machine's network to crash.
Fuse

In-depth analysis before the basic information about the Icehouse cluster: The cluster has 102 compute nodes, running Nova, Neutron, Glance,ceilometer and other services, in order to avoid a single point of failure, we removed Neutron L3 and other related services, the use of sophomore Layer of network, the virtual machine communicates with the outside world through physical routing. Theoretically, no matter which service is abnormal or even any node goes down, the worst result is that the OpenStack service is unavailable or a small number of virtual machines fail, but most of the VMS still work.

Experience has been told that multiple clusters using the above network model have never had such a scale failure in more than a year. Since the Log module configuration item does not affect the flow table, it is intuitively speculated that a bulk restart might be the trigger that triggered the OvS flow table being emptied.

A pit in Neutron.

Because the flow table is cleared unprecedented only restart neutron-openvs-agent, and compute node, only neutron-ovs-agent and OvS have interaction, so retrace browsing neutron-ovs-agent restart process, comb its logic as follows:

    1. Clear all Flow tables
    2. Get flow table related information via RPC to Neutron-server
    3. Create New Rheology

It is not difficult to find that, if the step 2 exception, such as Neutron-server busy, message middleware anomalies and database anomalies and other factors, will affect the reconstruction of the flow table, heavy causes the virtual machine network paralysis. In fact, the community is aware of a similar problem: restarting Neutron-ovs-agent will cause the network to be temporarily interrupted.

Restarting neutron OvS agent causes network hiccup by throwing away all flows

The way the community handles this is to increase the configuration item Drop_flows_on_start, which defaults to False, to avoid this problem. The Patch has been taken into Liberty, and the logical flow of grooming its reboots is as follows:

    1. Flag the current flow table with a cookie
    2. Get new rheology and update to OvS
    3. Clearing old stream tables based on cookies

Neutron-ovs-agent in the process of the restart of the Flow table processing left a hidden trouble, the direct result is that the flow table of the calculation node is cleaned clean, virtual machine into an isolated point, and a variety of factors can trigger the hidden trouble.

Maximum number of connections

Now explain why batch restart Neutron-ovs-agent will trigger the above hidden trouble, during the restart process, the neutron log reported the following error:

' Timeouterror:queuepool limit of size ten overflow reached, connection timed out, timeout.

This log means that the number of connections to the Neutron-server and database exceeds the limit of the client connection pool, and when neutron-ovs-agent bulk restarts, hundreds of concurrent RPC requests to the NEUTRON-SERVER request to build the flow table information, The number of Neutron-server and database connections is far more than 30, resulting in a large number of requests failed, the compute node can not get the flow table related information cannot reconstruct the flow table, so the calculation node's flow table is empty.

To solve the timeouterror problem caused by the limitation of Queuepool is also very simple, SQLAlchemy provides two configuration items [1], the following two parameters are suitably adjusted to a large.

[Database]
Max_overflow =
Max_pool_size =

Note: MySQL server default maximum number of connections is 100, when the size of the cluster rise, you need to properly adjust the maximum number of MySQL server connections.

Performance issues for processes

However, the problem is not fully resolved, please note the Nova error log:

Nova.compute.manager neutronclientexception:request failed:internal Server error while processing your Request.

This is an HTTP request from Nova to Neutron-server, and neutron returns internal server error. Internal Server error corresponds to the HTTP status code of 500, which means that neutron-server cannot respond to the NOVA request. Why is it not responding? During the bulk restart process, it was found that the CPU utilization of the neutron-server process was 100, which meant that Neutron-server was busy processing neutron-ovs-agent a large number of requests and was not able to process Nova's HTT P request.

The solution to this problem is also a simple answer, add more neutron-server process number. In fact, since Icehouse, Nova-conductor Nova-conductor to handle nova-compute a large number of RPC requests (Nova-compute through Nova-conductor access to the database) Processes, the number of processes equals the number of logical cores for the server CPU.

$ Workers= ' cat/proc/cpuinfo |grep processor |WC | awk ' {print '} '

[DEFAULT]
Api_workers = $workers
Rpc_workers = $workers

Python Concurrency & IO

It's worth thinking, is it true that hundreds of concurrent requests are really high for a server with a total of 24 threads on a 12-core dual channel? The CPU maximum utilization of the neutron-server process only reaches 100, meaning that Neutron-server only takes advantage of one thread of the server. It has to be mentioned here (Crontine) [2] that this pseudo-concurrency "user-state thread" means that only one thread is executing at any moment, that is, at any time, only one of the server CPUs can be exploited, and the process of all openstack components is full of co-processes (only one main thread). Thus, under a single process, neutron-server cannot take full advantage of multicore to improve the ability to handle concurrency.

Traditional Web server Apache and Nginx use multi-process multithreading to increase concurrency (context switching overhead: process > Thread > Coprocessor). So the question is, is it possible to improve the concurrency of neutron by using threads to replace the co-process with a single-course condition? The real answer is no, the root cause is Python gil[3], commonly known as Python global lock. For Python programs, a single process can only occupy one physical thread at any time, and all only multiple processes can take full advantage of the server's multicore multithreading.

Let's go into a little bit, assuming that the CPU is strong enough to solve neutron-server single-process concurrency problem? I don't think so, it's back to the IO problem. Monkey Patch[4] While replacing the system's socket library with its own non-blocking socket library, some blocking IO blocks the entire process (the main thread). However, OpenStack access to MySQL using the Libmysqlclient library, Eventlet can not use the system socket C library using Monky_patch, so when the MySQL CRUD will block the main thread, meaning that n Eutron-server is also prone to performance bottlenecks when accessing databases.

One way to resolve the appeal concurrency problem is to start more API worker and RPC worker. In newer versions, the default wokers of OpenStack is the logical number of servers. Another option is to replace Python HTTP server with Apache, which has recently been progressively supported by various components.

Reference
      1. Http://docs.sqlalchemy.org/en/rel_0_9/core/pooling.html
      2. Http://www.dabeaz.com/coroutines/Coroutines.pdf
      3. Http://stackoverflow.com/questions/1294382/what-is-a-global-interpreter-lock-gil
      4. Http://stackoverflow.com/questions/11977270/monkey-patching-in-python-when-we-need-it

http://wsfdl.com/openstack/2015/10/10/%E4%B8%80%E6%AC%A1%E6%89%B9%E9%87%8F%E9%87%8D%E5%90%AF%E5%BC%95%E5%8F%91 %e7%9a%84neutron%e7%bd%91%e7%bb%9c%e6%95%85%e9%9a%9c.html

(RPM) Neutron network failure caused by a bulk restart

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.