NetEase OpenStack Deployment operation and Maintenance combat

Source: Internet
Author: User
Tags cloud hosting truncated haproxy aws cloudformation

Since the inception of OpenStack in 2010, more than 200 companies have joined OpenStack projects, with 17,000 + of developers currently participating in OpenStack projects, and these numbers are increasing as an open source IaaS implementation, At present, the application of enterprises is more and more common, netease private cloud team shared their experience in a cloud computing management platform based on OpenStack, looking forward to communicate with the vast number of OpenStack users.

This paper introduces a cloud management platform based on OpenStack developed by NetEase, as well as the problems and experience sharing during development, operation and maintenance. As a large-scale Internet company, NetEase's IT infrastructure needs to support the needs of production, development, testing, management and so on, and the change of demand and request exists almost every day, which requires the internal IT infrastructure to be flexible and robust enough to meet the actual needs of various departments and teams. The NetEase private cloud platform team also hopes to communicate with the vast number of OpenStack users through this article to share the results they have gained in the actual project.

About OpenStack

OpenStack is an open source IaaS implementation that consists of a number of interrelated sub-projects, including compute, storage, and networking. As a result of the Apache agreement, more than 200 companies have joined OpenStack projects since the inception of the project in 2010, including at/T, AMD, Cisco, Dell, IBM, Intel, Red Hat and so on. The number of developers currently participating in the OpenStack project is 17,000 +, from 139 countries, which is growing.

OpenStack is compatible with a subset of AWS interfaces, and also provides OpenStack-style interfaces (RESTFul APIs) to provide more powerful functionality. Compared to other open source IaaS, the architecture is loosely coupled, highly scalable, distributed, pure Python implementations, and a friendly and active community makes it popular, and every half-yearly development summit attracts developers, suppliers and customers from all over the world.

The main sub-projects of OpenStack are:

NetEase Private Cloud uses the 4 components of Nova, Glance, Keystone, Neutron.

    • Compute (Nova) provides compute virtualization services, which are at the heart of OpenStack and are responsible for managing and creating virtual machines. It is designed to be extensible, supports a variety of virtualization technologies, and can be deployed on standard hardware.
    • Object Storage (Swift) is a distributed, extensible, multi-replica storage system that provides a storage service for objects.
    • Block Storage (Cinder), which provides blocks storage services that provide persistent block-level storage for OpenStack virtual machines. supports a variety of storage backend, including CEPH,EMC and so on.
    • Networking (Neutron) provides network virtualization services, which is a pluggable, extensible, API-driven service.
    • Dashboard provides a graphical console service that allows users to easily access, use, and maintain resources in OpenStack.
    • Image (Glance) provides the mirroring service, which is designed to discover, register and deliver virtual machine disks and mirrors. Multiple back-end support.
    • Telemetry (ceilometer) provides the usage statistics service, which makes it easy to implement OpenStack billing functions.
    • Orchestration (Heat) integrates many of the components in OpenStack, like AWS cloudformation, to enable users to manage resources through templates.
    • Database (Trove) is based on OpenStack-built database-as-a-service.

Overview of NetEase Private Cloud Platform

Figure 1: NetEase Private Cloud architecture

NetEase Private Cloud Platform by the NetEase Hangzhou Research Institute is responsible for research and development, mainly to provide infrastructure resources, data storage processing, application development and deployment, operations and maintenance management functions to meet the company's product testing/on-line requirements.

Figure 1 shows the overall architecture of the NetEase private cloud platform. The entire private cloud platform can be divided into three categories of services: Core Infrastructure Services (IaaS), basic platform Services (PaaS) and Operations management support services, currently includes: Cloud Host (virtual machine), cloud network, cloud disk, object storage, object cache, relational database, distributed database, full-text search, 15 services such as Message Queuing, video transcoding, load balancing, container engine, cloud billing, cloud monitoring, and management platform. NetEase's private cloud platform leverages the latest innovations in cloud computing, and we have deployed cloud hosting and cloud networking services based on the Keystone, Glance, Nova, and neutron component development of the OpenStack community.

In order to integrate with other services of NetEase private cloud Platform (EVs, cloud monitoring, cloud billing, etc.) and to meet the specific requirements of company's product use and operation and maintenance management, our team has independently researched and developed on the basis of the Community OpenStack version including: Cloud Host resource quality assurance (compute, storage, network QoS), More than 20 new features such as mirrored chunked storage, cloud host heartbeat escalation, tenant intranet isolation in FLAT-DHCP mode. At the same time, our team has also summarized some of the deployment, operational specifications, and upgrade experience in the daily operations OpenStack as well as the new release of the community. For more than two years, the development of the OpenStack team of the private cloud platform of NetEase has been adhering to the principle of "source community, give back to the community", with open concept. Our team is also actively contributing to the community by providing free access to the OpenStack community to continually develop new features and bug fixes, helping the OpenStack community grow. For two years, our team has submitted new features to the community to develop/bug repair commits nearly 100, repair community bugs more than 50, these community contributions involved in OpenStack Essex, Folsom, Havana, Icehouse, Juno and other versions.

Thanks to the increasingly stable and mature OpenStack, the private cloud platform has been running stably for more than 2 years, providing services to NetEase's 30 internet and gaming products. From the perspective of application, the NetEase private cloud platform based on OpenStack has achieved the following goals:

    1. Improve the utilization of the company's infrastructure resources, thus reducing the cost of hardware. In the case of physical server CPU utilization, the private cloud platform increased the average CPU utilization from 10% to 50%.
    2. Improve the level of infrastructure resource management and operational automation, thus reducing the operation and maintenance costs. With Web-based self-service resource requests and allocations, and cloud Platform auto-deployment Services, System OPS personnel are reduced by 50%.
    3. Increased flexibility in the use of infrastructure resources, thus enhancing the resilience of product business fluctuations. Using virtualization technology to make a virtual resource pool of physical infrastructure, through effective capacity planning and on-demand use, the private cloud platform can be well adapted to product burst business.

Introduction to NetEase OpenStack Deployment reference Scheme

In a specific production environment, in order to balance performance and reliability, the Keystone backend uses MYSQL to store user information and uses Memcache to hold tokens. To reduce the access pressure on Keystone, all services (Nova,glance,neutron) Keystoneclient are configured to use Memcache as token cache.

As NetEase private cloud needs to be deployed in a number of room, each room in the geographical location of natural isolation, this is the application of the upper layer is a natural disaster recovery method. In addition, in order to meet the functional and operational requirements of private cloud, netease private cloud needs to support two network modes simultaneously: Nova-network and neutron. In response to these requirements, we present a multi-region deployment scenario for enterprise-wide, as shown in 2. Overall, deployments between multiple regions are relatively independent, but can be interoperable via the intranet, each of which includes a full OpenStack deployment, so you can use separate mirroring services and separate network patterns, such as zone A using Nova-network, and zone B using neutron , each other, in order to realize the user's single sign-on, the region shared the Keystone, the region is divided according to the main network mode and geographical location.

Figure 2. Multi-Region deployment approach

And the typical OpenStack deployment divides the hardware into compute nodes and control nodes, in order to make full use of the hardware resources, we try to design the deployment as symmetric, that is, any node downline will not affect the overall service. So we divide the hardware into two categories: compute nodes, control compute nodes. COMPUTE node Deployment Nova-network,nova-compute,nova-api-metadata,nova-api-os-compute. The control compute nodes are deployed in addition to the services of the compute nodes, Nova-scheduler,nova-novncproxy,nova-consoleauth,glance-api,glance-registry and Keystone, as shown in 3.

External API services are Nova-api-os-compute,nova-novncproxy, Glance-api,keystone. The features of such services are stateless and can be easily scaled out, so the services are deployed after load balancing HAProxy and are highly available using keepalived. In order to ensure the quality of service and ease of maintenance, we do not use NOVA-API, but divided into Nova-api-os-compute and Nova-api-metadata separately managed. Externally dependent, netease private cloud deploys high-availability RabbitMQ clusters and master-slave MySQL, as well as memcache clusters.

Figure 3: Compute nodes, controlling compute nodes

The network planning aspect, NetEase private cloud mainly uses Nova-network's flatdhcpmanager+multi-host network pattern, and divides several VLANs, respectively for the virtual machine Fixed-ip network, the intranet floating IP network, the outer network network.

Operation and maintenance of the use of NetEase independent research and development platform for monitoring and alarm, similar features Nagios, but more powerful. Among the more important monitoring alarms are log monitoring and process monitoring. Log monitoring to ensure that the service is abnormal when the first discovery, process monitoring to ensure the normal operation of the service. In addition, NetEase private cloud uses Puppet to do automatic deployment, as well as using Stacktach to help locate bugs.

OpenStack Individual Component Configurations

OpenStack Havana Configuration Items Hundreds of thousands, most of the configuration items can use the default value, otherwise just understand so many of the configuration items meaning is enough to let operations personnel crash, especially for those who are not familiar with the source code of operations personnel. Some of the more critical configuration items in NetEase's private cloud are listed below, and explain how they affect the functionality, security, and performance of the service.

Nova Key Configuration

MY_IP = Intranet Address

This entry is used to generate the Nova metadata API request forwarding Iptables rule on the host, which, if improperly configured, causes the Ec2/openstack metadata information not to be obtained internally by 169.254.169.254 this IP within the virtual machine; Iptable Rule shape:

-A nova-network-prerouting-d 169.254.169.254/32-p tcp-m tcp--dport 80-j DNAT \
--to-destination ${my_ip}:8775

It also uses the virtual machine in the resize, cold migrate and other operations, with the destination host for data communication. The default value for this item is the host's extranet IP address, and it is recommended to change to an intranet address to avoid potential security risks.

Metadata_listen = Intranet Address

This is the IP address that the Nova-api-metadata service listens to, it can be seen from the above iptables rule that it is related to the configuration item of MY_IP, it is the wisest choice to keep consistent.

Novncproxy_base_url = Vncserver_proxyclient_address = ${private_ip_of_compute_host}
Vncserver_listen = ${private_ip_of_compute_host}
Novncproxy_host = ${private_ip_of_host}

We only deploy novncproxy processes on a subset of nodes and add these processes to the HAProxy service to implement NOVNC agent process is highly available, multiple HAProxy processes use keepalived implementation HAProxy high availability, external only need to expose Keepal The ived managed virtual IP address can:

The benefits of this deployment approach are:

1) High availability for NOVNC proxy services

2) does not expose the extranet address of the cloud platform related nodes

3) Easy expansion of NOVNC agent service

But there are also deficiencies:

1) The virtual machine listens to the intranet IP address of the compute node in which it resides, and once the virtual machine has a problem with the host's network isolation, it will cause the VNC address interface of all virtual machines to be exposed.

2) There is a problem in the online migration, because the VNC Listener network IP in the destination compute node is not present, but this problem Nova community has been solved, I believe soon to join the J version.

Resume_guests_state_on_host_boot = True

When the Nova-compute process starts, the virtual machine that should be running should be running, which means that the virtual machine record in the Nova database is running, but the virtual machine is not running on Hypervisor, and when the compute node restarts, the configuration item is of great use. It allows all virtual machines on the node to run automatically, saving time for operations personnel to process them manually.

Api_rate_limit = False

Without limiting the frequency of API access, the number of concurrent accesses to the API will be limited, depending on the volume of access to the cloud platform and the number and affordability of the API process, and if this option is turned off, the API request processing time will be longer in large concurrency scenarios.

Osapi_max_limit = 5000

The maximum return data length limit for the Nova-api-os-compute API, if set too short, causes partial response data to be truncated.

Scheduler_default_filters = Retryfilter, Availabilityzonefilter, Ramfilter, Computefilter, ImagePropertiesFilter, Jsonfilter, Ecufilter, Corefilter

Nova-scheduler available filters, Retry is used to skip the compute nodes that have been tried to create but fail, preventing the re-scheduling of dead loops, Availabilityzone is filtering those user-specified AZ, preventing the user's virtual machine from being created into the unspecified AZ; Ram is to filter out out-of-memory compute nodes, Core is to filter out the insufficient number of compute nodes, ECU is our own development of the filter, with our CPU QoS features developed to filter out the number of the ECU is not enough compute nodes; imageproperties is to filter out the compute nodes that do not meet the mirroring requirements, such as the image used by the QEMU virtual machine cannot be used on the LXC compute node; Json is a custom node selection rule, such as not being able to create to some AZ, to be created to the same AZ as those virtual machines. Other filters are available on request.

Running_deleted_instance_action = Reap

Nova-compute Timer task found in the database has been deleted, but the compute node Hypervisor also exist in the virtual machine (also known as the Wild Virtual machine Audit operation mode) after the processing action, it is recommended to choose log or reap. Log mode requires operations personnel based on the log records to find those wild virtual machines and manually perform subsequent actions, this method is more insurance, to prevent the Nova service due to unknown anomalies or bugs caused by the user virtual machine is cleared away, and the reap method can save the operator's manual intervention time.

Until_refresh = 5

The synchronization threshold for the user quota and the actual usage in the instances table, that is, how many times the user's quota has been modified, forcing the synchronization of the usage to the quota volume record

Max_age = 86400

The synchronization interval between the user quota and the actual usage, that is, how many seconds after the last quota record has been updated and is automatically synchronized with the actual usage when it is updated again.

As we all know, the Open source Nova project still has a lot of quota bug not solved, the above two configuration items can solve the problem that the user quota usage and actual usage are not matched to a large extent, but also bring some database performance cost, need to be set up according to actual deployment situation.

# # # COMPUTE Node Resource Reservation # # #

Vcpu_pin_set = 4-$

The binding scope of the virtual machine Vcpus can prevent the virtual machine from scrambling for the CPU resources of the host process, and the recommended value is to reserve the first few physical CPUs, to allocate all subsequent CPUs to the virtual machine, and to use the Cgroup or kernel boot parameters to implement the CPU resources that the host process does not consume for the virtual machine.

Cpu_allocation_ratio = 4.0

Physical CPU oversold ratio, the default is 16 times times, hyper-threading also counted as a physical CPU, need to be based on the specific load and physical CPU capacity of comprehensive judgment after determining the specific configuration.

Ram_allocation_ratio = 1.0

Memory allocation oversold ratio, default is 1.5 times times, production environment is not recommended to open oversold.

RESERVED_HOST_MEMORY_MB = 4096

Memory Reserve, this part of the memory cannot be used by the virtual machine

RESERVED_HOST_DISK_MB = 10240

Disk reserve space, which cannot be used by virtual machines

Service_down_time = 120

Service offline time threshold, if the Nova service on one node does not escalate the heartbeat to the database at this time, the API service will assume that the service is offline and will cause a miscarriage if the configuration is too short or too long.

Rpc_response_timeout = 300

RPC call time-out, because the single process of Python cannot be truly concurrent, the RPC request may not respond in a timely manner, especially when the target node is performing a lengthy scheduled task, which requires a comprehensive consideration of the time-out and latency.

Multi_host = True

Whether to turn on Nova-network's multi-node mode, which needs to be set to True if multiple node deployments are required.

Keystone

There are fewer configuration items, primarily to weigh what backend drivers are configured to store tokens, typically SQL databases, or memcache. SQL can persist storage, while memcache is faster, especially if the user wants to update the password, it is necessary to delete all expired tokens, in which case the speed of SQL is significantly different from memcache.

Glance

Includes two sections, Glance-api and Glance-registry,:

Workers = 2

Glance-api the number of child processes requested, if configured to 0, there is only one master process and the corresponding configuration is 2, then there is a master process plus 2 child processes to process requests concurrently. It is recommended that the process be based on the physical node compute capacity and cloud platform request amount to be determined synthetically.

Api_limit_max = 1000

Same meaning as configuration osapi_max_limit in Nova

Limit_param_default = 1000

The maximum number of returned items in a response, which can be specified in the request parameter, is 25 by default and may cause the response data to be truncated if the setting is too short.

OpenStack bottom-dependent software version, configuration, and performance tuning

Selection of Virtualization Technology

In the architecture of the private cloud platform, OpenStack relies on some underlying software, such as virtualization software, virtualization management software, and the Linux kernel. The stability and performance of these software are related to the stability and performance of the entire cloud platform. Therefore, the version selection and configuration tuning of these software is also an important factor in the development of NetEase private cloud.

In the NetEase private cloud platform, we choose the Linux kernel is compatible with the best KVM virtualization technology. Compared to Xen virtualization technology, KVM virtualization technology is more tightly connected and easier to maintain than the Linux kernel. After choosing KVM Virtualization technology, virtualization management drives the OpenStack community for KVM-configured compute-driven libvirt, a set of open-source virtualization management software with a very wide range of community activity and a wide range of virtualization management support, including KVM.

On the other hand, NetEase uses the open source Debian as its host core, the source uses the Debian wheezy stable branch, the KVM and the Libvirt use also is the Debian community wheezy source inside the package version:

QEMU-KVM 1.1.2+DFSG-6+DEB7U3
Libvirt-bin 0.9.12

Kernel selection

In the selection of cores, we mainly consider the following two factors:

    • Stability: At the beginning of the development of private cloud platform, the stability is a basic principle of netease private cloud development. We use the Debian Linux version and, relatively speaking, the native kernel of Debian is definitely more stable. This is also our first choice.
    • Functional requirements: In the custom development of NetEase, in order to guarantee the service performance of virtual machine, we have developed CPU QoS technology and disk QoS, it relies on the underlying CPU and Blkio Cgroup support. Therefore, we need to open the Cgroup configuration option in the kernel. On the other hand, netease private cloud comprehensive consideration, will support LXC this container-level virtualization, in addition to Cgroup, LXC also rely on the namespace features in the Linux kernel.

Taking into account the above factors, we chose the Debian community Linux 3.10.40 Kernel source code, and opened the Cpu/mem/blkio and other Cgroup configuration options, as well as the user namespace namespace options, and compiled a matching The Linux kernel of netease private cloud. From the usage point of view, the selection of the above version of the OpenStack bottom-dependent software, NetEase private cloud operation is relatively stable, we will be in a timely manner to update these software.

Configuration Optimizations

After the stability of NetEase's private cloud has been ensured, we have started the performance tuning work. In this regard, we refer to some of IBM's best practices, in terms of CPU, memory, I/O and other aspects of the configuration to do some optimization. Overall, NetEase private cloud on the basis of emphasis on stability, will also actively learn from the industry's best practices to optimize the overall performance of the private cloud platform.

CPU Configuration Optimization

In order to ensure the computing power of cloud host, NetEase private cloud has developed CPU QoS technology, in particular, the time slice of CFS is uniformly dispatched, plus the binding technology of process pinning.

Referring to IBM's analysis, we understand the pros and cons of process pinning technology, and have tested and verified that there is a significant difference in performance between cloud hosts with different binding methods. For example, the performance of 2 Vcpus bound to different NUMA nodes on a non-hyper-threaded core and allocated to a pair of adjacent hyper-threaded cores is 30%~40% (tested by the SPEC CPU2006 tool). On the other hand, CPU0 due to processing interrupt request, its own load is heavier, not suitable for the cloud host. Therefore, combining the above considerations and the multiple rounds of test verification, we finally decided to reserve the 0-3 CPU, and then let the cloud host in the remaining CPU resources by the host core to dispatch. The final CPU configuration is as follows (libvirt XML configuration):

<vcpu placement= ' static ' cpuset= ' 4-23 ' >1</vcpu>
<cputune>
<shares>1024</shares>
<period>100000</period>
<quota>57499</quota>
</cputune>

Memory configuration Optimization

In terms of memory configuration, the practice of NetEase private cloud is to turn off KVM memory sharing and open transparent large pages:

echo 0 >/sys/kernel/mm/ksm/pages_shared
echo 0 >/sys/kernel/mm/ksm/pages_sharing
echo Always >/sys/kernel/mm/transparent_hugepage/enabled
echo Never >/sys/kernel/mm/transparent_hugepage/defrag
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag

After SPEC CPU2006 testing, these configurations have about a 7% increase in cloud host CPU performance.

I/O configuration optimizations.

1) configuration optimizations for disk I/O mainly include the following:

KVM Disk Cache method: For reference to IBM's analysis, NetEase private cloud uses none of this cache mode.

Disk IO Scheduler: At present, NetEase private cloud host disk scheduling strategy is chosen by CFQ. In the actual use of the process, we found that the CFQ scheduling policy, to those low configuration disk is prone to the I/O scheduling queue too long, utils 100% problem. Subsequent NetEase private cloud will also learn from the practice of IBM, CFQ parameter tuning, and test deadline scheduling strategy.

Disk I/O QoS: In the face of increasingly prominent disk I/O resource shortage problem, netease private cloud developed disk I/O QoS, mainly based on Blkio Cgroup to set its throttle parameters to achieve. Since the libvirt-0.9.12 version is restricting disk I/O in QEMU and there is a volatility problem, our implementation is written to Cgroup via the Nova execution command. We also developed and submitted the Blkiotune Throttle Interface Setup patch (which has been included in the libvirt-1.2.2 version) to the Libvirt community to solve this problem completely.

2) configuration optimization of network I/O

We primarily opened the vhost_net mode to reduce network latency and increase throughput.

Operation and maintenance experience

Use experience

    • Open source software bugs are unavoidable, but the new version will be much more useful than the older version, especially for OpenStack, which is growing rapidly, with the Essex, Folsom, and Havana versions, so we recommend a variety of OpenStack users can follow the community version in a timely manner and stay in sync with the community.
    • Do not easily to the community version of all kinds of so-called functional performance of the "optimization", especially in the absence of community experts to exchange views, do not easily start, otherwise such "optimization" is likely to evolve into a fault point or performance bottleneck, may eventually lead to the inability to synchronize with the community, After all, the ability and knowledge of a company or team, especially small companies and small teams, is hard to compare with the hundreds of different experts in the community.
    • Multi-Reference large-scale companies to share the deployment of the scheme, try not to do their own behind closed doors, especially for open source software, various types of companies, teams, the use of different scenarios, a variety of peripheral components are everything, multi-reference industry practice is the best way.
    • Some of the details can be implemented in many ways, but each has advantages and disadvantages that need to be fully demonstrated, analyzed, tested and validated before it can be considered for deployment to production environments.
    • All deployment scenarios, functional design to take into account the smooth upgrade problem, even if you get the information is the upgrade can stop the service, still need to avoid this situation, because the impact of the stop service is difficult to define.

Operation and Maintenance guidelines

OpenStack is also a back-end system service, all of which are relevant to the basic principles of system operation, here is a brief summary of some of the practical operations in accordance with the problems summarized in some of the experience:

      • Configuration item defaults that do not match the actual environment can cause a variety of problems, especially the network-related configuration is strongly related to the hardware, the production environment and the development environment hardware heterogeneous, resulting in some default values in the production environment is not available. Response criteria: Each version must be tested in the same environment as the online hardware to be online.
      • To do capacity planning, the quota allocated is less than the total capacity of the cloud platform, otherwise there will be a variety of problems, resulting in operational development costs a lot of unnecessary effort to locate the analysis problem.
      • Too many configuration items are error-prone and need to be carefully checked with the developer to verify that the changes are correct before they go online, first through Puppet's NoOp function.
      • Network planning to do well in advance, such as fixed IP, floating IP, number of VLANs, network expansion difficulty and risk are relatively large, so plan ahead is the most insurance, a principle is big than small good, more than less good.
      • Network isolation to do well, or user network security can not be guaranteed.
      • Information security issues to pay attention to, this is a commonplace problem, each platform has the same problem, but still need to pay attention to, once security loopholes, all virtual machines are facing a serious threat.

NetEase OpenStack Deployment operation and Maintenance combat

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.