The technology evolution of the cloud: first the cloud host to do the stability of the other

Source: Internet
Author: User
Tags execution redis web services cloudstack openvswitch

In the first half of 2013, the U.S. Regiment released its public cloud services to the United States cloud. The product at the beginning of the support team is the United States Regiment system Operation Dimension group, by June 2014, the United States Group cloud business from the system operation and maintenance team out of the independent, specialized in cloud computing aspects of product development and operation.

Recently, the Infoq Chinese station editor has communicated with the three engineers from the cloud Business department to understand the current state of the cloud, the past evolution and the next development plan. The following content is based on this exchange.

Background

Independent of the United States Group cloud Business unit currently has more than 10 engineers. Although the department is independent, but the work still with the system operation and maintenance team have close cooperation. American Regiment system operation and Maintenance Group has two-thirds of the original development engineers, and the operation of the engineers have all the ability to write code, so the development and operation and maintenance engineers can carry out close cooperation.

The initial version of the cloud began in July 2012 and was built as a private cloud computing platform. The first edition developed for about 2 months, then took about 10 months to migrate all the operations of the United States Mission to the cloud platform. Now in addition to Hadoop, the database is still running on the physical machine, the United States network of all the business has been running in the United States cloud, the internal research and development, test platform is also running in the United States Group Office cloud platform.

May 2013, the United States and the cloud began to provide public cloud services, until now only to provide cloud host products. Object storage, Redis, MySQL, load balancing, monitoring, VPC and other services are also research and development, some products have been inside and some testing customers use, the future will be based on product maturity and customer demand level gradually open to the outside.

Stability is the core value of public cloud services. Since the opening of public cloud services, the main work of the United States cloud is to enhance the stability of the cloud host, improve the cloud host template, backup, monitoring, security and other work. It is expected that in September 2014, the United States cloud will release its third version of the update, continue to burnish its product details.

Technical framework

The United States cloud initial selection of the time to OpenStack, cloudstack, Eucalyptus have done research, the results of the research is the framework design using the OpenStack frame, network Architecture reference Cloudstack, the main components by their own development, Some of the components were developed two times on OpenStack native components.

The core Cloud host management system is its own research and development, not using Nova. The Region-zone-cluster three-tier architecture is used to support large-scale cluster deployments across geographies and multiple data centers. The adoption of KVM based host virtualization and Openvswitch+openflow based network virtualization technology.

Mirroring management uses glance. There are certain modifications, such as distributed support for multiple data centers, and mirror replacements.

Identity management was Keystone. There are certain modifications, such as high concurrency performance improvements, and integration with the American mission account system.

The object store uses swift, but Swift has a performance problem with write latency and a weak OAM function, so it has been modified and developed. Swift is now servicing the dozens of TB-level business within the American network.

The reason why not the whole introduction of OpenStack, because at the time of research, feeling openstack design compared to the actual situation of the United States Regiment. For example, the network architecture needs to be significantly tuned, with the support of shared storage. At that time for the United States Regiment, the existing business infrastructure has been basically solidified, in order to adapt to openstack and make such adjustments is basically unacceptable. In addition, OpenStack does not reach the required level in many details. For example, OpenStack for the design of multiple zones, it assumes that the network between your room is complete, which is not very consistent with our network status. Therefore, we have redesigned the network, storage and host management model based on the existing host usage pattern and network architecture of the United States Regiment, and developed the virtualization management platform independently. At the same time, we do from a single room to do a lot of computer room, in the zones between the understanding of decoupling, such as in each room are placed glance service nodes, reduce the dependence on the network across the engine room. It is because of these research and development work that we have been honed in less than a year to achieve the goal that the entire infrastructure of the United States is fully operational on the private cloud.

Of course, because we do not use Nova, it means that many of the OpenStack communities rely on Nova-dependent components and functionality that we simply cannot use directly. However, relative to the OpenStack all-inclusive/inclusive architecture, we insist on every aspect of the focus on a technology, such as host virtualization using KVM, network virtualization using Openvswitch+openflow, so that the overall system development and maintenance costs are relatively low, At the same time can dig deep into these technical solutions to the functional characteristics of the maximum compression of hardware performance. At the same time, because we have mastered the basic system code, so that we can be more efficient to add some new business functions (for example, the virtual IP,USB key support, etc.), as well as the implementation of the system architecture upgrades (for example, for the multiple-room architecture support, etc.). In addition, we have made some changes to the OpenStack components used, such as the optimization of Swift, and the technical Committee is proposing how to give back to the upstream community. Of course, this also needs to see the community to our patch acceptance or not, and we also to meet business needs priority.

In other ways, block storage falls on a local SAS disk and is made locally for raid. At present we to the United States network of their own business do RAID5, the public cloud users do RAID10. This is considered the United States network of its own business in the application layer has done a more complete high available design, even if a single node is lost, it will not affect the business, but for the public cloud users, they use the cloud host is a single point, so to their cloud host to do better protection. The cost of using RAID10 is certainly higher. We are also considering shared storage, assuming that the above stability and performance problems are solved first. SSD will be used later to increase the performance of block storage.

Network has done a distributed design, the host used OpenFlow, through the OpenFlow to modify the two-layer protocol, so that each user has a separate flat network, and other users of the network isolation. With DNS virtualization technology, different users can use the same host name on their own private network and deploy distributed DNS and DHCP on each host to achieve the central hub of the underlying network services.

Yun-Wei

The operation of the cloud of the United States and the whole idea of the operation of the United States is consistent. The following description of the operation of the concept is applicable to both the United States cloud, but also applicable to the entire United States network.

The operation dimension frame can be summarized as five horizontal three longitudinal. From the horizontal perspective, from the bottom up into five levels:

More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/Servers/cloud-computing/

Physical layer, including computer room network, hardware facilities. We are in the development of multiple computer rooms and metropolitan area Network, from the lowest level to ensure the stability of the infrastructure. In order to cope with the operation cost of the large-scale computer room construction, we have realized the Web management of BareMetal automatic installation deployment, from the server shelves, the other work is completed by automation, and can manage the physical machine like the virtual machine.

System layer, including operating system, virtualization. We have adopted a templating (mirroring) approach to management on the basis of virtualization, as well as a partial customization of the Linux kernel, such as optimizations for OVS compatibility.

Service layer, including webserver, caching, database and other basic services. We have a unified configuration management based on puppet tools, we have our own software warehouses, and we have customized some of the packages. The benefits of unified configuration management, on the one hand, is to avoid inconsistent changes to ensure the stability of the cluster, on the other hand, improve operational efficiency.

Logical layer, including business logic, data flow. The main task of this layer is to publish and change. In many other companies, the release of the business, the database change management are by Operation Dimension to do, we believe that the development, operation and maintenance of the high cost of collaboration, so has been to the direction of self-help developers to do, through the code release platform, database change platform to achieve development and operation of the light coupling work. In the release platform, each application corresponds to a separate cluster, one developed as the owner of the application has the highest privileges, and many developers who are members of the application can publish the code themselves. The database change platform also has similar permission control mechanism, and has special stability consideration in the task execution level, for example, the large change task is automatically dispatched to nightly execution, and the task of deleting the datasheet is first backed up in the background.

The application layer, including the user-visible portion. In addition to similar releases and changes to the logic layer, we have a unified front-end platform to achieve access to traffic separation, behavior monitoring and access control, which has great benefits for the overall security.

Vertically, there are three parts of the work that are common to the above five levels:

Monitoring. From the physical layer to the service layer of monitoring and alarm are operational dimension to follow up, response. For the logic layer and application layer, is also the development of self-help ideas, operation and maintenance of the API to provide the specification, development can create their own monitoring items, set alarm rules, to carry out the screening. Monitoring the alarm after the processing, and now some have done automation, and some have not. In particular, there is a vertical chain between the infrastructure and the business, this includes establishing a business capacity model, how much of a particular business form is loaded in the case of a user, what the SLA should be at different load levels, and so on, which can be automatically processed after they are established.

Safety. We have deployed a unified security access platform very early, all the manual operations on the line need to login relay springboard, each person has a separate login account, all online operations have audit log. More security work is done by a dedicated information security group.

Process. Early on Jira has done some simple processes, but it still needs to be improved. Now is to focus on the needs of the development of the corresponding Process control system, the direction is also automated, self-service. From the business unit to request VM resources, to business expansion of the entire process, we are in the upstream and downstream through, the future can be in the Web interface through a very simple operation, but also provide a service API to facilitate integration of other business platforms. After virtualization covers the full line of business, these things become easy to do.

In short, the United States Group Network overall operation of the idea is: to ensure stable operation of the business, while promoting a comprehensive automation, self-help. The part that involves the development, the Operation Dimension communication collaboration, as far as possible through the automation platform way, by the developer self-service completes. In addition to the basic environment, platform construction, the operator helps the business to comb the highly available architecture, improve the operational dimension of the Code, and locate and solve all kinds of problems in the business.

Improvement and evolution

The United States cloud from the beginning of the domestic service to the present two years, the biggest improvement is from the single room to the construction of multiple computer rooms, this is with the United States and the network of metropolitan Area Network development.

When a single room, the United States network business early encounter operators network interruption for several hours, during the business is not available, very painful. Multi-room redundancy to achieve the most ideal situation is, even if a computer room the whole power, the business is not affected, of course, this means that the need for 100% of redundant, the cost is relatively high. But for the United States network, the cost of redundancy is very willing to bear, because the loss of business can not be used more than the cost of doing these redundancy, so we now have 50% of physical resources redundancy, bandwidth will generally reserve 30% of the redundancy.

Because of the rapid development of the United States network, last year we encountered a lack of resources, in the above a lot of potholes, began to do some long-term planning. Now the United States network business double Room redundancy has been implemented a part of the United States Group Cloud has two rooms, if the public cloud customer's business support lateral expansion, then also can do across the room deployment. This room class high can be done, the stability is a great upgrade, greatly reduce the impact of network jitter on the business, availability of SLA can be from the current 4 9 to achieve higher. Some of the larger customers on the quality of service will have a higher demand, so the United States Regiment of the metropolitan Area Network, as well as the future WAN, will also be shared to our public cloud customers.

In addition to the above mentioned our database ran on the physical machine, this piece is now using SSD, read and write performance of the early three 15000 to the SAS, the bottleneck in the gigabit NIC, so we are also doing million gigabit network upgrade. The database service will also be open to the public cloud users, and the infrastructure is in line with the U.S. mission's own business.

Plans for the future

Because local storage is used, virtual machine migrations are now required at night to reduce the impact on user services. In order to improve the availability of services, shared storage is a good choice to ensure stability and performance, so we are testing shared storage solutions under Gigabit networks. In addition, the KVM we use for the underlying virtualization mechanism is not a hot-swappable feature, which is what we plan to do.

Now many customers ask us, when out Redis, when Izumo database, some customers to Redis and MongoDB will have the demand, Web services want MySQL. Our plan is to provide a number of templates for the DBA team, equivalent to some of the system mirrors designed specifically for redis/mysql, so that customers can use them directly. This may be introduced at the next release.

We will also provide some infrastructure consulting services, one on the one hand, the human services provided by engineers, and the other on the Internet to share our best practices in the form of tools + documentation. The United States Group network to do now tens of billions of scale, the internal has a lot of experience accumulation, if you can accumulate these accumulated to our customers, can help customers to take a lot less detours.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.