How to build a high-force cloud operations platform?

Source: Internet
Author: User
Tags snmp


Guide After the implementation of the standardization, due to the increase in number, or the increase in some operations, we will gradually carry out some of the tools and automation, at this stage of our operational efficiency has been improved. But many of the tools and automation scripts will make our management process more difficult, as people change or some tool maintenance process errors, our automated Operations tool audience is not very stable.
Preface

Everyone doing operations is generally experienced in this process:

First, we will standardize the operation, which is the stage of improving the quality of operations and maintenance.

After the implementation of the standardization, due to the increase in number, or the increase in some operations, we will gradually carry out some of the tools and automation, at this stage of our operational efficiency has been improved.

But many of the tools and automation scripts will make our management process more difficult, as people change or some tool maintenance process errors, our automated Operations tool audience is not very stable.

This time we need a platform to our operations and maintenance tools and some experience in the process of sedimentation, with this platform to achieve our intelligent operation, so we from the operation and maintenance of the needs and experience of a platform for the operation of a product-based construction.

Overview of bank card organization cloud operation and maintenance platform

I would like to introduce our IT system construction, almost 10 years ago we built a process platform based on ITIL, changes, events, problems, services and other processes through the platform for circulation.

We turned from an open platform to a cloud operations platform five years ago, and in the process, I built the IaaS virtualization resource platform, and we built the CMDB as the industry to agree to manage operational data.

However, after the operation, we found that there are many needs to be achieved, the main three aspects:

With the increasing number of hardware and software nodes, daily operations need an efficient automation platform to adapt to various operation and maintenance scenarios and reduce duplication of labor.

The requirement is that the experience of operations personnel needs to be deposited on a platform to form an intelligent scene library, which will reuse the operation and maintenance services or capabilities, thus improving the overall operation and maintenance quality and operational efficiency.

The third requirement is to inject the intelligent scene into the traditional process-based operation and maintenance, and to rely on the artificial judgment and process decision, and gradually turn to the machine intelligence analysis and judgment.

Therefore, based on these three requirements, we have built a cloud computing environment under the scale of operation and maintenance platform.

Cloud operations platform mainly solves the following several pain points:
    • The internet business is very fast in my company and there are some marketing activities that require a quick response from OPS.
    • The number of our hardware has grown at a geometric level.
    • In recent years, the use of a number of open source architecture emerging technologies, the operation and maintenance technology has increased requirements.
    • Operation and maintenance tools scattered, lack of the same management.
    • We don't have a single display of our OPS data.
    • The sixth is that our human growth is slow, and we have some problems with artificial security in our audit process.

For these reasons, our vision for the OPS platform is that the quality of operations and the number of operational dimensions are not changed by the number of our operations personnel or the changes in skills, thus achieving a manageable number and quality of our operations.

What kind of product is the cloud Operation platform of bank card organization ?

Next we introduce our operation and maintenance platform this product, the main four aspects:

The first is the unified scheduling of resources, we can integrate resources, we provide through the Resource Platform API includes OpenStack, database management platform, container management platform, distributed storage management platform, network management platform, security management platform, the operations we commonly use, are integrated in our operation and maintenance platform, our operation and maintenance process as far as possible to simplify the implementation of self-service operations.

Second, we want to use our operation and maintenance platform to achieve automation management, reduce our manual operation, automatic data collection, automatic application installation, automatic configuration and update, automatic data analysis, automatic extension, automatic backup recovery, automatic applause processing and so on.

The third is multidimensional for visualization, so that each role has a separate perspective on the platform, to redefine operations with roles. such as network management view, System administration view, monitoring view, Report view, and so on. Unify the reporting system, unify global data, and provide customizable multidimensional reports.

The last one is to achieve high performance, we hope that our operation and maintenance platform can satisfy the concurrent collection and execution of the node.

Cloud operations platform construction scene

This is the scene plan of our operation and maintenance platform, below is a core of our transfer module. including execution, acquisition and docking with other processes, the middle is our operation and maintenance platform is the main thing to do, we call this operation and maintenance of the OS, chart management for automated topology and custom reporting, full lifecycle management is to implement the application system from the online to the offline through our platform to achieve an automated implementation.

Operational environment management and operations tools provide a convenient operating environment for the actual OPS, including backup alignment, job orchestration, and parameter management, capacity management we are looking to make a summary of the monitored data through our platform, to achieve capacity control.

High-availability management is a unified management, usability monitoring, and automated usability walkthrough for the availability of components across our application systems, at all levels.

key Scenario One: life cycle management

The first is life cycle management, we are around in a previous deployment process, usually this way, developers write a requirement document through internal processes to the operations interface person, he will coordinate the resource managers to allocate resources, form a deployment plan, and finally the deployment of the implementation by manually building changes.

There are two problems, one is the possible deviation in the transmission process, the first is the period is longer, we want to use our cloud operation platform to achieve the parameter level of electronic delivery, as well as automated deployment. That is, users on our platform to select the required components, as well as resource requirements, by our administrator to assign, confirm the actual deployment resources.

Finally, an automated deployment is performed by the platform, and the implementation of the specification standards is carried out automatically during the deployment process.

Important Scenario Two: operational environment Management

The second scenario is our operational environment management, including the resource class CPU, memory, IP, port, access relationship, and so on, and our OPS personnel concerned about, scheduled tasks, backup strategy, self-initiated projects. We manage the operating environment through the cloud operations platform, replacing the original Excel tables and automating the setup.

Important Scenario Three: continuous Deployment management

The third scenario is the continuous deployment management, the traditional deployment method we will encounter some problems, including: The application version through the version of the server multiple manual delivery, the configuration of the application, maintenance scripts do not have a unified standard, through the table manually maintain the parameters of the various environments, different environments manually modify the parameters; application installation process depending on the change of staff experience , there is no uniform standard for abnormal alarms, and the fallback method is not uniform.

To this end, we have made a continuous release of standards, and these standards can be implemented through the platform, including: Unified version of the route, version standardization, build production, testing, research and development environment configuration Difference Library, the platform according to the environment to automatically survive the corresponding parameters; Standardized application deployment process, multi-node installation sequence free orchestration, Install in the order of arrangement, standard exception alarm, and reverse backward in order of failure.

Important Scenario Four: Operating Environment maintenance

The fourth scenario is the common operations tools integration, including our common application restart, health check, quarantine, recovery tools, some physical testing of the server, and automatic access to OpenStack after automatic installation or other resource management platform automatic docking, network equipment health checks, and some regular security checks , we have integrated these tools on our cloud dimensional plane platform.

important scene Five: Portrait scene

The fifth scenario is the application image we apply to the dimension, usually we have a lot of elements of an application, you want to know that these elements will be more difficult, such as the architecture of this application, may only be in some application development designers, or some of the backbone of the heart to know, is not necessarily particularly accurate.

There may be a lot of parameters applied to the server to check. Application versions, parameter changes, maintenance records need to be changed, the application of the capacity of all levels need to find the professional room to check. The application of the situation is generally not known, to waste a great effort to know what is.

We are in the cloud operations platform, with the various product management tools we mentioned earlier, capacity management and high-availability management, we put in a view of the image, according to the change of maintenance history and application capacity, high availability of information, but also to calculate the application of his operational aspects of maturity.

Cloud Operations Platform Technology solution

At the hardware asset level we get state and operation through some SNMP tools, virtual resource level we are currently managing with the interface provided by OpenStack and other management platforms, and we manage Linux and applications on our own core scheduling system.

Our entire platform is a deployment of the right to use, in addition to the following cache and all other components of MySQL are all container-based deployments, using Apache, Haproxy, keepalived on the front end, and the backend using JBoss, RABBITMQ, Ansible, Zookeeper; data storage using MySQL, Redis, ceph, etc. in addition we have a Security Service module to check if there are some high-risk operations.

Business Flow Technology

is our specific business process, on the left is our cloud operations platform interface, an operations request will be encapsulated as a message will be placed in the message queue, the schedule module receives the message, according to the scheduling algorithm, automatically assigned to the Ansible node, The Ansible node executes through SSH to the server and asynchronously returns the execution result to the message queue.

Schedule scheduling algorithm and Ansible distributed architecture

Schedule scheduling algorithm, we consider that our production environment has a lot of partitions, we will automatically generate a zone based on his IP tag,schedule after discovering these messages, he will be on your tag and target machine data splitting, we have this detailed split several messages, Ansible to subscribe to handle its own messages.

We have a transformation on the ansible, all tasks have a unique ID, the processing is completed after the return of the message, in order to achieve multi-tasking concurrent asynchronous execution.

Visualization of data

In the aspect of data visualization, we collect information through the collector, synchronize other platform information through Synchronizer, store in the core database, make the comparison alarm through the threshold library, analyze the function library for performance analysis, and generate some reports for our operation and maintenance to visualize the management.

Bank card Organization cloud operation and maintenance platform achievements show

Our platform for the construction of the results, we have the platform above has been completely built some parts, and some features we are in the development, this is our actual in-line platform, about thousands of too virtual server, we first see this information center inside there is a room, we see some cabinets, and configure which servers in each enclosure.

This switch/f5-the physical server-Virtual Server automatic Topology page, is we according to the SNMP crawl switch, the F5 information, through the anbible grasping the physical machine information, through OpenStack crawls the virtual machine the information, according to the above message to generate the topology automatically.

Data synchronization allows you to customize timing capture data.

This is an actual backup management function, we can use our platform to select the appropriate server, through the platform self-timer, instant backup.

Self-service startup item management.

Self-service scheduled task management.




Original address of this article: http://www.linuxprobe.com/make-cloud-devops.html

free to provide the latest Linux Technical Tutorials Books for open-source technology enthusiasts to do more and better: Http://www.linuxprobe.com/thread


How to build a high-force cloud operations platform?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.