Personal introduction: Dick Silk man
Work Mileage: Chrysanthemum Five years Operation engineer, the management of 1.4W server of the cock wire installed workers
Work experience: Simple things easy to do, MO to complicate
Motto: All in all, good intentions
Had the privilege of having a conversation with Mr. Zhuhua, HP's senior advisor at Hewlett Packard, to record some of the exchange experience
1. If you now give you a new user environment, how to quickly build operations and maintenance systems, efficient operation and maintenance management?
The core attribute of the public cloud platform is the shared resource service
1.1 Rapid construction of operation and maintenance system
1) Establish operation and maintenance specification
2) Establish operation and maintenance process
3) Establish operation and maintenance monitoring system (network monitoring, hardware status, business status, resource utilization, etc.)
4) Establishment of CMDB system (including host configuration library, network configuration library, computer room configuration library, etc.)
The core component of the linkage of the above system is the CMDB
1.2 Efficient operation and maintenance management
1) framework, i.e. reasonable division of labor/responsibility/KPI
2) Blood, that is, a professional process/specification;
3) interface, i.e. good service awareness/skill
2. There are many branches of the global Software Code and version, how to facilitate rapid management, to ensure that there is no error?
Use Git for source and version control http://git.oschina.net/progit/
3. How is the monitoring system linked to the Saltstack systems? How to linkage with the CMDB?
Monitoring system continuous collection of system performance indicators upload CMDB, visualize various indicators through the CMDB self-built library
Saltstack Acquisition Server basic information to upload CMDB, through the CMDB self-built library to visualize all kinds of information
If the monitoring system found that the business reached a certain bottleneck, monitoring system is responsible for early warning, Saltstack is responsible for the collection of resource allocation information, CMDB responsible for the overall dynamic deployment of resources
4. A large-scale system or product needs to upgrade software releases and patches, how to ensure that the business does not stop the rapid implementation of the upgrade operation?
Spacewalk can manage patches, logins, updates.
If it is a kernel-level patch, the current industry has no way to achieve no downtime, Ali had a plan before, but no open source,
If it is a normal business package, it is recommended to deploy the software update source git above, and then through the Saltstack or Ansible and other batch management platform for unified update
Of course, we also need to make batch updates according to different engine rooms and different business types, and update process visualization
5. How to put the relevant information into the CMDB, how to quickly go online this device?
What has been done is to open a CMDB API interface, write the device information acquisition script in a custom OS or system, and automatically upload the configuration to the CMDB after the device is online.
6. How to choose different hardware and what system to install according to what kind of application and service?
CPU-intensive cloud computing Hadoop Spark
Memory-intensive data storage cache middleware MySQL Redis mangodb nginx Tomcat
Disk-intensive Log archive, security Audit Elk Rsync-server Ossec
7. Do you want to run the business process on every server that goes online? This middle business logic is very complex, how do you circumvent it?
It is recommended to refine the business process based on the type of business, such as the CPU-intensive, memory-intensive, storage-intensive
Instead of running a whole process,
8. What is the main content of health inspection, what is the open source framework?
Network health status Cacti smokping
Hardware health status Zabbix Ngios
Business Health status Zabbix Ngios
9. What are the tools for the cloud computing platform? (Openstack, other automation tools?) )
Current business such as VMware's Vcenter,redhat Rhev, open source such as OpenStack, etc.
10. How does the application automate the upgrade?
11. What language do you usually use for monitoring scripts?
Shell,python,perl
Foshan Disaster Room Building
2015.3-month low-capacity disaster room to put on the agenda
2015.4 months to start purchasing equipment
2015.5 months equipment shelves complete, bare equipment, no system, no network,
Leadership to the task is half a month to deploy the basic environment, when the bare equipment for 360 units, at that time disaster tolerance basically on my own operation
of which rh2285 160 units
RH2288 v2 100 units
RH 2288 V3 100 units
E9000 12 Sets
The virtual machine you need to create is about 1000 +,
If you are relying on people to record and allocate, half a month is impossible to complete the task, close the record resource allocation resources will feel the difficulty, but also need to consider the following three points
1. The host needs to be assigned to more than 10 kinds of business, respectively, the game gateway, up, affection online, PUSH, every day browser, hispace,cloud+, cloud storage platform, mobile album and so on
2. Each type of business is in a different network segment
3. Each class of business is CPU-intensive, memory-intensive, disk-intensive
Operation and maintenance project experience
Foshan Disaster Room Building
2015.3-month low-capacity disaster room to put on the agenda
2015.4 months to start purchasing equipment
2015.5 months equipment shelves complete, bare equipment, no system, no network,
Given the task is half a month to deploy all the basic environment, when the bare equipment for 360 units, then disaster tolerant basically I am the one to operate
of which rh2285 160 units
RH2288 v2 100 units
RH 2288 V3 100 units
E9000 12 Sets
The virtual machine you need to create is about 1000 +,
If you are relying on people to record and allocate, half a month is impossible to complete the task, close the record resource allocation resources will feel the difficulty, but also need to consider the following three points
1. The host needs to be assigned to more than 10 kinds of business, respectively, the game gateway, up, affection online, PUSH, every day browser, hispace,cloud+, cloud storage platform, mobile album and so on
2. Each type of business is in a different network segment
3. Each class of business is CPU-intensive, memory-intensive, disk-intensive
Solve the current pain point by convening a departmental meeting
1. Server and virtual Machine Automation deployment
This has accumulated some experience in the 4-year installation process.
Server deployment has been largely automated, DHCP+TFTP+HTTPD (PXE and cobbler), deployment of boot files via custom server Autoinstall.xml and AUTOINSTALL.KS
When the power is turned on, the IP is automatically assigned after the machine is installed, the disk is automatically formatted and mounted after the IP assignment, and the fault of the business service reload has been done before.
Solution for unifying resource pool servers into one installed VLAN
Virtualization uses the open source Xen platform to make a template for deployment through an automated script developed by Shell+python
2. Automatic allocation of storage and IP resources
One of the pain points, because at that time E9000 supporting is S3900 storage, and this kind of storage needs manual configuration Raid,lun, assigned to the business Board,
Solutions, subsequent unified procurement of RH series servers
3. Automatic update of server and resource information tables
Pain point of the second, due to the small size, fewer servers, the use of Excel as an account tool, but to 2015 mid-term OS number has exceeded 5000, room ringtones already has 8
2016.3 months has exceeded 14000.
Solution using a set of CMDB developed by the PYTHON+MYSQL framework for server + network information archiving opens up a CMDB API interface that writes device information capture scripts to a custom OS or system, and automatically uploads the configuration to the CMDB after the device is online
4. Automating the deployment of the business
Pain point three, in the years of mileage basic from manual Installation ==> script installation ==> using open source tools two times development automation deployment business
Due to security issues, there was no saltstack open source Deployment tool using the C/S architecture.
Solution Solutions
Deploy the corresponding business using the Ansible Open Source Deployment Tool
Each room defines a playbook (YML) corresponding to multiple business
Each playbook corresponds to a roles (screenplay)
Each roles corresponds to one or several tasks (script set)
5. Account Control and information security
As a result of the development of the team, all kinds of authority flying, authority control and account recovery is very important, security no trivial matter
Deploy ldap+ Fortress Machine (UMA) for account management and daily operation record analysis
Deploy the Ossec+elk intrusion detection system to collect system logs to save and analyze daily actions to achieve security incidents tracking and positioning
This article from "June" blog, declined reprint!
Efficient operation 11 Q (have the privilege of having a heart-to-heart with HP HPE Senior Advisor)