Transferred from: https://tech.youzan.com/youzan-mysql-auto-ops-road/
First, preface
In the Internet age, the scale of business is often exploding. Tasks such as fast instance delivery, database optimization, and backup management have higher demands on DBAs, and simply managing those dozens of sets of db with memory is no longer applicable. So how to manage the backup, metadata, timed script and fast instance delivery of these instances is an urgent problem to be solved.
II. Standardization of databases
In the implementation of the automatic operation of MySQL, the most painful is nothing more than the directory of the Unity, the confusion of the configuration file and the DB host is not standard, and these non-standard environment will make the road of automation operation is heavy Thorn. So first of all, we have the corresponding DB host and directory standardization, the previous non-compliant with the standard host and instance of the transformation.
- All instances on a machine are differentiated by ports, such as my3306,my3307, in a unified directory. Then create the corresponding data directory, log directory, run file directory, and so on under my3306
- One profile per instance, with the exception ofthe server ID, bufferpool_size, and other parameters that remain consistent
- Consistent MySQL software catalog and version on the online environment
Third, the road of automatic operation and maintenance of the first phase
At the outset, we need to get to the top of the current issue: backup. For DBAs, backups are more important than everything. If the DBA knows nothing about the backup of the database that it maintains, then one day, you will encounter a disaster of data loss. Therefore, we started the first phase of the work, ZANDB backup monitoring System. The main functions it implements are:
- Real-time view of the backup situation, the number of instances that should be backed up, completed instances
- Show the time spent on each backup
- View backup statistics for the last 5 days, such as total number, size, etc.
Iv. automatic operation and maintenance of the road phase two
After the implementation of the ZANDB backup monitoring system, we began to design the ZANDB two phase design and development work.
In the process of designing zandb, we divided the main functions into seven parts: backup management, instance management, host management, task management, meta-data management, log management, daily maintenance.
1. Mission system
In order to implement the backup, metadata, and timing scripts, we must have a robust task scheduling system. The task system supports several types of tasks: daily Scheduled Tasks, scheduled tasks per week, scheduled tasks per month, and support for repetitive tasks of a certain interval.
The task system is carried out by a task agent and dispatching system, and the task scheduling system records all the tasks and the time policy of the host under the task.
Through the task system, we completely removed the crontab script on the DB host, modifying the task execution time, policy, and whether it was necessary to execute it easily.
2. Backup Management
On the basis of the first issue, we perfected the backup system. By combining with the task system, it is easy to set up the backup time and the backup instance, whether the backup is needed, etc., and ensure the backup operation during the low peak period of the business.
The backup operation was performed by the agent, and the backup failed to set the state through the corresponding callback address. If there is a backup failed instance on a host, you can view its backup error log directly in the backup system, perform the retry, eliminating the pain of frequent login to the DB host.
At the same time, the backup system performs a daily checksum operation on the backup of the core database. If it is found that the backup check failed, through the alarm platform trigger or SMS alarm, convenient maintenance personnel first know if there is a situation of backup failure.
3. Host Management
The host's metadata is the basis for the DB instance. When the host is added, ZANDB can automatically connect Zabbix to get host information, such as disk size, disk free space, memory free space, etc.
4. Instance Management
To maximize the performance of the host, we have deployed multiple instances on a single host, so the host and instance are one-to-many relationships.
Through the instance management system, we can realize the following functions:
- View the current list of instances, get the current data size of the instance, log size, master-slave status, whether there are slow-scan, kill SQL, instance history information performance information, etc.
- Added a single instance, a pair of instances, for an instance/a pair of master-slave additions from the library. The process of adding a new instance is through the Rsync standard database template and then using the MY.CNF template rendering to generate a standard my.cnf configuration file, which can be viewed through the process system and supported by failed retries.
- The master-slave verification of the instance. In MySQL master-slave replication, it is possible that master-slave replication errors, master-slave switching, or software bugs, and other causes of inconsistent data. In order to detect inconsistencies in the data earlier, it is necessary to perform a master-slave consistency check on the core database every day to avoid the impact on the line.
- Instance split to split multiple schemas previously in the same instance into different instances
- Each day, the metadata of the instance is taken as snapshot, such as slow-scan data, data directory size, etc., to facilitate the analysis of historical data of the instance.
5. Log Management
SQL is the most common database operation, and it is the DBA's responsibility to optimize SQL. Faced with so many instances, if there is not a log system, only by logging on to each DB host to find a slow check will be a very painful thing. In order to liberate the DBA's hands, at the same time better find and optimize slow log, to ensure the stability of the DB, Zandb log system was born.
During the first instance metadata collection, the data for slow-scan and killed SQL is counted and then updated to the instance's metadata. Get the top slow check of yesterday by using the slow-check information of the instance meta-data.
So how to get a slow check? Of course, through the great agent to get the slow check log. The slow check will be rotate every day, resulting in a new slow log file. When the system is going to get the slow-check details, it will parse the slow log file by calling Pt-query-digest, and cache the results and return them. The next time the system gets a slow check, if it is found that the date of the slow check already has the results of analysis, directly returned.
At the same time, the log management also contains the kill SQL Top case, and the slow check is similar.
6. Meta-Data management
Meta data management includes Binlog metadata, primary key overflow check, shard information, and so on.
Through Binlog metadata management, the Binlog information management of all instances is realized. Binlog metadata records each binlog start and end time of an instance, binlog the length of time, and can quickly locate a log when data recovery occurs.
Through the primary key overflow check, we can find out in time which table's primary key self-increment has reached the critical value, avoids because the primary key self-increment overflow cannot insert causes the fault.
Due to the large amount of core library data, such as transaction, and other related information, it is necessary to find the corresponding instance according to the partition key. We have developed a Shard metadata query function, as long as the database name, Shard ID and number of shards can be quickly positioned to an instance, greatly facilitate the DBA, to achieve the rapid localization of the problem.
7. Daily Maintenance
Routine maintenance is performed mainly through the agent, including batch execution SQL, batch modification configuration, etc.
Bulk execution of SQL is the selection of a batch of instances to perform maintenance of SQL. For example, you need to modify the value of a parameter in memory, or get the value of a parameter. This SQL only allows maintenance-related, and DML is not allowed to execute.
The bulk modification configuration is similar to the modification configuration of the execution SQL type, but the modification configuration is synchronized with the change configuration file, permanently, and also modifies memory, such as adjusting the slow check time.
V. Outlook
The complete zandb system is a python Django + percona-toolkit + Agent + front-end related technology, while leveraging the cache Redis and MySQL backend db, the entire system is a simple technology stack, the implementation of the function for the present is more practical. Follow-up Database performance diagnosis, automatic analysis of database slow-check, access to key information, automated library and other functions. With the deepening of automation, it is believed that DBAs will have fewer manual repetitions and devote a limited amount of time to more valuable things.
In the absence of any special instructions, this document is copyrighted and licensed by the author and the technical team, using the Attribution-NonCommercial 4.0 International license.
Reproduced please specify: from the likes of the technical team blog http://tech.youzan.com/youzan-mysql-auto-ops-road/
There are ways to-zandb the MySQL Automation operation and Maintenance