How to Realize the Separation of Database Storage and Calculation at Low Cost

Source: Internet
Author: User
Keywords database atorage kubernetes expansion
challenge

Storage death
Every data product will have it, because we must prepare storage resources in advance for the amount of data that may be generated in the future. This is like paying in advance for the goods that may be purchased in the future. For companies and organizations, it is quite As a result, operating costs are increased, and the pre-purchased resources cannot be used immediately, which will inevitably lead to resource waste.

Although the elastic database achieves streaming resource delivery, it only reduces the granularity of resource delivery from the original 1T to 64G. In fact, elastic databases only reduce the cost of database services, but do not completely solve the problem of "silent costs and resource islands (resources cannot be shared)".

In the past, the database storage resources used by each application needed to be estimated in advance, and the database service delivered all storage resources at once. This is a black and white delivery method, either all storage resources are delivered, or no resources are delivered.

The elastic database adopts small-granularity streaming delivery, initially delivering part of the storage resources, and when the disk usage reaches 80%, the next batch of storage resources application and delivery will be carried out. This method is equivalent to small-grained batch delivery, and cannot completely avoid the silent cost of storage resources, that is, part of the storage resources is provided, but this part of the storage resources is not fully used for a long period of time. At the same time, due to the exclusive storage of each database instance, the disk IO load pressure of different instances is different, resulting in extremely tight IO in some instances, and very idle IO in some instances.

Pain of expansion
Although the elastic database achieves online scalability that users do not perceive, it is still a traditional database service. When the instance reaches the hardware limit or the host where the instance is located cannot apply for additional hardware resources, it still needs to be migrated and expanded.

Apply for the container on the host that meets the resource requirements for expansion.

Copy the stock backup file from the original container to the new container, start the mysql service based on the stock data, build a master-slave relationship with the main library, and then append binlog.

When the new slave library is consistent with the master library, switch it. If the original instance is the master database, first perform a master-slave switch and then delete the original database instance, otherwise directly delete the original database instance.

It can be seen from the above process that the migration and expansion process is complicated, and a large amount of data needs to be transmitted (inventory data transmission + additional binlog), which takes a long time. If you encounter a large amount of data writing, it may happen that the newly added slave library has been unable to catch up with the main library, resulting in that the expansion has not been completed, and the remaining space of the original database instance hard disk has been exhausted, causing the database to be down and unable to provide service.

According to our experience, 80% of the expansion and splitting are caused by increased data volume and insufficient local disks. Although the amount of data has been increasing, the system basically uses hot data (recently generated data, the amount of data in this part will not change much), for cold data (very early historical data), frequent access The degree is actually not high. Therefore, it is actually not necessary to perform frequent expansion for data that is rarely accessed or rarely accessed. If the local disk can be large enough, then these expansions can be avoided.

The difficulty of operation and maintenance
Although the elastic database has automatic failure recovery and processing, the coupling of the database and storage leads to a failure of the database instance, a new database instance must be added for failure recovery. This process is as complicated as the worst scenario of database expansion mentioned earlier, involving a large amount of data transmission, and in extreme cases (large data writing), there will be situations where the newly added instance cannot catch up with the existing master.


solution
In order to solve the above three problems, we began to study the disca project. By integrating ChubaoFS, kubernetes, MyRocks and Vitess, we realized the separation of storage and computing in an extremely simple way.

ChubaoFS

ChubaoFS is a distributed file system designed for large-scale container platforms. It can provide storage products of object storage and file system protocols at the same time. The product has the following characteristics:

Scalability: ChubaoFS uses a distributed metadata subsystem to provide higher scalability.

Multi-tenant: Under the condition of high concurrency of multi-tenant, it provides support for random/sequential reading and writing of large files and small files.

Strong consistency replication: According to different file writing methods, different replication protocols are adopted to ensure consistency between copies.

Compatible with Posix interface: simplify the development of upper-layer applications and reduce the learning difficulty for new users. At the same time, ChubaoFS relaxes the consistency requirements of POSIX semantics to take into account the performance of file and metafile operations.

Compatible with S3 interface: Provide an object storage interface compatible with Amazon S3, which can be used simultaneously with the POSIX interface. One piece of data and multiple interfaces can be flexibly selected according to the user's usage scenarios, helping users cope with increasingly complex storage usage scenarios.

Kubernetes

Kubernetes is an automated container orchestration, scheduling and management system open sourced by Google. It has the following characteristics:

Portability: Support public cloud, private cloud, hybrid cloud, multi-cloud.

Extensible: modular, plug-in, mountable, and combinable.

Automation: Automatic deployment, automatic restart, automatic replication, automatic scaling/scaling.

MyRocks

MyRocks is a MySQL storage engine based on RocksDB. It is more suitable for running on high-speed storage devices such as SSDs. Compared with Innodb, it has the advantages of saving storage space and efficient writing performance.

Vitess

Vitess is a database cluster system used to realize rapid online elastic expansion of mysql. It has the advantages of online rapid expansion, fault self-healing, built-in memory pool and query cache.


Overall structure
The architecture of a conventional elastic database uses local storage, interacts with the kubernetes api server through the console, and directly creates a pod for database service deployment. But the shortcomings of this architecture are:

Once the cluster is expanded or the failure is restored, since each database instance adopts local storage, it is necessary to add a new database instance and transfer data files and binlogs through the network, which will generate a large amount of network data transmission, which will ultimately affect the timeliness of expansion and failure recovery And stability.

The use of local disks results in different disk loads (capacity and IO) for different instances, some instance storage resources are in short supply, some instance storage resources are idle, storage resources cannot be shared, and "resource islands" appear.

Disca was created to make up for the shortcomings of conventional elastic databases. Compared with conventional elastic databases, it has the following changes:

Instead of using the host's local disk, ChubaoFS is used, so that each database instance is completely decoupled from the local disk.

Change from the original MySQL innodb engine to the MyRocks engine. Through testing and online service experience, we found that ChubaoFS performs better in the batch additional write scenario, and MyRocks uses LSM Tree to write data in a batch add/overwrite method. The combination of the two can give the greatest performance advantage.

Abandoning the original method of using Pod for database instance, clustering, expansion and scheduling work is completely handed over to the console, but making full use of the capabilities of kubernetes for service orchestration and scheduling, database instance management, orchestration and scheduling through statefulset, through pvc Realize the stickiness of the database instance and the data, thereby ensuring that the data is not lost.


Implementation plan
Transparent integration of ChubaoFS

In order to realize the transparency of ChbuaoFS to users, ChubaoFS needs to exist as a remote distributed storage service that Kubernetes can automatically identify. Therefore, ChubaoFS implements kuberne's CSI (Container Storage Interface) to achieve seamless integration with kubernetes and dynamic POD-based Mount.

Storage sharing

When using ChubaoFS, the size of each volume applied for is quota record information, and resources will not be allocated in advance, nor will it limit the upper limit of storage used by users. For example: the total storage space of ChubaoFS is 1TB, there are ten database instances to apply for ChubaoFS space, the size of PV (persistent volumes) applied by each database instance can be 1TB, as long as the total disks actually used by ten database instances It does not exceed 1TB. The actual disk space used by each user is uniformly monitored by the ChubaoFS resource management node. Once the total used disk capacity exceeds 80% of the total capacity, the ChubaoFS resource management node will immediately monitor it and issue an alarm.

Thanks to the design of the ChubaoFS shared disk, disca can apply for a larger disk (default: 2TB) for each database instance without worrying about wasting or exceeding the total capacity of the cluster. This can reduce the situation that the database instance has to be split or expanded due to insufficient disk space.

Ensure data stickiness

The database service is different from the application service. The local storage of the application service is generally the log file. After the pod to which it belongs is restarted or rescheduled, the local data can be deleted. The database service hopes that the original data will continue to be retained and used after restarting or rescheduling. This is data stickiness: when the pod where a database instance is restarted or rescheduled, the previous data of the database instance will not be lost, and the database is restarted Or when rescheduling to another Node, the previous data can be reused.

To achieve data stickiness, we need to use three resources in kubernetes: statefulset, pvc (persistent volume claim) and pv. Each pod in the statefulset dynamically generates the pv and pvc corresponding to the pod through volumeClaimTemplates, and the statefulset also guarantees the uniqueness and order of each pod.



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.