Data Model of massive data

Source: Internet
Author: User

Http://my.oschina.net/chenzuoping/blog/37747

Sharding nightmare

Many large websites in China should have sharding experience. Sharding seems to be the most important way to upgrade the performance of MySQL websites. Weibo.com, youku.com, and Douban.com all use sharding as a means to upgrade their performance. Sharding looks good, and from the user experience of Weibo Youku Douban, sharding does play a role.

Sharding generally Splits a table with a huge dataset into multiple data tables based on the primary key, or even splits a database into multiple databases. In this way, the data read and write pressure is distributed to multiple tables or multiple machines, which naturally improves the performance.

Although sharding significantly improves the performance, the problems it brings are also very obvious. The program logic will change and cause difficulties in programming. After a table is sharded into multiple sub-tables, the program determines which sub-table to find data. When you want to perform multi-table join queries, this is almost an incomplete task, because the sub-table to be connected may be on another machine. The solution is to write a dedicated data access layer, but with the sharding again and again, the data access layer itself becomes very complex. Sharding itself is also very complicated. Many issues need to be considered when performing sharding operations. It is necessary to ensure that the data operations required by the business can still be completed after sharding, incorrect sharding can cause disastrous consequences to the system. Due to its complexity, it is easy to cause unexpected errors.

New Ideas

The problems that MySQL has encountered in sharding also allow the industry to find new and better solutions. Google and Amazon respectively provide their solutions, bigtable and Dynamo. Bigtable and Dynamo have many differences, but one important thing is that both of them give up the relational model, from the above analysis, we can see that the relational model has difficulties in expansion in the case of massive data volumes.

They all adopt an entity [1] model. An entity can be understood as a set of correlated attributes, or an object that can add any attributes but has no method. The role of an object is to connect closely related data and separate unrelated data. The entity is the smallest unit of separation. When the data volume changes, the data is divided into different machines. Different entities may be divided into different machines, however, different attributes of the same object must be stored on the same machine.

For example, a user A can be modeled as an entity, and the person's name (name), age (AGE), height (height), weight (weight) and other information can be expressed as the attributes of the object. If a writes an article, the article should still be expressed as an attribute of the object (article1). Because this article is written by a, Article 1 and a are closely related, if it writes the second article, it creates an attribute of Article 2. If a publishes a photo, the photo will also be represented by a property photo1 of, the second photo is represented by the property photo2.

The advantage of this is that all associated data is bound together. When the number of users reaches a certain level, a service cannot store all the data, or a server cannot load massive data requests, the entity can be dispersed to different servers. Since entities are almost completely unrelated, being dispersed on different servers will not significantly affect the program.

The advantage of the above model is that the model is very simple and suitable for distributed storage. Automatic sharding can also be implemented, because the relationship between entities does not need to be considered during sharding. You only need to distribute entities to different machines in units of entities.

 

The content of the above thinking comes from the life beyond distributed transactions paper. It also talked about message transmission, state machine, and other issues. It is worth reading.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.