Data Model of massive data

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://my.oschina.net/chenzuoping/blog/37747

Sharding nightmare

Many large websites in China should have sharding experience. Sharding seems to be the most important way to upgrade the performance of MySQL websites. Weibo.com, youku.com, and Douban.com all use sharding as a means to upgrade their performance. Sharding looks good, and from the user experience of Weibo Youku Douban, sharding does play a role.

Sharding generally Splits a table with a huge dataset into multiple data tables based on the primary key, or even splits a database into multiple databases. In this way, the data read and write pressure is distributed to multiple tables or multiple machines, which naturally improves the performance.

Although sharding significantly improves the performance, the problems it brings are also very obvious. The program logic will change and cause difficulties in programming. After a table is sharded into multiple sub-tables, the program determines which sub-table to find data. When you want to perform multi-table join queries, this is almost an incomplete task, because the sub-table to be connected may be on another machine. The solution is to write a dedicated data access layer, but with the sharding again and again, the data access layer itself becomes very complex. Sharding itself is also very complicated. Many issues need to be considered when performing sharding operations. It is necessary to ensure that the data operations required by the business can still be completed after sharding, incorrect sharding can cause disastrous consequences to the system. Due to its complexity, it is easy to cause unexpected errors.

New Ideas

The problems that MySQL has encountered in sharding also allow the industry to find new and better solutions. Google and Amazon respectively provide their solutions, bigtable and Dynamo. Bigtable and Dynamo have many differences, but one important thing is that both of them give up the relational model, from the above analysis, we can see that the relational model has difficulties in expansion in the case of massive data volumes.

They all adopt an entity [1] model. An entity can be understood as a set of correlated attributes, or an object that can add any attributes but has no method. The role of an object is to connect closely related data and separate unrelated data. The entity is the smallest unit of separation. When the data volume changes, the data is divided into different machines. Different entities may be divided into different machines, however, different attributes of the same object must be stored on the same machine.

For example, a user A can be modeled as an entity, and the person's name (name), age (AGE), height (height), weight (weight) and other information can be expressed as the attributes of the object. If a writes an article, the article should still be expressed as an attribute of the object (article1). Because this article is written by a, Article 1 and a are closely related, if it writes the second article, it creates an attribute of Article 2. If a publishes a photo, the photo will also be represented by a property photo1 of, the second photo is represented by the property photo2.

The advantage of this is that all associated data is bound together. When the number of users reaches a certain level, a service cannot store all the data, or a server cannot load massive data requests, the entity can be dispersed to different servers. Since entities are almost completely unrelated, being dispersed on different servers will not significantly affect the program.

The advantage of the above model is that the model is very simple and suitable for distributed storage. Automatic sharding can also be implemented, because the relationship between entities does not need to be considered during sharding. You only need to distribute entities to different machines in units of entities.

The content of the above thinking comes from the life beyond distributed transactions paper. It also talked about message transmission, state machine, and other issues. It is worth reading.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data Model of massive data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data Model of massive data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support