Common scalable architecture technology-data splitting

Last Update:2014-09-25 Source: Internet

Author: User

Tags mongodb sharding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Common scalable architecture technologies

-- Data sharding/partition)

1 Overview

I originally wanted to write an article on the scalability architecture and found that there were too many things to write down for a long time. Here I will first write the partition/sharding content that everyone is most interested in, for your reference.

We know that, in order to cope with the increasing data, we split the data and store it in different databases. The databases mentioned in this article are not specific, A logical database (a group of databases, such as master-slave), rather than a single physical database.

There are two main methods:

Vertical partition/sharding ):Data in different formats is stored in different databases.

Horizontal Split (horizontal partition/sharding ):Data in the same data format is stored in different databases. This article will focus on this.

2 vertical partition/sharding)

Vertical splitting is widely used,The main purpose is to store data that is highly dependent on the same database., Mainly including the following application methods:

Different applications use different databases:This is easy to understand, that is, for an enterprise, there are often multiple applications, and even some applications gradually evolve into two or more applications, which is actually a vertical splitting application.

Different modules of an application use different databases:Use different databases for different modules of the same application and provide low-coupling APIs for access.
Use different databases for the same application module:In some applications, data suitable for relational queries is stored in relational databases, while others are suitable for data stored in nosql databases (such as key-value databases ), stored in a nosql database to facilitate data expansion. Here is an example. For example, a forum application can store personal user information in a relational database, such as the number of visits and personal information, it can be stored in a nosql file for easy extension. Note: The example here is not a real example. It is just to show a hypothetical example.

3 Horizontal Split (horizontal partition/sharding)

Horizontal segmentation is relatively complex. We will talk about the horizontal segmentation strategy.

3.1 horizontal segmentation policies are divided into the following types:

1.Round-Robin(Round-robin) Algorithm

As the name implies, data is stored on database nodes in turn. For example, if there are two nodes, N0 and N1, data0 is placed on N0 and data1 is placed on N1, place data2 on N0, and so on .......

This method is very easy to implement. For numeric keys, we have: N = Key mod n. Key is the data key, n is the number of nodes, and N is the number of nodes that store data. For non-numeric keys, we can convert them into numeric keys, for example, if key values are evenly distributed using some hash functions, there are: N = f (key) mod n.

This method has obvious disadvantages. It is not easy to cope with changes in data nodes, that is, it is not easy to perform secondary splitting.Secondary partitioningThis means that when the increase of data exceeds the database capacity, you need to increase the number of databases, or when a system failure causes some databases to be unavailable, You need to split the database again. For example, there are two nodes, N0 and N1. Now we need to add a node N2. At this time, we need to migrate the data on N0 and the data on N1 to N2, this workload is huge, and may lead to changes in the data of upper-layer applications. For example, when data5 was stored on N1 and the upper-layer applications accessed the data, according to key = 5, it is stored in the database N1, and the data will be queried in N1. Now another node N2 is added, so this data is migrated to N2, upper-layer applications should query this data on N2. This seemingly simple, in fact, often leads to a high complexity of the application.

2.Virtual Partitioning technology

To ensure secondary sharding, a virtual fragment-physical fragment ing table is added in the middle to avoid changes to the data access logic caused by changes to the actual physical database of upper-layer applications, data Objects are stored on virtual fragments. Each virtual segment uses this ing table to find the corresponding physical fragments. During this time, upper-layer applications rely on virtual fragments instead of physical fragments. As long as there are enough virtual fragments, the dependency of upper-layer applications can be avoided.

3.Consistent hashAlgorithm

In order to avoid changes in the number of databases and cause large-scale data migration problems, a consistent hash algorithm is introduced. This algorithm was published in 1997 by David karger and others. The paper titled consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. here is an article about Java's simple implementation of consistent hash algorithm http://weblogs.java.net/blog/2007/11/27/consistent-hashing.

The main idea of consistent hash algorithms is not to change the hash function itself. When a node is reduced, the neighboring node takes over the node. Therefore, data migration on the disappearing node is only migrated to the neighboring node; when a node is added, only part of the data of the adjacent node is taken over. Therefore, only a portion of the data of the adjacent node is migrated to the newly added node.

Let's take a closer look at the specific implementation: the data generated by our hash function has a value range [min, Max]. We use a ring to represent this range, the hash values of each node are mapped to this ring, as shown in:

Suppose our value range is []. We have three nodes, 9. The data key is also mapped to this ring. The key value of A is between 1 and 12 ~ Between 4, data is stored on node 4, that is, data is stored clockwise, B is also stored on node 9, and C is stored on node 1.

Assume that node 4 is unavailable, data a is migrated to node 9, and data of other nodes is not migrated, as shown in:

If node 7 is added, part of the data on node 9 is migrated to node 7, and the data on other nodes is not changed, as shown in:

4.Split data according to data features

The most common way is to split data by geographical location. We place the data in the database closest to their geographical location based on the user registration information or the IP address submitted by the user data.

3.2 practical application

In real applications, these policies are often combined, and even more abstract interfaces are provided to developers to implement their own splitting methods. Here we will describe the sharding method of MongoDB and hibernate shards.

3.2.1 MongoDB sharding

Mongo dB is a document-based nosql database. The query method is very similar to that of relational databases.

MongoDB stores data in a chuncks data structure. The default chunck size is 64 MB, and each chunck stores data within a certain split range. When the data exceeds 64 MB, it is split into two chunks by itself, which is equivalent to adding a node to the consistent hash algorithm, but this node is not a DB and is consistent with the Splitting Method of the consistent hash algorithm. Each physical database (called shard) contains multiple chunks. To achieve better load balancing, chunks on these physical databases are automatically migrated, so that the chunks on the DB are evenly released.

3.2.2 hibernate shards

Hibernate shards is an extension of hibernate core to encapsulate and reduce the complexity of horizontal splitting in relational databases.

Hibernate shards provides developers with abstract interfaces. developers can implement their desired splitting policies. To prevent physical database changes from causing application changes, they adopt virtual sharding technology.

Hibernate shards refer to: http://redhat.iteye.com/blog/328032 for Chinese documents

3.3 notes

After the database is split horizontally, the query may be difficult, especially the aggregation query. MongoDB uses the MAP/reduce method to efficiently query aggregation.

4. Summary

For large-scale, scalable, and massive data applications, data splitting is a key aspect of our architecture. When we perform data splitting, we usually adopt vertical first, split the data horizontally.

Common scalable architecture technology-data splitting

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More