Scalable architecture Common technology--data segmentation

Source: Internet
Author: User
Tags mongodb mongodb sharding
Common technologies for Scalable architecture

--Data Segmentation (sharding/partition) 1. Introduction

Originally wanted to write a scalable architecture of the article, found that too many things, a long time to write, here first of all, the most concerned about the data segmentation (partition/sharding) aspects of the content first written, for you to reference.

We know that in order to cope with the growing data, we slice the data and store it in different databases, and the database mentioned in this article refers to a logical database (a set of databases, such as Master-slave) rather than a single physical database, in a nonspecific case.

There are two main ways: vertical segmentation (Vertical partition/sharding): the data in different formats is stored in different databases.  Horizontal Segmentation (horizontal partition/sharding): the same data format of the data, stored in different databases, this article will focus on this to tell. 2. Vertical segmentation (Vertical partition/sharding)

For vertical segmentation, in fact, the application is very broad, the main idea is that those relationships rely very closely on the data to save to the same database , mainly includes the following several application methods: different applications using different databases: This is very easy to understand, that for a business, Often there are multiple applications, and even some applications evolve into two or more applications, which is actually a vertical segmentation application. Different modules of the application use different databases: different databases are used for different modules of the same application, and a low coupling API is provided to access them. The same application of the same module uses different databases: in some applications, the data suitable for relational queries are kept in relational databases, while others that are suitable for NoSQL databases (such as Key-value databases) are stored in the NoSQL database for easy data expansion. Here An example, such as a forum application, can be stored in the personal user information in the relational database, for example, the number of visits, personal information and so on, and for posts and replies, can be saved in a nosql, easy to expand. Note that the example given here is not a real example, just to make it easier to illustrate a hypothetical example. 3. Horizontal segmentation (horizontal partition/sharding)

Horizontal segmentation is relatively complex, and we are talking about the strategy of horizontal segmentation.

3.1 Horizontal segmentation strategy is mainly divided into the following:

1. Round-robin (polling type) algorithm

As the name suggests, is the data in turn by the way stored in the database node, for example, there are 2 nodes, N0 and N1, then Data0 on the N0 node, Data1 put on N1, Data2 put on the N0, and so on.

This approach is very easy to implement, for numeric keys, we have: n = key mod n. Among them, key is the data of the keys, n is the number of nodes, N is the number of nodes to hold the data; For those non-numeric keys, we can turn them into numeric keys, such as through some hash function, so that the key values are evenly distributed, so there are: n = f (key) mod n.

This approach has a disadvantage is very obvious, it is not easy to deal with data node changes, that is not easy to two times segmentation. The so-called two-time fragmentation means that when the data growth exceeds the database capacity, the need to increase the database, or the system failure caused some databases can not be used, then need to split the database. For example, there are two nodes, N0 and N1, now need to add a node to the N2, this time, the need for the N0 data and N1 on the data migrated to the N2, the workload is enormous, and can lead to the upper application of the data changes, such as the previous data Data5 stored in the N1 above, When the upper application accesses the data, according to key=5 know it stored in the database N1, then will query the data in the N1, now add another node N2, then this data is migrated to the N2, the upper application should go to the N2 query this data, this seemingly simple, In fact, it often leads to high application complexity.

2. Virtual Slicing technology

In order to ensure two times slicing, avoid to the upper layer application because the actual physical database changes and cause the change of the data access logic, in the middle of a virtual fragment-physical fragment mapping table, data Objects stored on the virtual partition, each virtual fragment through this mapping table to find the corresponding physical fragments. This time, the upper applications rely on virtual fragmentation, not physical fragmentation, as long as the virtual fragment to ensure that enough, can avoid the upper application of the dependency.

3. Consistency hash Algorithm

In order to avoid the change of database quantity, the problem of large-scale data migration is introduced, and the consistent hash algorithm is adopted. This algorithm was published by David Karger and others in 1997 with the title "Consistent hashing and random trees:distributed caching for protocols hot spot s on the World Wide Web, here is an article about the simple implementation of the Java language Consistency hash algorithm http://weblogs.java.net/blog/2007/11/27/consistent-hashing.

The main idea of the consistent hash algorithm is not to change the hash function itself, when the node is reduced, the neighboring node takes over the node, therefore, the data migration on the vanishing node only migrates to the adjacent node, and when the node is added, only the partial data of the adjacent node is taken over. Only a subset of the data near the node is migrated to the new addition node.

Let's take a detailed look at the implementation: our hash function generates data with a value interval [Min,max], which we use as a loop to represent the hash value of each node mapped to the loop, as shown in the following figure:

Suppose our range of values is [1,12], we have three nodes, 1,4,9, and the key to the data is also mapped to the ring, where the key value of a is between 1~4, stored on node 4, which stores the data clockwise, the same b is stored in node 9, and c is stored in node 1.

Assuming that node 4 is unavailable, data A is migrated to Node 9, and data for other nodes does not migrate, as shown in the following illustration:

If you add node 7, you will migrate some of the data on node 9 to node 7, and the other node data does not change, as shown in the following illustration:

4. According to the characteristics of data segmentation data

The most common is to divide data by geographic location, so we put them in a database that is closest to their geographic location according to the user's registration information or the IP address submitted by the user's data. 3.2 Practical Applications

In real applications, these strategies are often combined, and even more abstract interfaces are provided to enable developers to implement their own segmentation methods. We are here to describe the MongoDB and hibernate shards fragmentation methods. 3.2.1 Mongodb Sharding

Mongo DB is a document based NoSQL database that is very close to querying and relational databases.

MongoDB the data in the name of the chuncks structure, Chunck the default size is 64M, each Chunck storage a certain range of data segmentation, when the data over 64M, will split into two chunks, The equivalent hash algorithm adds a node, but this node is not db. and each physical db (called Shard) contains multiple chunks, in order to achieve better load balancing, these physical db chunks will automatically migrate, so that the chunks distribution on the db balance. 3.2.2 Hibernate Shards Hibernate Shards is a layer of expansion on the Hibernate core to encapsulate and reduce the complexity of horizontal segmentation on relational databases.

Hibernate Shards provides developers with abstract interfaces that enable developers to implement the segmentation strategy they want, using virtual slicing technology to avoid changes in the physical database that cause changes in the application.

Hibernate Shards Reference Chinese documentation See also: http://redhat.iteye.com/blog/328032 3.3 Issues to be noted

After splitting the database horizontally, it will cause some difficulties to the query, especially the aggregation query. MongoDB using Map/reduce method, can be more efficient aggregation query. 4 Summary

For large-scale, scalable, massive data applications, data segmentation is the framework must consider a key content, we are in the data segmentation, often used first vertical, and then horizontal way to the data fragmentation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.