Part V Architecture Chapter 16th MongoDB sharding Architecture (Understanding sharding)

Source: Internet
Author: User
Tags mongodb sharding mongodb support

1. Introduction

It is a kind of database cluster system that expands the massive data horizontally, the data table is stored on each node of sharding, and the user can easily construct a distributed MongoDB cluster by simple configuration. So first we should understand what is a shard (sharding) and its basic mode of operation.

2. What is a shard

Sharding (sharding) is a method that MongoDB uses to split a large collection into a different server (or cluster), although the Shard originates from a relational database partition, but it (like most aspects of MongoDB) is completely different.

Compared to any partitioning scheme you might have used, the biggest difference between mongodb is that it can almost automatically do everything, just tell MongoDB to allocate data, it can automatically maintain the balance of data between different servers, of course, you have to tell MongoDB to add the server to the cluster, But as long as this is done, the MongoDB will ensure that the newly added servers have equal data.

Shards are primarily designed to achieve 3 simple goals.

    • Make the cluster "invisible"
It is enough for an application to know that it is dealing with a common mongod instance, and in order to achieve this goal, MongoDB comes with a proprietary routing process called MONGOs. MONGOs sits in front of the cluster, for any application connected to it is like an ordinary Mongod server, MONGOs will send the request correctly to one or a group of servers in the cluster, and then assemble the received response back to the client, so that The client does not need to know whether it communicates with a server or a cluster. However, due to the nature of the cluster itself, there are some special cases that violate this abstraction, which are mentioned in the following tutorial.
    • Ensure that the cluster is always readable and writable
No cluster can be guaranteed to be available forever (such as a large-scale power outage), but under reasonable conditions, the user cannot read or write data, and the cluster should allow as many nodes to fail before the function is significantly degraded. MongoDB ensures maximum uptime in a variety of ways, each part of the cluster and should be run in a redundant process on other servers (ideally in other datacenters) so that when a process, machine, data center is broken, other replicas can be immediately (automatically) Take the broken part and keep working. There is also a very interesting challenge in migrating data from one server to another: how do you ensure continuity and consistency of data access in the process of transmission? We have identified some very good solutions, but some are beyond the scope of their own, the purpose of MongoDB has adopted some very beautiful tricks.
    • Make clusters easy to scale
When the system needs more space or resources, should be able to add, MONGODB support on-demand expansion of the system capacity, to achieve these goals, a cluster should be easy to use and easy to manage (otherwise adding new shards is not so easy), MongoDB can easily let the application naturally Shang grow.

3. Understanding ShardingIn order to establish, manage or debug the cluster, we need to understand the basic working mode of the Shard, then we will understand the basic working mode of the Shard from the following parts. 3.1. Split DataA shard (sharding) is one or more servers in a cluster that are responsible for a subset of the data, for example, if there is a cluster that stores 1 million documents representing the users of the site, a shard may contain information about 200,000 of the users. A shard can consist of multiple servers, and if the Shard contains more than one server, each server has a copy of exactly the same data byte, and in a production environment, a shard is typically a replica set (replica set).
One shard contains a subset of the data, and if a shard contains multiple servers, each server has a full copy of the data. In order to distribute data evenly between shards, MongoDB moves the subset of data between different shards, which determines which data to move based on the slice key, for example, we may choose to divide the user collection by the user name (username) field, and MongoDB uses the interval-based method to divide it. That is, by dividing the data into different blocks according to a given interval, such as ["a", "F"), note that a standard interval symbol is used in this section to describe the interval. "[" and "]" denote closed intervals, while "(" and ")" denote an open interval, so there are 4 possible intervals: X in (A, b), which represents when and only if X is present to make a<x<b. X in (A, b], when and only if there is X makes a<x<=b. x in [A, b), when and only if the presence of x makes a<=x<b. x in [A, b], when and only if the presence of x makes a<=x<=b. In MongoDB, shards are used to denote interval ranges, so this is the most common form, which can be expressed as starting from a and containing a, to B but not B.
3.2. Assigning DataThe way MongoDB divides data is not intuitive, in order to understand the rationale for doing so, we first use the primary method, and then find a better way to encounter problems.
    • One shard and one interval
The simplest way to allocate data to a shard is to have each shard take responsibility for one interval of data, so if we have 4 shards, we are likely to get the settings shown, in this instance, we assume that all user names start with a letter from A to Z, and that the range can represent the interval ["a", "{"), where { is a character that is followed by the letter Z in the ASCII code.
In the figure above, these 4 shards, intervals are ["a", "F"), ["F", "n"), ["N", "T") and ["T", "{") The Shard system is very easy to understand, but in a large or busy system can cause a lot of inconvenience, so what is the result of fragmentation? Suppose many users have a range of initial letters ["a", "F") in the name to register, which will result in a large shard 1, so we need to take out some of its documents to move it to Shard 2, we can adjust the interval to make Shard 1 (for example) into ["A", "C"), so that the Shard 2 into ["C", "N"), As shown in the following:
In the Shard 1 migrating some data to Shard 2, the interval of shard 1 shrinks and the interval of Shard 2 expands.
, one shard can lead to cascading effects: The data must be understandable to the next server, and the balance cannot be improved immediately. What happens if you add a new shard? Assuming that the cluster continues to work, eventually each shard has 500G of data, and then we add a new shard come in, now we have to move 400G data from Shard 4 to Shard 5, 300G from Shard 3 to Shard 4, 200G from Shard 2 to Shard 3, Move 100G from Shard 1 to Shard 2, moving a full 1TB of data! As shown in the following:

, add a server and load balance the cluster, we can reduce the amount of data needed to migrate by adding servers in the middle of the cluster (between Shard 2 and Shard 3), but it still requires migrating 600G of data. As the number of shards and the amount of data increases, this cascade effect will only continue to deteriorate, so MongoDB does not allocate data this way, but instead makes each shard contain multiple intervals.
    • A piecewise multi-interval
Let us revisit the above section of the area of the situation, where Shard 1 and Shard 2 each have 500G of data, and Shard 3 shards 4 each have 300G data, this time we allow each shard to contain more than one block of interval. We can divide the data on Shard 1 into two intervals, make one contain 400G data (for example ["a", "D"), another contains 100G data ["D", "F"), and then we do the same with Shard 2, get the interval ["F", "J") and ["J", "N"), Now we can migrate 100G data ["D", "F") from Shard 1 to Shard 4, migrating all documents within the interval ["J", "N") from Shard 2 to Shard 3, as shown in:
Description: An interval of data is called a data block (also called block, Chunk), when we divide a block interval into two, a block becomes two blocks. If you add a new Shard, MongoDB can fetch 100G data from the top of each shard and move the blocks onto the new Shard, so that the new shard gets 400G of data that needs to be moved to minimize the amount of data, only 400G. As shown in the following:
Description: When adding shards in the above diagram, each shard can provide data directly to it. This is the way MongoDB allocates data, and when a block gets bigger and larger, MongoDB automatically splits it into two smaller chunks, and if the fragmentation is disproportionate, MongoDB will ensure that it is balanced by migrating blocks.

Part V Architecture Chapter 16th MongoDB sharding Architecture (Understanding sharding)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.