MongoDB Shard Principle

Last Update:2015-08-03 Source: Internet

Author: User

Tags ranges

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction to Shards

Sharding is a method of using multiple machines to store data, and MongoDB uses shards to support huge amounts of data storage and manipulation of data.

Purpose of Sharding

High data volume and throughput of the database application will be more pressure on the performance of the single machine, large query volume will be a single CPU exhausted, large amount of data on the single-machine storage pressure, and eventually will exhaust the system's memory and transfer the pressure to disk IO.

To solve these problems, there are two basic methods: Vertical Scaling and sharding .

650) this.width=650; "src=" Http://docs.mongoing.com/manual/_images/sharded-collection.png "alt=" Sharded-collection.png "/>

Sharding provides a way to cope with high throughput and large data volumes.

Using sharding reduces the number of requests that each shard needs to process, so the cluster can increase its storage capacity and throughput by horizontally scaling .
For example, when inserting a piece of data, the app only needs to access the shards that store the data.
Using shards reduces the data stored for each shard.
For example, if a database has 1TB data and 4 shards, then each shard only needs to store 256GB of data, and if the database has 40 shards, each shard will only need to store 25GB of data.

MongoDB Shards

MongoDB supports sharding by configuring the cluster .

650) this.width=650; "alt=" Diagram of a sample sharded cluster for production purposes. Contains exactly 3 config servers, 2 or more "MONGOs" query routers, and at least 2 shards. The shards is replica sets. "Src=" http://docs.mongoing.com/manual/_images/ Sharded-cluster-production-architecture.png "/>

Diagram of a sample sharded cluster for production purposes. Contains exactly 3 config servers, 2 or more mongos query routers, and at least 2 shards. The shards is replica sets.

The cluster has the following components: shard , distribution route : Term: Configure server <config server>.

Shards Store the data. To provide high availability and data consistency, with a production sharded cluster, each shard is a replica set [ 1]. For more information in replica sets, see replica sets.

Query Routers, or MONGOs instances, interface with client applications and direct operations to the APPR Opriate Shard or Shards. The query router processes and targets operations to shards and then returns results to the clients. A sharded cluster can contain more than one query router to divide the client request load. A client sends requests to one query router. Most sharded cluster has many query routers.

Config Servers store the cluster ' s metadata. This data contains a mapping of the cluster "s data set to the shards. The query router uses this metadata to target operations to specific shards. Production sharded clusters has exactly 3 config servers.

[1]	In a test and development environment, each shard can be a separate mongod without having to be a replica set. 3 Configuration Servers must be deployed in a production environment.

Data partitioning

The shards of data in MongoDB are set as the basic unit, and the data in the collection is divided into multiple parts by the slice key .

Tablet key

When you shard a collection, you need to select a slice key , Shard Key is a single field or compound field that each record must contain, and the index is indexed, and MongoDB divides the data into different chunks according to the slice key and The data blocks are distributed evenly across all shards. In order to partition data blocks according to the chip key, MongoDB uses a range-based sharding method or hash-based sharding method , see the chip key for more information.

Range-based Sharding

For a range-based shard , MongoDB divides the data into different parts according to the range of the slice key. Suppose there is a slice key for a number: Imagine a straight line from negative infinity to positive infinity, and the value of each slice key draws a point on the line. MongoDB divides this line into shorter, non-overlapping fragments, called blocks of data, and each chunk contains data that is within a certain range of key slices.

In a system that uses slice keys for scoping, documents that have "close" chip keys are likely to be stored in the same block and therefore stored in the same shard.

650) this.width=650; "alt=" Diagram of the Shard key value space segmented into smaller ranges or chunks. "Src=" Http://docs . Mongoing.com/manual/_images/sharding-range-based.png "/>

Diagram of the Shard key value space segmented into smaller ranges or chunks.

Hash-based Sharding

For Hash-based shards , MongoDB computes a hash value for a field and uses this hash value to create a block of data.

In a hash-based system, a document with a "close" slice key is probably not stored in the same block of data, so the data is better decoupled.

650) this.width=650; "alt=" Diagram of the Hashed based segmentation. "Src=" http://docs.mongoing.com/manual/_images/ Sharding-hash-based.png "/>

Diagram of the hashed based segmentation.

Performance comparison of sharding mode and hash-based sharding mode based on range

The range-based sharding approach provides a more efficient range query, given the scope of a slice key, distribution routing can easily determine which block stores the requested data and forwards the request to the appropriate shard.

However, a range-based shard can result in uneven data on different shards, and sometimes the negative effect is greater than the positive effect of query performance. For example, if the field where the slice key resides is linearly growing, all requests within a certain period of time will fall into a fixed block of data. The resulting distribution is distributed in the same shard. In this case, a small subset of shards carries the majority of the data in the cluster, and the system does not scale well.

Compared with this, the hash-based Sharding method can guarantee the balance of the data in the cluster at the cost of the loss of the range query performance. The randomness of the hash value makes the data randomly distributed in each block of data, and therefore randomly distributed in different shards. But also because of randomness, a range query is difficult to determine which shards should be requested, It is often necessary to request all shards in order to return the desired result.

Customizing the distribution of data in a cluster using tags

MongoDB allows administrators to use tags to directly determine the cluster's equalization policy. The administrator binds the tag to the scope of the slice key and binds the tag directly to the Shard, and the equalizer distributes the data that satisfies the tag directly to the Shard to which it is bound. and ensure that the data that satisfies the tag is stored on the corresponding shard.

Tags are the first condition that controls the behavior of the equalizer and the distribution of data blocks, and generally, when you have multiple datacenters, you use tags to customize the distribution of data blocks in your cluster to improve the efficiency of data access between different geographies.

See tagged shards for more information.

Maintenance of data equalization

The addition of new data or the addition of new shards may result in an imbalance in the data in the cluster, that is, the number of blocks of data represented by some shards is significantly greater than the number of chunks saved by other shards.

MONGOBD uses two processes to maintain the balance of data in a cluster: Split and equalizer.

Split

Splitting is a background task that prevents a chunk of data from being too large. When the size of a chunk exceeds: Ref:' set chunk size <sharding-chunk-size> ', MongoDB divides it into two, Insert and UPDATE trigger splitting process. Fragmentation alters meta-information, but is highly efficient. When splitting, MongoDB does not migrate any data and has no impact on cluster performance.

650) this.width=650; "alt=" Diagram of a shard with a chunk that exceeds the default chunk size of up to MB and triggers a SPL It's the chunk into the chunks "src=" http://docs.mongoing.com/manual/_images/sharding-splitting.png "/>

Diagram of a shard with a chunk that exceeds the default chunk size of up to and triggers a split of the the chunk into the-ch Unks.

Equilibrium

The balancer is a background process, that manages chunk migrations. The balancer runs in any of the query routers in a cluster.

When the data imbalance occurs in the cluster, the equalizer migrates the data blocks from the Shard with the largest number of blocks to the smallest shard of the block, for example: If the collection users have 100 blocks on Shard 1 , the Shard 2 has 50 data blocks, the equalizer migrates data blocks from Shard 1 to Shard 2 until the data is balanced.

Shard Management manages data block migrations from the source Shard to the target Shard in the background, and during the migration, the target shard first receives all the data from the source shard on the migrated data block, The target Shard app changes the migration data block that occurred between the previous migration step on the source Shard, and finally, the meta information stored on the configuration server is updated.

If an error occurs in the migration, the data on the source shard is not modified and the migration is stopped. MongoDB will delete data on the source shard after the migration has completed successfully.

650) this.width=650; "alt=" Diagram of a collection distributed across three shards. For this collection, the difference in the number of chunks between the shards reaches the *migration thresholds* case, 2) and triggers migration. "src=" Http://docs.mongoing.com/manual/_images/sharding-migrating.png "/>

Diagram of a collection distributed across three shards. For this collection, the difference in the number of chunks between the shards reaches the migration thresholds ( In this case, 2) and triggers migration.

Adding or removing shards from a cluster

When adding shards to a cluster, there are no data blocks on the new shards, which can result in uneven data. Now MongoDB will immediately begin migrating data to the new Shard, and it will take some time for the cluster to reach the state of the data balance.

When you delete a shard, the equalizer needs to migrate all the data on the deleted shard to the other shards, and you can safely remove the Shard after the entire migration is complete and the meta information is updated.

Cluster components

sharded clusters implement sharding . A sharded cluster consists of the following components:

Shard
A shard is a MongoDB instance that stores part of a collection of data, each shard being a separate < Span style= "Font-family:nsimsun;" >mongod or replica set . In a production environment, all shards should be replica sets. See shards For more information.
Configure server
each: Ref: Configuration Server <sharding-config-server> ' are all:p rogram that store the cluster meta-information: ' Mongod . Meta Information Stores the mapping of shards, see Configuring the server . For more information.

Distribute routes

Each route is < Span style= "Font-family:nsimsun;" >mongos , which distributes read-write requests to shards. The application does not directly access the Shard. 650) this.width=650; "alt=" Diagram of a sharded cluster. "Src=" Http://docs.mongoing.com/manual/_images/sharded-cluster.png "/>

Diagram of a sharded cluster.

The basic unit of a shard in MongoDB is a collection, and a shard key can be set for each set of shards that are opened.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More