Key Points of MongoDB Design

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

From:

Http://apan.me/index.php/2011/05/01/mongodb%E8% AE %BE%E8% AE %A1%E8%A6%81%E7%82%B9/

A while ago, when studying auto-scaling, I learned about MongoDB. Here I will briefly record the key points of its design.
The official website of MongoDB is described as follows: MongoDB is a high-performance, highly scalable open-source document-type database, which is implemented in C ++. Its main features are as follows:

Document-oriented storage)
Indexes supported
High Availability (replica sets)
Auto-sharding

Basic Concepts

Document: A Basic data unit, which is equivalent to a row record in a relational database );
Collection: equivalent to a table in a relational database, but a schema-free table;
Namespace: each collection has a corresponding namespace.

Storage Engine

The memory ing file MMAP is used for implementation, while 32-bit machines are limited by address space. Therefore, the maximum data space of a single instance is only about 128 GB, and 64-bit machines are basically unlimited (TB ), therefore, we recommend that you use a 64-bit machine for deployment.
Each database is composed of a. Ns metadata file and multiple data files (dbname. {, 2 ,......}, With the auto-increment number as the extension ). To prevent small databases from wasting space, MongoDB's data files start at 16 MB by default and increase in multiples. 2 GB is the maximum size of a single data file. Example:

Each data file is organized by extents. Each namespace can contain multiple extents and can be discontinuous. Similar to the data file growth mechanism, the extent size of each namespace also increases by a multiple. Each data file contains a special namespace named $ freelist, which is mainly used for extent recycling Management (when the collection is deleted ).
The. Ns file stores the metadata of all collections in the database. For example, the collection consists of the blocks and indexes. The location of each segment is represented by the following data structure: struct diskloc
{
Int filenum; // The 0 in test.0
Int offset; // position in file
}; // 64 bits
Because MMAP is used for implementation, all memory management is handed over to the OS for processing.
References: http://www.10gen.com/video/mongosv2010/storageengine

Disaster Recovery Backup

Basic Principles of replica sets

The replica set is composed of a unique master (a primary) and one or more slave (secondaries. When the current master is unavailable, the replica set selects a new slave as the new master by election.

Composition of replica set

Standard: a standard node that contains all data backups and can be elected as a master node.
Passive: includes all data backups, but cannot be elected as a master.
Arbiter: used only for voting and does not contain any data. It is generally used for determining the network status.

Oplog

Oplog is short for operation log, which is equivalent to MySQL's bin-log and is the core of the MongoDB replication mechanism. Oplog exists in a special dB named local, and its collection name is oplog. $ main. The collection length can be set through the-oplogsize parameter at startup.
Each oplog contains the following information:
TS: 8-byte timestamp, expressed by 4-byte UNIX timestamp + 4-byte auto-increment count. This value is very important. During the election of the new primary, the secondary with the largest ts will be selected as the new primary.
OP: 1-byte operation type. For example, I indicates insert and D indicates Delete.
NS: Namespace of the operation.
O: The document corresponding to the operation.
Oplog is idempotent and can be executed multiple times as long as it is in the specified sequence. For example, an auto-increment (INC) operation records a set operation in oplog.

Synchronization

When a new backup machine is started, a full file synchronization is performed from the host, the master oplog is queried, and

Election Algorithm

Query all others for their maxappliedoptime
Try to elect self if we have the highest time and can see a majority of nodes
- If a tie on highest time, delay a short random amount first
- Elect (selfid, maxoptime) MSG-> others
If we get a MSG and our time is higher, we send back no
We must get back a majority of yes
If a yes is sent, we respond no to all others for 1 minute. Electing ourself counts as a yes.
Repeat as necessary after a random sleep

Automatic resizing

Auto-sharding Mechanism

Config servers: stores the metadata of the entire cluster, including the basic information of each shard server and its chunk information. Each config server has a full backup of the metadata. Data Consistency of multiple config servers is ensured through two-step commit. If a Config server is unavailable, the metadata of the entire cluster becomes read-only and cannot be modified.
Router: the router is mainly responsible for receiving client requests and forwarding the requests to the corresponding shards. If necessary, the router also needs to merge the result set and return it to the client.
Shard servers: data server. Each Shard is generally a replica set.

Route table

Collection	Minkey	Maxkey	Location
Users	{Name: 1}	{Name: Miller}	Shard0
Users	{Name: Miller}	{Name: ness}	Shard1
Users	{Name: ness}	{Name: Ogden}	Shard2
Users	{Name: Ogden}	{Name: 1}	Shard3

Chunks

A chunks indicates a continuous partition under the collection, and its partition range is [Minkey, maxkey ). When the size of a chunks increases to MB, the chunks are automatically split into two chunks. If necessary, the chunks are migrated to other shard instances.
When selecting shard keys, you need to consider whether the key can split the data evenly. For example, if many people use the same name as the key, the split and migration may fail.

Case studies

Foursquare crash event

Foursquare experienced an 11-hour downtime in last October. The main cause is the uneven shard algorithm of MongoDB. For detailed analysis, see fenng's article: Foursquare's 11-hour downtime.
Here, we will add that because the MongoDB storage engine uses MMAP, its memory management is all handled by the OS, while the OS is handled by page, if a page contains multiple small documents, deleting only one document in the page does not release the page memory. Therefore, during event processing, finally, it took nearly five hours to execute the repairdatabase.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Key Points of MongoDB Design

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Key Points of MongoDB Design

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support