Key Points of MongoDB Design

Source: Internet
Author: User

From:

Http://apan.me/index.php/2011/05/01/mongodb%E8% AE %BE%E8% AE %A1%E8%A6%81%E7%82%B9/

A while ago, when studying auto-scaling, I learned about MongoDB. Here I will briefly record the key points of its design.
The official website of MongoDB is described as follows: MongoDB is a high-performance, highly scalable open-source document-type database, which is implemented in C ++. Its main features are as follows:

  • Document-oriented storage)
  • Indexes supported
  • High Availability (replica sets)
  • Auto-sharding
Basic Concepts
  • Document: A Basic data unit, which is equivalent to a row record in a relational database );
  • Collection: equivalent to a table in a relational database, but a schema-free table;
  • Namespace: each collection has a corresponding namespace.
Storage Engine
  1. The memory ing file MMAP is used for implementation, while 32-bit machines are limited by address space. Therefore, the maximum data space of a single instance is only about 128 GB, and 64-bit machines are basically unlimited (TB ), therefore, we recommend that you use a 64-bit machine for deployment.
  2. Each database is composed of a. Ns metadata file and multiple data files (dbname. {, 2 ,......}, With the auto-increment number as the extension ). To prevent small databases from wasting space, MongoDB's data files start at 16 MB by default and increase in multiples. 2 GB is the maximum size of a single data file. Example:
  3.  

     

  4. Each data file is organized by extents. Each namespace can contain multiple extents and can be discontinuous. Similar to the data file growth mechanism, the extent size of each namespace also increases by a multiple. Each data file contains a special namespace named $ freelist, which is mainly used for extent recycling Management (when the collection is deleted ).
  5. The. Ns file stores the metadata of all collections in the database. For example, the collection consists of the blocks and indexes. The location of each segment is represented by the following data structure: struct diskloc
    {
    Int filenum; // The 0 in test.0
    Int offset; // position in file
    }; // 64 bits
  6. Because MMAP is used for implementation, all memory management is handed over to the OS for processing.
  7. References: http://www.10gen.com/video/mongosv2010/storageengine
Disaster Recovery Backup

Basic Principles of replica sets

The replica set is composed of a unique master (a primary) and one or more slave (secondaries. When the current master is unavailable, the replica set selects a new slave as the new master by election.

Composition of replica set

  • Standard: a standard node that contains all data backups and can be elected as a master node.
  • Passive: includes all data backups, but cannot be elected as a master.
  • Arbiter: used only for voting and does not contain any data. It is generally used for determining the network status.

Oplog

Oplog is short for operation log, which is equivalent to MySQL's bin-log and is the core of the MongoDB replication mechanism. Oplog exists in a special dB named local, and its collection name is oplog. $ main. The collection length can be set through the-oplogsize parameter at startup.
Each oplog contains the following information:
TS: 8-byte timestamp, expressed by 4-byte UNIX timestamp + 4-byte auto-increment count. This value is very important. During the election of the new primary, the secondary with the largest ts will be selected as the new primary.
OP: 1-byte operation type. For example, I indicates insert and D indicates Delete.
NS: Namespace of the operation.
O: The document corresponding to the operation.
Oplog is idempotent and can be executed multiple times as long as it is in the specified sequence. For example, an auto-increment (INC) operation records a set operation in oplog.

Synchronization

When a new backup machine is started, a full file synchronization is performed from the host, the master oplog is queried, and

Election Algorithm

  • Query all others for their maxappliedoptime
  • Try to elect self if we have the highest time and can see a majority of nodes
    • If a tie on highest time, delay a short random amount first
    • Elect (selfid, maxoptime) MSG-> others
  • If we get a MSG and our time is higher, we send back no
  • We must get back a majority of yes
  • If a yes is sent, we respond no to all others for 1 minute. Electing ourself counts as a yes.
  • Repeat as necessary after a random sleep
Automatic resizing

Auto-sharding Mechanism

  • Config servers: stores the metadata of the entire cluster, including the basic information of each shard server and its chunk information. Each config server has a full backup of the metadata. Data Consistency of multiple config servers is ensured through two-step commit. If a Config server is unavailable, the metadata of the entire cluster becomes read-only and cannot be modified.
  • Router: the router is mainly responsible for receiving client requests and forwarding the requests to the corresponding shards. If necessary, the router also needs to merge the result set and return it to the client.
  • Shard servers: data server. Each Shard is generally a replica set.

Route table

Collection

Minkey

Maxkey

Location

Users

{Name: 1}

{Name: Miller}

Shard0

Users

{Name: Miller}

{Name: ness}

Shard1

Users

{Name: ness}

{Name: Ogden}

Shard2

Users

{Name: Ogden}

{Name: 1}

Shard3

Chunks

A chunks indicates a continuous partition under the collection, and its partition range is [Minkey, maxkey ). When the size of a chunks increases to MB, the chunks are automatically split into two chunks. If necessary, the chunks are migrated to other shard instances.
When selecting shard keys, you need to consider whether the key can split the data evenly. For example, if many people use the same name as the key, the split and migration may fail.

Case studies

Foursquare crash event

Foursquare experienced an 11-hour downtime in last October. The main cause is the uneven shard algorithm of MongoDB. For detailed analysis, see fenng's article: Foursquare's 11-hour downtime.
Here, we will add that because the MongoDB storage engine uses MMAP, its memory management is all handled by the OS, while the OS is handled by page, if a page contains multiple small documents, deleting only one document in the page does not release the page memory. Therefore, during event processing, finally, it took nearly five hours to execute the repairdatabase.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.