MongoDB Data Model

Last Update:2017-07-25 Source: Internet

Author: User

Tags mongodb client mongo shell

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Note: The official documents that were originally translated from MongoDB.

Introduction to Data modeling

The key challenge in data modeling is to balance the requirements of the application, the performance characteristics of the database engine, and the data retrieval pattern. when designing a data model, always consider the application usage of the data (that is, the query, update, and processing of the data) and the inherent structure of the data itself.

File structure

The key decision of the MONGODB application design data model is the structure of the document and how the application represents the relationship between the data. There are two tools that allow an application to represent these relationships: referencing and embedding documents.

Reference

References a relationship between a document that contains a link or a reference to the stored data in another document the application can resolve these references to access the relevant data. Broadly speaking, these are normalized data models.

Embed a document

Embedding a document captures the relationship between data by storing related data in a single document structure. A MongoDB document can embed a document structure in a field or array in a document. These "denormalized" models allow applications to retrieve and manipulate related data in a single database operation.

Atomic nature of write operations

In MongoDB, write operations are atomic at the document level, and no one write operation can affect multiple documents or multiple collections on an atom. A non-normalized data model with embedded data combines all the relevant data for the represented entity in a single document. This is useful for atomic write operations, because a single write operation can insert or update data for an entity. Normalized data splits data across multiple collections and requires multiple writes that are not atomic.

However, the pattern of promoting atomic writes may limit how the application can use the data, or it may limit the way the application is modified. The atomicity considerations document describes the challenges of designing balanced flexibility and atomicity patterns.

File growth

Some updates, such as pushing elements to an array or adding new fields, can increase the size of the document.

For the MMAPV1 storage engine, MongoDB will relocate the document to disk if the document size exceeds the allocated space for the document. When using the MMAPV1 storage engine, growth considerations can affect data normalization or non-normalized decisions. For more information about document growth planning and management for MMAPV1, see document growth considerations.

Data Model Design

An effective data model supports your application needs. The key consideration for your document structure is to decide to embed or use references.

Embedded data Model

With MongoDB, you can embed related data into a single structure or document. These patterns are often referred to as "denormalized" models and take advantage of MongoDB's rich documentation.

The embedded data model allows applications to store related information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations.

In general, when using an embedded data model:

You have an "inclusive" relationship between entities. See model one-to-one relationships with embedded documents.
You have a one-to-many relationship between entities. In these relationships, "many" or sub-documents are always present or viewed with the "one" or the parent document. See the one-to-many relationship with the embedded document model.

In general, embedding provides better performance for read operations, as well as the ability to request and retrieve related data in a single database operation. The embedded data model can update related data in a single atomic write operation.

However, embedding related data into a document can cause the document to grow after it is created. With the MMAPV1 storage engine, document growth can affect write performance and result in fragmentation of data.

In 3.0.0, MongoDB uses a 2-size allocation function as the default allocation policy for MMAPV1 to consider document growth to minimize the likelihood of data fragmentation. In addition, the document in MongoDB must be less than the maximum Bson document size. For bulk binary data, consider Gridfs.

Normalize the data model

In general, use the canonical data model:

Embedding results in data duplication, but does not provide sufficient read performance advantages over duplicate effects.
Represents a more complex many-to-many relationship.
Simulates a large hierarchical data set.

The reference provides more flexibility than embedding. However, the client application must issue a subsequent query to resolve the reference. In other words, a normalized data model might require more round-trips to the server.

Operational factors and data models

Modeling application data for MongoDB depends on the data itself and the nature of the MongoDB itself. For example, different data models might allow applications to use more efficient queries, increase the throughput of insert and update operations, or more effectively distribute activities to a shard cluster.

These factors are operational or address requirements that arise outside the application, but affect the performance of MongoDB-based applications. When developing a data model, analyze read and write operations for all applications in conjunction with the following considerations.

File growth

Version 3.0.0 changed.

Some updates to the document can increase the size of the document. These updates include pushing elements to an array (that is, push) and adding new fields to the document.

When you use the MMAPV1 storage engine, document growth can be a consideration for your data model. For MMAPV1, if the document size exceeds the allocated space for the document, MongoDB will place the document back on disk. However, with MongoDB 3.0.0, the default use of 2-size allocation of power can minimize the occurrence of this redistribution and allow efficient re-use of freed record space.

When using MMAPV1, if your application needs to be updated, frequently causing the document to grow more than the current 2 allocation feature, you may need to refactor the data model to use references between data in different documents, rather than the denormalized data model.

You can also use pre-allocation policies to explicitly avoid document growth.

Atomic Nature

In MongoDB, operations are atomic at the document level. There is no single write operation that can change multiple documents. The action to modify multiple individual documents in one collection still runs on one document at a time. [1] Ensure that your application stores all fields that have atomic affinity requirements in the same document. If your application can tolerate non-atomic updates of two data, you can store the data in a separate document.

Data models that embed related data into a single document can help with these atomic operations. For data models that store references between related data, the application must issue separate read and write operations to retrieve and modify these related data.

Sharding

MongoDB uses shards to provide horizontal scaling. These clusters support deployments with large data sets and high throughput operations. Shards allow users to partition collections in a database to distribute the collection's documents to multiple Mongod instances or shards.

To spread data and application traffic across the Shard collection, MongoDB uses sharding keys. Selecting the correct sharding key has a significant impact on performance and can enable or prevent query isolation and increase write capacity. It is important to consider carefully the fields or fields that are used as sharding keys.

Index

Use indexes to improve the performance of common queries. Indexes are built on fields that frequently appear in the query and all operations that return sort results. MongoDB automatically creates a unique index on the _id field.

When you create an index, consider the following behavior of the index:

Each index requires at least 8 KB of data space.
Adding an index has some negative performance implications for write operations. For collections with high read/write ratios, indexes are expensive because each insert must also update any indexes.
Collections with high read/write ratios typically benefit from additional indexes. Indexes do not affect read operations that are not indexed.
When active, each index consumes disk space and memory. This use may be important, and you should track capacity planning, especially with regard to the size of the working set.

Large collection

In some cases, you might choose to store related information in more than one collection, rather than in a single collection.

Consider a sample collection log that stores log files for various environments and applications. The Log collection contains documents for the following forms:

{log: "dev", Ts:...,info: ...}

{log: "Debug", Ts:...,info: ...}

If the total number of documents is small, you can group documents into collections by file type. For logs, consider maintaining different sets of logs, such as Logs_dev and Logs_debug. The Logs_dev collection will contain only documents related to the development environment.

In general, owning a large number of collections has no significant performance penalty and results in very good performance. A large number of different collections are important for high-throughput batching.

When using a model with a large collection, consider the following behavior:

Each collection has a certain minimum cost of thousands of bytes.
Each index (including indexes on _id) requires at least 8 KB of data space.
For each database, a single namespace file (that is, <database>. NS) stores all the metadata for the database, and each index and collection has its own entry in the namespace file. MongoDB limits the size of the namespace file.

MongoDB using the MMAPV1 storage engine has a limited number of namespaces. You may want to know the current number of namespaces to determine how many namespaces the database can support. To get the current number of namespaces, run the following command in the MONGO Shell:

Db.system.namespaces.count ()

The limit on the number of namespaces depends on the <database>. NS size. The namespace file defaults to a size of MB.

To change the size of the new namespace file, use the option--nssize <new size mb> to start the server. For an existing database, run the db.repairdatabase () command from the MONGO shell after you start the server with--nssize.

collection contains a large number of small files

If you have a large collection of small documents, you should consider performance-related embedding. If you can group these small documents in a logical relationship, and you often retrieve documents from this grouping, you might consider "scrolling" small documents into larger documents that contain an array of embedded documents.

"Scrolling" These small documents into logical groupings means that queries that retrieve a set of documents involve sequential reads and fewer random disk accesses. In addition, "roll up" files and move public fields to larger documents are advantageous for these areas of indexing. A copy of the public field is reduced, and the key key entry in the corresponding index is reduced.

However, if you often only need to retrieve a subset of the documents within a group, the scrolling document may not provide better performance. In addition, if a small, separate document represents the natural model of the data, you should maintain the model.

Storage optimization for small collections

Each MongoDB document contains a certain amount of overhead. This overhead is usually trivial, but if all the documents are just a few bytes, this can become important if the documents in your collection have only one or two fields, this is the case.

Consider the following recommendations and policies to optimize storage utilization for these collections:

Explicitly use the _id field.

The MongoDB client automatically adds a _ID field for each document and generates a unique 12-byte objectid for the _id field. In addition, MongoDB always indexes the _id field. For smaller documents, this can take up a lot of space.

To optimize storage usage, users can explicitly specify the value of the _id field when inserting a document into the collection. This policy allows the application to store values in the _id field that will occupy the space of another part of the document.

You can store any value in the _id field, but because this value is the primary key for the document in the collection, you must uniquely identify them. If the value of a field is not unique, it cannot be a primary key because there will be conflicts in the collection.

Use a shorter field name.

Attention

Shortening the field name reduces the expressive force and does not provide significant benefits for large documents, and document overhead is not important. Shorter field names do not decrease the size of the index because the index has a predefined structure.

In general, you do not need to use short field names.

MongoDB stores all field names in each document. For most documents, this represents a small portion of the space used by the document; However, for small documents, the field name can represent a large scale of space. Consider a collection of small documents similar to the following:

{last_name: "Smith", best_score:3.9}

You can save 9 bytes per document If you shorten the field with the name last_name to LName and shorten the field named Best_score to the following criteria.

{lname: "Smith", Score: 3.9}

Embed files

In some cases, you might want to embed the document in another document and save the cost on each document. Viewing a collection contains a large number of small files.

Data lifecycle Management

Data modeling decisions should consider data lifecycle management.

After a period of time, the collection's time-to-live or TTL feature expires file. Consider using the TTL feature if your application requires some data to persist for a limited time in the database.

Also, if your application uses only recently inserted documents, consider using capped collections. The stamped collection provides first in, out (FIFO) management of the inserted document and effectively supports the operation of inserting and reading documents based on the insertion order.

MongoDB Data Model

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More