6 important rules of thumb in MongoDB database design

Source: Internet
Author: User
Tags joins mongodb schema design

Part 1

Original: 6 Rules of Thumb for MongoDB Schema Design:part 1

By William Zola, leads Technical support Engineer at MongoDB

"I have a lot of experience with SQL, but I'm a beginner in MongoDB. How do I model a one-to-many relationship in MongoDB? "This is one of the most problematic questions I have been asked.

I can't simply give the answer, because there are a lot of options for it to come true. Next I'll teach you how to model one-to-many.

This topic has a lot of content to discuss, I will use three parts to explain. In the first part, I'll discuss three basic scenarios for modeling one-to-many relationships. In the second part I will cover more advanced content, including anti-normalization and bidirectional referencing. In the last part, I will review the options and give the factors to consider when making a decision.

Many beginners think that the only solution for one-to-many modeling in MongoDB is to embed an array subdocument in the parent document, but this is inaccurate. Because you can embed a document in MongoDB doesn't mean you have to do it.

When you design a MONGODB database structure, you need to ask yourself a question you will not consider when using a relational database: What is the size of the collection in this relationship? You need to realize that a pair is very few, a couple of many, a couple very much, these subtle differences. In different situations, your modeling will be different.

Basics:modeling one-to-few

Couple few

It is appropriate to use an inline document for a scenario where you need to save multiple addresses for modeling, and you can embed the addresses array document in the person document:

This design has all the pros and cons of embedded document design. The main advantage is that there is no need to execute a single statement to get the embedded content. The main drawback is that you can't access these inline documents as separate entities.

For example, if you are modeling a task tracking system, each user will be assigned several tasks. Embedding these tasks into a user's document can be very difficult when it encounters a problem such as "query all yesterday's tasks." I will provide some appropriate design for this use case in the next article.

Basics:one-to-many

Pair of many

Take the product parts ordering system as an example. There are hundreds of replaceable parts per item, but no more than thousands of. This use case is good for using indirect reference---to store the objectid of a part as an array in the product document (in this case objectid I use a more readable 2 bytes, in the real world they may be made up of 12 bytes).

Each part will have their own document object

The objectid of multiple parts will be stored in the parts array in the Document object for each product:

When acquiring all the parts in a particular product, a join at the application level is required

To be able to execute queries quickly, you must ensure that the Products.catalog_number is indexed. Of course, because the parts._id in the parts must be indexed, it is also very efficient.

This type of citation is complemented by the advantages and disadvantages of embedding. Each part is a separate document that makes it easy to search and update them independently. A separate statement is required to get the specific contents of the part is an issue to be considered when using this modeling approach (please consider this question carefully, in the second chapter anti-normalization, we will also discuss this issue)

The part part of this modeling method can be used by multiple products, so there is no need for a separate connection table for many pairs of long.

Basics:one-to-squillions

A couple of very many

We use an example of a collection of machine logs to discuss a couple of very many issues. Since each MongoDB document has a 16M size limit, even if you are storing objectid is not enough. We can use a very classical processing method "parent reference"---with a document storage host, in each log document to save the objectid of this host.

The following is a slightly different application-level join with the second scenario to find the last 5,000 log information for a single host

So even this simple discussion has the ability to perceive the differences between MONGOBD modeling and relational model modeling. You have to look at two factors:

Would the entities on the ' N ' side of the one-to-n ever need to stand alone?

Whether a one-to-many is required is a separate entity.

What is the cardinality of the relationship:is it one-to-few; One-to-many; or one-to-squillions?

The size of the collection in this relationship is few, many, or very large.

Based on these factors, you can pick one of the three basic One-to-n schema designs:

Based on the above factors, we decided to take a look at three ways of modeling.

A couple with a few and no need to access inline content alone can use a multi-inline party.

A one-to-many and many-sided content can be referenced by an array of multiple parties for various reasons that need to exist separately.

A pair of very many cases, insert the one-side reference into a multi-end object.

Part 2

Original: 6 Rules of Thumb for MongoDB Schema Design:part 2

By William Zola, leads Technical support Engineer at MongoDB

In the previous article I covered three basic design scenarios: inline, sub-reference, parent reference, and two key factors to consider when choosing a scenario.

Whether a one-to-many is required is a separate entity.

The size of the collection in this relationship is few, many, or very large.

Having mastered the basic techniques above, I will introduce a more advanced topic: bidirectional correlation and inverse normalization.

Bidirectional correlation

If you want to make your design more cool, you can have the reference "one" side and "many" side save each other's references at the same time.

For example, the task tracking system discussed in one of the previous articles. There are two collections of person and task, and the one-to-n relationship is from the person side to the task side. In the scenario where you need to get all of the person's tasks, you need to hold the ID array of the task in the Person object, as shown in the following code.

In some scenarios, this app needs to display a list of tasks, such as displaying all the tasks in a multi-person collaboration project, in order to quickly get a user-owned project to embed additional person reference relationships in the Task object.

This scenario has the pros and cons of all the one-to-many scenarios, but by adding additional referential relationships. Adding an additional "owner" reference to a task document object can quickly find the owner of a task, However, if you want to assign a task to another person, you need to update both the person and task objects in the reference (the child shoes that are familiar with the relational database will find that this does not guarantee the atomic nature of the operation.) Of course, this is not a problem for the task tracking system, but you have to consider whether your use case can tolerate it.

Applying inverse paradigms in a one-to-many relationship

Adding the inverse paradigm to your design allows you to avoid applying layer-level join reads, of course, at the cost of which you will need to manipulate more data in the update. Let me give you an example to illustrate

Inverse Paradigm Many-< one

In the case of products and parts, you can redundantly store the names of the parts in the parts array. The following is a structure that does not incorporate the inverse paradigm design.

Inverse normalization means that you do not need to perform an application-level join to display all the part names of a product, but if you need additional part information at the same time, the join of the application layer will not be avoided.

While making it easy to get a part name, executing a join at the application level is a bit different from the previous code, as follows:

The inverse normalization saves you the cost of reading and brings up the cost of the update: If you want the name of the part to be redundant into the document object of the product, you will have to update all the product objects that contain this part at the same time by changing the name of the part.

In a system with a higher read-write frequency, the inverse paradigm is useful. If you often need to read the redundant data efficiently, but almost do not change his d, then pay the cost of the update is worth it. The higher the frequency of updates, the less benefit this design offers.

For example, if the name of a part changes very often, but the stock of parts changes very frequently, then you can spare the name of the part to the product object, but don't stock the redundant parts.

It is important to note that once you have redundant a field, the update for this field will not be atomic. As with the example of the two-way reference above, if you update the name of the part in the part object, there will be a short time inconsistency before updating the name field saved in the Product object.

Anti-paradigm One-< many

You can also redundant one-side data to the many side:

If you have the name of a redundant product in the Parts table, you must update all parts related to this product once the name of the product is updated, which is significantly more expensive than updating only one product object. In this case, more careful consideration should be given to reading and writing frequency.

Applying anti-paradigm in a couple of relationships

Anti-normalization techniques can also be applied in the log system, a pair of many examples. You can either redundancy the one side (host object) to the log object, or vice versa.

The following example redundantly adds the IP address from the host to the log object.

If you want to get the log information of a recent IP address is very simple, just need a statement instead of the previous two can be completed.

In fact, if there is only a small amount of information stored on one side, you can even store all of the redundancy on a multi-terminal, merging two objects.

On the other hand, you can also redundant data to one side. For example, if you want to save the last 1000 logs in the host document, you can use the newly added $eache/$slice feature in MongoDB 2.4 to keep the list in order and save only 1000 entries.

The log objects are saved in the Logmsg collection, and are redundant to the hosts object. This will not cause the log object to be lost even if more than 1000 of the data in the hosts object is missing.

By using a projection parameter (like {_id:1}) in a query to avoid acquiring an entire MongoDB object without using the LOGMSGS array, the network overhead of 1000 log information is significant.

In a one-to-many scenario, it is prudent to consider the frequency of reading and updating. Redundant log information into the host document object is a good decision only if the log object is almost never updated.

Summarize

In this article, I present a complementary selection of three basic scenarios: inline documentation, sub-references, and parent references.

Use bidirectional references to optimize your database schema, provided you accept the cost of not being able to update atoms.

You can redundant data in a reference relationship to one or N-side.

The following factors need to be considered when deciding whether to use anti-normalization:

You will not be able to perform atomic updates on redundant data.

The design of anti-normalization should be adopted only when reading and writing is high.

Part 3

Original: 6 Rules of Thumb for MongoDB Schema Design:part 3

By William Zola, leads Technical support Engineer at MongoDB

This article is the last article in the series. In the first article, I introduced three basic scenarios for modeling "one-to-many" relationships. In the second article, I covered the expansion of the underlying scenario: bidirectional correlation and inverse normalization.

The inverse paradigm allows you to avoid some application-level joins, but it also makes the update more complex and expensive. However, it is worthwhile to have redundant fields that read much more frequently than the update frequency.

If you haven't read the first two articles, welcome to the list.

Let's review these options

You can either take an inline, or create a reference to one end or N-end, or all three.

You can redundancy multiple fields on one side or N end

Here are some of the things you need to remember:

1, the priority is embedded, unless there is any compelling reason.

2. To access an object individually, the object is not intended to be embedded in other objects.

3, arrays should not grow indefinitely. If there are hundreds of document objects on the many side, do not embed them with a reference to the Objectid scheme, if there are thousands of document objects, then do not embed objectid arrays. Which scenarios to take depends on the size of the array.

4. Do not be afraid to apply layer-level joins: If the index is built correctly and the results are limited by the projection criteria (mentioned in chapter II), then the join at the application tier level will not be much larger than the join overhead in the relational database.

5, in the design of anti-paradigm, please confirm the read-write ratio. A field that is barely changed to be read only is suitable for redundancy into other objects.

6. How you model your data in MongoDB depends on how your application accesses them. The structure of the data should be adapted to the reading and writing scenarios of your program.

Design Guide

When you model a "one-to-many" relationship in MongoDB, you have a lot of options to choose from, so you have to be careful about the structure of the data. Here are some of the questions you need to think carefully about:

What is the size of the collection in the relationship: Is it a few, many, or very large?

For a one-to-many "many" end, do you need to access them individually, or will they only be accessed in the context of the parent object.

What is the ratio of read and write to the redundant fields?

Data Modeling Design Guide

In a couple of rare cases, you can embed an array in the parent document.

You can use arrays to refer to Objectid in a pair of data that is a lot or that requires a separate access to the "N" side. If you can speed up your access, you can also use the parent reference on the "N" side.

In a couple of very many cases, you can use the parent reference on the "N" side.

If you are going to introduce redundant inverse paradigm designs into your design, you must make sure that the redundant data is read much more frequently than it is updated. And you don't need strong consistency. Because the inverse-normalization design will allow you to pay a price for updating redundant fields (slower, non-atomized)

6 important rules of thumb in MongoDB database design

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.