Does MongoDB tend to put data under a Collection?

Last Update:2018-01-26 Source: Internet

Author: User

Tags array length

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

That's not true.

Collection's single doc has a size limit and is now 16MB, which makes it impossible for you to rub everything into a Collection. And if the collection structure is too complex, it will affect the efficiency of query and update, and will also cause maintenance difficulties and operational risks. Did you try to shake it and put a doc accidentally into NULL, anyway I did, if a person all the information in this collection inside, that feeling must be quite sour.

The general principle is:

Clustering according to Query method
- The data that needs to be read together is often put together.
- Put together information that is logically connected.
- Data with map-reduce/aggregation requirements is put together, and these operations can only operate on a single collection.
Split according to the amount of data
- If you find that you want to use the array in collection, the array length will continue to increase, then you should put the data content into a dedicated collection, each data refers to the current DOC's primary key (like the MySQL 1..N foreign key dependency).
- If a doc level is found to be too deep (more than 2 layers), probably must consider splitting, otherwise performance and maintainability will be problematic.
Designed in a way that has a table structure
- MongoDB is not the concept of table structure, but the actual use of the time, rarely said a collection inside a variety of structure of the doc, if found that the structure of the doc more and more large, then should consider how to abstract into a similar structure, the change of things to throw to other Collection, using foreign key-dependent way to reference each other.

For example, the design of a user system, user collection should put name and other commonly used information, should also put lastloginat these only with the user-related things, perhaps users should have access to the information also put in, But do not put the user's log in log this information will continue to increase information.

The existence of user collection for the relationship between the user needs to be discussed. If you just need to store the relationship between users, record the UID of the friend is OK, and the number of friends is not too big, hundreds of the most, then I tend to put in a collection. If the relational data itself is more complex, or if the number of friends is thousands, then I tend to split.

In addition, MONGODB official data Model design paradigm is well worth reading, recommended to take a good look.

June 26, 2014 answer
1 reviews
Appreciated
Edit

Huandu8.5k Prestigethe answer is helpful to the person, has the reference value2 The answer is no help, it's the wrong answer, irrelevantly replying .

Original address: Http://pwhack.me/post/2014-06-25-1 Reprint Annotated Source

This article extracts from the eighth chapter of the "MongoDB authoritative guide", which can answer the following two questions thoroughly:

http://segmentfault.com/q/1010000000364944
http://segmentfault.com/q/1010000000364944

There are many ways to represent data, one of the most important of which is how much data is normalized. Normalization (normalization) is the spread of data across multiple sets of different collections that can reference data to each other. Although many documents can reference a piece of data, the data is stored only in a single collection. So, if you want to modify this piece of data, just modify the document that saved the piece of data. However, MongoDB does not provide a connection (join) tool, so it takes multiple queries to perform a connection query between different collections.

Inverse Normalization (denormalization) is the opposite of normalization: embedding the data required for each document inside the document. Each document has its own copy of the data, and not all documents collectively reference the same copy of the data. This means that if the information changes, all related documents need to be updated, but when the query is executed, all the data can be obtained with only one query.

It is difficult to decide when to use normalization when it is used in the form of inverse normalization. Normalization can improve data writing speed, and inverse normalization can improve the speed of data reading. Need to weigh carefully against the more than 10 needs of your application.

Examples of data representations

Suppose you want to save student and course information. One representation is to use a students collection (each student is a document) and a classes collection (each course is a document). Then use the third set Studentsclasses to save the link between the student and the course.

> db.studentsClasses.findOne({"studentsId": id});{  "_id": ObjectId("..."),  "studentId": ObjectId("..."); "classes": [ ObjectId("..."), ObjectId("..."), ObjectId("..."), ObjectId("...") ]}

If you are more familiar with relational databases, you may have previously Jianguo this type of table connection, although your each demerit document may have only one student and one course (instead of a course "_id" list). It's a bit of a mongodb style to put the course in an array, but it doesn't usually save the data so much because it takes a lot of queries to get real information.

Suppose you want to find a course that a student has chosen. You need to find the student information in the students collection, then query studentclasses find the course "_id", and then query classes collection to get the information you want. To find the course information, you need to request three queries from the server. It is possible that you do not want to use this data organization in MONGODB unless the student information and course information changes frequently, and the speed of data reading is not required.

If you embed a course reference in a student document, you can save the query once:

{  "_id": ObjectId("..."),  "name": "John Doe",  "classes": [ ObjectId("..."), ObjectId("..."), ObjectId("..."), ObjectId("...") ]}

The "Classes" field is an array that holds the course "_id" on which John Doe is required. When you need to find out the information for these courses, you can use these "_id" to query the Classes collection. This process requires only two queries. If the data does not need to be accessed at any time and does not change at any time ("at any time" is more demanding than "often"), then this data is organized in a very good way.

If you need to further optimize the reading speed, you can completely reverse the normalization of the data, the course information as an inline document in the "Classes" field of the student document, so that only one query to get the student's curriculum information:

{"_id": ObjectId ("..."),"Name":"John Doe""Classes": [{"Class": "trigonometry",  "credites": 3,  "204"}, { "classes": " Physics ", " credites ": 3,  "
                   
                     "159"}, {
                     "class":  "women in Literature ", " credites ": 3, " the ": Span class= "hljs-string" > "14b"}, { "AP European history",  "credites": 4,  "hostel": " 321 "}"}

The advantage of this approach is that only one query is needed to get the student's course information, the disadvantage is that it takes up more storage space, and the data synchronization is more difficult. For example, if the physics credits turn out to be 4 points (no longer 3 points), then every student document that takes a physics course needs to be updated, and not just to update the "Physics" document.

Finally, you can also mix inline and reference data: Create a sub-document array to hold common information, and use references to find the actual document when you need to query more details:

  { "_id": ObjectId ( "name":  "John Doe",  "classes": [{" _id ": ObjectId (" class ":  "trigonometry"}, { "_id": ObjectId (" class ": " Physics "}, {" _id ": ObjectId ( "class":  "women in Literature "}, {" _id ": ObjectId ( "AP European History"}]}

This is also a good choice, because the embedded information can be changed as the requirements change, if you want to include more (or less) information in a page, you can put more (or less) information in the embedded document.

Another important question to consider is whether information is updated more frequently or information is read more frequently? If the data is updated periodically, normalization is a good choice. If the data changes infrequently, it is not worthwhile to sacrifice read and write speed in order to optimize the update efficiency.

For example, one example of a textbook presentation of normalization might be to keep users and user addresses in separate collections. However, there is little change in the address, so the efficiency of each query should not be sacrificed for a situation where the probability is minimal (someone has changed the address). In this case, the address should be embedded in the user's document.

If you decide to use an inline document, you need to set up a scheduled task (cron job) to update the document to ensure that every update you make is successfully updated for all documents. For example, we tried to diffuse the update to multiple documents and the server crashed before the update finished all the documents. You need to be able to detect this problem and re-run the unfinished update.

In general, the more frequently the data is generated, the less it should be embedded in other documents. If there is an infinite increase in the number of inline or inline fields, it should be stored in a separate collection, accessed using references, rather than embedded in other documents, and information such as comment lists or activity lists should be kept in separate collections and should not be embedded in other documents.

Finally, if some of the fields are part of the document data, you need to embed those fields in the document. If you often need to exclude a field when querying a document, this field should be placed in a different collection instead of being embedded in the current document.

more suitable for inline	better for referencing
Small subdocuments	Subdocument larger
Data does not change periodically	Data changes frequently
The final data is consistent	Data in the intermediate phase must be consistent
Small increase in document data	Significant increase in document data
Data typically requires two queries to get	Data is not usually included in the results
Quick Read	Quick Write

Suppose we have a user collection. Here are some fields that you might need, and whether they should be embedded in the user's document.

User Preferences (account preferences)

User preferences are only relevant to specific users and are likely to be queried along with other user information in the user's document. So user preferences should be embedded in the user's documentation.

Recent activities (recent activity)

This field depends on the frequency of recent activity growth and change. If this is a fixed-length field (such as the last 10 events), then this field should be embedded in the user's document.

Friends (Friends)

You should not usually embed your friend information in a user's document, or at least not fully embed your friend information in a user's document. The following section describes what social networking apps are about.

All user-generated content

Should not be embedded in user documentation.

Base

The number of references to other collections contained in a collection is called cardinality (cardinality). Common relationships have a pair of one or one-to-many, many-to-many. If there is a blog application. Each blog post (POST) has a title (title), which is a relationship to one. Each author (author) can have more than one article, which is a relationship to many. Each article can have multiple tags (tag), each tag can be used in multiple articles, so this is a many-to-many relationship.

In MongoDB, many (many) can be split into two subcategories: many (many) and few (less). If the author and the article may be a couple of little relationships: Each author publishes only a few articles. Blog posts and tags can be many-to-few relationships: the number of articles is actually more likely than the number of labels. Blog posts and comments are a one-to-many relationship: Each article can have many comments.

As long as the relationship between the few and many is determined, it is easier to trade off between the embedded data and the reference data. In general, "less" relationships are better using inline methods, and "many" relationships are better to use references.

Friends, fans, and other troublesome things.

Close friends, stay away from the enemy

Many social applications require links, content, fans, friends, and other things. It is not easy to weigh these highly correlated data using inline form or reference form. This section describes the considerations related to social graph data. Typically, a concern, friend, or collection can be simplified as a publishing, subscribing system: One user can subscribe to another user-related notification. In this way, there are two basic operations that need to be more efficient: How to save Subscribers, and how to notify all subscribers of an event.

There are three ways to implement more common subscriptions. The first way is to embed the content producer in the subscriber's document:

{    "_id": ObjectId("..."),    "username": "batman",    "email": "[email protected]", "following": [ ObjectId("..."), ObjectId("...") ]}

Now, for a given user document, you can db.activities.find({"user": {"$in": user["following"]}}) query all of the activity information that the user is interested in in a form-like manner. However, for a newly released activity message, if you want to find out all the users interested in this information, you will have to query the "following" field for all users.

Another way is to embed subscribers in the producer documentation:

{    "_id": ObjectId("..."),    "username": "joker",    "email": "[email protected]", "followers": [ ObjectId("..."), ObjectId("..."), ObjectId("...") ]}

When the producer publishes a new message, we immediately know which users need to be notified. When you do this, you must query the entire collection of users if you need to find a list of users that the user is following. The pros and cons of this approach are the opposite of the pros and cons of the first approach.

At the same time, there is another problem with both of these approaches: they make the user's documents larger and more frequent. In general, the "following" and "followers" fields do not even need to be returned: How often are query fan lists? If a user focuses more frequently on someone or cancels attention to some people, it can also lead to a lot of fragmentation. As a result, the final scenario is further normalized to the data, keeping the subscription information in a separate collection to avoid these drawbacks. This normalization of Chengdu may be a bit too much, but it is useful for fields that change frequently and do not need to be returned with other fields of the document. This normalization of the "followers" field makes sense.

A collection is used to store the relationship between the Publisher and the Subscriber, and the document structure may resemble the following:

{    "_id": ObjectId("..."),   //被关注者的"_id"    "followers": [        ObjectId("..."), ObjectId("..."), ObjectId("...") ]}

This makes the user's documents leaner, but requires additional queries to get a list of fans. Because the size of the "followers" array is often changed, you can enable "usepowerof2sizes" on this collection to ensure that the users collection is as small as possible. If you save the followers collection in another database, you can also compress it without too much impact on the users collection.

Coping with the owner effect

Regardless of the strategy used, inline fields can only work if the sub-document or reference quantity is not particularly large. For more famous users, the document that is used to save the fan list may overflow. A solution to this situation is to use a "continuous" document when necessary. For example:

> Db.users.find ({"Username":"Wil"}) {"_id": ObjectId ("..."),"Username":"Wil","Email":"[Email protected]","TBC": [ObjectId (  "123"), //Just for example ObjectId (  "456") //same as above],  "followers": [ObjectId ( "_id": ObjectId ( "123"),  "_id": ObjectId ( "456"),

In this case, you need to add the relevant logic in your application to fetch data from the "TBC" (to is continued) array.

Say something

No Silver bullet.

June 26, 2014 answer
Comments
Appreciated
Edit

Portwatcher1.9k Prestigethe answer is helpful to the person, has the reference value0 The answer is no help, it's the wrong answer, irrelevantly replying .

If the business always needs to query the relationship between the user or the relationship is independent of a collection

https://segmentfault.com/q/1010000000589390

Does MongoDB tend to put data under a Collection?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More