MongoDB Paging Scheme (paging method for general databases) (2012-11-06 17:59:55)
reproduced
Tags: mongodb paging database Jump |
Category: MongoDB |
Reprint Please specify source: http://blog.sina.com.cn/s/blog_56545fd30101442b.html
MongoDB Paging performance is the majority of users are criticized by one of the big problem, in the big data volume environment, if a jump in the number of pages, such as more than 10W pages, may be users to wait a few 10 seconds (nonsense data), interested can go to see this article Paging & Ranking with Large Offsets:mongodb vs Redis vs PostgreSQL.
Have you been disappointed with the performance of MongoDB, and are you full of reverence for Redis?
In fact, this contrast is totally unfair.
First, take a look at Redis and learn a bit about how much nosql is going to know, it's a completely memory-cached storage system, and he's not exactly a database, why is that, because the most basic feature of a database-persistent Redis is not. Redis, like memcached, is a storage system that puts data all in memory for caching, so its performance is memory performance.
and MongoDB, and all other databases, will be the data into the hard disk, although MongoDB will also put some hot data into memory, but face tens even billions of data, let the memory down all hot data is not possible, so if a query matching too much data , it is possible that most of the performance bottlenecks are on the page exchange (reading data from the hard disk).
Finally, you can measure the performance of the relational database, it is difficult to have which relational database paging performance than MongoDB is OK.
Of course, this is not for MongoDB excuse, I think 10gen should find some ways to optimize this page after the large offset performance degradation problem, although I have not thought of.
Analysis of database performance bottleneck
The reason for this is that the user's query requires the database to filter or sort by criteria, which is very time consuming, such as:
Db.user.find ({age:{$gt: $, $lt: +}}). Sort ({registdate:1, name:-1})
If you make this query in Google's user data (said to be billion), assuming that the data is stored using MongoDB (which should actually be BigTable), which matches the amount of billions of data, this process,MongoDB needs a compound index, similar to {age:1, registdate:1, name:-1}, first, filter the data according to the age of {age:{$gt: $, $lt: 30}}, and then according to the two criteria {registdate:1, name:-1} is sorted, this sort is likely to be different from the original index order, so MongoDB takes time to do this sort operation, and more importantly, based on the limit of memory size, The results of this sort cannot be stored in memory, not even recorded, but sorted once every time (this is normal), if you are querying the first 100 data is OK,MongoDB may only need to sort hundreds of data to return to the first 100 sorted.
For example, reduce the size of your data, such as:
MongoDB consists of two shard,user This collection has {age:1, registdate:1, name:-1} index, assuming query the first three data, that is
Db.user.find ({age:{$gt: $, $lt: +}}). Sort ({registdate:1, name:-1}). Limit (3)
Well, ideally (because I'm not sure about MongoDB's query mechanism),MongoDB only needs to find three steps to return the results. such as (the preceding sequence number can be understood as the pointer Jump sequence number):
If a user jumps, such as:
Db.user.find ({age:{$gt: $, $lt: +}}). Sort ({registdate:1, name:-1}). Skip (3), limit (2)
Then MongoDB must sort out the first 5 to return the user to article 4-5. such as (the preceding sequence number can be understood as the pointer Jump sequence number):
S
Google instances
First look at how Google Big Brother did it.
The key words are:NBA
This is the number of results found:
and Navigation Flip Bar:
Manually point the page, the speed is very fast, but this is only 1000 of data page, without any reference meaning, so I kept turning backwards until:
Well, this makes me suddenly speechless ...
can only turn to 70 pages, after the whole was cut off. If you don't believe you can try it yourself, you won't get over 80 pages. And note about the number of search results:
get about:675,000,000,000 results ...
Google used a "about" word, the database will certainly not be this value.
In conclusion, Google's approach topaging is to use certain methods to obtain the approximate value of the matching results, which is similar to the following: matching only the first part of the importance of the data, and then estimate the total number of matching results according to the proportion of this part. Only 700 or so of data is displayed when the data is rendered , because the user will not be able to flip so many pages.
Analysis of the demand for paging
Before you solve the paging problem, you must first resolve some issues:
1: is the paging feature necessary?
Yes, it is generally necessary.
2: is it necessary to turn on the dozens of-page function?
Well, there may be situations where it's needed.
Question 3: is it really necessary to turn on the hundreds of-page function?
This, it's really hard to come up with the corresponding demand.
Demand: Some people will put forward the demand said, my company has tens of millions of items information, I want to find the item exactly in page 1W, so I asked to have the ability to turn over 1W pages at once.
Workaround: Is this really a requirement to turn on the paging feature? How does the customer know that the item is on page 1W of the information system without a page and page , the corresponding attribute information of the item must be bound to appear on page 1W, such as ID information, its ID is increased, and the item ID is 100002, the information system displays 10 items per page, in ascending case, the item will appear on page 1W. We may not necessarily know that in detail, but there will be a general information, such as the ID 10000X, which can be found quickly using ID > 100000 && ID < 100010来.
In conclusion, the way to solve this demand is to tell the customer that I have a better way to find this item for you, which is to use a query instead of paging.
In many cases the need to change, if the customer wants you to do what you do, then you are not a product manager, but just a microphone. does Google dare to give users only 700 of the data is because his data is not enough? Of course not, it's a compromise between performance and customer experience, and in general, the first 10 pages of a search result will help you solve the problem you want. Even if not found,Google believes that 70 pages have not found the results should be modified or replaced the keyword search again.
Solution Solutions
Let's deal with some practical, no longer lip.
Scenario 1: A compromise between Google-like performance and user experience
Note:Google's process is certainly not the case, but in a similar way distributed computing, I just present to the user in the way to define this scenario.
First, make sure that the ordering criteria for the user query must be indexed, and then:
Db.test.find ({"Context": "NBA"}). Sort ({"Date": -1}). Limit (1000)
Query only the first 1000 data
1) If the number of results returned is < 1000, it is rendered directly to the user.
2) Otherwise, according to the final data of the sorting condition of the data information analysis, to obtain the estimated value. For this example, assume that the date of the last data returned is lastdate = "2012-11-05 08:00:00" and that the data stored in your database is from origindate = "2010-01-01 08:00:00 "So far, the current time (the most recent data time in the database) is nowdate ="2012-11-06 18:00:00 ", then:
This is only a rough formula, the data in the case of the basic uniform distribution of time, the actual application may not be so, for example, according to the age of the user data stored in the database is the middle of more than two sides of the situation needs to adjust the formula, for each age of user data amount plus a certain weight.
One example of this is:
Age Segment |
User ratio |
0~10 |
0% |
10~20 |
6.4% |
20~30 |
29.1% |
30~40 |
46.9% |
40~50 |
14.7% |
50~60 |
2.9% |
60+ |
0% |
Your site is like a user, and then you have a query that is sorted by user age:
Db.user.find ({"name":/ stand /, "age": {"$GT":}}). sort ({age:1}). Limit (+)
The age at which the last data is queried is 21, so you can roughly calculate the number of matches to:
It is much faster to make such a simple mathematical formula in memory than to traverse about 24 times times the data.
Scenario 2: Limit the number of pages to flip
The above scheme is too flattering, this program fully realizes the page of all data, the essence of the method is:
do not allow users to pages above the page action.
This 20 page is not fixed and can be adjusted according to performance. The reason for this qualification is that the performance of a certain number of pages backwards is controllable, given the knowledge that the current user is viewing the data. Give me a chestnut:
Db.test.find (). Sort (date:-1). Limit (200)
with date indexed, this query is fairly fast because the data is already sequenced, and the pointer only needs to traverse the first 200 (or 200) to return the result. Then look at the following query statement:
Db.test.find ({"date": {"$lt" 1352194000:}}). Sort (date:-1). Limit ($)
Note:the data storage format for date is a timestamp in seconds
This query is still very fast, because the pointer positions 1352194000 of the data in the index number on date , and then reads 200 data sequentially, similar to the performance of the previous query.
Here's how to actually do it:
First of all, to screen out the ability to manually specify page turn pages, when the user is currently on page 15, the navigation bar only displays up to 20 pages of the navigation bars (in fact , 10 pages is enough, such as Baidu).
Then you must be able to get the data information for the corresponding sorting criteria of the last data on this page, for example , if the query is sorted by ID, you need to know the ID value of the last data on this page and set it to Queryid.
Assuming that each page shows 20, the user turns to 24 pages, then executes:
Db.test.find ({id:{$gt: Queryid}}). sort ({id:1}). Limit (180)
Take the last 20 to get it.
In addition, the navigation bar can be added "forward 20 pages", " 20 pages Backward" Such a function, is completely no problem. There is also "home" "last" can also be achieved.
Scenario 3: Exchanging space for time
The simple thing is to cache the information that is being queried, and of course, there are two different ways of caching:
1. Cache All the information that is queried. This depends on the amount of data you are querying and the size of your server's memory, which is the second way.
2. Cache key index information. is primarily the field that the user uses to sort.
Here's a second way, for example:
Example one:
As an example of a table with a unique index ID, after the user query is sorted by ID, do not provide other sorting methods, you can use every 10 ID Cache Once the data structure for storage, take 10 of the reason is that the average number of per page display is 10,20 or 50 is more common. If you are using a key-value-pair-mode caching scheme, such as membercached or Redis, you can store it as:
key |
value |
4x0001 |
1 |
4x0011 |
2 |
4x0021 |
3 |
4x0031 |
4 |
... |
... |
The previous ID is the primary key ID of the data table to be queried ,value is the page number/ordinal used for paging , such as a zset data structure with weights in Redis , and the performance of any page is constant complexity.
Time: Cache response Time (constant)+ database response time (constant).
Space: Set the ID to 64 bits, the amount of data within 20 billion, you need to cache N * (64 + 32)/10 bit size, approximately n * 1.2 Byte. One billion is the occupancy of 120MB.
Example two:
The sortable fields are: Price, quantity (num), Time (date). In addition, if the sort field values are the same, they are sorted by ID.
Each sort needs to be stored as a data structure, in the case of prices, wherekey is [Price]:[id], andvalue is still the number of pages to page:
key |
value |
1:4x0219 |
1 |
1:4x9555 |
2 |
3:4x1500 |
3 |
6:4x3038 |
4 |
... |
... |
The amount and time are also stored in this manner.
Time: Cache response Time (constant)+ database response time (constant).
Space: The overall and example of the same calculation, because more prefixes, so each data structure is 1/3 to 2/3, the more fields to sort, the more data structure you need to store. Ideally, a 100 million data volume of 3 field orders would need about 480MB.
Reprint Please specify source: http://blog.sina.com.cn/s/blog_56545fd30101442b.html
MongoDB Paging Scheme (paging method for general databases)