Transferred from: http://quentinxxz.iteye.com/blog/2149440
One, under normal circumstances, should not have this demand
First of all, we should have a concept, the title of this problem, in most cases is a pseudo-proposition, should not be mentioned. To know, for the general large data volume of the database, full table query, this operation generally should not appear, in the normal query, if the scope of the query, you should at least add limit.
Say, my application scenario: An index that is used to build a full-scale search engine. This is a non-general case where a full table scan is required. For the results of a full table scan, we do not have a sorting requirement.
Ii. description of the situation
Since there is such a large amount of data, the storage space is basically a T. So it must be a distributed environment with a MONGODB cluster and a shard configured.
The company's server, the performance is better, should be 24 cores, 96g of memory. So the readers using different machines, measured by the time, and my results may not necessarily match.
Third, the first method, the use of chunk information for division.
Principle: We know that in a fragmented environment, MongoDB uses chunk blocks to organize data. The data for the same chunk block must exist on the same shard. Let's say that we use "_id" as the chip key for Sharding. MongoDB in order to ensure the efficiency of the range lookup, will certainly be a certain range of _id value of the document side in the same chunk. And through MongoDB itself provides the method, we can be very convenient to obtain, each chunk in Maxkey and Minkey. [Minkey,maxkey] The data within this range must be within the same chunk and set in the same shard.
Practice: 1, get all chunk information first, get their maxkey and Minkey.
2, multi-threaded execution, these chunk information, distributed to these execution threads.
3, each thread, according to the current chunk and Maxkey and minkey information, do scope search.
4. Finally, the results are summarized.
The benefits of doing this are:
1, each time to a chunk to do a range lookup, it must be only in one shard (meaning the same disk) in the conduct, will not be dispersed to multiple shards. This is efficient.
2, can easily use multi-threading, improve efficiency.
This method I did not try, the company's predecessors tried, it is said, the final time of 3 hours or less. is the most desirable effect.
Four, the full table scanning method after the use of the hash chip key
With the above method, there is a previous problem, that is, the sharding policy is the default method of MongoDB ascending slice key. This ensures that the ascending _id will be distributed in the chunk block, sorted by the sort range. There is an obvious problem with this strategy, which is that the new doucument write will definitely hit the chunk with the current maximum _id. The imbalance that caused the distribution to be written.
In the previous article, it was said that a full table scan should be a condition not allowed under normal circumstances. Therefore, the development of database policy should not take into account the efficiency of the full table scan priority, in the current situation, should be a write-efficiency priority. The chip key policy that the company is using is the SLICE key policy Hash Tab key (Hashed Shard key), so that the write condition is well distributed across multiple shards, but does not take advantage of scoping lookups. The full-scale scanning method used above can no longer be used.
Procedure: 1, obtain the global maximum ID MAXID with the global minimum ID minid.
2, set a stepsize, such as 5000. to slice [Minid,maxid] by 5000 for a chunk (a logical chunk we define), then the first block range [minid,minid+5000],
3, each thread, according to the chunk block assigned to the MAXID and MiniD information, do scope search.
4. Finally, the results are summarized.
The question of such a procedure:
1, to a chunk query, will hit multiple shards to query, query efficiency greatly reduced.
2, if the _id distribution sparse, the query becomes faster. Because of the stepsize of 5000, there may be many _id that do not exist. Test students help me to take the line test data, 2000w data, due to calculation errors, _id range distribution from 1 billion, to 13 billion. The resulting threads are almost together in an empty run chunk. Therefore, in the case of sparse distribution of _id, this kind of query method is completely inapplicable.
3, _id distribution is uneven. There may be almost 5,000 full loads in one chunk, and some chunk only a few _id effective. It can result in uneven distribution of computing resources.
The final result is not ideal, and I started off with more than 16 hours of time because of some join operations involved in the table. Later, the various adjustments, add threads, remove some operations, almost still need 10 hours.
The more efficient full-table scanning method after the use of the hash chip key
The above method, of course, is not ideal. The data will take 16 hours, plus the subsequent processing, certainly longer, so that our search engine index establishment, it is impossible to complete the current. But have been suffering from no better way.
Eventually the problem was solved by my head. Let's start with the final effect. 20w Bar/sec, approx. 3 hours complete.
In fact, the final approach is simple, but you have to give up multithreading (multiple cursor), at most one shard of one thread.
Practice: Use single-threaded scanning, not suitable for conditions that may affect sorting.
The purpose of this is to use the natural sort in MongoDB. When scanning, it is necessary to hit a shard read sequentially, without causing the disk concussion problem.
How to make an efficient full-table scan of 1 billion data-volume MongoDB