The performance test of MONGODB billion data volume was carried out, and the following items were tested:
(All inserts are single-threaded and all reads are multithreaded)
1) Normal insertion performance (the inserted data is about 1KB per piece)
2) Bulk Insert performance (using the official C # client's Insertbatch), which is measured by how much the bulk insert performance can improve
3) Safe Insert function (ensure the insertion is successful, using the safemode.true switch), this test is the safety of the insertion performance will be much worse
4) Query the number column after an index, return 10 records (that is, 10KB) performance, this is measured by the performance of the index query
5) Query the number column after two index, return 10 records (each record returns only about 20 bytes of the 2 small Small segment) performance, this is measured by the return of the amount of data and a number of query conditions on the performance of the impact
6) Query the number column after an index, sorted by the date field of another index (the index is set in reverse, the order is reversed), and the performance of the 10 records is returned after the Skip100 record, which is measured by the effect of skip and order on performance
7) query 100 records (that is, 100KB) performance (no sorting, no conditions), this is measured by the large data volume of the query results on the performance of the impact
8) Statistics as the test progresses, total disk usage, index disk usage, and number of data disks occupied
And each test uses a single process of MongoDB and the same server to open three MongoDB processes as sharding (each process may only use about 7GB of memory) two scenarios
In fact, for sharding, although it is a machine to put 3 processes, but in the query when each parallel process query part of the data, and then run on another machine MONGOs to summarize the data, in theory, in some cases performance will be a bit higher
Based on the assumptions above, it is assumed that performance will degrade in some cases, and performance will improve in some cases, so what about the final test results?
Note: The test storage server is E5620 @ 2.40GHZ,24GB memory, CentOS operating system, the suppression machine is E5504 @ 2.0GHZ,4GB memory, Windows Server 2003 operating system, both gigabit NIC directly connected.
As you can see from this test, the way to a single process:
1) The non-Secure insertion method of MongoDB, at the beginning of the insertion performance is very high, but after reaching 20 million data performance drop, this time happens to be the server 24G memory basically full time (with the test of MongoDB constantly occupy memory, until the operating system memory fully occupied) , that is, MongoDB memory mapping mode, so that the data all in memory when the speed, when some of the data needs to be swapped out to disk, performance degradation is very severe. (This performance is not too bad, because we index three columns of data, even if the memory is full after the second can insert 2MB of data, at the beginning of the 25MB data per second inserted). Foursquare actually also use MongoDB as a memory database with persistence, only when the memory bottleneck is not found sharding not handled well.
2) for the BULK Insert function, is actually committed a batch of data at a time, but compared to a single insert performance did not increase much, one is because the network bandwidth has become a bottleneck, and secondly I want to write a lock can also be a reason.
3) for the safe insertion function, relatively stable, will not fluctuate very large, I think it may be because the safe insert is to ensure that the data is directly persisted to the disk, rather than inserting memory to be finished.
4) for a list of conditions of the query, performance has been relatively stable, do not underestimate, can have 8000-9000 queries per second, each return 10KB, equivalent to query 80MB data per second, and the database record is 200 million can maintain this level, performance is amazing.
5) for the two-column condition to return small data query, the overall performance will be better than 4, may return a small amount of data to improve the performance is relatively large, but the relative performance fluctuation is also a bit more, there may be one more condition there is a chance to change pages from disk.
6) for a column of data plus sort and skip query, after the data volume is significantly worse performance (at this time the index data volume exceeds the memory size, do not know whether there is a connection), I guess it is skip compared to the performance, but compared to the performance is not particularly large gap.
7) for the return of Big Data query, a second bottleneck also has about 800 times, that is, 80M data, which further shows that in the case of an index, sequential query and conditional search performance is similar, this time is the bottleneck of IO and network.
8) in the whole process of the index accounted for the amount of data accounted for a considerable proportion of the total data volume, when the amount of 140 million data, the light index can occupy the entire memory, when the query performance is very high, the insertion performance is not too bad, mongodb performance is really very cow.
So what are the highlights of the sharding model:
1) non-secure insert and a single-process configuration, the performance drops dramatically after the memory is full. Safe Insert performance is much slower than single-process, but very stable.
2) for a condition and two conditions of the query, performance is relatively stable, but the performance of the conditional query is equal to half of a single process, but under multiple conditions sometimes even higher than a single process. I think this may be some time the data block is located in two sharding, so MONGOs will be parallel in the two sharding query, and then in the data to consolidate, because the query returns a small amount of data, the network is unlikely to become a bottleneck, so that sharding have the opportunity to stand out.
3) for order and skip query, the gap of Sharding way out, I think the main performance loss may be in order, because we did not follow the sort field as Sharding key, using the _id as key, so the sorting is more difficult to carry out.
4) for the return of large data volume of the query, sharding in fact, and the single process gap is not very large, I think the data forwarding may be a cause of performance loss (although MONGOs is located in the machine, but the data is always changed hands once).
5) for disk space, the two are almost the same, some of the gap may be because more than one process will allocate a little more space, plus sometimes more than a single process to occupy a bit of disk (and those who occupy less than a single process is actually the beginning of the coding error, Mistake the actual data size and disk file size.
The final sharding distribution of the tests is as follows:
{
"sharded": true,
"NS": "Testdb.test",
"Count": 209766143,
"Size": 214800530672,
"Avgobjsize": 1024.0000011441311,
"Storagesize": 222462757776,
"Nindexes": 4,
"Nchunks": 823,
"Shards": {
"shard0000": {
"NS": "Testdb.test",
"Count": 69474248,
"Size": 71141630032,
"Avgobjsize": 1024.0000011515058,
"Storagesize": 74154252592,
"Numextents": 65,
"Nindexes": 4,
"Lastextentsize": 2146426864,
"Paddingfactor": 1,
"Flags": 1,
"Totalindexsize": 11294125824,
"Indexsizes": {
"_id_": 2928157632,
"Number_1": 2832745408,
"Number1_1": 2833974208,
"Date_-1": 2699248576
},
"OK": 1
},
"shard0001": {
"NS": "Testdb.test",
"Count": 70446092,
"Size": 72136798288,
"Avgobjsize": 1024.00000113562,
"Storagesize": 74154252592,
"Numextents": 65,
"Nindexes": 4,
"Lastextentsize": 2146426864,
"Paddingfactor": 1,
"Flags": 1,
"Totalindexsize": 11394068224,
"Indexsizes": {
"_id_": 2969355200,
"Number_1": 2826453952,
"Number1_1": 2828403648,
"Date_-1": 2769855424
},
"OK": 1
},
"shard0002": {
"NS": "Testdb.test",
"Count": 69845803,
"Size": 71522102352,
"Avgobjsize": 1024.00000114538,
"Storagesize": 74154252592,
"Numextents": 65,
"Nindexes": 4,
"Lastextentsize": 2146426864,
"Paddingfactor": 1,
"Flags": 1,
"Totalindexsize": 11300515584,
"Indexsizes": {
"_id_": 2930942912,
"Number_1": 2835243968,
"Number1_1": 2835907520,
"Date_-1": 2698421184
},
"OK": 1
}
},
"OK": 1
}
Although at the end of the time due to the relationship between the 1 billion levels of data is not measured, but this data has been able to prove the performance of MongoDB is how strong. Another reason is that in many cases it is possible that the data will only be up to the point where we split the library and not make the index of a library very large. Several issues need to be noted in the course of testing:
1) In the case of large amounts of data, the service is restarted, then the initialization phase of the service startup, although it can accept the data query and modification, but the performance is very poor, because MongoDB will continue to transfer data from the disk into memory, at this time the IO pressure is very large.
2) in the case of large amounts of data, if the service does not shut down properly, then MongoDB Startup Repair Database time is very considerable, in 1.8 exit-dur seemingly can solve this problem, according to the official said to read no impact, write speed will be slightly lower, I will also be the next test.
3) When using sharding, MongoDB will occasionally split the data to move, this time the performance degradation is very strong, although not seen from the test map (because I test every time the more iterations), but I can see in the actual observation, Insert performance may be as low as hundreds of per second when moving data. In fact, I think it is possible to manually slice the database manually or manually to do a history library, do not rely on this automated sharding, because the first data in the right place than the separation and relocation efficiency is not known how much higher. Individuals think that MongoDB single database storage not more than 100 million of the data is appropriate, then large or manual sub-Library it.
4) For data insertion, the use of multithreading does not result in performance improvements, but it also degrades a bit of performance (and can be seen on the HTTP interface, where a large number of threads are waiting).
5) During the entire test process, the batch insertion encountered several times the connection was shut down by the remote computer error, the suspect is sometimes mongodb unstable shut down the connection, or the official C # client has a bug, but only when the amount of data is particularly large encountered several times.
Latest additions: After a few more days of testing, the test data increased to 500 million, the total disk occupies more than 500G, found that compared with 200 million of the data, all the performance is similar, just test 6 and test 7 after more than 200 million levels of data, every 4 million records as a cycle, fluctuation 30% performance , it's very regular.
Performance test of "turn" MONGODB billion-level data volume