Performance testing of MONGODB billion-level data volume

Last Update:2014-09-03 Source: Internet

Author: User

Tags bulk insert disk usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The performance test of MONGODB billion data volume was carried out, and the following items were tested:

(All inserts are single-threaded and all reads are multithreaded)

1) Normal insertion performance (the inserted data is about 1KB per piece)

2) Bulk Insert performance (using the official C # client's Insertbatch), which is measured by how much the bulk insert performance can improve

3) Safe Insert function (ensure the insertion is successful, using the safemode.true switch), this test is the safety of the insertion performance will be much worse

4) Query the number column after an index, return 10 records (that is, 10KB) performance, this is measured by the performance of the index query

5) Query the number column after two index, return 10 records (each record returns only about 20 bytes of the 2 small Small segment) performance, this is measured by the return of the amount of data and a number of query conditions on the performance of the impact

6) Query the number column after an index, sort by the date field of another index (the index is set up in reverse, the order is reverse), and the performance of the 10 records is returned after the Skip100 record.

This is the effect of skip and order on performance

7) query 100 records (that is, 100KB) performance (no sorting, no conditions), this is measured by the large data volume of the query results on the performance of the impact

8) Statistics as the test progresses, total disk usage, index disk usage, and number of data disks occupied

And each test uses a single process of MongoDB and the same server to open three MongoDB processes as sharding (each process may only use about 7GB of memory) two scenarios

In fact, for sharding, although it is a machine put 3 processes, but in the query when each parallel process query part of the data, and then run on another machine MONGOs to summarize the data,

In theory, performance can be somewhat improved in some cases, based on the assumptions above, guessing that performance will degrade in some cases, and that performance will improve in some cases, so what about the final test results?

Note : The test storage server is E5620 @ 2.40GHZ,24GB memory, the CentOS operating system, the suppression machine is E5504 @ 2.0GHZ,4GB memory,

The Windows Server 2003 operating system, both gigabit NICs are directly connected.

As you can see from this test, the way to a single process:

1) The non-Secure insertion method of MongoDB, at the beginning of the insertion performance is very high, but after reaching 20 million data, the performance of the sudden reduction,

This is when the server 24G memory is basically full (as the test MongoDB constantly occupy memory, until the operating system memory full),

In other words, mongodb memory mapping, so that the data all in memory when the speed, when some of the data needs to be swapped out to disk, performance degradation is very severe.

(This performance is not too bad, because we index three columns of data, even if the memory is full after the second can insert 2MB of data, at the beginning of the 25MB data per second inserted).

Foursquare actually also use MongoDB as a memory database with persistence, only when the memory bottleneck is not found sharding not handled well.

2) for the BULK Insert function, is actually committed a batch of data at a time, but compared to a single insert performance did not increase much, one is because the network bandwidth has become a bottleneck, and secondly I want to write a lock can also be a reason.

3) for the safe insertion function, relatively stable, will not fluctuate very large, I think it may be because the safe insert is to ensure that the data is directly persisted to the disk, rather than inserting memory to be finished.

4) for a list of conditions of the query, performance has been relatively stable, do not underestimate, can have 8000-9000 queries per second, each return 10KB, equivalent to query 80MB data per second,

And the database record is able to maintain this level after 200 million, the performance is amazing.

5) for the two-column condition to return small data query, the overall performance will be better than 4), may return a small amount of data on the performance improvement is relatively large, but the relative performance fluctuations is also a bit more powerful,

There might be one more condition. One more chance of a page break from disk.

6) for a column of data plus sort and skip query, after the data volume is significantly worse performance (at this time, the index data volume exceeds the memory size, do not know whether there is a connection),

I suspect that skip is more cost-performance, but compared to 4, the performance is not very large.

7) for the return of Big Data query, a second bottleneck also has about 800 times, that is, 80M data, which further shows that in the case of an index, sequential query and conditional search performance is similar,

This time is the bottleneck of IO and network.

8) The amount of data that is indexed during the whole process has accounted for a significant proportion of the total data volume, when the amount of data reached 140 million, the optical index can occupy the entire memory,

At this point the query performance is still very high, insert performance is not too bad, mongodb performance is really very cow.

See what the highlights of the sharding model are:

1) non-secure insert and a single-process configuration, the performance drops dramatically after the memory is full. Safe Insert performance is much slower than single-process, but very stable.

2) for a condition and two conditions of the query, performance is relatively stable, but the performance of the conditional query is equal to half of a single process, but under multiple conditions sometimes even higher than a single process.

I think this might be some time the data block is located in two sharding, so MONGOs will parallel in the two sharding query, and then in the consolidation of data,

Because of the small amount of data returned by the query, the network is unlikely to become a bottleneck, making sharding a chance to get out.

3) for order and skip queries, the gap in sharding way out, I think the main performance loss may be in order, because we did not follow the sort field as Sharding key,

Using _id as the key, this sort of order is more difficult.

4) for the return of large data volume of the query, sharding way in fact and the single process gap is not very large, I think the data forwarding may be a performance loss reason (although MONGOs is located in the machine,

But the data always changed hands once).

5) for disk space occupancy, the two are actually similar, some of the gaps may be because multiple processes will allocate a little more space, add up sometimes more than a single process to occupy a bit of disk

(and those that take up less than a single process are actually starting coding errors, making the actual data size and disk file size wrong).

The final sharding distribution of the tests is as follows:

1{2 "sharded":true,3 " ns": "Testdb.test",4 "Count": 209766143,5 "Size": 214800530672,6 "Avgobjsize": 1024.0000011441311,7 "Storagesize": 222462757776,8 "Nindexes": 4,9 "Nchunks": 823,Ten "Shards": { One "shard0000": { A " ns": "Testdb.test", - "Count": 69474248, - "Size": 71141630032, the "Avgobjsize": 1024.0000011515058, - "Storagesize": 74154252592, - "numextents": - "Nindexes": 4, + "Lastextentsize": 2146426864, - "Paddingfactor": 1, + "Flags": 1, A "Totalindexsize": 11294125824, at "indexsizes": { - "_id_": 2928157632, - "number_1": 2832745408, - "Number1_1": 2833974208, - "date_-1": 2699248576 -}, in "OK": 1 -                 }, to "shard0001":{ + "NS":"Testdb.test", - "Count": 70446092, the "Size": 72136798288, * "Avgobjsize": 1024.00000113562, $ "Storagesize": 74154252592,Panax Notoginseng "numextents": - "Nindexes": 4, the "Lastextentsize": 2146426864, + "Paddingfactor": 1, A "Flags": 1, the "Totalindexsize": 11394068224, + "indexsizes": { - "_id_": 2969355200, $ "number_1": 2826453952, $ "Number1_1": 2828403648, - "date_-1": 2769855424 -}, the "OK": 1 -                 },Wuyi "shard0002":{ the "NS":"Testdb.test", - "Count": 69845803, Wu "Size": 71522102352, - "Avgobjsize": 1024.00000114538, About "Storagesize": 74154252592, $ "numextents": - "Nindexes": 4, - "Lastextentsize": 2146426864, - "Paddingfactor": 1, A "Flags": 1, + "Totalindexsize": 11300515584, the "indexsizes": { - "_id_": 2930942912, $ "number_1": 2835243968, the "Number1_1": 2835907520, the "date_-1": 2698421184 the}, the "OK": 1 -                 } in         }, the "OK": 1 the }

View Code

Although at the end of the time due to the relationship between the 1 billion levels of data is not measured, but this data has been able to prove the performance of MongoDB is how strong.

Another reason is that in many cases it is possible that the data will only be up to the point where we split the library and not make the index of a library very large.

Several issues need to be noted in the course of testing:

1) In the case of a large amount of data, the service is restarted, then the initialization phase of the service startup, although it can accept the data query and modification, but at this time poor performance,

Because MongoDB will constantly swap data from the disk into memory, the IO pressure is very large at this time.

2) in the case of large data volume, if the service does not shut down properly, then MongoDB Startup Repair Database time is very considerable, in 1.8 exit-dur seems to solve the problem,

According to the official said there is no impact on the read, the write speed will be slightly lower, I will be free to test again.

3) When using sharding, MongoDB will occasionally split the relocation of data, this time the performance is very bad, although not seen from the test map (because I test every time the more iterations of the test),

But I can see from the actual observation that the insertion performance may be as low as hundreds of per second when moving data. In fact, I think you can manually slice the database manually or manually to do the history library,

Do not rely on this automated sharding, because the first data is placed in the right place than the separation and relocation efficiency is not known how much higher.

Individuals think that MongoDB single database storage not more than 100 million of the data is appropriate, then large or manual sub-Library it.

4) For data insertion, the use of multithreading does not result in performance improvements, but it also degrades a bit of performance (and can be seen on the HTTP interface, where a large number of threads are waiting).

5) During the entire test process, the batch insertion encountered several times the connection was closed by the remote computer error, it is suspected that sometimes mongodb unstable closed the connection,

or the official C # client has bugs, but it's only a few times when the amount of data is particularly large.

Report:

After a few more days of testing, the amount of test data increased to 500 million, the total disk occupied more than 500G, and found that compared with 200 million of the data, all performance is similar,

Just test 6 and test 7 after more than 200 million levels of data, every 4 million records as a cycle, up and down fluctuation 30% performance, very regular.

Performance testing of MONGODB billion-level data volume

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More