Performance Test of hundreds of millions of Mongodb data records zz

Source: Internet
Author: User
Tags e5620
We tested the performance of Mongodb's hundreds of millions of data records and tested the following items:

(ALL inserts are performed in a single thread, and all reads are performed in multiple threads)

1) General insert performance (each inserted data entry is about 1 KB)

2) Batch insert performance (using InsertBatch of the official C # client). This test shows how much the batch insert performance can be improved.

3) Secure insertion (SafeMode. True is used to ensure successful insertion). This test shows the performance of secure insertion.

4) query the numeric column after an index and return the performance of 10 records (that is, 10 KB). This test tests the index query performance.

5) query the numeric columns after two indexes and return the performance of 10 records (each record returns only two small fields about 20 bytes, this test is to return a small amount of data and the impact of one more query condition on performance.

6) query the numeric column after an index and sort it by the date field of another index (the index is created in reverse order, and the sorting is also in reverse order ), and the performance of 10 records is returned after the Skip100 records. The impact of Skip and Order on the performance is measured.

7) query the performance of 100 records (that is, KB) (no sorting and no conditions). This test is to test the performance impact of query results of large data volumes.

8) count the total disk usage, index disk usage, and data disk usage as the test progresses.

In addition, each test uses a single-process Mongodb instance and the same server to run three Mongodb processes as Sharding (each process can only use about 7 GB of memory ).

In fact, although Sharding is a machine with three processes, each parallel process queries part of the data during the query, and then runs the mongos on another machine to summarize the data, theoretically, the performance may be improved in some cases.

Based on the above assumptions, I guess the performance will decrease in some cases and the performance will improve in some cases. So let's take a look at the final test results?

Note: The tested storage Server is E5620 @ 2.40 GHz, with 24 GB memory and CentOs operating system. The suppression machine is E5504 @ 2.0 GHz, with 4 GB memory and Windows Server 2003 operating system, two Gigabit NICs are directly connected.

From this test, we can see that the single process method is as follows:

1) Mongodb's non-secure insertion method has a very high performance at the beginning, but the performance decreases sharply after it reaches 20 million data entries, this happens when the 24 GB memory of the server is basically full (mongodb continues to occupy the memory as tested until the memory of the operating system is full), that is, the Mongodb memory ing mode, this results in a fast speed when all data is in the memory. When some data needs to be swapped out to the disk, the performance is greatly reduced. (This performance is not too bad, because we have indexed the data of the three columns and can insert 2 MB of data per second even after the data is fully stored, at the beginning, 25 MB of data is inserted per second ). Foursquare actually uses Mongodb as a memory database with persistence, but does not handle it properly when it cannot be found to reach the internal storage bottleneck.

2) For the batch insert function, a batch of data is actually submitted at a time, but the performance of a batch of data is not much improved compared to that of a single insert, because the network bandwidth has become a bottleneck, second, I want to write locks for a reason.

3) for the secure insert function, it is relatively stable and will not fluctuate greatly. I think it may be because the secure insert ensures that the data is directly persisted to the disk, rather than inserting the memory.

4) for queries with one column of conditions, the performance has been relatively stable. Don't underestimate that there can be-queries per second, and 10 KB is returned each time, which is equivalent to 80 mb of data per second, in addition, the database record can maintain this level after 0.2 billion, with amazing performance.

5) for queries that return small data with two-column conditions, the overall performance is better than 4). The returned data volume may be small, which improves the performance relatively, however, performance fluctuations are also a little more severe, and one more condition may lead to another opportunity to switch pages from the disk.

6) for a column of data plus Sort and Skip queries, the performance will obviously deteriorate when the data volume is large (when the index data volume exceeds the memory size, I don't know if there is any connection). I guess Skip is a little performance consuming, but it is not a big difference in performance compared with 4.

7) for queries that return big data, the bottleneck is also around 800 times per second, that is, 80 mb of data. This further demonstrates that when there is an index, the performance of sequential query and conditional search is almost the same, which is the bottleneck of IO and network.

8) during the entire process, the index data volume has accounted for a considerable proportion of the total data volume. When the data volume reaches 0.1 billion million, the light index can occupy the entire memory, at this time, the query performance is still very high, and the insertion performance is not too bad. mongodb performance is indeed very good.

Let's take a look at the highlights of the Sharding mode:

1) non-secure insertion is the same as the configuration of a single process. When the memory is full, the Performance drops sharply. The Security insert performance is much slower than that of a single process, but it is very stable.

2) the query performance of one condition and two conditions is stable, but the query performance of the condition is equivalent to half of that of a single process, however, in many cases, it may be a little higher than a single process. I think this may be because sometimes the data block is located in two Sharding databases, so Mongos will query the data in two Sharding databases in parallel and then merge and summarize the data. Because the returned data volume is small, the network is unlikely to become a bottleneck, so that Sharding has a chance to stand out.

3) for Order and Skip queries, the Sharding method is different. I think the major performance loss may be in Order, because we didn't use the Sharding Key as the sort field, _ id is used as the Key, which makes sorting more difficult.

4) for queries that return a large amount of data, the Sharding method is not very different from that of a single process. I think Data Forwarding may be a cause of performance loss (although mongos is on the press machine, but the data is always transferred once ).

5) the disk space usage is similar to that of the two. Some of the gaps may be due to the fact that multiple processes allocate more space, in addition, sometimes it takes more disk space than a single process (and those that occupy less than a single process are actually the starting encoding errors, the actual data size and disk file usage are incorrect ).

The distribution of each Sharding at the end of the test is as follows:

{
"Sharded": true,
"Ns": "testdb. test ",
"Count": 209766143,
"Size": 214800530672,
"AvgObjSize": 1024.0000011441311,
"StorageSize": 222462757776,
"Nindexes": 4,
& Quot; nchunks & quot;: 823,
"Shards ":{
"Shard0000 ":{
"Ns": "testdb. test ",
"Count": 69474248,
"Size": 71141630032,
"AvgObjSize": 1024.0000011515058,
"StorageSize": 74154252592,
"NumExtents": 65,
"Nindexes": 4,
"LastExtentSize": 2146426864,
"PaddingFactor": 1,
"Flags": 1,
"TotalIndexSize": 11294125824,
"IndexSizes ":{
"_ Id _": 2928157632,
"Number_1": 2832745408,
"Number1_1": 2833974208,
"Date _-1": 2699248576
},
"OK": 1
},
"Shard0001 ":{
"Ns": "testdb. Test ",
"Count": 70446092,
"Size": 72136798288,
"Avgobjsize": 1024.00000113562,
"Storagesize": 74154252592,
"NumExtents": 65,
"Nindexes": 4,
"LastExtentSize": 2146426864,
"PaddingFactor": 1,
"Flags": 1,
"TotalIndexSize": 11394068224,
"IndexSizes ":{
"_ Id _": 2969355200,
"Number_1": 2826453952,
"Number1_1": 2828403648,
"Date _-1": 2769855424
},
"OK": 1
},
"Shard0002 ":{
"Ns": "testdb. Test ",
"Count": 69845803,
"Size": 71522102352,
"Avgobjsize": 1024.00000114538,
"Storagesize": 74154252592,
"NumExtents": 65,
"Nindexes": 4,
"LastExtentSize": 2146426864,
"PaddingFactor": 1,
"Flags": 1,
"TotalIndexSize": 11300515584,
"IndexSizes ":{
"_ Id _": 2930942912,
"Number_1": 2835243968,
"Number1_1": 2835907520,
"Date _-1": 2698421184
},
"OK": 1
}
},
"OK": 1
}

Although no 1 billion data volume was measured due to time at the end, it can prove how powerful Mongodb is. Another reason is that, in many cases, we may split the database as long as the data reaches tens of millions, without making the index of a database very huge. Note the following issues during the test:

1) when the data volume is large and the service is restarted, although the data query and modification can be accepted in the initialization phase of the Service Startup, the performance is poor at this time, because mongodb constantly swaps data from the disk into the memory, the IO pressure is very high.

2) When the data volume is large, if the service is not properly shut down, it takes a considerable amount of time for Mongodb to start and fix the database. The-dur exited in 1.8 seems to be able to solve this problem, according to the official saying that reading is not affected, the write speed will be slightly reduced, and I will test again when I have time.

3) When Sharding is used, Mongodb will split and relocate data from time to time, which causes severe performance degradation, although it cannot be seen from the test diagram (because each test will test a large number of iterations), I can find that, data insertion performance may be as low as several hundred records per second during data migration. In fact, I think you can manually split databases or manually split historical databases without relying on this automated Sharding, because the data is placed in the correct position at the beginning, and the efficiency of the relocation is unknown. I personally think that it is more appropriate to store up to 0.1 billion data records in a single Mongodb database. If it is larger, You Should manually split the database.

4) for data insertion, multithreading does not improve the performance, but reduces the performance. (You can also see that a large number of threads are waiting on the http interface ).

5) during the entire test, several connections were closed by the remote computer during batch insertion. It is suspected that the connection was closed due to Mongodb instability, or the official C # client has bugs, but it only encounters several times when the data volume is very large.

Last added: After a few days of testing, I further increased the test data volume to 0.5 billion and occupied more than 500 GB of disk space. I found that all the performance was similar to that of 0.2 billion million data volumes, only Test 6 and Test 7, after more than 0.2 billion levels of data, each 4 million records act as a cycle with a performance fluctuation of 30% up and down, which is quite regular.

Author: lovecindywang: The copyright of this article is shared by the author and the blog. You are welcome to reprint it. However, you must keep this statement without the author's consent and provide a connection to the original text on the article page, otherwise, you are entitled to pursue legal liability. We tested the performance of Mongodb's hundreds of millions of data records and tested the following items:

(ALL inserts are performed in a single thread, and all reads are performed in multiple threads)

1) General insert performance (each inserted data entry is about 1 KB)

2) Batch insert performance (using InsertBatch of the official C # client). This test shows how much the batch insert performance can be improved.

3) Secure insertion (SafeMode. True is used to ensure successful insertion). This test shows the performance of secure insertion.

4) query the numeric column after an index and return the performance of 10 records (that is, 10 KB). This test tests the index query performance.

5) query the numeric columns after two indexes and return the performance of 10 records (each record returns only two small fields about 20 bytes, this test is to return a small amount of data and the impact of one more query condition on performance.

6) query the numeric column after an index and sort it by the date field of another index (the index is created in reverse order, and the sorting is also in reverse order ), and the performance of 10 records is returned after the Skip100 records. The impact of Skip and Order on the performance is measured.

7) query the performance of 100 records (that is, KB) (no sorting and no conditions). This test is to test the performance impact of query results of large data volumes.

8) count the total disk usage, index disk usage, and data disk usage as the test progresses.

In addition, each test uses a single-process Mongodb instance and the same server to run three Mongodb processes as Sharding (each process can only use about 7 GB of memory ).

In fact, although Sharding is a machine with three processes, each parallel process queries part of the data during the query, and then runs the mongos on another machine to summarize the data, theoretically, the performance may be improved in some cases.

Based on the above assumptions, I guess the performance will decrease in some cases and the performance will improve in some cases. So let's take a look at the final test results?

Note: The tested storage Server is E5620 @ 2.40 GHz, with 24 GB memory and CentOs operating system. The suppression machine is E5504 @ 2.0 GHz, with 4 GB memory and Windows Server 2003 operating system, two Gigabit NICs are directly connected.

From this test, we can see that the single process method is as follows:

1) Mongodb's non-secure insertion method has a very high performance at the beginning, but the performance decreases sharply after it reaches 20 million data entries, this happens when the 24 GB memory of the server is basically full (mongodb continues to occupy the memory as tested until the memory of the operating system is full), that is, the Mongodb memory ing mode, this results in a fast speed when all data is in the memory. When some data needs to be swapped out to the disk, the performance is greatly reduced. (This performance is not too bad, because we have indexed the data of the three columns and can insert 2 MB of data per second even after the data is fully stored, at the beginning, 25 MB of data is inserted per second ). Foursquare actually uses Mongodb as a memory database with persistence, but does not handle it properly when it cannot be found to reach the internal storage bottleneck.

2) For the batch insert function, a batch of data is actually submitted at a time, but the performance of a batch of data is not much improved compared to that of a single insert, because the network bandwidth has become a bottleneck, second, I want to write locks for a reason.

3) for the secure insert function, it is relatively stable and will not fluctuate greatly. I think it may be because the secure insert ensures that the data is directly persisted to the disk, rather than inserting the memory.

4) for queries with one column of conditions, the performance has been relatively stable. Don't underestimate that there can be-queries per second, and 10 KB is returned each time, which is equivalent to 80 mb of data per second, in addition, the database record can maintain this level after 0.2 billion, with amazing performance.

5) for queries that return small data with two-column conditions, the overall performance is better than 4). The returned data volume may be small, which improves the performance relatively, however, performance fluctuations are also a little more severe, and one more condition may lead to another opportunity to switch pages from the disk.

6) for a column of data plus Sort and Skip queries, the performance will obviously deteriorate when the data volume is large (when the index data volume exceeds the memory size, I don't know if there is any connection). I guess Skip is a little performance consuming, but it is not a big difference in performance compared with 4.

7) for queries that return big data, the bottleneck is also around 800 times per second, that is, 80 mb of data. This further demonstrates that when there is an index, the performance of sequential query and conditional search is almost the same, which is the bottleneck of IO and network.

8) during the entire process, the index data volume has accounted for a considerable proportion of the total data volume. When the data volume reaches 0.1 billion million, the light index can occupy the entire memory, at this time, the query performance is still very high, and the insertion performance is not too bad. mongodb performance is indeed very good.

Let's take a look at the highlights of the Sharding mode:

1) non-secure insertion is the same as the configuration of a single process. When the memory is full, the Performance drops sharply. The Security insert performance is much slower than that of a single process, but it is very stable.

2) the query performance of one condition and two conditions is stable, but the query performance of the condition is equivalent to half of that of a single process, however, in many cases, it may be a little higher than a single process. I think this may be because sometimes the data block is located in two Sharding databases, so Mongos will query the data in two Sharding databases in parallel and then merge and summarize the data. Because the returned data volume is small, the network is unlikely to become a bottleneck, so that Sharding has a chance to stand out.

3) for Order and Skip queries, the Sharding method is different. I think the major performance loss may be in Order, because we didn't use the Sharding Key as the sort field, _ id is used as the Key, which makes sorting more difficult.

4) for queries that return a large amount of data, the Sharding method is not very different from that of a single process. I think Data Forwarding may be a cause of performance loss (although mongos is on the press machine, but the data is always transferred once ).

5) the disk space usage is similar to that of the two. Some of the gaps may be due to the fact that multiple processes allocate more space, in addition, sometimes it takes more disk space than a single process (and those that occupy less than a single process are actually the starting encoding errors, the actual data size and disk file usage are incorrect ).

The distribution of each Sharding at the end of the test is as follows:

{
"Sharded": true,
"Ns": "testdb. test ",
"Count": 209766143,
"Size": 214800530672,
"AvgObjSize": 1024.0000011441311,
"StorageSize": 222462757776,
"Nindexes": 4,
& Quot; nchunks & quot;: 823,
"Shards ":{
"Shard0000 ":{
"Ns": "testdb. test ",
"Count": 69474248,
"Size": 71141630032,
"AvgObjSize": 1024.0000011515058,
"StorageSize": 74154252592,
"NumExtents": 65,
"Nindexes": 4,
"LastExtentSize": 2146426864,
"PaddingFactor": 1,
"Flags": 1,
"Totalindexsize": 11294125824,
"Indexsizes ":{
"_ Id _": 2928157632,
"Number_1": 2832745408,
"Number1_1": 2833974208,
"Date _-1": 2699248576
},
"OK": 1
},
"Shard0001 ":{
"Ns": "testdb. test ",
"Count": 70446092,
"Size": 72136798288,
"AvgObjSize": 1024.00000113562,
"StorageSize": 74154252592,
"NumExtents": 65,
"Nindexes": 4,
"LastExtentSize": 2146426864,
"PaddingFactor": 1,
"Flags": 1,
"TotalIndexSize": 11394068224,
"IndexSizes ":{
"_ Id _": 2969355200,
"Number_1": 2826453952,
"Number1_1": 2828403648,
"Date _-1": 2769855424
},
"OK": 1
},
"Shard0002 ":{
"Ns": "testdb. Test ",
"Count": 69845803,
"Size": 71522102352,
"Avgobjsize": 1024.00000114538,
"StorageSize": 74154252592,
"NumExtents": 65,
"Nindexes": 4,
"LastExtentSize": 2146426864,
"PaddingFactor": 1,
"Flags": 1,
"Totalindexsize": 11300515584,
"Indexsizes ":{
"_ Id _": 2930942912,
"Number_1": 2835243968,
"Number1_1": 2835907520,
"Date _-1": 2698421184
},
"OK": 1
}
},
"OK": 1
}

Although no 1 billion data volume was measured due to time at the end, it can prove how powerful Mongodb is. Another reason is that, in many cases, we may split the database as long as the data reaches tens of millions, without making the index of a database very huge. Note the following issues during the test:

1) when the data volume is large and the service is restarted, although the data query and modification can be accepted in the initialization phase of the Service Startup, the performance is poor at this time, because mongodb constantly swaps data from the disk into the memory, the IO pressure is very high.

2) When the data volume is large, if the service is not properly shut down, it takes a considerable amount of time for Mongodb to start and fix the database. The-dur exited in 1.8 seems to be able to solve this problem, according to the official saying that reading is not affected, the write speed will be slightly reduced, and I will test again when I have time.

3) When Sharding is used, Mongodb will split and relocate data from time to time, which causes severe performance degradation, although it cannot be seen from the test diagram (because each test will test a large number of iterations), I can find that, data insertion performance may be as low as several hundred records per second during data migration. In fact, I think you can manually split databases or manually split historical databases without relying on this automated Sharding, because the data is placed in the correct position at the beginning, and the efficiency of the relocation is unknown. I personally think that it is more appropriate to store up to 0.1 billion data records in a single Mongodb database. If it is larger, You Should manually split the database.

4) for data insertion, multithreading does not improve the performance, but reduces the performance. (You can also see that a large number of threads are waiting on the http interface ).

5) during the entire test, several connections were closed by the remote computer during batch insertion. It is suspected that the connection was closed due to Mongodb instability, or the official C # client has bugs, but it only encounters several times when the data volume is very large.

Last added: After a few days of testing, I further increased the test data volume to 0.5 billion and occupied more than 500 GB of disk space. I found that all the performance was similar to that of 0.2 billion million data volumes, only Test 6 and Test 7, after more than 0.2 billion levels of data, each 4 million records act as a cycle with a performance fluctuation of 30% up and down, which is quite regular.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.