Performance testing of hundreds of millions of Mongodb data volumes

Source: Internet
Author: User

We tested the performance of Mongodb's hundreds of millions of data records and tested the following items:

 

All inserts are performed in a single thread, and all reads are performed in multiple threads)

1) normal insertion performance: Each inserted data entry is about 1 kb)

2) The batch insert performance uses the official C # client InsertBatch. This test shows how much the batch insert performance can be improved.

3) The secure insert function ensures that the insert operation is successful and the SafeMode. True Switch is used.) This test shows the performance of the secure insert operation.

4) query the numeric column after an index and return 10 records (10 KB). This test is about the index query performance.

5) query the numeric columns after two indexes, and return 10 records. Each record returns only two small fields about 20 bytes, this test is to return a small amount of data and the impact of one more query condition on performance.

6) query the numeric column after an index. Sort the index by the date field of another index in reverse order, and sort the index in reverse order ), and the performance of 10 records is returned after the Skip100 records. The impact of Skip and Order on the performance is measured.

7) query the performance of 100 records (that is, KB) is not sorted and there are no conditions). This test is the impact of query results of large data volumes on the performance.

8) count the total disk usage, index disk usage, and data disk usage as the test progresses.

In addition, each test uses a single-process Mongodb instance and the same server to run three Mongodb processes as Sharding. Each process can only use about 7 GB of memory .)

In fact, although Sharding is a machine with three processes, each parallel process queries part of the data during the query, and then runs the mongos on another machine to summarize the data, theoretically, the performance may be improved in some cases.

Based on the above assumptions, I guess the performance will decrease in some cases and the performance will improve in some cases. So let's take a look at the final test results?

Note: The tested storage Server is E5620 @ 2.40 GHz, with 24 GB memory and CentOs operating system. The suppression machine is E5504 @ 2.0 GHz, with 4 GB memory and Windows Server 2003 operating system, two Gigabit NICs are directly connected.

From this test, we can see that the single process method is as follows:

1) Mongodb's non-secure insertion method has a very high performance at the beginning, but the performance decreases sharply after it reaches 20 million data entries, this happens when the 24 GB memory of the server is basically full and mongodb continues to occupy the memory as tested until the memory of the operating system is full), that is to say, the memory ing mode of Mongodb, this results in a fast speed when all data is in the memory. When some data needs to be swapped out to the disk, the performance is greatly reduced. This performance is not too bad, because we have indexed the data of the three columns, and can insert 2 MB of data per second even after the data is full, at the beginning, 25 MB of data is inserted per second)

2) For the batch insert function, a batch of data is actually submitted at a time, but the performance of a batch of data is not much improved compared to that of a single insert, because the network bandwidth has become a bottleneck, second, I want to write locks for a reason.

3) for the secure insert function, it is relatively stable and will not fluctuate greatly. I think it may be because the secure insert ensures that the data is directly persisted to the disk, rather than inserting the memory.

4) for queries with one column of conditions, the performance has been relatively stable. Don't underestimate that there can be-queries per second, and 10 KB is returned each time, which is equivalent to 80 mb of data per second, in addition, the database record can maintain this level after 0.2 billion, with amazing performance.

5) for queries that return small data with two-column conditions, the overall performance is better than 4). The returned data volume may be small, which improves the performance relatively, however, performance fluctuations are also a little more severe, and one more condition may lead to another opportunity to switch pages from the disk.

6) for a column of data plus Sort and Skip queries, when the data volume is large, the performance will obviously deteriorate. In this case, when the index data volume exceeds the memory size, I don't know if there is any connection ), I guess Skip is relatively performance-consuming, but it is not a big difference in performance compared with 4.

7) for queries that return big data, the bottleneck is also around 800 times per second, that is, 80 mb of data. This further demonstrates that when there is an index, the performance of sequential query and conditional search is almost the same, which is the bottleneck of IO and network.

8) during the entire process, the index data volume has accounted for a considerable proportion of the total data volume. When the data volume reaches 0.1 billion million, the light index can occupy the entire memory, at this time, the query performance is still very high, and the insertion performance is not too bad. mongodb performance is indeed very good.

Let's take a look at the highlights of the Sharding mode:

1) non-secure insertion is the same as the configuration of a single process. When the memory is full, the Performance drops sharply. The Security insert performance is much slower than that of a single process, but it is very stable.

2) the query performance of one condition and two conditions is stable, but the query performance of the condition is equivalent to half of that of a single process, however, in many cases, it may be a little higher than a single process. I think this may be because sometimes the data block is located in two Sharding databases, so Mongos will query the data in two Sharding databases in parallel and then merge and summarize the data. Because the returned data volume is small, the network is unlikely to become a bottleneck, so that Sharding has a chance to stand out.

3) for Order and Skip queries, the Sharding method is different. I think the major performance loss may be in Order, because we didn't use the Sharding Key as the sort field, _ id is used as the Key, which makes sorting more difficult.

4) for queries that return a large amount of data, the Sharding method is not very different from that of a single process. I think Data Forwarding may be a cause of performance loss, although mongos is on the press machine, but the data is always transferred once ).

5) the disk space usage is similar to that of the two. Some of the gaps may be due to the fact that multiple processes allocate more space, in addition, a single process occupies more disk space than a single process, while those that occupy less space than a single process are actually the starting encoding errors, and the actual data size and disk file size are wrong ).

Although no 1 billion data volume was measured due to time at the end, it can prove how powerful Mongodb is. Another reason is that, in many cases, we may split the database as long as the data reaches tens of millions, without making the index of a database very huge. Note the following issues during the test:

1) when the data volume is large and the service is restarted, although the data query and modification can be accepted in the initialization phase of the Service Startup, the performance is poor at this time, because mongodb constantly swaps data from the disk into the memory, the IO pressure is very high.

2) When the data volume is large, if the service is not properly shut down, it takes a considerable amount of time for Mongodb to start and fix the database. The-dur exited in 1.8 seems to be able to solve this problem, after a simple test, we have not significantly affected the insertion and query performance when we enable dur.

3) When Sharding is used, Mongodb will split and relocate data from time to time, which causes severe performance degradation, although it cannot be seen from the test diagram that each test will test a large number of iterations), I can see from the actual observation that, data insertion performance may be as low as several hundred records per second during data migration.

4) for data insertion, multithreading does not improve the performance, but reduces the performance. You can see that a large number of threads are waiting on the http interface ).

5) during the entire test, several connections were closed by the remote computer during batch insertion. It is suspected that the connection was closed due to Mongodb instability, or the official C # client has bugs, but it only encounters several times when the data volume is very large.

  1. Why does NoSQL replace MySQL in Digg?
  2. CEO of MongoDB talks about NoSQL's big data processing capabilities
  3. Explain how MongoDB accelerates the storage of physical files and SQUID
  4. Detailed description of NoSQL database use instances
  5. 35 leading non-mainstream open source databases MongoDB

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.