Want to expand your database? Well, first I'm going to know.

Last Update:2014-12-25 Source: Internet

Author: User

Keywords us the solution if then

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As a http://www.aliyun.com/zixun/aggregation/6434.html "> software developer, we value the abstraction of things." The simpler the API, the more attractive it is to us. Dialectically, the biggest advantage of MongoDB is the "elegant" API and its agility, which makes the coding process of developers unusually simple.

However, when MongoDB involves large data scalability issues, developers need to understand the underlying, understand the potential problems, and then quickly solve. If you don't understand it, you may end up choosing an inefficient solution and wasting time and money. This article focuses on how to find an efficient solution to the problem of scalability of large data.

Defining problems

First, we want to determine the context of the application, which is mainly about MongoDB applications. This means that we will study a distributed document storage database, and it also supports level two indexing and fragmentation clusters. If it is for other NoSQL products, like Riak or Cassandra, we may discuss I/O bottlenecks, and this article focuses on some of the features of MongoDB.

Second, what can these applications do? Do online transaction processing (OLTP) or online analytical processing (OLAP)? This article mainly discusses the OLTP, because for MongoDB, OLAP is still a small challenge, or basically can not handle.

Third, what is the big data? With large data, we are able to process and use more data, no longer limited to those parts of a single ram. In this case, some data remains on the server, and more data is stored on disk, which requires I/O access. Note, however, that we are not talking about a database that is not big enough, but focusing on the data that is often accessed and used (sometimes referred to as a "working set") is not very small. For example, although the disk is stored for several years of data, but the application may frequently access only the last day of data.

Four, what are the restrictive factors for OLTP applications? In short, I/O. The hard drive can only start hundreds of I/Os per second, and on the other hand, RAM can achieve millions of accesses per second, a limiting factor that causes large data application I/O bottlenecks.

Finally, how do we solve the I/O bottleneck? By analyzing thinking, formulas and direct instructions provide us with a variety of ways, but a durable solution requires "understanding." Users must focus on the I/O characteristics of the application before making the best design decisions.

Cost Model

To solve the I/O bottleneck in the future, the first step is to know which database operations will include I/O. There are three basic operations, whether MongoDB or other database types:

Point Query: Find a separate file. Retrieves the document at a given location in a folder (disk or memory). For large data, the file may not be in memory. This operation may cause an I/O.

Range Query: Find a large number of continuity files in the index, which is a more efficient query operation compared to point query. This is because the data we are looking for is packaged and stored on disk, and can be read directly into memory with minimal I/O operations. Range query typically retrieves 100 files to start an I/O, compared to 100 point query retrieving 100 files may require 100 I/O operations.

Write: Writes a file to the database. Databases like MongoDB generate I/O. For those databases that write optimized data structures, such as TOKUMX, it only requires very little I/O. Unlike Mongdb, write-optimized data structures can allocate I/O by performing multiple inserts.

After understanding the impact of three basic operations on I/O, you also need to understand the impact of the MONGODB database statement on I/O. MongoDB contains these three basic operations, and also builds four user-level operations:

Insert: Write a new file into the database.

Query: Use the index on the collection to do a consolidation of range queries and point query. If the index is an overlay index or a clustered index, then basically only a scoping query is required. Otherwise, consolidated range queries and point queries are enabled.

Modify and Delete: This is an integration of query and write operations. Query operations are used to discover files that need to be updated and deleted, and then write to modify or delete those files.

Now we understand the cost model. However, to address the I/O bottleneck, users also need to know which applications have started I/O operations. This requires us to understand the behavior of the database. is the I/O boot originating from the query operation? If so, how does the query behavior affect I/O or is it due to a modification? If the effect is caused by the modification, is it because of the query operation or the insert operation during the modification? Once the user has mastered what factors affect I/O, Then you can gradually solve the bottleneck problem.

Assuming we understand the I/O characteristics of an application, we can explore several ways to solve the problem. My favorite way Is this: first try to use software to solve the problem, if not a perfect solution, then consider the hardware. After all, software is cheaper and easier to maintain.

Viable software Solutions

A viable software solution is to reduce the number of I/O startup software or applications. For different bottlenecks, here are a few possible solutions:

Problem: Insert operation causes excessive I/O

Feasible scenario: Using a write-optimized database, such as TOKUMX, one of the advantages of using TOKUMX is that it can significantly reduce I/O requests for write operations, while indexing uses fractal Trees (Tokudb actually improves fractal Trees , the inserted IO order of magnitude is raised from log (n) to LOGB (n))

Problem: Query operation causes excessive I/O

Feasible solution: Use a better index, reduce the point query, try to use the scope of the query instead. In my opinion, it is still necessary to "understand the index". I explained that indexes can reduce I/O requests for query operations. Of course, this is not a one or two-paragraph words can be said clearly, but the main points are as follows: to reduce the application of I/O requests first to avoid the independent point query to retrieve each folder, in order to achieve this, we need to use overlay or cluster index, so that we can intelligently filter out the query process has been analyzed in the file, Then use the scope query to report the results.

Admittedly, a good indexing method is not enough, if it is an OLTP application, then all queries are essentially point queries (because they simply retrieve very few files), and even with an entirely appropriate index, the user will still have the I/O bottleneck problem. In this case, the hardware solution is necessary.

Of course, adding an index also means increasing the cost of inserting, because each insert will keep the index up-to-date, but writing optimized databases can reduce the cost. That's why I've always stressed that we need to understand the application, and for some applications, a viable solution does not necessarily apply to other applications.

Problem: Modification or deletion causes excessive I/O

Solution: Consolidate all of the above solutions

Modifications and deletions are complicated because they are an integration of queries and inserts. Raising their performance requirements is a good understanding of the operational overhead. Which part of the operation involves I/O? is the query? If so, we will elevate the index. Is it a write operation or a combination? In short, which part of the operation involves an I/O problem, we apply the corresponding solution.

A common mistake is that when we use a write-optimized database (like tokumx), without changing any indexes, we want it to eliminate the I/O bottleneck that was modified or deleted. After all, it is not enough to write optimized databases, and implicit queries in the modification/deletion process must also be processed.

A viable hardware solution

As mentioned above, when the software solution is not able to solve the problem, we need to seek hardware solutions. Let's analyze the strengths and weaknesses of the hardware:

Buy as much memory as possible, even if it is difficult to do, put the working set into memory

Use SSD to increase IOPS

Buy more servers and use an extensible solution that includes

Read extensions by copy

Piecewise

Buy more memory

RAM is expensive, and scalable memory is limited on a single machine. If the data is too large, it is not a viable option to keep it in RAM. This solution may be suitable for most applications, but we are more concerned with applications that this solution cannot solve.

Using SSD storage

Admittedly, this is a very practical solution if the storage device uses SSD to boost throughput. If I/O becomes an application restrictive factor, increasing IOPS (I/O operations per second) can naturally increase throughput, simplicity and use.

However, the cost of SSD and the cost of hard disk are not the same, SSD can definitely increase I/O throughput, although the cost is not cheap, but lighter, larger capacity, faster. In order to reduce costs, data compression is a key. In fact, although hardware costs have increased, this does not mean that management costs will increase:

Read extensions by copy

When query operation becomes the bottleneck of application, read extension through copy is a very effective solution. The idea is as follows:

Use Backup to copy multiple data to separate machines

Distributed on different machines, which will increase read throughput

If you read on a single machine and transmit it, there is a big bottleneck. If there are multiple replicas, the application will have more resources available, so there will be a great improvement in reading and writing.

If there is a bottleneck in the insertion, modification, or deletion process, the replica may not be as efficient. This is because write operations need to back up the data to all the server's replica sets, and the machine faces the same bottleneck in the data entry process.

Piecewise

Based on the Shard key, the slice slices the data to a different replica set, and the different replica sets in the cluster are responsible for the corresponding range of values. Therefore, you can increase the write throughput of your application by assigning write operations to different replica sets in the cluster. For applications with high write loads, fragmentation can be very effective.

Dividing data by scope in the Shard key space can be very effective by using Shard key to span multiple replicas. If you shard the key hash, all range queries must run all slices in the cluster, but the point query on the Shard key runs on only one fragment. The Shard key hash, which is a structural model and does not support joins, is very efficient and simple to use. If the above solution is not enough to solve the application bottleneck, then you can do more work on the fragmentation.

In any case, fragmentation is a heavyweight solution, and the cost is very high. For starters, your hardware budget needs to be increased several times. Not only do you need to add servers for fragmentation settings, but you also need to add a complete set of replicas. You also need to increase and manage the configuration server, considering the cost problem, users need to carefully consider whether fragmentation is necessary. Generally speaking, all of the above schemes cost less than this.

Another big challenge is to choose a Shard key, a good shard key has the following characteristics:

Most (if not all) queries use Shard key, and any query (without using Shard key) must run on all slices. This may be more troublesome.

Shard key should be very beneficial to the cluster of different copies and the distribution of the write operation, if all the write operations are corresponding to the same replica set on the cluster, then the replica set for the write operation will become a bottleneck, as if in a non-fragmented settings, so that, This is as bad as using the timestamp as the Shard key.

These requirements are not easy to implement. Sometimes a good shard key does not exist, which makes fragmentation extremely efficient.

Summary

Many solutions work, but none of them is guaranteed, even fragmentation. This is also emphasized by the authors: it is essential to understand the nature of the application. Tools can solve the problem, but how to better use the tools depends on the user.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More