Nosql theory-memory is a new hard disk, hard disk is a new tape

Source: Internet
Author: User
Tags cassandra

 

"Memory is a new hard disk, and hard disk is a new tape" is from the Turing prize winner Jim Gray.

I. Preface

I understand the meaning of this sentence: we should put all random IO in the memory, and leave the sequential Io like the tape to the hard disk (SSD is not included here ).

If the application does not reach a certain level, we may think it is too geek to read the above two sentences. However, today, when the application data volume is growing and the proportion of dynamic content is increasing, ignoring this basic principle will be a disaster.

Today, let's talk about how this theory is presented in nosql products.

II. Implementation

Problem 1: Data loss during downtime

Let's take a look at several outstanding nosql representatives, Cassandra, MongoDB, and redis. Almost all of them use the same storage mode, that is, they perform write operations in the memory, regularly or by certain conditions to directly write data in the memory to the disk. The advantage of doing so is that we can take full advantage of the random I/O in the memory, avoiding the random I/O bottleneck caused by direct Disk Writing: disk seek time. Of course, the downside is that some data may be lost in the event of downtime or other problems.

There are two solutions to the data loss problem:

1. Record operation logs in real time

Generally, when a write operation arrives, the system first appends a write record to the log file and then operates the memory to write data. Because log files are constantly appended, it ensures that there will not be a large number of random IO requests.

2. Quorum NRW

This theory is based on cluster-based storage. The principle is that if the cluster has n nodes, it is successful if we need to synchronize at least W nodes for each write operation, each read operation only needs to read data from the R node to ensure the correct result. (If a node has this data, it is successful. If no data exists at all r nodes, this indicates that no data exists ). The relationship between nrws must meet n
<R + W. In fact, this theory is not difficult to understand. We can consider this inequality as an example: r> N-W
We have n nodes. If we write at least W nodes at the same time, the data is successfully written, that is, W nodes have the data, so the N-W is probably not the maximum number of knots for a piece of data. At most may have a N-W node without a certain data, if we read the data operation, read more than the N-W node, then there must be more than one node is the data. Therefore, R is required.
> N-w.

You may have figured out that to prevent data loss, we adopt a simple redundant backup method. Is writing data to multiple machines faster than writing data to a single machine's disk? Yes. Compared with direct disk operations, cross-network memory operations can be faster. The simplest example is the improved consistent Hash (for more information about consistent hash, see here ):

From the dynamo document of Amazon, the hash value of the key is located in the data between the and B nodes, not only on the B node, instead, back up data at the C and D nodes respectively in the direction of the ring. Of course, the benefit of doing so is not entirely due to the redundant backup mentioned above.

Of course, most of the above two solutions are used at the same time to ensure high data availability.

Question 2: Memory Capacity Limit

When we use memory as a hard disk, we will inevitably face capacity problems. This is also the reason why the data we mentioned above will be flushed to the disk on a regular basis. When the data in the memory has exceeded the available memory size, we need to implement it, the excessive use of swap is not in line with our original intention, but also fails to achieve the efficient random Io effect. There are also two solutions:

1. Application Layer swap

Tokyocabinet and redis are used in this method. Tokyocabinet improves Io efficiency through MMAP, while its MMAP only contains part of the data file header. Once the data file exceeds the maximum MMAP length set by the data file (controlled by the xmsize parameter), the rest is purely inefficient disk operations. Therefore, it provides a cache mechanism similar to memcached. Through the rcnum parameter configuration, it caches key-value hot data filtered by the LRU mechanism, this part of memory is completely independent from the memory occupied by MMAP. Similarly, redis added support for disk storage after version 2.0. Its Mechanism is similar to that of tokyocabinet. It also judges the popularity of data through data operations, and try to put the hot data in the memory.

2. Merge data of multiple versions

What is multi-version data merging? The bigtable or its open-source version, Cassandra, uses timed flushing of data blocks in the memory to the disk. If this is an update operation, for example, if the value of Keya changes from valuea to valueb, we have to clear the old data valuea when flushing to the disk. In this way, can we not achieve the objective of sequential disk io? Yes, this is not possible, so the bigtable system does not. During disk flushing, the merge operation is not performed, instead, the memory data is directly written to the disk. In this way, writing is much more convenient. There may be multiple versions of a value during reading. In this case, we need to merge multiple versions. Therefore, the second method is to write a write operation for a period of time into a block (which may not be a file) to ensure that the memory usage will not expand infinitely. Data versions are merged by reading multiple file blocks.

If the data volume stored on the disk is many times the memory capacity, we may generate many data blocks. Do we need to traverse all data blocks when obtaining the data version? Of course, no. If you have read bigtable papers, I believe you still remember that it uses the bloom-filter algorithm. The bloom-filter algorithm is most widely used in search engine crawlers. It is used to determine whether a URL exists in a captured set, this algorithm is not accurate (data that is not in the set may be mistaken as in the set, but there will be no opposite error), but its time complexity is only a few hash calculations, the space complexity is also very low. The bloom-filter algorithm is also used in the bigtable implementation to determine whether a value is in a collection. Due to the characteristics of the bloom-filter algorithm, we only read more (with a low probability) and do not read less data blocks. Therefore, we can store data that is far greater than the physical memory capacity.

3. End

Well, I will write it here. I would like to share more understanding and understanding about the application of this principle in nosql.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.