Document directory
- Overview
- Hlog class
- Hlogkey class
Seek vs. Transfer
I have previously compared B + tree and LSM tree.
Http://www.cnblogs.com/fxjwind/archive/2012/06/09/2543357.html
The last blog in this article makes a good analysis of the nature of B + tree and LSM tree (log-structured merge-trees), the balance of Read and Write efficiency, global order and local order...
But previously, seek vs. transfer was not very familiar with this title. Here I will explain it in detail,
The B + tree also needs to store disks. The unit for data exchange with disks is page. In the example in the book, when a page exceeds the configured size, it will be split.
The problem is as follows: the adjacent pages are not adjacent to each other on the disk and may be far away.
The issue here is that the new pages aren't necessarily next to each other on disk. so now if you ask to query a range from key 1 to key 3, it's going to have to read two leaf pages which cocould be far apart from each other.
Therefore, no matter whether you read or write the node on B + tree, the first thing you need to do is to find the page where the node is located through the disk seek. This is inefficient and the following data is displayed, this problem becomes more and more serious as the disk grows.
While CPU, Ram and disk size double every 18-24 months the seek time remains nearly constant at around und 5% speed-up per year.
For reading, we can use buffer cache to partially solve this problem, but for random write, this issue cannot be avoided, and a large number of page fragments will be generated.
Therefore, B + tree is not suitable for scenarios with a large number of random writes.
The LSM tree optimizes random writes. Of course, it is difficult to optimize random writes on disks. Therefore, the random write buffer is used in memory, and sort the data, and finally flush random writes to the disk in batches. In this way, random writes are converted to sequential writes.For more information about LSM tree, see http://www.cnblogs.com/fxjwind/archive/2012/08/14/2638371.html
In this way, the seek problem is effectively avoided, but the data is constantly transferred to the disk, so the write efficiency is much higher.
There is data in the book to prove the write efficiency of the LSM tree,
When updating 1% of entries (100,000,000) It takes:
•1,000 daysWith Random B-tree updates
• 100 days with batched B-tree updates
•1 dayWith sort and merge
Of course, this problem is that the global order cannot be ensured, and the efficiency of data reading will be lower. This problem can be solved through merge and bloom filter.
B + tree is a traditional data method that supports crud. It turns out that supporting UD makes the data system very complex. In essence, the generation of new data does not deny the existence of old data.
Therefore, simplified to Cr can greatly simplify the system and improve fault tolerance.
The above is my understanding. below is the original text of the comparison in the book,
Comparing B + trees and LSM-trees is about understanding where they have their relative strengths and weaknesses.
B + trees work well until there are too modified modifications, Because they force you to perform costly Optimizations to retain that advantage for a limited amount of time.
The more and faster you add data at random locations, the faster the pages become fragmented again. Eventually you may take in data at a higher rate than the optimization process takes to rewrite the existing files.
The updates and deletes are doneDisk seek rates, And force you to use one of the slowest metric a disk has to offer.
LSM-trees work at disk transfer ratesAnd scale much better to handle vast amounts of data.
They also guarantee a veryConsistent insert Rate, As they transform random writes into sequential ones using the log file plus in-memory store.
TheReads are independent from the writes, So you also get no contention between these two operations.
The stored data is always in an Optimized Layout. So, you havePredictable and consistent bound on number of disk seeks to access a key, And reading any number of records following that key doesn't incur any extra seeks. in general, what cocould be emphasized about an LSM-tree based system is cost transparency: You know that if you have five storage files, access will take a maximum of five disk seeks. whereas you have no way to determine the number of disk seeks a RDBMS query will take, even if it is indexed.
Storage
I have taken notes in this blog.
Http://www.cnblogs.com/fxjwind/archive/2012/08/21/2649499.html
Write-ahead log
The region servers keep data in-memory until enough is collected to warrant a flush to disk, avoiding the creation of too limit very small files. while the data resides in memory it is volatile, meaning it cocould be lost of the server loses power for example. this is a typical problem, as explained in the section called "Seek. transfer ".
A commonApproachTo solving this issue isWrite-Ahead Logging[87]:
Each update (also called "edit") is written to a log, and only if that has succeeded the client is informed that the operation has succeeded.
The server then has the liberty to batch or aggregate the data in memory as needed.
In fact, the idea is very simple. We need to put the data buffer in memory and flush it to disk in batches. The data in memory is easy to lose, so we use Wal to solve this problem.
Overview
The WAL is the lifeline that is needed when disaster strikes. Similar to a binary log in MySQL, it records all changes to the data.
This is important in case something happens to the primary storage. If the server crashes it can perform tivelyReplayThe log to get everything up to where the server shoshould have been just before the crash. It also means that ifWriting the record to the wal fails,Whole operationMust be consideredFailure.
Since it is shared by all regions hosted by the same region server it acts as a central logging backbone for every modification.
All regions on a region server share a Wal. You can refer to the refinements article in bigtable to optimize the log mechanism.
Hlog class
The class which implements the wal is calledHlog. When a hregion is instantiated the single hlog instance is passed on as a parameter to the constructor of hregion. When a region generation es an update operation, it can save the data directly toShared Wal instance.
Hlogkey class
Currently the wal is using a hadoopSequencefile, Which stores records as sets of key/values.
Read path
This section is not well organized. Anyway, he wants to explain that a row of data is not a simple get, but scan. this is caused by the LSM tree structure. The same row of data can be scattered in MEM and different files.
Detailed reference http://www.cnblogs.com/fxjwind/archive/2012/08/14/2638371.html
Therefore, we can see the importance of continuous compaction. Otherwise, in the face of a large number of files, read will be very slow. In addition, we will use bloom filter and time stamp to further filter files to improve Read efficiency.
Region lookups
For the clients to be able to find the region server hosting a Specific Row key range hbase provides two special catalog tables called-root-and. Meta ..
The-root-table is used to refer to all regions in the. Meta. Table.
The design considers only one root region, I. e., the root region isNever splitTo guarantee a three level, B + tree like Lookup Scheme:
The first level is a node stored in zookeeper that contains the location of the root table's region, in other words the name of the region server hosting that specific region.
The second level is the lookup of a matching meta region from the-root-table,
And the third is the retrieval of the User table region from the. Meta. Table.
Refer to the bigtable paper, 5.1 tablet location, exactly the same