HBase Schema Parsing (i)

Source: Internet
Author: User
Tags failover mapr hadoop ecosystem

Http://www.blogjava.net/DLevin/archive/2015/08/22/426877.html

Pre-record

Internal use of the MAPR version of the Hadoop ecosystem, so from MapR's official website to see this article: An in-depth looks at the HBase Architecture, originally wanted to translate the full text, however, if the translation needs a variety of semantics, too troublesome, Therefore, most of this article used their own language, and joined the other resources of the reference understanding and I read the source of their own understanding of it, belongs to half-translation, semi-original bar.

HBase Schema composition

HBase uses the Master/slave architecture to build clusters, which belong to the Hadoop ecosystem, consisting of a type of node: Hmaster node, hregionserver node, zookeeper cluster, and at the bottom, it stores data in HDFs. Thus, the overall structure of HDFS, such as Namenode and Datanode, is as follows:

Where the hmaster node is used to:

    1. Manage Hregionserver to achieve load balancing.
    2. Manage and assign hregion, such as assigning new hregion at Hregion split, and migrating hregion to other hregionserver on Hregionserver exit.
    3. Implement DDL Operations (Data Definition language,namespace and table additions and deletions, column familiy additions and deletions, etc.).
    4. Manages metadata for namespace and table (actually stored on HDFS).
    5. Permission control (ACL).

The hregionserver node is used to:

    1. Store and manage local hregion.
    2. Read and write HDFs to manage the data in the table.
    3. The client reads and writes data directly through the Hregionserver (metadata is obtained from the Hmaster to find the hregion/hregionserver after the Rowkey is located).

The zookeeper cluster is a coordinated system for:

    1. Stores the metadata for the entire hbase cluster and the status information for the cluster.
    2. Implement the failover of Hmaster master-slave node.

HBase client communicates via RPC with Hmaster, Hregionserver, and a hregionserver can hold 1000 hregion; the underlying table data is stored in HDFs. and hregion the data as far as possible with the data in the Datanode together, to achieve localization of data; Data localization is not always possible, such as in the case of a hregion move (for example, split), the next compact is required to continue back to localization.

In accordance with the principle of semi-translation, and then a "in-depth look at the HBase Architecture" architecture diagram:

This architecture diagram clearly expresses the hmaster and Namenode support multiple hot backup, use zookeeper to do coordination; Zookeeper is not cloud-like mystery, it is generally composed of three machines a cluster, The internal use of the Paxos algorithm supports an outage in three servers, as well as the use of five machines, which can support simultaneously two down-time, less than half of the downtime, but as the machine increases, so does its performance. Regionserver and Datanode are typically placed on the same server for localizing data.

Hregion

HBase uses Rowkey to cut the table horizontally into multiple hregion, and from Hmaster's point of view, each Hregion records its Startkey and EndKey (the first hregion of Startkey is empty, The last hregion EndKey is empty), because Rowkey is sorted so that the client can quickly locate each rowkey in Hregion by Hmaster. The hregion is assigned to the appropriate hregionserver by the Hmaster, which is then responsible for hregion startup and management by the Hregionserver, and the client communication, which is responsible for reading the data (using HDFS). Each hregionserver can manage 1000 or so hregion at the same time (how did this figure come about?) Don't see the limitations from the code, is it out of experience? More than 1000 can cause performance problems? to answer this question : the feeling that this 1000 figure is from BigTable's paper (5 implementation): Each tablet server manages a set of tablets (typically We have somewhere between ten to a thousand tablets per tablet server).

Hmaster

Hmaster no single point of failure problem, you can start multiple hmaster, through the zookeeper Master election mechanism to ensure that only one hmaster out of active state, the other hmaster is in a hot backup state. Typically, two hmaster are started, and non-active hmaster periodically communicate with the active hmaster to get their latest status, ensuring that it is updated in real time, so that if more than one hmaster is started, it increases the active The burden of Hmaster. The previous article has introduced Hmaster's main use for hregion allocation and management, DDL (Data Definition Language, both table new, delete, modify, etc.) implementation, etc., it has two main responsibilities:

    1. Coordinating Hregionserver
      1. The hregion is allocated at startup, and the hregion is redistributed when load is balanced and repaired.
      2. Monitor the status of all Hregionserver in the cluster (through heartbeat and monitoring the status in the zookeeper).
    2. admin function
      1. Create, delete, and modify the definition of a table.

ZooKeeper: Facilitator

Zookeeper provides a coordination service for HBase clusters, which manages the state of Hmaster and Hregionserver (available/alive, etc.) and notifies hmaster when they are down, Thus Hmaster can implement the failover between Hmaster, or the Hregion collection in the hregionserver of the outage (assigning them to other hregionserver). The zookeeper cluster itself uses the consistency protocol (Paxos protocol) to guarantee the consistency of each node state.

How the Together

Zookeeper coordinates shared information for all nodes in the cluster, creates ephemeral nodes after Hmaster and Hregionserver are connected to zookeeper, and uses the heartbeat mechanism to maintain the node's survival status. If a ephemeral node is effective, Hmaster will receive a notification and do the appropriate processing.

In addition, Hmaster monitors hregionserver joins and outages by listening to the ephemeral node in the zookeeper (default:/hbase/rs/*). When the first hmaster is connected to zookeeper, the ephemeral node (default:/hbasae/master) is created to represent the hmaster of the active, and the hmaster that is added thereafter listens to the ephemeral node. If the current active hmaster is down, the node disappears, so that the other hmaster is notified and converts itself to an active hmaster, before it becomes the hmaster of the active, it is created in/hbase/back-masters /below to create your own ephemeral node.

The first read and write of HBase

Before HBase 0.96, HBase had two special table:-root-and. META. (as designed in BigTable), where-root-table's location is stored in zookeeper, which is stored. META. Table's RegionInfo information, and it can only exist with a hregion, and. META. Table stores the RegionInfo information of the user table, which can be cut into multiple hregion, so that the first time a user table is accessed, the-root-table is first read from the zookeeper Hregionserver And then read from the Hregionserver according to the requested Tablename,rowkey. META. Table is Hregionserver, and finally read from the Hregionserver. META. The contents of the table obtain the location of the hregion where the request needs to be accessed, and then access the Hregionsever to obtain the requested data, which takes three requests to find the location of the user table, and then the fourth request begins to fetch the real data. Of course, to improve performance, the client caches-root-table locations and-root-/. META. The contents of the table. As shown in the following:

But even if the client has a cache, it takes three requests in the initial phase until the user table is actually in a poor performance position, and is it really necessary to support so many hregion? It may be necessary for a company like Google, but it doesn't seem to be necessary for a typical cluster. In BigTable's paper, metadata stores about 1KB of data per row, with a medium-sized tablet (hregion) around 128MB, and a 3-tier schema design that supports 2^34 tablet (hregion). Even if the removal of-root-table, also can support 2^17 (131072) hregion, if each hregion or 128MB, that is 16TB, this seems not big enough, but now the maximum size of hregion will be set relatively large, For example, we set the 2GB, at this time the size of the support becomes 4PB, for the general cluster is enough, so after hbase 0.96 removed the-root-table, only this special directory table is called meta table (Hbase:meta), It stores the location information for all user hregion in the cluster, while the Zookeeper node (/hbase/meta-region-server) stores the location of the meta table directly, and the meta table is like the previous-root- Table is not split. Thus, the first time the client accesses the user table, the process becomes:

    1. Gets the location of the Hbase:meta (Hregionserver location) from zookeeper (/hbase/meta-region-server), which caches the location information.
    2. From Hregionserver, query the hregionserver of the user table corresponding to the rowkey of the request, and cache the location information.
    3. Reads row from query to Hregionserver.

From this process, we find that the customer caches these location information, but the second step is to cache the location of the current rowkey corresponding hregion, so if the next rowkey to be checked is not in the same hregion, you will need to continue querying hbase: Meta Hregion, however, over time, the client caches more and more location information so that there is no need to look up information for Hbase:meta table again, unless a hregion is moved because of an outage or split, and the cache needs to be re-queried and updated.

Hbase:meta table

Hbase:meta table stores all user hregion location information, its rowkey is: Tablename,regionstartkey,regionid,replicaid, it only the Info column family, this column family contains three columns, They are: The Info:regioninfo column is the regioninfo proto format: regionid,tablename,startkey,endkey,offline,split,replicaid;info: Server format: Hregionserver the corresponding Server:port;info:serverstartcode format is the boot timestamp of the hregionserver.

Hregionserver detailed

Hregionserver General and Datanode run on the same machine, enabling data locality. The hregionserver contains multiple hregion, consisting of Wal (HLog), Blockcache, Memstore, and hfile.

  1. wal is the write Ahead Log , called Hlog in earlier versions, which is a file on HDFs, as its name indicates, all writes will guarantee that the data is written to the log file. To actually update the Memstore, and finally write to hfile. With this mode, we can still read the data from the log file and replay all operations without losing the data after the hregionserver is down. This log file will periodically roll out new files and delete old ones (those that have been persisted to hfile can be deleted). The Wal file is stored in the directory of/hbase/wals/${hregionserver_name} (stored in the/hbase/.logs/directory before 0.94), typically a hregionserver has only one instance of Wal, In other words, a hregionserver of all the Wal write is serial (like log4j's log write is also serial), which will certainly cause performance problems, so after HBase 1.0, through HBASE-5699 implementation of multiple Wal parallel write (Multiwal), The implementation is written in multiple pipelines in HDFs, in a single hregion unit. About Wal can refer to Wikipedia's Write-ahead Logging. By the way, the English version of Wikipedia can be without pressure of normal access, this is a GFW negligence or the future normal?
  2. The
  3. Blockcache is a read cache , the "referential locality" principle (also applied to CPUs, spatial locality, and temporal locality, where the CPU needs some data at some point, Then there is a great probability that the data it needs is near it at some point; time locality refers to the fact that once a data has been accessed once, it has a large probability that it will be accessed again in the near future, and the data is pre-read into memory to improve read performance. The implementation of two Blockcache is available in HBase: Default On-heap Lrublockcache and Bucketcache (usually off-heap). Usually the performance of Bucketcache is worse than Lrublockcache, however, due to the effect of GC, Lrublockcache delay will become unstable, and bucketcache because it is their own management blockcache, without the need for GC, As a result, its delay is usually relatively stable, which is why it is sometimes necessary to choose Bucketcache. This article BlockCache101 a detailed comparison of the blockcache of On-heap and Off-heap.
  4. Hregion is the expression of a region in a table in a hregionserver . A table can have one or more region, they can be on the same hregionserver, can also be distributed on different hregionserver, a hregionserver can have multiple hregion, They belong to different table respectively. Hregion is made up of multiple stores (Hstore), each hstore corresponding to a table in this hregion column Family, that is, each column Family is a centralized storage unit, It is therefore best to store column with similar IO characteristics in a column Family for efficient reading (data locality principle, which can increase the cache hit rate). Hstore is the core of hbase storage, which implements the read-write HDFs function, a hstore consisting of a memstore and 0 or more storefile.
    1. Memstore is a write cache (in Memory Sorted Buffer), all data written after the completion of the Wal log write, will be written to Memstore, by Memstore according to a certain algorithm to flush the data into the stratum hdfs file (hfile), usually each hregion in each Column family has a memstore of its own.
    2. hfile (storefile) is used to store hbase data (cell/keyvalue). The data in hfile is sorted by rowkey, column Family, column, and the same cell (that is, the three values are the same), in timestamp reverse order.


Although the above diagram shows the latest Hregionserver architecture (but not so precise), I have always preferred to look at the following diagram, even though it should be a 0.94-old architecture.

The solution of data writing flowchart in Hregionserver

When the client initiates a put request, it first detects from the Hbase:meta table the hregionserver that the put data will eventually need to go to. The client then sends the put request to the appropriate hregionserver, and in hregionserver it first writes the put operation to the Wal log file (flush to disk).

After writing the Wal log file, Hregionserver finds the corresponding hregion according to the TableName and Rowkey in the put, and finds the corresponding family according to column Hstore. And writes the put to the memstore of the Hstore. This writes successfully and returns the notification client.

Memstore Flush

Memstore is an in Memory Sorted Buffer, and in each hstore there is a memstore, that is, it is a hregion of a column family corresponding to an instance. It is arranged in the order of Rowkey, column Family, column, and reverse of the timestamp, as follows:

Each put/delete request is written to the Memstore first, and when the Memstore is full it will flush into a new storefile (the underlying implementation is hfile), i.e. a hstore (Column Family) There can be 0 or more storefile (hfile). There are three situations in which the Memstore flush action can be triggered, and it is important to note that the Memstore minimum flush unit is hregion rather than a single memstore. It is said that this is one of the reasons that column family has a number limit, presumably because too many column family flush together can cause performance problems? The specific reasons need to be verified.

  1. When the sum of all memstore in a hregion exceeds the size of the hbase.hregion.memstore.flush.size, the default is 128MB. At this point all the Memstore in the current hregion are flush into HDFs.
  2. When the size of the global Memstore exceeds the size of Hbase.regionserver.global.memstore.upperLimit, the default is 40% memory usage. The Memstore in all hregion in hregionserver now flush into HDFs. The flush order is the reverse of the memstore size (the sum of all memstore in a hregion as the memstore of the Hregion or the largest memstore as a reference?). Pending verification), until the overall memstore usage is less than hbase.regionserver.global.memstore.lowerLimit, the default 38% memory usage.
  3. The size of the Wal in the current hregionserver exceeds hbase.regionserver.hlog.blocksize * The number of hbase.regionserver.max.logs, the Memstore in all hregion in the current hregionserver are flush into HDFS, Flush uses the time order, the first Memstore flush until the number of Wal is less than HB Ase.regionserver.hlog.blocksize * Hbase.regionserver.max.logs. Here it is said that the default size of the two multiplication is 2GB, check the code, The default value for Hbase.regionserver.max.logs is 32, and Hbase.regionserver.hlog.blocksize is the default blocksize,32mb for HDFs. But anyway, because this size exceeds the limit caused by flush is not a good thing, can cause a long delay, thus the article gives advice: "Hint: Keep Hbase.regionserver.hlog.blocksize * Hbase.regionserver.maxlogs just a bit above hbase.regionserver.global.memstore.lowerLimit * hbase_heapsize. ". And be aware that the description given here is wrong (although it is an official document).

During the Memstore flush process, some meta data is appended at the end, including the largest Wal sequence value at Flush, to tell hbase the sequence of the most recent data written by the StoreFile, and where to start at recover. When the hregion is started, the sequence is read and the largest is taken as the starting sequence of the next update.

hfile format

HBase data is stored in the form of KeyValue (Cell) in hfile, hfile is generated during the flush process of memstore, because the cells stored in Memstore follow the same order, so the flush process is sequential, We didn't need to keep moving the disk pointer until we had a high-order write performance on the disk.

hfile reference BigTable's sstable and Hadoop Tfile implementations, from HBase to the present, hfile experienced three versions, of which V2 was introduced in 0.92, and V3 was introduced in 0.98. First let's look at the format of the V1:

V1 's hfile consists of multiple data blocks, Meta Block, FileInfo, data Index, meta index, and trailer, where the data block is the smallest storage unit of HBase. The blockcache mentioned in the previous article is the cache based on the data block. A data block consists of a magic number and a series of keyvalue (cells), which is a random number to indicate that this is a data block type to quickly monitor the format of this data block and to prevent the destruction of the database. The size of the Data block can be set when the column family is created (Hcolumndescriptor.setblocksize ()), the default value is 64KB, the large block facilitates sequential scan, small block for random query, It is therefore necessary to weigh. The meta block is optional and FileInfo is a fixed-length block that records some meta-information about the file, such as: Avg_key_len, Avg_value_len, Last_key, COMPARATOR, Max_seq_id_key, etc. Data index and Meta index record the fact points, uncompressed size, Key (starting Rowkey) for each data block and meta block. such as Trailer records the starting position of the FileInfo, data index, Meta index block, the number of data index and meta index indexes, and so on. Where FileInfo and trailer are fixed-length.

Each keyvalue pair inside the hfile is a simple byte array. However, this byte array contains many items and has a fixed structure. Let's take a look at the concrete structure inside:

The start is a two fixed-length number that represents the length of the key and the length of the value, respectively. Next is the key, which starts with a fixed-length value that represents the length of the RowKey, followed by a RowKey, then a fixed-length value that represents the length of the family, then the family, then the qualifier, then the two fixed-length values that represent the time Stamp and Key Type (Put/delete). The value section does not have such a complex structure, which is purely binary data.with the hfile version migrated, the format of the KeyValue (Cell) has not changed much, except that in the V3 version, an optional tag array is added at the end

HFileV1 version of the actual use of the discovery that it takes up more memory, and Bloom file and block index will become very large, and cause the start time to become longer. The bloom Filter for each hfile can grow to 100MB, which causes performance problems when queried, because the bloom filer of Bloom filter,100mb need to be loaded and queried for each query to cause significant delays; Index in a hregionserver may grow to a total of 6gb,hregionserver at startup, which requires loading all of these block Index first, thus increasing the startup time. To address these issues, the HFileV2 version was introduced in version 0.92:

In this version, the block index and Bloom filter are added to the middle of the data block, which also reduces the amount of memory used to write, and in addition, in order to increase the startup speed, a deferred read feature is introduced in this version. That is, when the hfile is actually used, it is parsed.

The FileV3 version has not changed substantially compared to the V2 version, it adds support for the tag array at the keyvalue (Cell) level, and adds two fields related to the tag in the FILEINFO structure. For specific hfile format Evolution Introduction, you can refer to here.

For the HFILEV2 format specific analysis, it is a multi-level class B + Tree index, using this design, you can implement the lookup does not need to read the entire file:

The cells in the Data block are in ascending order, each block has its own leaf-index, and the last key of each block is put into Intermediate-index, Root-index points to Intermediate-index. At the end of the hfile, there is bloom filter for quick positioning then no row;timerange information in a data block is used for references to those using time queries. When Hfile is turned on, these index information is loaded and stored in memory to increase later read performance.

This article is first written here, not to be continued ....

Reference:

Https://www.mapr.com/blog/in-depth-look-hbase-architecture#.VdNSN6Yp3qx
Http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
Http://hbase.apache.org/book.html
Http://www.searchtb.com/2011/01/understanding-hbase.html
Http://research.google.com/archive/bigtable-osdi06.pdf

HBase Schema Resolution (i)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.