HBase terminology in this article:
Based on columns: column-oriented
Row: Row
Columns Group: Column families
Columns: Column
Cell: Cell
The biggest difficulty in understanding HBase (an Open-source Google bigtable practical application) is what is the HBASE data structure concept? First HBase is different from the general relational database, which is a database suitable for unstructured data storage. Another difference is that HBase is based on columns rather than on rows.
Google's BigTable paper clearly explains what is bigtable:
BigTable is a loosely distributed, persistent multidimensional sorted map, which is indexed by row keys, column keys, and timestamp. Each value is a contiguous byte array. (a Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; Each value in the ' map is ' an uninterpreted array of bytes.
The HBase architecture page of the Hadoop wiki mentions:
HBase uses the same data model as bigtable. The user stores the data row in a table. A data row has a selectable key and any number of columns. Tables are loosely stored, so users can define different columns for rows. (HBase uses a data model very errors to that of Bigtable.) Users store data rows in labelled tables. A data row super-delegates a sortable key and an arbitrary number of columns. The table is stored sparsely and so this rows in the Mahouve table can have crazily-varying columns, if the user likes.)
Essentially, hbase and bigtable are maps. The same in arrays (PHP), dictionaries (Pyhton), Hash (Ruby), or object (Javascript) representations. So each row is a map, and there can be more than one map in this map (Based on a column group). Get a data just like you get data from a map. Given a row name (that is, fetching data from this map), and then given a key (column group name + qualifier) to get the data.
Both HBase and BigTable are built on a distributed file system, so the underlying file storage can be distributed across the system's machines.
HBase uses Hadoop ' s Distributed File System (HDFS) or Amazon's simple Storage Service (S3), Kosmos Distributed File System (KFS), Like this bigtable uses Google File system (GFS). Data is replicated to multiple nodes just as data is stored on a RAID system.
Unlike most map applications, in HBase and BigTable, Key/value is very strictly alphabetical. That means the key value for the next row of the row with the key value "AAAAA" is "Aaaab", but the and key value is "zzzzz" The lines are very far away. Because these systems are very large and distributed, these features are very important. The space is close to the column to ensure that when you are sure to browse the table, your interest will be near the line. This is very important when you select the key value of the row. Example: Consider the column in your table is the domain name. It's better to be upside down (so "com.jimbojw.www" is better than "www.jimbojw.com"). Better, because your subdomain will be close to your primary domain. Note that sorting in HBase is just a Kay sort, and value is not sorted.
In the following JSON data, we see that the entire data structure is a map, and each key in the map corresponds to a map containing "a" and "B". Assuming that the entire following data is a table, then it has "1". AAAAA "," Aaaab "," xyz "," zzzzz "these lines, each row has a" a "and" B "map. In hbase terms," a "and" B "are the column groups.
{"1": {"A": "X", "B": "Z"}, "AAAAA": {"a": "Y", "B": "W"}, "Aaaab": {"a": "World", "B": "Ocean"}, "xyz": {"a": "H Ello "," B ":" There "}," Zzzzz ": {" A ":" Woot "," B ":" 1337 "}}
A column group in HBase enables each column group to contain a number of columns by qualifying the word or by being called a label.
{"AAAAA": {"A": {"foo": "Y", "bar": "D"}, "B": {"": "W"}, "Aaaab": {"A": {"foo": "World", "bar": "Dominatio N "}," B ": {" ":" Ocean "}," Zzzzz ": {" A ": {" catch_phrase ":" Woot ",}" B ": {" ":" 1337 "}}}
In the example above, in the "AAAAA" line, the column group "A" contains two columns: "foo" and "Bar", and the column group "B" only has a column with a qualifier of null characters. When we get the data to HBase, you must provide the complete column name "< column group >:< Qualifying Words > ". Therefore, the above example line" AAAAA "and" Aaaab "all contain three columns:" A:foo "," A:bar "and" B: ". Although the column group is fixed in a row, the qualifier in the same column can be different, just like the column group" A "in the row" Zzzzz " There is only one column "Catch_phrase". The final dimension is the timestamp (timestamp). All the data stored in HBase has a timestamp version or you can insert or retrieve data by specifying a timestamp.
{"AAAAA": {"A": {"foo": {: "Y", 4: "M"}, "Bar": {: "D",}}, "B": {"": {6: "W" 3: "O" 1: "W"}}}
Each column can specify how many versions of data are saved in each cell. In the example above, the column "A:foo" in the Row "AAAAA" contains two reverse timestamp rows of data 15 and 4, and column "B" contains data arranged by three reverse timestamps. Generic applications are simply (not timestamp) Requests a unit of data. Under this condition, HBase simply returns the most recent version, which is the most timestamp version. To get "A:foo" to return "Y", get "B" to return "W". If the application takes a timestamp when requested in a row, HBase will return data less than or equal to the request timestamp. Then the above example if the program asks "A:foo" to take a timestamp of 10, return "M", plus a timestamp of 3, returns NULL.
Each row can be multiple column families, each column family can contain countless columns, and each column can have a timestamp that is different from the other columns. In a common database we have defined columns when tables are created, and it can be very difficult to modify the table structure (for example, add a column). In HBase we can easily add a column family or column.
HBase Performance Options:
Just like in a relational database. As with Char,varchar or text, this column can affect data storage and performance as well, and hbase performance options also affect hbase performance. All columns in HBase have the same max_versions in the same column group, Max_length,compression,in_memory and bloomfilter characteristics.
HBase uses Hadoop's mapfile to store data and indexes, Mapfile call Sequencefile write data, Sequencefile allows you to choose how to compress the data. The Mapfile index file is block compressed. The data file depends on your settings, and in HBase the default is to not compress the data in the column group, but there are two options to compress the data in the column group: Block and record.
Block compression, assuming that you have separate columns that contain large chunks of data and that you want to save only one version of the data. In this case, you might let this column group support block compression. Because this compression option compresses multiple columns of data to achieve a better compression ratio.
Record compression, assuming that you have many rows containing data, and each column you want to save multiple versions. You may have this column group support the record compression, because this compression will allow the data for each column to be contiguous.
Although the compression ratio block compression is better than the record compression, it is theoretically easier to access time than the value, because it does not need to decompress the key (Hstorekey) but to extract only this part of the data.
If this column group supports the Bloomfilter filter, then there is an index in memory that quickly determines whether the column being looked for exists in this row, reducing disk IO operations. If you have a large number of columns in this column group, the data for each column contains very little data, You may need to apply the filter in this column group. The use of Bloom filter in HBase has clearly described the usage and fault-tolerant algorithm of bloem filters.
In_memory feature Options,
If this column group loads data into memory, we will be able to speed up the reading and writing advantages. The reading and writing of the disk and the reading and writing of the memory are certainly not comparable. The disadvantage is that all data loaded in memory will cost our memory and will interfere with HDFs backups, because the data will be written to disk less often than usual.
Max_length and max_versions column group characteristics are important in terms of overall performance, but rarely affect actual functionality. In fact, these two properties control how many versions of data each unit holds (the default is 3) and how many bytes of data each version of each cell can hold ( The default is 32-bit signed shaping.
This article basically translates from the following website and adds some paragraphs of my own organization, hope can be helpful to everybody:
Understanding HBase and BigTable
Understanding HBase column-family Configured Options