Hbase terminology in this article:
Column-based: column-oriented
Row: Row
Column group: column families
Column: Column
Unit: Cell
The biggest difficulty in understanding hbase (an open-source Google bigtable application) is what is the concept of hbase's data structure? First, hbase is different from general relational databases. It is a database suitable for storing unstructured data. The other difference is that hbase is column-based instead of Row-based.
Google's bigtable:
Bigtable is a loose distributed and persistent multi-dimensional sorting map. This map is indexed by row keys, column keys, and timestamps. Each value is a continuous byte array. (
Bigtable is a sparse, distributed, persistent multidimen1_sorted
Map. The map is indexed by a row key, column key, and a timestamp; each
Value in the map is an uninterpreted array of bytes .)
The hbase architecture page of hadoop wiki mentions:
Hbase uses the same data model as bigtable. You store data rows in a table. A data row has an optional key and any number of columns. The table is loosely stored.
You can define different columns for rows. (hbase uses a data model very similar to that
Bigtable. Users store data rows in labeled tables. A data row has
Sortable key and an arbitrary number of columns. The table is stored
Sparsely, so that rows in the same table can have crazily-Varying
Columns, if the user likes .)
In essence, hbase and bigtable are map objects. They are the same as arrays (PHP), dictionaries (pyhton), hash (Ruby), and objects.
(JavaScript. therefore, each row is a map. This map can contain multiple maps (based on column groups ). getting a data is like getting data from a map.
Sample. Give a row name (that is, get data from this map), and then give a key (column group name + qualified word) to get data.
Hbase and bigtable are both built on distributed file systems, so basic file storage can be distributed on machines in distributed file systems.
Hbase uses hadoop's Distributed File System (HDFS) or Amazon's simple storage
Service (S3), Kosmos Distributed File System (KFS), bigtable uses Google
File System (GFS). Copying data to multiple nodes is like storing data in a raid system.
Unlike most map applications, in hbase and bigtable, key/value
Is strictly arranged in alphabetical order. this means that the key value of the next row of the key value "AAAAA" is "aaaab", but it is far away from the row of the key value "zzzzzzz. because this
These systems are very large and distributed, and these features are very important. Space close columns ensure that when you are sure you want to browse the table,
The row you are interested in will be close to this line. This is very important when you select the key value of the row. Example: consider that the column in your table is a domain name. It is better to reverse it (so
"Com. jimbojw. www" is better than "www.jimbojw.com)
Because your sub-domain names will be near your primary domain name. Note that in hbase, sorting is only Kay sorting, and value is not sorted.
In the following JSON data, we can see that the entire data structure is a map, and each key in the map corresponds to
Map of "A" and "B. assume that the data below is a table, then it has "1 ". "AAAAA", "aaaab", "XYZ", "zzzzzzz", each
A row has a map of "A" and "B". In hbase terminology, "a" and "B" are column groups.
{ "1" : { "A" : "x", "B" : "z" }, "aaaaa" : { "A" : "y", "B" : "w" }, "aaaab" : { "A" : "world", "B" : "ocean" }, "xyz" : { "A" : "hello", "B" : "there" }, "zzzzz" : { "A" : "woot", "B" : "1337" }}
In hbase, a column group enables each column group to contain many columns by defining words or labels.
{ "aaaaa" : { "A" : { "foo" : "y", "bar" : "d" }, "B" : { "" : "w" } }, "aaaab" : { "A" : { "foo" : "world", "bar" : "domination" }, "B" : { "" : "ocean" } }, "zzzzz" : { "A" : { "catch_phrase" : "woot", } "B" : { "" : "1337" } }}
In the preceding example, in the row "AAAAA", column group "A" contains two columns: "foo" and "bar ", the column group "B" only has a column with the delimiter "Empty escape. when we
When obtaining data from hbase, you must provide the complete column name "<column group >:< qualifier>". Therefore, in the preceding example, the row "AAAAA" and "aaaab"
All contain three columns: "A: foo ",
"A: bar" and "B :". although the column group in the row is fixed, the restrictions in the same column can be different, just as there is only one column in the column group "A" in the row "zzzzzzz"
"Catch_phrase". The last dimension is timestamp. All data stored in hbase has a timestamp version.
Timestamp to insert or obtain data.
{ "aaaaa" : { "A" : { "foo" : { 15 : "y", 4 : "m" }, "bar" : { 15 : "d", } }, "B" : { "" : { 6 : "w" 3 : "o" 1 : "w" } } }}
Each column can specify how many versions of data are stored in each unit. in the preceding example, the "A: foo" column of the "AAAAA" row contains 15 and 4 data sorted by two reverse timestamps.
"B" contains data arranged by three reverse timestamps. generally, an application simply requests the data of a unit (without a timestamp. in this case, hbase simply returns the latest version.
This is the version with the maximum timestamp. to obtain "A: foo" and return "Y", obtain "B" and return "W ". if the application carries a timestamp in a row request, hbase returns a value smaller than or
Data that is equal to the request timestamp. In the preceding example, if the program requests "A: foo" with the timestamp 10, "M" is returned, with the timestamp 3, null is returned.
Each row can have multiple columns, and each column family can contain countless columns. Each column can have a timestamp different from other columns. in a general database, when a table is created, we have defined columns. It is very difficult to modify the table structure (for example, adding a column ). in hbase, we can easily add a column family or column.