When we are unsure or disorganized about the data structure field, it is difficult to extract data by a concept. What database is it suitable for? The answer is what, if we use the traditional database, there must be extra fields, 10 No, 20, but this seriously affects the quality. And if the big database, PT-level data, this waste is more serious, then we should use what database? HBase has several good options, then we have the following problems with HBase:
What does 1.Column family represent?
2.HBase with row and column to determine a piece of data, the value of this data may have more than one version, why there are multiple versions?
3. Which version will be displayed when querying?
4. What are their storage types?
What type of 5.tableName is it?
What is the type of 6.RowKey and ColumnName?
What type of 7.Timestamp is it?
What type of 8.value is it?
introduction
There are many projects in the team that use HBase, and for business people, there is usually no need to build and maintain an hbase cluster environment from scratch, and it is not necessary to have a deep understanding of the architectural details (assigned to the HBase Cluster maintenance team). What is urgently needed is a quick understanding of basic technologies to solve business problems. Recently in the XX project rotation process, try to see hbase from the perspective of business personnel, some of the process is recorded, looking forward to a quick understanding hbase, mastering the relevant technology to carry out the work of the business people a little help. I think as a business development tester with HBase for the first time, he needs to grasp at least the following points:
In-depth understanding of htable, how to combine business design with high performance htable
Mastering the interaction with HBase, anyway, is inseparable from the data to change, through the HBase shell command and Java API are required
Master How to use MapReduce to analyze the data in HBase, hbase data is always analyzed, Using MapReduce is one of the ways
Master how to test HBase MapReduce, and it's not always written, no matter what it is, debug is needed, see how to debug in this machine
this series will be extended around the above points, If it is the HBase beginners recommend reading while practicing, for hbase more skilled, can be selected under, such as the attention of HBase MapReduce and its testing methods.
From an example,
The traditional relational database must be familiar to everyone, and we will use a simple example to illustrate the respective solutions and pros and cons of using RDBMS and hbase.
In the case of Bowen, the RDBMS table is designed as follows:
For ease of understanding, we use some data examples to
In the example above, we can design with HBase as follows
Also for ease of understanding, we use some data examples to mark some key concepts in red and later explain
htable some basic concepts
Row Key
Row primary Key, HBase does not support queries such as conditional queries and order BY, read records can only be scanned by row key (and its range) or full table, so row key needs to be designed according to the business to take advantage of its storage sorting characteristics (table is sorted by row key dictionary like 1, 10,100,11,2) improves performance.
column Family (family of columns)
declared at table creation time, each column family as a storage unit. In the previous example, an HBase table blog was designed with two column families: article and author.
column (columns)
each column of HBase belongs to a column family, prefixed by a column family name, such as Columns Article:title and article:content belong to the article column family, and Author:name and Author:nickname belong to the author column family.
column can be dynamically added without creating a table, the columns of the same column family are gathered on a single storage unit and sorted by column key, so the design should design a column with the same I/O characteristics in a column Family to improve performance. at the same time, it is important to note that this column can be added and deleted, which is quite different from our traditional database. So he fits into unstructured data.
Timestamp
HBase determines a single piece of data by row and column, the value of which may have multiple versions, the values of different versions are sorted in reverse chronological order, that is, the most recent data is first, and the latest version is returned by default when queried. As in the example above, row Key=1 has two versions of the Author:nickname value, corresponding to "one leaf crossing" and 1317180718830 corresponding "yedu" respectively. (Correspondence to the actual business can be understood to modify the nickname at some point as yedu, but the old value still exists). Timestamp defaults to the current time of the system (accurate to milliseconds), or you can specify this value when writing data.
Value
Each value is uniquely indexed by 4 keys, tablename+rowkey+columnkey+timestamp=>value, such as {tablename= ' blog ' In the previous example, rowkey= ' 1 ', columnname= ' Author:nickname ', timestamp= ' 1317180718830 '} The unique value to be indexed is "yedu".
Storage type
TableName is a string
RowKey and ColumnName are binary values (Java type byte[])
Timestamp is a 64-bit integer (Java type Long)
Value is a byte array (Java type byte[]).
Storage structure
The storage structure of htable can be simply understood as
That is, htable is automatically sorted by row key, each row contains any number of columns,columns sorted by column key, and each column contains any number of values. Understanding the storage structure will help in iterating through the results of the query.
Say what the situation requires hbase
Semi-structured or unstructured data
Data that is not deterministic or disorganized for data structure fields is difficult to extract by a concept that is suitable for hbase. In the example above, when business development needs to store author Email,phone,address information, the RDBMS requires downtime maintenance while hbase support dynamically increases.
Very sparse records
The number of rows in an RDBMS is fixed, and null columns waste storage space. As mentioned above, the null column of hbase is not stored, which saves space and improves read performance.
Multi-version data
As mentioned above, the value that is anchored to row key and column key can have any number of version values, so it is very convenient to use hbase for data that needs to store the change history. For example, the address of the author in the example above is subject to change, and business generally requires only the most recent values, but sometimes it may be necessary to query to historical values.
Very large data volume
When the data volume is getting larger, the RDBMS database can't hold up, there is a read-write separation strategy, through a master dedicated to write operations, multiple slave responsible for read operations, server cost multiplier. As the pressure increases, master can't hold up, at this time to separate the library, the data is not associated with the deployment, some join query can not be used, need to rely on the middle tier. As the amount of data increases further, the records of a table become larger, the query becomes very slow, and the tables are divided, such as by ID modulo into multiple tables to reduce the number of records in a single table. People who have experienced these things know how the process is going to be frustrating. With HBase, it's easy to add machines, HBase automatically scales horizontally, and seamless integration with Hadoop guarantees high performance (MapReduce) for data Reliability (HDFS) and massive data analytics.
Personal Opinion:
What does Column family stand for?
Column Family (Family of columns)
HBase determines a single piece of data by row and column, which may have multiple versions, why are there multiple versions?
Which version will be displayed when querying?
Ensure the data is not modified, always show the latest version when queried
HBase common sense and habse suitable for what scenario