Now there are business needs and real-time statistics needs, and hbase may be used, so we have reprinted someArticle
Reproduced from: Taobao QA Team, original address: http://qa.taobao.com /? P = 13850
Introduction
There are more projects using hbase in the team. For business personnel, it is generally not necessary to build and to maintain an hbase Cluster Environment, it is not necessary to have a deep understanding of the Architecture details (to be handled by the hbase cluster maintenance team). What is urgently needed is to quickly understand basic technologies to solve business problems. Recently, during the rotation of XX Project, I tried to view hbase from the perspective of business personnel and record some processes, it is expected that it will be helpful for business personnel who quickly understand hbase and master relevant technologies to carry out their work. I think as a business developer who first came into contact with hbase, he needs to grasp at least the following:
A deep understanding of htable and how to design high-performance htable based on business
Understanding the interaction with hbase is inseparable from the addition, deletion, modification, and query of data. hbase shell commands and Java APIs are required.
Master how to use mapreduce to analyze data in hbase, and use mapreduce to analyze data in hbase.
To learn how to test hbase mapreduce, You can't simply write it, regardless of the correctness. debug is required. Let's take a look at how to test Debug on the local machine.This series will be centered on the above points, with a long length. If you are a beginner of hbase, you are recommended to read and practice it. If you are familiar with hbase, you can choose to read it, for example, pay attention to hbase mapreduce and its testing method.
Starting from an example
Traditional relational databases must be familiar to everyone. We will use a simple example to illustrate the solutions and advantages and disadvantages of using RDBMS and hbase.
Taking blog as an example, the RDBMS table design is as follows:
For ease of understanding, we use some data examples
In the above example, we can use hbase to design
For ease of understanding, we use some data examples and marked some key concepts in red, which will be explained later.
Htable Basic Concepts
Row keyRow primary keys. hbase does not support conditional queries and order by queries. Read records can only be scanned by row key (and its range) or the entire table, therefore, the row key needs to be designed based on the business to utilize its storage sorting feature (table is sorted by row key lexicographically, such as, and 2) to improve performance.
Column family)When a table is created, each column family is a storage unit. In the above example, an hbase table blog is designed, which has two columns: article and author.
Column)Each column in hbase belongs to a column family and is prefixed with the column family name. For example, column article: Title and article: Content belong to the article column family, Author: Name and author: nickname belongs to the author columnfamily.
Columns of the same column family can be dynamically added without being defined during table creation. columns of the same column family are clustered in a storage unit and sorted by column key, therefore, you should design a column with the same I/O characteristics in a column family to improve performance.
TimestampHbase determines a copy of data through row and column. The value of this data may have multiple versions. The values of different versions are sorted in reverse chronological order, that is, the latest data is ranked first, the latest version is returned by default during query. In the preceding example, the row key = 1 Author: nickname value has two versions, they are the "yiye dujiang" corresponding to 1317180070811 and the "yedu" corresponding to 1317180718830 (corresponding to the actual business, it can be understood that at a certain time point, the nick name is changed to yedu, but the old value still exists ). Timestamp is the current system time (accurate to milliseconds) by default. You can also specify this value when writing data.
ValueEach value is uniquely indexed using four keys, tablename + rowkey + columnkey + timestamp => value. For example, {tablename = 'blog ', rowkey = '1 ', columnname = 'author: nickname', timestamp = '000000'} the unique index value is "yedu ".
Storage TypeTablename is a string
Rowkey and columnname are binary values (Java byte []).
Timestamp is a 64-bit integer (Java type long)
Value is a byte array (Java byte []).
Storage StructureThe htable storage structure can be simply understood
That is, htable Automatically sorts by row key. Each row contains any number of columns, and columns are automatically sorted by column key. Each column contains any number of values. Understanding the storage structure will facilitate the iteration of the query results.
What do we need to say about hbase?
Semi-structured or unstructured dataHbase is applicable to data that is hard to be extracted by one concept due to undefined or disorganized data structure fields. In the preceding example, when the business development requires the storage of author email, phone, and address information, RDBMS needs to be shut down for maintenance, while hbase supports dynamic increase.
Sparse recordsThe number of columns in the rows of RDBMS is fixed, which wastes storage space. As mentioned above, columns whose hbase is null are not stored, which saves space and improves read performance.
Multi-version dataAs mentioned above, you can locate any number of version values based on the row key and column key. Therefore, it is very convenient to use hbase to store data with change history records. For example, in the preceding example, the address of author is changed. In business, only the latest value is required. However, you may need to query historical values.
Large data volumeWhen the data volume increases and the RDBMS database cannot support it, a read/write splitting policy emerged. A master is responsible for write operations, and multiple slave instances are responsible for read operations, increasing the server cost. As the pressure increases, the master node will not be able to support it. In this case, it is necessary to split the database and deploy the associated data separately. Some join queries cannot be used and the middle layer must be used. As the amount of data increases, a table has more and more records, and queries become slow, so we have to split tables, for example, the ID modulo is divided into multiple tables to reduce the number of records in a single table. People who have experienced these things know how hard the process is. Hbase is easy to use. You only need to add machines. hbase will automatically split and expand horizontally. the seamless integration with hadoop ensures its data reliability (HDFS) high Performance of mass data analysis (mapreduce ).
Summary
This article mainly introduces some basic concepts of htable, through examples and illustrations and comparison with RDBMS to facilitate further understanding, after reading this article, we should have a certain understanding of the design and structure of htable and when to use hbase. In the next article, we will deepen this knowledge through operation exercises.