The Hadoop learning HBase

Source: Internet
Author: User
Tags cassandra relative require zookeeper hadoop ecosystem

1. What are the basic features of hbase?

2. What are the issues that hbase can solve relative to a relational database?

3. What is the data model for HBase? How to express. What are the forms of operation.

4. Some concepts and principles of the HBase schema schema design

5. What is the topological structure of hbase?

6. HBase comparison with Cassender.


1. What are the basic features of hbase?

HBase is an open source implementation similar to Google's BigTable, with the following characteristics:

1). On top of HDFs

2). A distributed database based on column storage

3). For real-time read and write large datasets

Other features of HBase:

1). There is no real index, row order storage, and there is no so-called index bloat problem.

2) Automatic partitioning, when the table grows, is automatically partitioned to the new node.

3) Linear expansion and area will be automatically re-balanced, running regionserver, to achieve load balancing purposes.

4). Fault-tolerant and general commercial hardware support. This is similar to Hadoop.


2. What are the issues that hbase can solve relative to a relational database?

The difference between hbase and relational data.

is actually the pros and cons of the relational database and HBase respectively.

The flaws of the relational database:

1). Difficulty extending

2). Maintain complex

HBase is the problem of solving scalable rows. Gain linear scalability by simply adding nodes. SQL is not supported.

The difference between hbase and RDBMS.

1). Table Design: HBase table can be very high, very wide, scalability is very strong. And the schema of the table is a direct reflection of the physical storage.

2). Topology: HBase can be partitioned horizontally and automatically replicated on thousands of nodes.

3). Application Form: Developers must assume more responsibility for using HBase's retrieval and storage methods correctly.

4). The RDBMS follows a fixed pattern, such as "Codd 12 Rules", emphasizing the "strong consistency" of transactions, referential integrity, SQL support, and the relative independence of logical and physical forms of data. Wait a minute. Suitable for small to medium-sized data, but for large scale expansion of data size and concurrency, RDBMS performance is greatly reduced, and distribution is more difficult because it needs to abandon the ease-of-use features of many RDBMS.


HBase applies to billions and according of data, and traditional RDBMS is a better choice if only thousands and millions of levels are data.

HBase needs more hardware, and if there are fewer hardware, such as 5, nothing good can be done.

If you are migrating from an RDBMS to hbase, you need to eliminate many of the additional features of the RDBMS, such as column data types, secondary indexes, transactions, advanced queries, and so on.


3. What is the data pattern for hbase? That is, what elements are there. How to store it. Wait

1). Data mode

such as the following three tables:

The first is a sparse table, which is actually a virtual table, only a conceptual view, not a real storage form, it originates from the latter two tables.

The second two tables are real tables, physical views, they are actual storage forms, and they are stored by column family.

Time
Row KeyStamp columnfamily Contents columnfamily Anchor
"Com.cnn.www" T9 anchor:cnnsi.com = "CNN"
"Com.cnn.www" T8 anchor:my.look.ca = "CNN.com"
"Com.cnn.www" T6 contents:html = "<HTML> ..."
"Com.cnn.www" T5 contents:html = "<HTML> ..."
"Com.cnn.www" T3 contents:html = "<HTML> ..."


Time
Row KeyStamp Column Family Anchor
"Com.cnn.www" T9 anchor:cnnsi.com = "CNN"
"Com.cnn.www" T8 anchor:my.look.ca = "CNN.com"


Time
Row KeyStamp columnfamily "Contents:"
"Com.cnn.www" T6 contents:html = "<HTML> ..."
"Com.cnn.www" T5 contents:html = "<HTML> ..."
"Com.cnn.www" T3 contents:html = "<HTML> ..."


2). The basic elements of HBase:

Table, row, column, cell: basic elements of a table

Key: Generally refers to the key of a row, that is, the element that uniquely identifies a row. The rows in the table can be sorted by key, while access to the table is also through the key.

Column family: All column family members have the same prefix, and members of a column family need to be predefined, but can also be appended directly.

Column family members are put into memory together. The HBase column-oriented storage is a column-oriented data store (as shown in the example above), where data storage and tuning are at this level, and HBase tables are similar to tables in RDBMS, rows are sorted, and clients can add columns to the column family.

Cells Cell: An indivisible array of bytes is stored in the cell. And each cell has version information. HBase is sorted in reverse order of version information.

Area region: Divides the table horizontally, which is the smallest unit of the HBase cluster distribution data. All areas of the line constitute the contents of the table.

Locking: A lock is required to update the data rows. Maintain atomicity.


3) What are the operations of the data model?

Get, Scan, Put, delete, which returns the properties of a particular row, multiline properties, inserting, deleting data.

These all require a htable instance to operate. There are get, Scan, Put, delete classes to specify the corresponding parameters, attributes.

Take scan as an example:


Htable htable = ...      Instantiate htable
    
Scan scan = new scan ();
Scan.addcolumn (bytes.tobytes ("CF"), Bytes.tobytes ("attr"));
Scan.setstartrow (Bytes.tobytes ("Row"));                   Start key is inclusive
Scan.setstoprow (bytes.tobytes ("row" +  (char) 0));  Stop key is exclusive
Resultscanner rs = htable.getscanner (scan);
try {for
  (Result r = Rs.next (); r = null; r = Rs.next ()) {
  //Process Result ...
} finally {
  rs.close ();  Always close the resultscanner!
}


4) Returns how the results are sorted.

First the row, then the column family, then the column modifier, and finally the timestamp (reverse sort, newest in front).


5) Finally, HBase does not support federated queries


6) MapReduce is used in conjunction with the HBase table, and the default MapReduce task split is divided by how much of the region in the HBase table, and a region has a map.


4. Some concepts and principles of the HBase schema schema design

1) creation and updating of the mode

You can use the HBase shell or hbase admin to create and edit hbase patterns.

In the 0.90.x version, you can disable the table first, then modify the column family, and after the 0.92.x version, Support Online modification.


and the table and column family modified, such as size, region, block size, and so on, the next time the main tightening or storage files to function.


2) Number of column families

-The fewer the number of columns, the better, even if there are two family of columns at the same time, the query will always access one of the column family, not access at the same time.

-When a table has more than one column family, when the cardinality gap is large, such as a clan has 1 million rows, B group 1 billion rows, a clan may be dispersed to many regions region, resulting in the efficiency of scanning a reduced.

-In addition, multiple column families can cause many I/O burdens when flush and compaction.


3) row key design Rowkey

A. Do not design rowkey in an orderly form, because it is easy to block parallelism and load pressure on a single machine

B. Locate a unit that requires rows, column names, and timestamps. If the coordinates of a cell are large, memory is consumed and the index is exhausted. Therefore, the solution: The column family name as small as possible, such as a character A, short attribute name, and the row key length is readable (the row key length has no significant effect on data access), the numeric characters into the digital byte mode (space-saving).

C. Reverse timestamp helps find the most recent version value

D. Row keys are valid within the range of column families and can have the same row keys in different column families

E. Row keys never change


4) HBase supports all things that can be converted to byte arrays, such as strings, numbers, complex objects, counters, and even images.

5) The column family can set the time-to-live TTL, and HBase automatically deletes the data after time-out

6) Second index and query: There are a lot of things in this, need to see the corresponding version of the official document better.


5. What is the topological structure of hbase?

1) Topology: Similar to HDFs's mast and Slave,mapreduce's tasktracker and Jobtracker, HBase also has master and Regionserver


2) What is the relationship between HBase and zookeeper?

HBase must manage a zookeeper instance, which relies on zookeeper, primarily to coordinate servers within a zone through zookeeper, which is responsible for important information such as directory tables, master addresses, and, if a server crashes, HBase is able to coordinate allocations through zookeeper.
Regionserver is in the configuration file Conf/regionservers file for HBase, while the HBase cluster's site configuration is configured in Conf/hbase-site.xml and conf/hbase-env.sh. HBase follows the rules of Hadoop as much as possible.


3) HBase internal structure management status:

It has a special directory table of-root,-meta, which maintains the list, location, and status of all the zones on the current cluster.

The-root table contains a list of regions for the-meta table, and-meta represents the list of regions that contain the user.

So, the process of hbase management is:

Client-Link to zookeeper-find the location of the-root table-Find the location of the-meta table-Find the node, location, and state of the user's area-directly manage and interact with the regionserver of the specified area.


HBase supports the development of Java and MapReduce.

HBase provides interfaces for thrift, rest, and Avro. HBase needs to have a corresponding interface client that is responsible for interacting with these interfaces. However, these require the agent to process requests and responses, so it is slower than Java.

%hbase-daemon.sh start/stop Rest/thrift/avro//start or terminate the corresponding client


4) Examples of use of hbase:

1. Create a table

As in the shell environment: create ' station ', {NAME = ' info ', VERSION >= 1}

2. Loading data

MapReduce and HBase can take full advantage of the distributed model of the cluster and copy the raw data into HDFs.

There is only one htable instance per task, and by default, each htable.put (put) is not cached when the insert operation is performed. But to use the cache, you can set it yourself.

3. Web Query

You can use HBase's Java API to implement a Web application directly. Htable.get () can be used to get all the contents of a defined column family. The result of the get is returned to result, which contains the data row.

HBase can also use the scanner scanner to retrieve observational data. And gets an ordered result. Similar to a "cursor" in a traditional database. Htable.getscanner (scan).


6. HBase and Cassender different applications, should choose a different NoSQL database, Cassandra, HBase, MongoDB, Riak have their own advantages and disadvantages.  And the above various databases, all in the development, with the version transformation, the characteristics will also change. Based on the CAP theory (consistency consistency, availability availability, partitioning tolerance partitioning tolerance), the two can be easily differentiated.
HBase is a part of the Hadoop ecosystem, and other frameworks such as pig, Hive, and so on, while the Cassender run on MapReduce is relatively complex. In general, Cassender may be more efficient on storage, but HBase's data processing power is stronger. HBase has the processing power of shell scripts and Web pages, and Cassender does not have shell support, only APIs, which are less usable than hbase.
When the schema of Cassender changes, a cluster restart is required, but Cassender declares that "write operations never fail" and hbase is possible. Scenario: The Cassandra is best for small data centers connected by high-speed fiber (around hundreds of nodes), while HBase is suitable for network "slow" and unpredictable internet networks.
Other: HBase performance tuning, others very good summary: http://rdc.taobao.com/team/jm/archives/975

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.