Differences between Hbase and Oracle

Source: Internet
Author: User
Tags hadoop mapreduce
I. Why is Hbase performance so good compared with Oracle? Hbase is much different from previous relational databases. It is developed according to Bigtable. The definition of applying a Bigtable is ABigtableisasparse, distributed, persistentmultidimensionalsortedmap. Bigtable, which is a sparse and distributed

I. Why is Hbase performance so good compared with Oracle? Hbase is much different from the previous relational database. It is developed according to Bigtable. The definition of applying A Bigtable is: a Bigtable is A sparse, distributed, persistent multidimen1_sorted map. bigtable is a sparse, distributed

I. Why is Hbase performance so good compared with Oracle?

Hbase is much different from the previous relational database, which is developed according to Bigtable. The definition of applying a Bigtable is as follows:
A Bigtable is a sparse, distributed, persistent multidimen1_sorted map.
Bigtable is a sparse, distributed, and continuous multi-dimensional sorting ing array.

1. Data Types. Hbase only has simple string types. All types are handled by the user. It only saves strings. Relational databases have a wide range of options and storage methods.

2. data Operations and Hbase operations only involve simple insertion, query, deletion, and clearing. The tables and tables are separated without complex relationships between tables, therefore, it is not necessary to associate tables with other operations. The traditional relational data usually involves various functions and connection operations.

3. In the storage mode, Hbase stores data based on columns. Each column family stores several files, and files of different column families are separated. Traditional relational databases are saved Based on the table structure and row mode.

4. Data Maintenance. If Hbase is updated correctly, it should not be called an update. In addition, a new version corresponding to a primary key or column, and its old version will still be guaranteed
So it inserts new data instead of replacing and modifying the data in traditional relational databases.

5. scalability: distributed databases such as Hbase and Bigtable are developed directly for this purpose, which can easily increase or decrease the number of hardware (when hardware errors occur, the compatibility with errors is high. Traditional relational databases usually need to add an intermediate layer to implement similar functions.

Ii. HBase classifies cache into three types:

1. InMemory: it is expected that its content can be stored in memory.

2. Single: Put the accessed block here

3. Multi: Put the block accessed more than once here

InMemory is easy to understand. Some meta data in the system is frequently accessed, and the size is small, it is very reasonable to make it resident memory instead of being replaced by memory size limit.

The emergence of Single and Multi is to avoid the impact of scan. Imagine if the cache size is set to 100 M, and the data size of one scan is 200 M, then all the data in the cache will be remvoed, but the block that fill enters into the cache will never be accessed for the second time (scan semantics), which is a waste.

There are only three methods to access rows in hbase table: (1.1) access through a single row key; (1.2) access through the range of the row key; (1.3) full table Scan


HBase is a distributed and column-oriented open-source database. It is originated from a google paper bigtable: a distributed storage system for structured data. HBase is an open-source implementation of Google Bigtable, it uses Hadoop HDFS as its file storage system, Hadoop MapReduce to process massive data in HBase, and Zookeeper as a collaborative service.

1. Introduction

HBase is a distributed, column-oriented open-source database. It originated from google's paper bigtable: a distributed storage system for structured data. HBase is an open-source implementation of Google Bigtable. It uses Hadoop HDFS as its file storage system, Hadoop MapReduce to process massive data in HBase, and Zookeeper as a collaborative service.

2. HBase table structure

HBase stores data in tables. A table consists of rows and columns. The column is divided into several column families/column families ).

Row Key Column-family1 Column-family2 Column-family3
Column1 Column2 Column1 Column2 Column3 Column1
Key1





Key2





Key3

As shown in, key1, key2, key3 is the unique row key value of three records, column-family1, column-family2, and column-family3 are three columns, each containing several columns. For example, the column-family1 family contains two columns named column1 and column2, t1: abc, t2: gdxdf is a cell uniquely identified by row key1 and the column-family1-column1. The cell contains two data types: abc and gdxdf. The timestamps of the two values are different, t1, t2, and hbase returns the value of the latest time to the requester.

The specific meanings of these terms are as follows:

(1) Row Key

Like nosql databases, row keys are the primary keys used to retrieve records. There are only three methods to access rows in hbase table:

(1.1) access through a single row key

(1.2) use the range of the row key

(1.3) full table Scan

The Row key can be any string (the maximum length is 64 KB, and the actual length is generally 10-bytes). In hbase, the Row key is saved as a byte array.

Data is stored in the Lexicographic order (byte order) of the Row key. When designing keys, you need to fully sort and store the rows that are frequently read together. (Location correlation)

Note:

The result of the lexicographically ordered int is 1, 10, 11, 12, 13, 16, 17, 18, 19, 21 ,..., 9,91, 92,93, 94,95, 96,97, 98,99. To maintain the natural order of the integer, the row key must be left filled with 0.

One read/write operation on a row is an atomic operation (no matter how many columns are read/written at a time ). This design decision makes it easy for users to understand the program's behavior when performing concurrent update operations on the same row.

(2) column family

Each column in The hbase table belongs to a column family. A column family is a part of the table's chema (rather than a column) and must be defined before the table is used. All column names are prefixed with column families. For example, courses: history and courses: math all belong to the courses column family.

Access control, disk and memory usage statistics are all performed at the column family level. In practical applications, the control permissions on the columnfamily can help us manage different types of applications: we allow some applications to add new basic data, some applications to read basic data and create inherited columnfamily, and some applications to only browse data (or even not because of privacy ). all data ).

(3) Cell

In HBase, a storage unit identified by row and columns is called cell. By {row key, column (= + ), Version} the unique unit. The data in cell is of no type and all are stored in bytecode format.

(4) timestamp

Each cell stores multiple versions of the same data. Versions are indexed by timestamps. The timestamp type is a 64-bit integer. The timestamp can be assigned by hbase (automatically when data is written). The timestamp is accurate to the current system time in milliseconds. The timestamp can also be explicitly assigned by the customer. To avoid data version conflicts, the application must generate a unique timestamp. In each cell, data of different versions are sorted in reverse chronological order, that is, the latest data is ranked first.

To avoid the management (including storage and indexing) burden caused by excessive data versions, hbase provides two data version recycling methods. The first is to save the last n versions of the data, and the second is to save the versions (for example, the last seven days) in the recent period ). You can set for each column family.

3. Basic usage of HBase shell

Hbase provides a shell terminal for user interaction. Run the hbase shell command to enter the command interface. Run the help command to view the help information of the command.

The usage of hbase is demonstrated using an example of an online student sequence table.

Name Grad Course
Math Art
Tom 5 97 87
Jim 4 89 80

Here, grad is a column family only for a table. course is a column family with two columns for the table. This column family consists of two columns: math and art, of course, we can create more columns in course as needed, such as computer and physics, and add columns to the course columnfamily.

(1) create a table scores with two columnfamilies grad and courese

Hbase (main): 001: 0> create 'scores', 'case', 'Course'

You can use the list command to view the tables in the current HBase. Use the describe command to view the table structure. (Remember to enclose all the representations and column names with quotation marks)

(2) Insert values based on the designed table structure:

Put 'scores', 'Tom ', 'grade:', '5 ′
Put 'scores', 'Tom ', 'course: math', '97 ′
Put 'scores', 'Tom ', 'course: art', '87 ′
Put 'scores', 'jim', 'grade ', '4 ′
Put 'scores', 'jim', 'course: ', '89 ′
Put 'scores', 'jim', 'course: ', '80 ′

In this way, the table structure becomes more free. It is very convenient to add sub-columns freely in the column family. If there are no child columns in the column family, you can add a colon without adding it.

The put command is relatively simple. only use this method:
Hbase> put 't1', 'r1 ', 'c1', 'value', ts1

T1 indicates the table name, r1 indicates the row key name, c1 indicates the column name, and value indicates the cell value. Ts1 indicates the timestamp, which is generally omitted.

(3) querying data based on key values

Get 'scores', 'jim'
Get 'scores', 'jim', 'grade'

You may find the rule. an approximate sequence of HBase shell operations is the sequence of Operation keywords followed by table name, row name, and column name. If there are other conditions, add them in curly brackets.
The get method is as follows:

Hbase> get 't1', 'r1 ′
Hbase> get 't1', 'r1', {TIMERANGE => [ts1, ts2]}
Hbase> get 't1', 'r1', {COLUMN => 'c1 ′}
Hbase> get 'T1 ', 'r1', {COLUMN => ['c1', 'c2', 'c3']}
Hbase> get 't1', 'r1 ', {COLUMN => 'c1', TIMESTAMP => ts1}
Hbase> get 't1', 'r1 ', {COLUMN => 'c1', TIMERANGE => [ts1, ts2], VERSIONS => 4}
Hbase> get 'T1 ', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1, VERSIONS => 4}
Hbase> get 't1', 'r1 ', 'c1 ′
Hbase> get 't1', 'r1 ', 'c1', 'c2 ′
Hbase> get 't1', 'r1', ['c1', 'c2']

(4) scan all data

Scan 'scores'

You can also specify modifiers: TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH, or COLUMNS. There is no modifier, that is, the above example, all data rows will be displayed.

Example:

Hbase> scan '. META .'

Hbase> scan '. META.', {COLUMNS => 'info: regioninfo '}
Hbase> scan 't1', {COLUMNS => ['c1 ', 'c2'], LIMIT => 10, STARTROW => 'xyz '}
Hbase> scan 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804,130 3668904]}
Hbase> scan 't1', {FILTER => "(PrefixFilter ('row2') AND (QualifierFilter (>=, 'binary: xyz '))) AND (TimestampsFilter (123,456 ))"}
Hbase> scan 't1', {FILTER => org. apache. hadoop. hbase. filter. ColumnPaginationFilter. new (1, 0 )}
The filter has two methods:

A. Using a filterString-more information on this is available in
Filter Language document attached to the HBASE-4176 JIRA
B. Using the entire package name of the filter.

There is also a CACHE_BLOCKS modifier that switches the scan cache. By default, it is enabled (CACHE_BLOCKS => true). You can choose to disable it (CACHE_BLOCKS => false ).

(5) delete specified data

Delete 'scores', 'jim', 'grade'

Delete 'scores', 'jim'
The data deletion command does not change much either. There is only one:

Hbase> delete 't1', 'r1 ', 'c1', ts1

In addition, there is a deleteall command, which can be used to delete the entire row range. Use it with caution!
If you need to delete a full table, use the truncate command. In fact, there is no direct full table deletion command. This command is also combined by the disable, drop, and create commands.

(6) modify the table structure

Disable 'scores'
Alter 'scores', NAME => 'info'
Enable 'scores'

Use the alter command as follows (if the version cannot be successful, disable the table disable first ):
A. change or add a column family:

Hbase> alter 't1', NAME => 'f1', VERSIONS => 5

B. delete a columnfamily:

Hbase> alter 't1', NAME => 'f1', METHOD => 'delete'
Hbase> alter 't1', 'delete' => 'f1 ′

C. You can also modify table attributes such as MAX_FILESIZE.

MEMSTORE_FLUSHSIZE, READONLY, and DEFERRED_LOG_FLUSH:
Hbase> alter 't1', METHOD => 'table _ att ', MAX_FILESIZE => '123 ′
D. You can add a table coprocessor.

Hbase> alter 't1', METHOD => 'table _ att ', 'coprocessor' => 'hdfs: // foo. jar | com. foo. fooRegionObserver | 1001 | arg1 = 1, arg2 = 2 ′

Multiple coprocessor can be configured for a table, and a sequence will automatically grow for identification. To load a coprocessor (a filter program), you must comply with the following rules:

[Coprocessor jar file location] | class name | [priority] | [arguments]

E. Remove coprocessor as follows:

Hbase> alter 't1', METHOD => 'table _ att_unset', NAME => 'max _ filesize'
Hbase> alter 't1', METHOD => 'table _ att_unset', NAME => 'coprocessor $1 ′

F. Multiple alter commands can be executed at a time:

Hbase> alter 't1', {NAME => 'f1 '}, {NAME => 'F2', METHOD => 'delete '}

(7) number of statistical rows:

Hbase> count 't1 ′
Hbase> count 'T1 ', INTERVAL => 100000
Hbase> count 't1', CACHE => 1000
Hbase> count 't1', INTERVAL => 10, CACHE => 1000

Count is generally time-consuming. mapreduce is used for statistics and the statistical results are cached. The default value is 10 rows. The default statistical INTERVAL is 1000 rows (INTERVAL ).


(8) disable and enable Operations
In many operations, you need to suspend table availability first, such as the alter operation mentioned above. This operation is also required to delete a table. Disable_all and enable_all can operate more tables.

(9) delete a table
Stop the table's usability and run the DELETE command.

Drop 't1 ′

The above is a detailed explanation of some common commands. The specific shell commands for all hbase are as follows, which are divided into several command groups. It can be seen that the commands are useful in English, use help "cmd" for detailed usage.

Command groups:
Group name: general
Commands: status, version

Group name: ddl
Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all,
Enable, enable_all, exists, is_disabled, is_enabled, list, show_filters

Group name: dml
Commands: count, delete, deleteall, get, get_counter, incr, put, scan, truncate

Group name: tools
Commands: assign, balance_switch, balancer, close_region, compact, flush, hlog_roll, major_compact,
Move, split, unassign, zk_dump

Group name: replication
Commands: add_peer, disable_peer, enable_peer, list_peers, remove_peer, start_replication,
Stop_replication

Group name: security
Commands: grant, revoke, user_permission


4. hbase shell script

Since it is a shell command, of course, you can also write all hbase shell commands into a file and execute all commands in sequence like the linux shell script program. Like writing a linux shell, write all hbase shell commands in a file, and then execute the following command:

$ Hbase shell test. hbaseshell

Easy to use.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.