Apache Phoenix is an open source SQL engine for hbase. You can use the standard JDBCAPI instead of the HBase client API to create tables, insert data, and query your hbase data.
For your better and faster understanding of Apache Phoenix, the official gives a quick 15-minute description of the documentation for Apache Phoenix: http://phoenix.apache.org/Phoenix-in-15-minutes-or-less.html
I have also had a friend to translate this content, thank lnho2015 Blogger, I modified and added some content on the basis of translation.
1. Do not add an extra layer between my program and HBase, it will only slow down the speed.
In fact, No. Phoenix achieves the same or perhaps better performance (not to mention writing a lot of code) in the way that you write your own handwriting in the following ways:
* Compile your SQL query as native HBase's scan statement
* Detect the best start and end of the scan statement key
* Carefully organize your scan statements so that they execute in parallel
* Move data to the location of the data instead of moving it
* Push your WHERE clause's predicate to the server-side filter processing
* Perform aggregate queries via service-side hooks (called co-processors)
In addition to this, we have also made some interesting enhancements to better optimize performance:
* implemented a level two index to improve the performance of non-primary key field queries
* Statistics-related data to improve parallelization and help select optimal optimization scenarios
* Skip Scan Filter to optimize In,like,or query
* Optional scatter-line key to evenly distribute write pressure
2. Well, it's fast. But why use SQL. This is something of the 70 's.
One idea is: give the guys something they already know. What is the better way to motivate them to use hbase? The best way to do this is to use JDBC and SQL for the following reasons:
* Reduce the number of code users need to write
* Make performance optimization transparent to users
* Easy to use and integrate a large number of existing tools
3. But how can I get SQL to support some of my favorite hbase technologies
Don't take this (after using Phoenix) as the last time you see hbase. SQL is just a way of expressing the functionality you want to implement, and you don't have to think about how to use SQL to implement functionality. See if you can support the special usage of hbase that you like, for the Phoenix feature that is already in existence or is being done. You have your own ideas. We would love to hear your thoughts: write down the questions and we can also join our mailing list.
Having said so much, I just want to know how to get started.
Very good. Just follow our installation Guide (the version number of my HBase cluster environment is: hbase-1.1.5):
* Download and unzip our installation package (APACHE-PHOENIX-4.8.0-HBASE-1.1-BIN.TAR.GZ)
TAR-ZXVF apache-phoenix-4.8.0-hbase-1.1-bin.tar.gz
* Copy the service-side jar package of Phoenix that is compatible with your HBase installation to the Lib directory of each cluster node
CP phoenix-4.8.0-hbase-1.1-server.jar/var/lib/kylin/hbase-1.1.5/lib/
Then copy the following to the Lib directory of each hbase node of the cluster:
SCP Phoenix-4.8.0-hbase-1.1-server.jar Kylin@szb-l0023776:/var/lib/kylin/hbase/lib
SCP Phoenix-4.8.0-hbase-1.1-server.jar Kylin@szb-l0023777:/var/lib/kylin/hbase/lib
SCP Phoenix-4.8.0-hbase-1.1-server.jar Kylin@szb-l0023778:/var/lib/kylin/hbase/lib
SCP Phoenix-4.8.0-hbase-1.1-server.jar Kylin@szb-l0023779:/var/lib/kylin/hbase/lib
* Restart your cluster node
stop-hbase.sh
start-hbase.sh
* Add Phoenix Client jar package to your hbase client under Classpath
* Download and set squirrel as your SQL client, so you can initiate an instant query of SQL statements to manipulate your hbase cluster
4. I don't want to download and install anything else.
Well, it makes sense. You can create your own SQL scripts and use our command-line tools to execute them (instead of the previously mentioned scenario of downloading the installation software). Now let's look at an example.
Navigate to the Bin directory where you installed the Phoenix path to start.
4.1 First, let's create a Us_population.sql file that contains a statement with the following table
CREATE TABLE IF not EXISTS us_population (
State CHAR (2) is not NULL,
City VARCHAR is not NULL,
Population BIGINT
CONSTRAINT my_pk PRIMARY KEY (state, city)
);
4.2 Create a Us_population.csv file that contains the data for the table
Ny,new york,8143197
Ca,los angeles,3844829
il,chicago,2842518
tx,houston,2016582
pa,philadelphia,1463281
az,phoenix,1461575
Tx,san antonio,1256509
Ca,san diego,1255540
tx,dallas,1213825
Ca,san jose,912332
4.3 Finally we create a us_population_queries.sql file that contains a query SQL
Select state as ' state ', Count (city) as ' City Count ', sum (population) as "population sum"
From Us_population
GROUP by State
ORDER by sum (population) DESC;
4.4 Performing specific actions from the command line
Note: I specify the full connection to hbase on zookeeper, including the IP address, port number, and znodeparent, which, if Znode parent is not specified, is the/hbase node by default.
[kylin@szb-l0023780bin]$./psql.py szb-l0023780:2181:/hbase114 US_population.sql us_ Population.csv Us_population_queries.sql
No row supserted
Time:2.845sec (s)
CSV columns from database.
CSV Upsert complete. Rows upserted
Time:0.129sec (s)
St City Count Population Sum
-- -------------- ------------------------- -------------------------
NY 1 8143197
CA 3 6012701
TX 3 4486916
IL 1 2842518
PA 1 1463281
AZ 1 1461575
Time:0.077sec (s)
You can see that you have created your first Phoenix table, inserted the data in, and executed an aggregate query SQL code with only a few rows of data.
4.5 Performance Test Script performance.py
[kylin@szb-l0023780 bin]$./performance.py
Performance script Arguments notspecified. Usage:performance.sh <zookeeper> <row count>
Example:performance.sh localhost100000
We test 10 million data to see the following performance:
During execution, a hbase table named performance_10000000 is created and 10 million records are inserted, and then some query operations are performed.
[kylin@szb-l0023780 bin]$./performance.py szb-l0023780:2181:/hbase114 10000000
Phoenix Performance Evaluation Script 1.0
-----------------------------------------
Creating Performance Table ...
No rows upserted
time:2.343 sec (s)
Query # 1-count-select Count (1) from performance_10000000;
Query # 2-group by First pk-select host from performance_10000000 Group by HOST;
Query # 3-group by Second pk-select domain from performance_10000000 Group by DOMAIN;
Query # 4-truncate + group By-select TRUNC (date, "Day") Day from performance_10000000 Group by TRUNC (date, "Day");
Query # 5-filter + count-select Count (1) from performance_10000000 WHERE core<10;
Generating and upserting data ...
CSV columns from database.
CSV Upsert complete. 10000000 rows upserted
time:565.593 sec (s)
COUNT (1)
----------------------------------------
10000000
time:8.206 sec (s)
HO
--
Cs
EU
NA
time:0.416 sec (s)
DOMAIN
----------------------------------------
Apple.com
google.com
Salesforce.com
time:13.134 sec (s)
Day
-----------------------
2016-08-30 00:00:00.000
2016-08-31 00:00:00.000
2016-09-01 00:00:00.000
2016-09-02 00:00:00.000
2016-09-03 00:00:00.000
2016-09-04 00:00:00.000
......
2016-12-18 00:00:00.000
2016-12-19 00:00:00.000
2016-12-20 00:00:00.000
2016-12-21 00:00:00.000
2016-12-22 00:00:00.000
2016-12-23 00:00:00.000
2016-12-24 00:00:00.000
time:12.852 sec (s)
COUNT (1)
----------------------------------------
200745
time:11.01 sec (s)
If you want to count the total number of rows in hbase tables, do not use the Count command of HBase to count, single-threaded, poor performance, but use MapReduce to calculate, as follows:
HBase org.apache.hadoop.hbase.mapreduce.RowCounter ' performance_10000000 '