1. Is
there an efficient CSV export tool?
Phoenix provides Bulkload tools that enable users to efficiently import data from large data volumes into HBase via Phoenix, so does Phoenix also have a tool class for efficiently exporting CSV data?
Some people here might wonder if they can export the data according to the usual HBase method. For example, write your own Java code, or use HBase Native-supported tool classes, or use the HBase tool class provided by Pig. Whether you can do this depends on the data type of the field when you build the table in Phoenix. If the field does not use a varchar, char type, and unsigned_* type, or if your table is salted table, then exporting from Phoenix will inevitably result in incorrect data being exported. The reason is that Phoenix handles most data type data byte formatting in a way that is different from native hbase. For example, Phoenix Salted table inserts a hash value in the first byte of Rowkey to distribute the data evenly across each region, so exporting with the regular HBase export tool is bound to cause the export of the Rowkey to be incorrect.
2. Pig loader--the best and only Phoenix export CSV file tool
Fortunately, Phoenix officially does provide an efficient export tool class, but it has to rely on pig. And in the test process, it is found that only the only tool can perfectly support the export of Phoenix table data .
The introduction and use of pig is not the focus of this article, friends who have not contacted please Baidu or Google search by themselves.
The introduction of the Phoenix Integrated Pig can be viewed in the following website links:
Https://phoenix.apache.org/pig_integration.html
It mentions two tool methods, one for importing large amounts of data, similar to the Bulkload tool, and another tool method for exporting massive amounts of data. Here we focus on the export of data.
The export tool claims to be Pig Loader. According to the official website of the introduction:
A Pig Data Loader allows users to the read data from Phoenix backed HBase tables within a pig script.
It means that we can write the pig script and use the tool class provided by Phoenix-pig (Phoenix Inherits Pig's module) to run the pig script to achieve the export of massive amounts of data.
There are two forms of Pig loader for exporting:
2.1 Export using Table
The first is to export the entire table data by specifying the HBase table name, for example, if I want to export all the records of the test table, you can use the following script command:
load ‘hbase://table/USER‘ using org.apache.phoenix.pig.PhoenixHBaseLoader(‘${zookeeper.quorum}‘);
${zookeeper.quorum} needs to be replaced with zookeeper cluster machine IP plus port, eg:master,slave1,slave2:2181
Of course we can also precisely control which columns of the table are exported:
load ‘hbase://table/USER/ID,NAME‘ using org.apache.phoenix.pig.PhoenixHBaseLoader(‘${zookeeper.quorum}‘);
The above script indicates that all records of the test table are exported, but only the ID column and the Name column are included.
2.2 Export using Query
The other is to control the exported data by specifying a query statement:
load ‘hbase://query/SELECT ID,NAME FROM USER WHERE AGE > 50‘ using org.apache.phoenix.pig.PhoenixHBaseLoader(‘${zookeeper.quorum}‘);
Note
There are significant restrictions on the way to export using the query statement, such as the inability to specify group by, limit, ORDER by, DISTINCT, and the use of aggregate functions such as count,sum.
3. Using the example
Here we introduce the use of the two export methods through two complete usage examples. Example1 shows the export that specifies the table way, and Example2 demonstrates the export that specifies the query mode.
3.1 Example1
vi example1.pig
REGISTER /data/phoenix-default/phoenix-4.6.0-HBase-1.0-client.jar;
rows = load ‘hbase://table/USER‘ USING org.apache.phoenix.pig.PhoenixHBaseLoader(‘master,slave1,slave2:2181‘);
STORE rows INTO ‘USER.csv‘ USING PigStorage(‘,‘);
Execute shell command:
pig -x mapreduce example1.pig
3.2 Example2
vi example2.pig
REGISTER /data/phoenix-default/phoenix-4.6.0-HBase-1.0-client.jar;
rows = load ‘hbase://query/SELECT ID,NAME FROM USER‘ USING org.apache.phoenix.pig.PhoenixHBaseLoader(‘master,slave1,slave2:2181‘);
STORE rows INTO ‘USER.csv‘ USING PigStorage(‘,‘);
Execute shell command:
pig -x mapreduce example2.pig
Phoenix Export CSV file