Hbase Concept Learning (7) Integration of hbase and mapreduce

Last Update:2014-07-31 Source: Internet

Author: User

Tags map class hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is based on the example mentioned above after reading the hbase authoritative guide, but it is slightly different.

The integration of hbase and mapreduce is nothing more than the integration of mapreduce jobs with hbase tables as input, output, or as a medium for sharing data between mapreduce jobs.

This article will explain two examples:

1. Read TXT text data stored on HDFS and simply store it in hbase tables as a JSON string.

2. Read the JSON string in the hbase table stored in step 1 and parse and store it in the new hbase table for query.

This article details the source code and how to execute it, aiming to deepen the learning of integration between hbase and mapreduce.

If you still don't know how to build an hbase standalone Environment Based on HDFS and how to execute mapreduce tasks, refer to these two articles first:

(1) hbase environment setup (1) standalone mode based on hadoop File System in Ubuntu

(2) hadoop basic learning (1) analyze, compile, and execute the wordcount Word Frequency Statistics Program

1. Read TXT text data stored on HDFS and simply store it in hbase tables as a JSON string.

Source code:

/*** @ Author Ji yiqin * @ date 2014-6 * @ reference hbase authoritative guide chapter7 ***/import Java. io. ioexception; import Org. apache. commons. CLI. commandLine; import Org. apache. commons. CLI. commandlineparser; import Org. apache. commons. CLI. helpformatter; import Org. apache. commons. CLI. option; import Org. apache. commons. CLI. options; import Org. apache. commons. CLI. parseexception; import Org. apache. commons. CLI. posixparser; import or G. apache. commons. codec. digest. digestutils; import Org. apache. commons. logging. log; import Org. apache. commons. logging. logfactory; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. hbase. hbaseconfiguration; import Org. apache. hadoop. hbase. keyValue; import Org. apache. hadoop. hbase. client. put; import Org. apache. hadoop. hbase. io. immutablebyteswritable; import or G. apache. hadoop. hbase. mapreduce. tableoutputformat; import Org. apache. hadoop. hbase. util. bytes; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. io. writable; import Org. apache. hadoop. mapreduce. job; import Org. apache. hadoop. mapreduce. mapper; import Org. apache. hadoop. mapreduce. lib. input. fileinputformat; import Org. apache. hadoop. util. genericoptionsparser; Publ IC class hdfstohbase {Private Static final log = logfactory. getlog (hdfstohbase. class); public static final string name = "importfromfile"; Public Enum counters {lines}/*** map class **/static class importmapper extends mapper <longwritable, text, immutablebyteswritable, writable> {private byte [] Family = NULL; private byte [] qualifier = NULL; @ overrideprotected void setup (context) throws IOE Xception, interruptedexception {// get the column name string columns = context passed through configuration. getconfiguration (). get ("Conf. column "); // parse the column family and column name byte [] [] columnsbytes = keyValue. parsecolumn (bytes. tobytes (columns); Family = columnsbytes [0]; qualifier = columnsbytes [1]; log.info ("Family:" + family. tostring () + "qualifiers:" + qualifier) ;}@ overridepublic void map (longwritable offset, text line, context) throws IOE Xception {try {string linestr = line. tostring (); byte [] rowkey = digestutils. MD5 (linestr); // construct the Put object put = new put (rowkey); Put. add (family, qualifier, bytes. tobytes (linestr); // transmits the Put object context. write (New immutablebyteswritable (rowkey), put); context. getcounter (counters. lines ). increment (1);} catch (exception e) {e. printstacktrace () ;}}/ *** resolves the number of commands in the command line to the CommandLine object of hbase * @ Param ARGs * @ return * @ throws par Seexception */Private Static CommandLine parseargs (string [] ARGs) throws parseexception {options Options = new options (); option O = New Option ("T", "table", true, "Table to import into (must exist)"); O. setargname ("table-name"); O. setrequired (true); options. addoption (o); O = New Option ("C", "column", true, "column to store row data into (must exist)"); O. setargname ("Family: qualifier"); O. setrequir Ed (true); options. addoption (o); O = New Option ("I", "input", true, "the directory or file to read from"); O. setargname ("Path-in-HDFS"); O. setrequired (true); options. addoption (o); commandlineparser parser = new posixparser (); CommandLine cmd = NULL; try {cmd = parser. parse (options, argS);} catch (exception e) {system. err. println ("error:" + E. getmessage () + "\ n"); helpformatter formatter = new H Elpformatter (); formatter. printhelp (name + "", options, true); system. exit (-1);} return cmd;}/*** main function * @ Param ARGs * @ throws exception */public static void main (string [] ARGs) throws exception {// resolve the number of input records to the CommandLine object configuration conf = hbaseconfiguration. create (); string [] otherargs = new genericoptionsparser (Conf, argS ). getremainingargs (); CommandLine cmd = parseargs (otherargs); // retrieves the number of records Str Ing tablename = cmd. getoptionvalue ("T"); string inputfilename = cmd. getoptionvalue ("I"); string columnname = cmd. getoptionvalue ("C"); Conf. set ("Conf. column ", columnname); job = new job (Conf," import from file "+ inputfilename +" into table "+ tablename); job. setjarbyclass (hdfstohbase. class); // set the map and reduce class job. setmapperclass (importmapper. class); job. setnumreducetasks (0); // set the type of the key-Value Pair output in the map stage Job. setoutputkeyclass (immutablebyteswritable. class); job. setoutputvalueclass (writable. class); // sets the input/output format of a job. setoutputformatclass (tableoutputformat. class); job. getconfiguration (). set (tableoutputformat. output_table, tablename); // sets the input/output path fileinputformat. addinputpath (job, new path (inputfilename); system. exit (job. waitforcompletion (true )? 0: 1 );}}

The imported jar files include:

This is developed in eclispe. It is placed below the default package and exported as a normal JAR file.

Then start hadoop and hbase respectively using the command start-all.sh and start-hbase.sh.

(1) first, log on to hbase shell and create a table that only contains one columnfamily.:

(2) Upload the TXT data to HDFS (the data is included in the source code package of the hbase authoritative guide ).

(3) then run the job:

Specifies the class name where the main function is located, and then it is respectively the habse table name, HDFS file name, And hbase table column name.

After the job is run, you can view the job running status at: http: // localhost: 50030/jobtracker. jsp.

Then, you can log on to hbase shell to view the number of rows in the article table. You can also use scan to print all rows.

2. Read the JSON string in the hbase table stored in step 1 and parse and store it in the new hbase table for query.

Source code:

/*** @ Author Ji yiqin * @ date 2014-6 * @ reference hbase authoritative guide chapter7 ***/import Java. io. ioexception; import Org. apache. commons. CLI. commandLine; import Org. apache. commons. CLI. commandlineparser; import Org. apache. commons. CLI. helpformatter; import Org. apache. commons. CLI. option; import Org. apache. commons. CLI. options; import Org. apache. commons. CLI. parseexception; import Org. apache. commons. CLI. posixparser; import or G. apache. commons. logging. log; import Org. apache. commons. logging. logfactory; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. hbase. hbaseconfiguration; import Org. apache. hadoop. hbase. keyValue; import Org. apache. hadoop. hbase. client. put; import Org. apache. hadoop. hbase. client. result; import Org. apache. hadoop. hbase. client. scan; import Org. apache. hadoop. hbase. io. immutablebyteswritable; impor T Org. apache. hadoop. hbase. mapreduce. identitytablereducer; import Org. apache. hadoop. hbase. mapreduce. tablemapreduceutil; import Org. apache. hadoop. hbase. mapreduce. tablemapper; import Org. apache. hadoop. hbase. util. bytes; import Org. apache. hadoop. io. writable; import Org. apache. hadoop. mapreduce. job; import Org. apache. hadoop. util. genericoptionsparser; import Org. JSON. simple. jsonobject; import Org. JSON. simple. pars Er. jsonparser; public class hbasetohbase {Private Static final log = logfactory. getlog (hbasetohbase. class); public static final string name = "hbasetohbase"; Public Enum counters {rows, cols, error, valid}/*** map class * uses the hbase table as the input, therefore, it inherits from tablemapper **/static class parsemapperextends tablemapper <immutablebyteswritable, writable> {private jsonparser parser = new jsonparser (); Private byte [] Family = NULL; @ overrideprotected void setup (context) throws ioexception, interruptedexception {family = bytes. tobytes (context. getconfiguration (). get ("Conf. family ") ;}@ override public void map (immutablebyteswritable rowkey, result columns, context) throws ioexception {string value = NULL; try {string author =" null "; put put = new put (rowkey. get (); // cyclically retrieve each column (here there is actually only one column storing the JSON string) for (keyValue KV: c Olumns. list () {context. getcounter (counters. cols ). increment (1); value = bytes. tostringbinary (KV. getvalue (); // parse the obtained JSON string jsonobject JSON = (jsonobject) parser. parse (value); For (Object key: JSON. keyset () {object val = JSON. get (key); If (key. equals ("author") {author = Val. tostring ();} Put. add (family, bytes. tobytes (key. tostring (), bytes. tobytes (Val. tostring () ;}}// use the parsed author as the row key to launch context. write (New im Mutablebyteswritable (bytes. tobytes (author), put); context. getcounter (counters. valid ). increment (1); log.info ("Storage Author" + author + "data is complete! ");} Catch (exception e) {e. printstacktrace (); system. err. println ("error:" + E. getmessage () + ", row:" + bytes. tostringbinary (rowkey. get () + ", JSON:" + value); context. getcounter (counters. error ). increment (1) ;}}/ *** Number of parsed command line partitions * @ Param ARGs * @ return * @ throws parseexception */Private Static CommandLine parseargs (string [] ARGs) throws parseexception {options Options = new options (); option O = new o Ption ("I", "input", true, "table to read from (must exist)"); O. setargname ("input-table-name"); O. setrequired (true); options. addoption (o); O = New Option ("ic", "column", true, "column to read data from (must exist)"); O. setargname ("Family: qualifier"); O. setrequired (true); options. addoption (o); O = New Option ("O", "output", true, "table to write to (must exist)"); O. setargname ("output-table-name"); O. Setrequired (true); options. addoption (o); O = New Option ("oc", "family", true, "CF to write data to (must exist)"); O. setargname ("family"); O. setrequired (true); options. addoption (o); commandlineparser parser = new posixparser (); CommandLine cmd = NULL; try {cmd = parser. parse (options, argS);} catch (exception e) {system. err. println ("error:" + E. getmessage () + "\ n"); helpformatter formatter = ne W helpformatter (); formatter. printhelp (name + "", options, true); system. exit (-1);} return cmd;}/*** main function * @ Param ARGs */public static void main (string [] ARGs) throws exception {configuration conf = hbaseconfiguration. create (); string [] otherargs = new genericoptionsparser (Conf, argS ). getremainingargs (); CommandLine cmd = parseargs (otherargs); string inputtable = cmd. getoptionvalue ("I"); // hbas E source table string outputtable = cmd. getoptionvalue ("O"); // hbase target table string inputcolumn = cmd. getoptionvalue ("ic"); // name of the column in The hbase source table string outputcolumnfamily = cmd. getoptionvalue ("oc"); // The column family name conf of the hbase target table. set ("Conf. family ", outputcolumnfamily); // provides scan instances to specify the columns to scan = new scan (); byte [] [] colkey = keyValue. parsecolumn (bytes. tobytes (inputcolumn); scan. addColumn (colkey [0], colkey [1]); job = new job (Co NF, "parse data in" + inputtable + ", write to" + outputtable); job. setjarbyclass (hbasetohbase. class); // high-speed configuration job uses hbase as the input source and output source tablemapreduceutil. inittablemapperjob (inputtable, scan, parsemapper. class, immutablebyteswritable. class, put. class, job); tablemapreduceutil. inittablereducerjob (outputtable, identitytablereducer. class, job); system. exit (job. waitforcompletion (true )? 0: 1 );}}

Note:

(1) When using an hbase table as the input of a mapreduce job, on the one hand, the word tablemapper class must be inherited, on the other hand, a scan instance must be provided to specify the records to be scanned as input.

(2) The reduce configured in the project is identitytablereducer. Its role is the same as that of identitytablemapper. It simply transfers key-value pairs to the next stage without any material effect, it is not necessary to store data in hbase tables. It can be replaced by another sentence, that is, setnumreducetasks (0 ).

In fact, when the job is running, you should also be able to see that reduce has been 0%.

The imported jar files include:

(1) Create an hbase table:

(2) Export the jar package:

Note: A third-party jar package, namely, the simple JSON jar package, is introduced to parse the JSON string.

The simple json jar file is downloaded here: http://www.java2s.com/Code/Jar/j/Downloadjsonsimple111jar.htm

Previously, I got a copy from a website. The result is that there is no parse (string) interface, and only the parse (Reader) interface. I always report an error when converting string to stringreader and passing it into the result job, so it is.

When a third-party jar package is introduced to run mapreduce jobs, the classnotfound exception is reported. The solutions include:

1. Deploy the dependent package to each tasktracker.

This method is the easiest, but it needs to be deployed to each tasktracker and may cause package contamination. For example, if application a and application B use the same libray, a conflict may occur if the version number is different.

2. Merge dependent packages and packages directly into mapreducejob

The problem with this method is that the merged package may be large, and it is not conducive to package upgrades.

3. Use distributedcache

This method is to first upload these packages to HDFS, which can be done once when the program starts. Add hdfspath to classpath during submitjob.
Demo:

$ Bin/hadoop FS-copyfromlocal IB/protobuf-java-2.0.3.jar/MyApp/protobuf-java-2.0.3.jar // setup the application's jobconf: jobconf job = new jobconf (); distributedcache. addfiletoclasspath (newpath ("/MyApp/protobuf-java-2.0.3.jar"), job );

4. In another case, when there are too many extension packages, it will be uncomfortable to use 3. Let's take a look:

The hadoop authoritative guide also describes how to handle jar packaging.

[All non-independent jar files must be packaged into the Lib folder of the JAR file. (This is similar to the Java webapplication archive or war file, the difference is that the latter JAR file is placed in the war file under the WEB-INF/lib subfolders )]

I'm using the fourth method, creating a lib directory below the project and putting the json-simple-1.1.1.jar in:

Then export:

(3) run the job:

OK. Now you can use hbase shell to log on and use scan 'authortable' to view the parsed data.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More