Hbase Concept Learning (7) Integration of hbase and mapreduce

Source: Internet
Author: User
Tags map class hadoop fs

This article is based on the example mentioned above after reading the hbase authoritative guide, but it is slightly different.

The integration of hbase and mapreduce is nothing more than the integration of mapreduce jobs with hbase tables as input, output, or as a medium for sharing data between mapreduce jobs.

This article will explain two examples:

1. Read TXT text data stored on HDFS and simply store it in hbase tables as a JSON string.

2. Read the JSON string in the hbase table stored in step 1 and parse and store it in the new hbase table for query.

This article details the source code and how to execute it, aiming to deepen the learning of integration between hbase and mapreduce.

If you still don't know how to build an hbase standalone Environment Based on HDFS and how to execute mapreduce tasks, refer to these two articles first:

(1) hbase environment setup (1) standalone mode based on hadoop File System in Ubuntu

(2) hadoop basic learning (1) analyze, compile, and execute the wordcount Word Frequency Statistics Program


1. Read TXT text data stored on HDFS and simply store it in hbase tables as a JSON string.

Source code:

/*** @ Author Ji yiqin * @ date 2014-6 * @ reference hbase authoritative guide chapter7 ***/import Java. io. ioexception; import Org. apache. commons. CLI. commandLine; import Org. apache. commons. CLI. commandlineparser; import Org. apache. commons. CLI. helpformatter; import Org. apache. commons. CLI. option; import Org. apache. commons. CLI. options; import Org. apache. commons. CLI. parseexception; import Org. apache. commons. CLI. posixparser; import or G. apache. commons. codec. digest. digestutils; import Org. apache. commons. logging. log; import Org. apache. commons. logging. logfactory; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. hbase. hbaseconfiguration; import Org. apache. hadoop. hbase. keyValue; import Org. apache. hadoop. hbase. client. put; import Org. apache. hadoop. hbase. io. immutablebyteswritable; import or G. apache. hadoop. hbase. mapreduce. tableoutputformat; import Org. apache. hadoop. hbase. util. bytes; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. io. writable; import Org. apache. hadoop. mapreduce. job; import Org. apache. hadoop. mapreduce. mapper; import Org. apache. hadoop. mapreduce. lib. input. fileinputformat; import Org. apache. hadoop. util. genericoptionsparser; Publ IC class hdfstohbase {Private Static final log = logfactory. getlog (hdfstohbase. class); public static final string name = "importfromfile"; Public Enum counters {lines}/*** map class **/static class importmapper extends mapper <longwritable, text, immutablebyteswritable, writable> {private byte [] Family = NULL; private byte [] qualifier = NULL; @ overrideprotected void setup (context) throws IOE Xception, interruptedexception {// get the column name string columns = context passed through configuration. getconfiguration (). get ("Conf. column "); // parse the column family and column name byte [] [] columnsbytes = keyValue. parsecolumn (bytes. tobytes (columns); Family = columnsbytes [0]; qualifier = columnsbytes [1]; log.info ("Family:" + family. tostring () + "qualifiers:" + qualifier) ;}@ overridepublic void map (longwritable offset, text line, context) throws IOE Xception {try {string linestr = line. tostring (); byte [] rowkey = digestutils. MD5 (linestr); // construct the Put object put = new put (rowkey); Put. add (family, qualifier, bytes. tobytes (linestr); // transmits the Put object context. write (New immutablebyteswritable (rowkey), put); context. getcounter (counters. lines ). increment (1);} catch (exception e) {e. printstacktrace () ;}}/ *** resolves the number of commands in the command line to the CommandLine object of hbase * @ Param ARGs * @ return * @ throws par Seexception */Private Static CommandLine parseargs (string [] ARGs) throws parseexception {options Options = new options (); option O = New Option ("T", "table", true, "Table to import into (must exist)"); O. setargname ("table-name"); O. setrequired (true); options. addoption (o); O = New Option ("C", "column", true, "column to store row data into (must exist)"); O. setargname ("Family: qualifier"); O. setrequir Ed (true); options. addoption (o); O = New Option ("I", "input", true, "the directory or file to read from"); O. setargname ("Path-in-HDFS"); O. setrequired (true); options. addoption (o); commandlineparser parser = new posixparser (); CommandLine cmd = NULL; try {cmd = parser. parse (options, argS);} catch (exception e) {system. err. println ("error:" + E. getmessage () + "\ n"); helpformatter formatter = new H Elpformatter (); formatter. printhelp (name + "", options, true); system. exit (-1);} return cmd;}/*** main function * @ Param ARGs * @ throws exception */public static void main (string [] ARGs) throws exception {// resolve the number of input records to the CommandLine object configuration conf = hbaseconfiguration. create (); string [] otherargs = new genericoptionsparser (Conf, argS ). getremainingargs (); CommandLine cmd = parseargs (otherargs); // retrieves the number of records Str Ing tablename = cmd. getoptionvalue ("T"); string inputfilename = cmd. getoptionvalue ("I"); string columnname = cmd. getoptionvalue ("C"); Conf. set ("Conf. column ", columnname); job = new job (Conf," import from file "+ inputfilename +" into table "+ tablename); job. setjarbyclass (hdfstohbase. class); // set the map and reduce class job. setmapperclass (importmapper. class); job. setnumreducetasks (0); // set the type of the key-Value Pair output in the map stage Job. setoutputkeyclass (immutablebyteswritable. class); job. setoutputvalueclass (writable. class); // sets the input/output format of a job. setoutputformatclass (tableoutputformat. class); job. getconfiguration (). set (tableoutputformat. output_table, tablename); // sets the input/output path fileinputformat. addinputpath (job, new path (inputfilename); system. exit (job. waitforcompletion (true )? 0: 1 );}}

The imported jar files include:


This is developed in eclispe. It is placed below the default package and exported as a normal JAR file.

Then start hadoop and hbase respectively using the command start-all.sh and start-hbase.sh.


(1) first, log on to hbase shell and create a table that only contains one columnfamily.:



(2) Upload the TXT data to HDFS (the data is included in the source code package of the hbase authoritative guide ).



(3) then run the job:


Specifies the class name where the main function is located, and then it is respectively the habse table name, HDFS file name, And hbase table column name.

After the job is run, you can view the job running status at: http: // localhost: 50030/jobtracker. jsp.

Then, you can log on to hbase shell to view the number of rows in the article table. You can also use scan to print all rows.


2. Read the JSON string in the hbase table stored in step 1 and parse and store it in the new hbase table for query.

Source code:

/*** @ Author Ji yiqin * @ date 2014-6 * @ reference hbase authoritative guide chapter7 ***/import Java. io. ioexception; import Org. apache. commons. CLI. commandLine; import Org. apache. commons. CLI. commandlineparser; import Org. apache. commons. CLI. helpformatter; import Org. apache. commons. CLI. option; import Org. apache. commons. CLI. options; import Org. apache. commons. CLI. parseexception; import Org. apache. commons. CLI. posixparser; import or G. apache. commons. logging. log; import Org. apache. commons. logging. logfactory; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. hbase. hbaseconfiguration; import Org. apache. hadoop. hbase. keyValue; import Org. apache. hadoop. hbase. client. put; import Org. apache. hadoop. hbase. client. result; import Org. apache. hadoop. hbase. client. scan; import Org. apache. hadoop. hbase. io. immutablebyteswritable; impor T Org. apache. hadoop. hbase. mapreduce. identitytablereducer; import Org. apache. hadoop. hbase. mapreduce. tablemapreduceutil; import Org. apache. hadoop. hbase. mapreduce. tablemapper; import Org. apache. hadoop. hbase. util. bytes; import Org. apache. hadoop. io. writable; import Org. apache. hadoop. mapreduce. job; import Org. apache. hadoop. util. genericoptionsparser; import Org. JSON. simple. jsonobject; import Org. JSON. simple. pars Er. jsonparser; public class hbasetohbase {Private Static final log = logfactory. getlog (hbasetohbase. class); public static final string name = "hbasetohbase"; Public Enum counters {rows, cols, error, valid}/*** map class * uses the hbase table as the input, therefore, it inherits from tablemapper **/static class parsemapperextends tablemapper <immutablebyteswritable, writable> {private jsonparser parser = new jsonparser (); Private byte [] Family = NULL; @ overrideprotected void setup (context) throws ioexception, interruptedexception {family = bytes. tobytes (context. getconfiguration (). get ("Conf. family ") ;}@ override public void map (immutablebyteswritable rowkey, result columns, context) throws ioexception {string value = NULL; try {string author =" null "; put put = new put (rowkey. get (); // cyclically retrieve each column (here there is actually only one column storing the JSON string) for (keyValue KV: c Olumns. list () {context. getcounter (counters. cols ). increment (1); value = bytes. tostringbinary (KV. getvalue (); // parse the obtained JSON string jsonobject JSON = (jsonobject) parser. parse (value); For (Object key: JSON. keyset () {object val = JSON. get (key); If (key. equals ("author") {author = Val. tostring ();} Put. add (family, bytes. tobytes (key. tostring (), bytes. tobytes (Val. tostring () ;}}// use the parsed author as the row key to launch context. write (New im Mutablebyteswritable (bytes. tobytes (author), put); context. getcounter (counters. valid ). increment (1); log.info ("Storage Author" + author + "data is complete! ");} Catch (exception e) {e. printstacktrace (); system. err. println ("error:" + E. getmessage () + ", row:" + bytes. tostringbinary (rowkey. get () + ", JSON:" + value); context. getcounter (counters. error ). increment (1) ;}}/ *** Number of parsed command line partitions * @ Param ARGs * @ return * @ throws parseexception */Private Static CommandLine parseargs (string [] ARGs) throws parseexception {options Options = new options (); option O = new o Ption ("I", "input", true, "table to read from (must exist)"); O. setargname ("input-table-name"); O. setrequired (true); options. addoption (o); O = New Option ("ic", "column", true, "column to read data from (must exist)"); O. setargname ("Family: qualifier"); O. setrequired (true); options. addoption (o); O = New Option ("O", "output", true, "table to write to (must exist)"); O. setargname ("output-table-name"); O. Setrequired (true); options. addoption (o); O = New Option ("oc", "family", true, "CF to write data to (must exist)"); O. setargname ("family"); O. setrequired (true); options. addoption (o); commandlineparser parser = new posixparser (); CommandLine cmd = NULL; try {cmd = parser. parse (options, argS);} catch (exception e) {system. err. println ("error:" + E. getmessage () + "\ n"); helpformatter formatter = ne W helpformatter (); formatter. printhelp (name + "", options, true); system. exit (-1);} return cmd;}/*** main function * @ Param ARGs */public static void main (string [] ARGs) throws exception {configuration conf = hbaseconfiguration. create (); string [] otherargs = new genericoptionsparser (Conf, argS ). getremainingargs (); CommandLine cmd = parseargs (otherargs); string inputtable = cmd. getoptionvalue ("I"); // hbas E source table string outputtable = cmd. getoptionvalue ("O"); // hbase target table string inputcolumn = cmd. getoptionvalue ("ic"); // name of the column in The hbase source table string outputcolumnfamily = cmd. getoptionvalue ("oc"); // The column family name conf of the hbase target table. set ("Conf. family ", outputcolumnfamily); // provides scan instances to specify the columns to scan = new scan (); byte [] [] colkey = keyValue. parsecolumn (bytes. tobytes (inputcolumn); scan. addColumn (colkey [0], colkey [1]); job = new job (Co NF, "parse data in" + inputtable + ", write to" + outputtable); job. setjarbyclass (hbasetohbase. class); // high-speed configuration job uses hbase as the input source and output source tablemapreduceutil. inittablemapperjob (inputtable, scan, parsemapper. class, immutablebyteswritable. class, put. class, job); tablemapreduceutil. inittablereducerjob (outputtable, identitytablereducer. class, job); system. exit (job. waitforcompletion (true )? 0: 1 );}}

Note:

(1) When using an hbase table as the input of a mapreduce job, on the one hand, the word tablemapper class must be inherited, on the other hand, a scan instance must be provided to specify the records to be scanned as input.

(2) The reduce configured in the project is identitytablereducer. Its role is the same as that of identitytablemapper. It simply transfers key-value pairs to the next stage without any material effect, it is not necessary to store data in hbase tables. It can be replaced by another sentence, that is, setnumreducetasks (0 ).

In fact, when the job is running, you should also be able to see that reduce has been 0%.


The imported jar files include:



(1) Create an hbase table:



(2) Export the jar package:

Note: A third-party jar package, namely, the simple JSON jar package, is introduced to parse the JSON string.

The simple json jar file is downloaded here: http://www.java2s.com/Code/Jar/j/Downloadjsonsimple111jar.htm

Previously, I got a copy from a website. The result is that there is no parse (string) interface, and only the parse (Reader) interface. I always report an error when converting string to stringreader and passing it into the result job, so it is.

When a third-party jar package is introduced to run mapreduce jobs, the classnotfound exception is reported. The solutions include:

1. Deploy the dependent package to each tasktracker.

This method is the easiest, but it needs to be deployed to each tasktracker and may cause package contamination. For example, if application a and application B use the same libray, a conflict may occur if the version number is different.

2. Merge dependent packages and packages directly into mapreducejob

The problem with this method is that the merged package may be large, and it is not conducive to package upgrades.

3. Use distributedcache

This method is to first upload these packages to HDFS, which can be done once when the program starts. Add hdfspath to classpath during submitjob.
Demo:

$ Bin/hadoop FS-copyfromlocal IB/protobuf-java-2.0.3.jar/MyApp/protobuf-java-2.0.3.jar // setup the application's jobconf: jobconf job = new jobconf (); distributedcache. addfiletoclasspath (newpath ("/MyApp/protobuf-java-2.0.3.jar"), job );

4. In another case, when there are too many extension packages, it will be uncomfortable to use 3. Let's take a look:

The hadoop authoritative guide also describes how to handle jar packaging.

[All non-independent jar files must be packaged into the Lib folder of the JAR file. (This is similar to the Java webapplication archive or war file, the difference is that the latter JAR file is placed in the war file under the WEB-INF/lib subfolders )]

I'm using the fourth method, creating a lib directory below the project and putting the json-simple-1.1.1.jar in:


Then export:



(3) run the job:


OK. Now you can use hbase shell to log on and use scan 'authortable' to view the parsed data.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.