Recently in the data analysis of a traffic flow, the demand is for a huge amount of urban traffic data, need to use MapReduce cleaning after importing into hbase storage, and then using the Hive External table associated with hbase, hbase data query, statistical analysis, Save the analysis results in a hive table, and finally use Sqoop to import the data from that table into MySQL. The whole process is probably as follows:
Below I mainly introduce the Hive Association hbase table--sqoop Export hive table to MySQL these processes, raw dataset collection, MapReduce cleaning and web interface display are not described here.
I. HBase database table
HBase (main):003:0> ListTABLEtranstable1 row (s) in 0.0250 seconds=> ["Transtable"]hbase (main):004:0> describe ' Transtable ' DESCRIPTION ENABLED ' transtable ', {NAME = ' jtxx ', data_block_encoding = ' NONE ', Bloomfilter = > ' ROW ', Replication_scope = ' 0 ', VER true sions = ' 1 ', COMPRESSION = ' NONE ', min_versions = ' 0 ', TTL = > ' 2147483647 ', keep_deleted_cells = ' false ', BLO cksize = ' 65536 ', in_memory = ' false ', Blockcache = ' True '}1 row (s) in 0.0480 seconds
Create an hbase table called "Transtable", the column family is: "Jtxx". Some of the data in HBase is as follows:
HBase (main):008:0> get ' transtable ', ' 32108417000000013220140317000701 ' COLUMN CELL jtxx:cdbh timestamp= 1429597736296, value=03 jtxx:clbj timestamp=1429597736296, value=0 jtxx:cllb timestamp=1429597736296, value =0 jtxx:cllx timestamp=1429597736296, value=3 jtxx:clsd timestamp=1429597736296 , value=127.00 JTXX:HPHM timestamp=1429597736296, value=\xe8\x8b\x8fkyv152 jtxx:wflx timestamp=1429597736296, value=0 jtxx:xsfx timestamp=1429597736296, value=038 Row (s) in 0.1550 seconds
Ii. Creating a Hive External Table association HBase table
Create external Table Transjtxx_hbase ( clxxbh string, xsfx string, cdbh string, hphm string, CLSD String, cllx string, clbj string, cllb string, wflx string) stored by ' Org.apache.hadoop.hive.hbase.HBaseStorageHandler ' with serdeproperties ("hbase.columns.mapping" = ": Key,jtxx:xsfx, JTXX:CDBH,JTXX:HPHM,JTXX:CLSD,JTXX:CLLX,JTXX:CLBJ,JTXX:CLLB,JTXX:WFLX ") tblproperties (" Hbase.table.name "=" Transtable ");
Hbase.columns.mapping to correspond to the column qualifier under the column family in the HBase database transtable table. It must be an external table here.
To see if the association succeeds, and how to execute a statement that can query the data in the HBase table, the association succeeds.
Hive> SELECT * from transjtxx_hbase where clxxbh like ' 3,.21,084,17e,+17% '; Total jobs = 1Launching Job 1 out of 1Number of reduce tasks are set to 0 since there ' s no reduce operatorstarting Job = Jo b_1428394594787_0007, Tracking URL = Http://secondmgt:8088/proxy/application_1428394594787_0007/Kill Command =/home/ Hadoopuser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop Job-kill Job_1428394594787_0007hadoop Job information for Stage-1: Number of mappers:1; Number of reducers:02015-04-21 17:27:18,136 Stage-1 map = 0, reduce = 0%2015-04-21 17:27:35,029 Stage-1 map = 100%, re Duce = 0, Cumulative CPU 12.31 secmapreduce Total cumulative cpu time:12 seconds 310 msecended Job = job_1428394594787_0 007MapReduce Jobs launched:job 0:map:1 Cumulative cpu:12.31 sec HDFs read:256 HDFs write:636 successtotal mapredu Ce CPU time spent:12 seconds 310 msecOK32108417000000013220140317000701 03 03 su KYV152 127.00 3 0 0 032108417000000013220140317000705 02 03 su KRU593 127.00 2 0 0 032108417000000013220140317000857 0 3 02 su KYL920 28.00 4 0 0 032108417000000013220140317001145 02 02 su K1 9v75 136.00 6 0 0 032108417000000013220140317001157 02 02 Lu QV0897 150.00 4 0 0 032108417000000013220140317001726 02 02 su KL2938 23.00 1 0 0 032108417000000013220140317001836 02 02 su j5s373 142.00 4 0 0 0321084170 00000013220140317001844 02 02 su KK8332 158.00 3 0 0 0321084170000000132201403170 02039 03 02 su KK8820 17.00 0 0 0 032108417000000013220140317002206 03 03 Su KK8902 32.00 4 0 0 0Time taken:36.018 seconds, Fetched:10 row (s)
Iii. Creating a hive table for storing queries and statistical analysis results
Because I am a simulation environment here, I create a table with the same field type as the Hive association table Transjtxx_hbase, which holds the query results as follows:
Hive> CREATE TABLE Temptrans > (clxxbh string, > xsfx string, > cdbh string, > hphm string, > clsd string, > cllx string, > clbj string, > cllb string, > WFLX string >) oktime taken:0.112 seconds
iv. inserting data into the Hive table through query results
using hive table four ways to import data-- Inserts data into the Hive table through SQL query statements. (More on one of my other blog posts: Four different ways to export data from hive tables and how to customize Export column separators) Here I take a look at a passing vehicle at a junction for example. Because, CLXXBH is made up of intersection number + date, this does not use overwrite, because the subsequent loop executes, the previously imported data needs to be preserved, so you must use into, as follows:
hive> INSERT INTO Table Temptrans Select * from transjtxx_hbase where clxxbh like ' 3,.21,084,17e,+17% '; Total jobs = 3Launching Job 1 out of 3Number of reduce tasks are set to 0 since there ' s no reduce operatorstarting Job = Jo b_1428394594787_0009, Tracking URL = Http://secondmgt:8088/proxy/application_1428394594787_0009/Kill Command =/home/ Hadoopuser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop Job-kill Job_1428394594787_0009hadoop Job information for Stage-1: Number of mappers:1; Number of reducers:02015-04-21 19:05:03,398 Stage-1 map = 0, reduce = 0%2015-04-21 19:05:24,091 Stage-1 map = 100%, re Duce = 0, Cumulative CPU 12.71 secmapreduce Total cumulative cpu time:12 seconds 710 msecended Job = job_1428394594787_0 009stage-4 is selected by condition Resolver. Stage-3 is filtered off by condition Resolver. Stage-5 is filtered off by condition Resolver. Moving Data to:hdfs://secondmgt:8020/hive/scratchdir/hive_2015-04-21_19-04-48_325_2835499611469580351-1/- Ext-10000loading DATA to table hive.temptranstable Hive.temptrans stats: [numfiles=2, Numrows=12, totalsize=1380, rawdatasize=732] MapReduce Jobs launched:job 0:map:1 Cumulative cpu:12.71 sec HDFs read:256 HDFs write:815 successtotal MapReduce CPU time Spent:12 seconds 710 msecoktime taken:37.229 seconds
Hive> SELECT * from temptrans;ok32108417000000013220140317000701 03 03 su KYV152 127.00 3 0 0 032108417000000013220140317000705 02 03 su KRU593 127.00 2 0 0 03 2108417000000013220140317000857 03 02 Su KYL920 28.00 4 0 0 03210841700000001322 0140317001145 02 02 Su k19v75 136.00 6 0 0 032108417000000013220140317001157 02 02 Lu QV0897 150.00 4 0 0 032108417000000013220140317001726 02 02 Su KL2938 23.00 1 0 0 032108417000000013220140317001836 02 02 su j5s373 14 2.00 4 0 0 032108417000000013220140317001844 02 02 su KK8332 158.00 3 0 0 032108417000000013220140317002039 03 02 su KK8820 17.00 0 0 0 032108 417000000013220140317002206 03 03 Su KK8902 32.00 4 0 0 032108417000000013320140317000120 02 02 Su KRW076 0.00 7 0 0 032108417000000013320140317000206 00 02 Su AHF730 0.00 4 0 0 032108417000000013320140317000207 02 02 su KYJ792 0.00 6 0 0 032108417000000013320140317000530 00 01 su k53t85 0.00 1 0 0 032108417000000013320140317000548 03 01 su KR0737 0.00 7 0 0 0321084170000000 13320140317000605 03 02 su KYU203 0.00 1 0 0 032108417000000013320140317000659 01 02 Su k3r762 0.00 4 0 0 032108417000000013320140317001042 02 03 Su KYK578 0.00 6 0 0 032108417000000013320140317001222 02 03 su KK8385 0.00 2 0 0 032108417000000013320140317001418 02 03 su k26f89 0.00 7 0 0 03210841 7000000013320140317001538 02 03 su KK8987 0.00 5 0 0 03210841700000001332014031 7001732 01 01 su KYB127 0.00 7 0 0 0Time taken:0.055 seconds, fetched:22 row ( S
v. Create a MySQL database and its tables for importing data
Mysql> CREATE DATABASE transport; Query OK, 1 row Affected (0.00 sec) mysql> use transport;database changedmysql> CREATE TABLE Jtxx ( -&G t; CLXXBH varchar (+) NOT null primary key, Xsfx varchar (2), cdbh varchar (4), HPHM varchar (+), clsd varchar (+), cllx varchar (2), clbj varchar ( 8), cllb varchar (8), WFLX varchar (8) ; Query OK, 0 rows affected (0.04 sec) mysql> show tables;+---------------------+| Tables_in_transport |+---------------------+| Jtxx |+---------------------+1 row in Set (0.00 sec) mysql> select * from Jtxx; Empty Set (0.00 sec)
When creating a MySQL table here, be sure to note that the field name corresponds to the hive table
vi. sqoop importing hive table data into MySQL
Using the export command above will report the following error:
15/04/21 19:38:52 INFO MapReduce. Job:task id:attempt_1428394594787_0010_m_000001_0, Status:FAILEDError:java.io.IOException:Can ' t export data, please Check task tracker logs at Org.apache.sqoop.mapreduce.TextExportMapper.map (textexportmapper.java:112) at or G.apache.sqoop.mapreduce.textexportmapper.map (textexportmapper.java:39) at Org.apache.hadoop.mapreduce.Mapper.run (mapper.java:145) at Org.apache.sqoop.mapreduce.AutoProgressMapper.run ( autoprogressmapper.java:64) at Org.apache.hadoop.mapred.MapTask.runNewMapper (maptask.java:763) at Org.apache . Hadoop.mapred.MapTask.run (maptask.java:339) at Org.apache.hadoop.mapred.yarnchild$2.run (yarnchild.java:162) At java.security.AccessController.doPrivileged (Native Method) at Javax.security.auth.Subject.doAs (subject.java:415 ) at Org.apache.hadoop.security.UserGroupInformation.doAs (usergroupinformation.java:1491) at Org.apache.hado Op.mapred.YarnChild.main (Yarnchild.java:157) caused by:java.util.NoSuchElementException at Java.util.arraylist$itr.next (arraylist.java:834) at JT Xx.__loadfromfields (jtxx.java:387) at Jtxx.parse (jtxx.java:336) at Org.apache.sqoop.mapreduce.TextExportMapp Er.map (textexportmapper.java:83) ... Ten more
The reason for this error is that a delimiter error is used between table fields in the specified hive for Sqoop read parsing to be incorrect. If the result of the MapReduce operation Rollup is performed by hive, the default delimiter is ' \001 ', otherwise the delimiter should be ' \ t ' if imported from an HDFs file. Here I am the result of hive performing the MapReduce analysis rollup, so the default separation is ' \001 '. The Sqoop command modifies the following, specifying the delimiter:
[Email protected] ~]$ sqoop-export--connect jdbc:mysql://secondmgt:3306/transport--username hive--password Hive-- Table Jtxx--export-dir/hive/warehouse/hive.db/temptrans--input-fields-terminated-by ' \001 '
Note:
Here the Sqoop Export command, when the corresponding table in the MySQL database is empty, no data can be executed successfully, but when there is data, that is, from the hive table in the data field to be exported, and the MySQL table in the key words have duplicate records, the process will die, no longer execute, View the Hadoop task interface memory is mostly occupied, the queue is accounted for 100% of the situation, as follows:
[[email protected] ~]$ sqoop-export--connect jdbc:mysql://secondmgt:3306/transport--username Hive--password Hive--table jtxx--export-dir/hive/warehouse/hive.db/temptrans--input-fields-terminated-by ' \001 ' Warning:/usr/lib /hcatalog does not exist! Hcatalog jobs would fail. Please set $HCAT _home to the root of your hcatalog installation.15/04/21 20:08:28 WARN tool. Basesqooptool:setting your password on the command-line is insecure. Consider Using-p instead.15/04/21 20:08:28 INFO Manager. Mysqlmanager:preparing to use a MySQL streaming resultset.15/04/21 20:08:28 INFO tool. Codegentool:beginning code GENERATION15/04/21 20:08:29 INFO Manager. Sqlmanager:executing SQL statement:select t.* from ' jtxx ' as T LIMIT 115/04/21 20:08:29 INFO Manager. Sqlmanager:executing SQL statement:select t.* from ' jtxx ' as T LIMIT 115/04/21 20:08:29 INFO orm. Compilationmanager:hadoop_mapred_home Is/home/hadoopuser/cloud/hadoop/programs/hadoop-2.2.0note:/tmp/ Sqoop-hadoopuser/compile/67173774b957b511b4d62bc4ebe56e23/jtxx.java uses or overrides a deprecated API. Note:recompile with-xlint:deprecation for DETAILS.15/04/21 20:08:30 INFO ORM. compilationmanager:writing jar File:/tmp/sqoop-hadoopuser/compile/67173774b957b511b4d62bc4ebe56e23/jtxx.jar15/04 /21 20:08:30 INFO MapReduce. Exportjobbase:beginning export of Jtxx15/04/21 20:08:30 INFO Configuration.deprecation:mapred.job.tracker is Deprecated. Instead, use Mapreduce.jobtracker.addressSLF4J:Class path contains multiple slf4j bindings. Slf4j:found Binding in [jar:file:/home/hadoopuser/cloud/hadoop/programs/hadoop-2.2.0/share/hadoop/common/lib/ Slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/staticloggerbinder.class]slf4j:found Binding in [jar:file:/home/ hadoopuser/cloud/hbase/hbase-0.96.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/ Staticloggerbinder.class]slf4j:see http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Slf4j:actual binding is of type [ORG.SLF4J.IMPL.LOG4JLOGGERFACTORY]15/04/21 20:08:30 INFO Configuration.deprecation:mapred.jar is deprecated. Instead, use MAPREDUCE.JOB.JAR15/04/21 20:08:31 INFO configuration.deprecation: Mapred.reduce.tasks.speculative.execution is deprecated. Instead, use MAPREDUCE.REDUCE.SPECULATIVE15/04/21 20:08:31 INFO configuration.deprecation: Mapred.map.tasks.speculative.execution is deprecated. Instead, use MAPREDUCE.MAP.SPECULATIVE15/04/21 20:08:31 INFO Configuration.deprecation:mapred.map.tasks is deprecated . Instead, use MAPREDUCE.JOB.MAPS15/04/21 20:08:31 INFO client. Rmproxy:connecting to ResourceManager at SECONDMGT/192.168.2.133:803215/04/21 20:08:32 INFO input. Fileinputformat:total input paths to PROCESS:215/04/21 20:08:32 INFO input. Fileinputformat:total input paths to PROCESS:215/04/21 20:08:32 INFO MapReduce. Jobsubmitter:number of Splits:315/04/21 20:08:32 INFO Configuration.deprecation:mapred.job.classpath.files is Deprecated. Instead, use MAPREDUCE.JOB.CLASSPATH.FILES15/04/21 20:08:32 INFO Configuration.deprecation:user.naMe is deprecated. Instead, use MAPREDUCE.JOB.USER.NAME15/04/21 20:08:32 INFO Configuration.deprecation:mapred.cache.files.filesizes is Deprecated. Instead, use MAPREDUCE.JOB.CACHE.FILES.FILESIZES15/04/21 20:08:32 INFO Configuration.deprecation:mapred.cache.files is deprecated. Instead, use MAPREDUCE.JOB.CACHE.FILES15/04/21 20:08:32 INFO Configuration.deprecation:mapred.reduce.tasks is Deprecated. Instead, use MAPREDUCE.JOB.REDUCES15/04/21 20:08:32 INFO Configuration.deprecation:mapred.mapoutput.value.class is Deprecated. Instead, use MAPREDUCE.MAP.OUTPUT.VALUE.CLASS15/04/21 20:08:32 INFO Configuration.deprecation:mapreduce.map.class is Deprecated. Instead, use MAPREDUCE.JOB.MAP.CLASS15/04/21 20:08:32 INFO Configuration.deprecation:mapred.job.name is deprecated. Instead, use MAPREDUCE.JOB.NAME15/04/21 20:08:32 INFO Configuration.deprecation:mapreduce.inputformat.class is Deprecated. Instead, use MAPREDUCE.JOB.INPUTFORMAT.CLASS15/04/21 20:08:32 INFO Configuration.deprecation:maprEd.input.dir is deprecated. Instead, use MAPREDUCE.INPUT.FILEINPUTFORMAT.INPUTDIR15/04/21 20:08:32 INFO configuration.deprecation: Mapreduce.outputformat.class is deprecated. Instead, use MAPREDUCE.JOB.OUTPUTFORMAT.CLASS15/04/21 20:08:32 INFO configuration.deprecation: Mapred.cache.files.timestamps is deprecated. Instead, use MAPREDUCE.JOB.CACHE.FILES.TIMESTAMPS15/04/21 20:08:32 INFO configuration.deprecation: Mapred.mapoutput.key.class is deprecated. Instead, use MAPREDUCE.MAP.OUTPUT.KEY.CLASS15/04/21 20:08:32 INFO Configuration.deprecation:mapred.working.dir is Deprecated. Instead, use MAPREDUCE.JOB.WORKING.DIR15/04/21 20:08:32 INFO mapreduce. Jobsubmitter:submitting tokens for JOB:JOB_1428394594787_001615/04/21 20:08:33 INFO impl. yarnclientimpl:submitted application application_1428394594787_0016 to ResourceManager at secondmgt/ 192.168.2.133:803215/04/21 20:08:33 INFO MapReduce. Job:the URL to track the JOB:HTTP://SECONDMGT:8088/PROXY/APPLICATION_1428394594787_0016/15/04/21 20:08:33 INFO MapReduce. Job:running job:job_1428394594787_0016
Sqoop task can not be submitted, has been in the above state can not continue to execute, view the Hadoop task interface, the following situation, the queue is 100% occupied:
Workaround:
1. See what processes Hadoop is running, Hadoop job-list
2. Kill zombie Process, Hadoop Job-kill [Job-id]
3. Modify the Sqoop execution command as follows:
Sqoop-export--connect jdbc:mysql://secondmgt:3306/transport--username hive--password Hive--table jtxx -- Update-key clxxbh--update-mode Allowinsert--export-dir/hive/warehouse/hive.db/temptrans -- Input-fields-terminated-by ' \001 '
Added two parameters:--update-key clxxbh--update-mode Allowinsert, the previous one indicates that if the data keywords in the post-import and the data in the MySQL database exist the same, the row record is updated, The latter indicates that data that originally did not exist in the target database is also imported into the database table, that is, there is data retention, new data is inserted, and it is followed by another option, updateonly, which updates only the data and does not insert new data. Detailed introduction to another blog post (Sqoop1.4.4 exporting file datasets from HDFs to MySQL database tables)
Import hive statistical analysis results into MySQL database table (i)--sqoop Import method