Alex 的 Hadoop 菜鳥教程: 第9課 Sqoop1 從Hbase或者Hive匯出mysql,hadoopsqoop1
今天講講怎麼用sqoop將Hbase或者Hive的東西匯出到mysql。不過事先要告訴大家
目前sqoop沒有辦法把資料直接從Hbase匯出到mysql。必須要通過Hive建立2個表,一個外部表格是基於這個Hbase表的,另一個是單純的基於hdfs的hive原生表,然後把外部表格的資料匯入到原生表(臨時),然後通過hive將暫存資料表裡面的資料匯出到mysql
資料準備mysql建立空表
CREATE TABLE `employee` ( `rowkey` int(11) NOT NULL, `id` int(11) NOT NULL, `name` varchar(20) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
注意:因為大家習慣性的把hive表用於映射Hbase的rowkey的欄位命名為key,所以在建立mysql的table的時候有可能也建立對應的key欄位,但是key是mysql的保留字,會導致insert語句無法插入的問題
Hbase建立employee表建立employee表,並插入資料
hbase(main):005:0> create 'employee','info'0 row(s) in 0.4740 seconds=> Hbase::Table - employeehbase(main):006:0> put 'employee',1,'info:id',10 row(s) in 0.2080 secondshbase(main):008:0> scan 'employee'ROW COLUMN+CELL 1 column=info:id, timestamp=1417591291730, value=1 1 row(s) in 0.0610 secondshbase(main):009:0> put 'employee',1,'info:name','peter'0 row(s) in 0.0220 secondshbase(main):010:0> scan 'employee'ROW COLUMN+CELL 1 column=info:id, timestamp=1417591291730, value=1 1 column=info:name, timestamp=1417591321072, value=peter 1 row(s) in 0.0450 secondshbase(main):011:0> put 'employee',2,'info:id',20 row(s) in 0.0370 secondshbase(main):012:0> put 'employee',2,'info:name','paul'0 row(s) in 0.0180 secondshbase(main):013:0> scan 'employee'ROW COLUMN+CELL 1 column=info:id, timestamp=1417591291730, value=1 1 column=info:name, timestamp=1417591321072, value=peter 2 column=info:id, timestamp=1417591500179, value=2 2 column=info:name, timestamp=1417591512075, value=paul 2 row(s) in 0.0440 seconds
建立Hive外部表格hive 有分為原生表和外部表格,原生表是以簡單檔案方式儲存在hdfs裡面,外部表格依賴別的架構,比如Hbase,我們現在建立一個依賴於我們剛剛建立的employee hbase表的hive 外部表格
hive> CREATE EXTERNAL TABLE h_employee(key int, id int, name string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:id,info:name") > TBLPROPERTIES ("hbase.table.name" = "employee");OKTime taken: 0.324 secondshive> select * from h_employee;OK11peter22paulTime taken: 1.129 seconds, Fetched: 2 row(s)
建立Hive原生表這個hive原生表只是用於匯出的時候臨時使用的,所以取名叫 h_employee_export,欄位之間的分隔字元用逗號
CREATE TABLE h_employee_export(key INT, id INT, name STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';
我們去看下實際儲存的文字檔是什麼樣子的
$ hdfs dfs -cat /user/hive/warehouse/h_employee_export/000000_01,1,peter2,2,paul
源Hive表匯入資料到暫存資料表
第一步先將資料從 h_employee(基於Hbase的外部表格)匯入到 h_employee_export(原生Hive表)
hive> insert overwrite table h_employee_export select * from h_employee;
hive> select * from h_employee_export;OK11peter22paulTime taken: 0.359 seconds, Fetched: 2 row(s)
我們去看下實際儲存的文字檔長什麼樣子
$ hdfs dfs -cat /user/hive/warehouse/h_employee_export/000000_01,1,peter2,2,paul
從Hive匯出資料到mysql
$ sqoop export --connect jdbc:mysql://localhost:3306/sqoop_test --username root --password root --table employee --m 1 --export-dir /user/hive/warehouse/h_employee_export/Warning: /usr/lib/sqoop/../hive-hcatalog does not exist! HCatalog jobs will fail.Please set $HCAT_HOME to the root of your HCatalog installation.Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.Please set $ACCUMULO_HOME to the root of your Accumulo installation.14/12/05 08:49:35 INFO sqoop.Sqoop: Running Sqoop version: 1.4.4-cdh5.0.114/12/05 08:49:35 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.14/12/05 08:49:35 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.14/12/05 08:49:35 INFO tool.CodeGenTool: Beginning code generation14/12/05 08:49:36 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 114/12/05 08:49:36 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 114/12/05 08:49:36 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduceNote: /tmp/sqoop-wlsuser/compile/d16eb4166baf6a1e885d7df0e2638685/employee.java uses or overrides a deprecated API.Note: Recompile with -Xlint:deprecation for details.14/12/05 08:49:39 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-wlsuser/compile/d16eb4166baf6a1e885d7df0e2638685/employee.jar14/12/05 08:49:39 INFO mapreduce.ExportJobBase: Beginning export of employee14/12/05 08:49:41 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar14/12/05 08:49:43 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative14/12/05 08:49:43 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative14/12/05 08:49:43 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps14/12/05 08:49:43 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.111.78.111:803214/12/05 08:49:45 INFO input.FileInputFormat: Total input paths to process : 114/12/05 08:49:45 INFO input.FileInputFormat: Total input paths to process : 114/12/05 08:49:45 INFO mapreduce.JobSubmitter: number of splits:114/12/05 08:49:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1406097234796_003714/12/05 08:49:46 INFO impl.YarnClientImpl: Submitted application application_1406097234796_003714/12/05 08:49:46 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1406097234796_0037/14/12/05 08:49:46 INFO mapreduce.Job: Running job: job_1406097234796_003714/12/05 08:49:59 INFO mapreduce.Job: Job job_1406097234796_0037 running in uber mode : false14/12/05 08:49:59 INFO mapreduce.Job: map 0% reduce 0%14/12/05 08:50:10 INFO mapreduce.Job: map 100% reduce 0%14/12/05 08:50:10 INFO mapreduce.Job: Job job_1406097234796_0037 completed successfully14/12/05 08:50:10 INFO mapreduce.Job: Counters: 30File System CountersFILE: Number of bytes read=0FILE: Number of bytes written=99761FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=166HDFS: Number of bytes written=0HDFS: Number of read operations=4HDFS: Number of large read operations=0HDFS: Number of write operations=0Job Counters Launched map tasks=1Data-local map tasks=1Total time spent by all maps in occupied slots (ms)=8805Total time spent by all reduces in occupied slots (ms)=0Total time spent by all map tasks (ms)=8805Total vcore-seconds taken by all map tasks=8805Total megabyte-seconds taken by all map tasks=9016320Map-Reduce FrameworkMap input records=2Map output records=2Input split bytes=144Spilled Records=0Failed Shuffles=0Merged Map outputs=0GC time elapsed (ms)=97CPU time spent (ms)=1360Physical memory (bytes) snapshot=167555072Virtual memory (bytes) snapshot=684212224Total committed heap usage (bytes)=148897792File Input Format Counters Bytes Read=0File Output Format Counters Bytes Written=014/12/05 08:50:10 INFO mapreduce.ExportJobBase: Transferred 166 bytes in 27.0676 seconds (6.1328 bytes/sec)14/12/05 08:50:10 INFO mapreduce.ExportJobBase: Exported 2 records.
注意
在這段日誌中有這樣一句話
14/12/05 08:49:46 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1406097234796_0037/
意思是你可以用瀏覽器訪問這個地址去看下任務的執行情況,如果你的任務長時間卡主沒結束就是出錯了,可以去這個地址查看詳細的錯誤記錄檔
查看結果
mysql> select * from employee;+--------+----+-------+| rowkey | id | name |+--------+----+-------+| 1 | 1 | peter || 2 | 2 | paul |+--------+----+-------+2 rows in set (0.00 sec)mysql>
匯入成功
注意匯出的時候最好加上
--input-null-string "\\\\N" --input-null-non-string "\\\\N"
變成
sqoop export --connect jdbc:mysql://localhost:3306/sqoop_test --username root --password root --table employee --m 1 --export-dir /user/hive/warehouse/h_employee_export/ --input-null-string "\\\\N" --input-null-non-string "\\\\N"
這兩句話的意思是如果遇到空值就插入null,要不然出現空值是會報錯的