Sqoop command, MySQL import to HDFs, HBase, Hive

Source: Internet
Author: User
Tags zookeeper mysql import hdfs dfs hadoop mapreduce hadoop ecosystem sqoop spark mllib


1. Test MySQL Connection

Bin/sqoop list-databases--connect jdbc:mysql://192.168.1.187:3306/trade_dev--username ' mysql '--password ' 111111 '

2. Verifying SQL statements

Bin/sqoop eval--connect jdbc:mysql://192.168.1.187:3306/trade_dev--username ' mysql '--password ' 111111 '--query ' SELECT * from tb_region WHERE region_id = ' 00a1719a489d4f49906a8ca9661ccbe8 ' "

3. Import hdfs3.1 Import

Bin/sqoop Import--connect jdbc:mysql://192.168.1.187:3306/trade_dev--username ' mysql '--password ' 111111 '--table TB_ Region--target-dir/sqoop/mysql/trade_dev/tb_region-m 5--columns "code,name,category,farthercode,visible, regionlevel,region_id "--direct

3.2 Verification

HDFs dfs-cat/sqoop/mysql/trade_dev_tb_region/*01

4. Import hbase4.1 New HBase table

HBase Shell
Create ' Mysql_trade_dev ', ' region '

4.2 Importing MySQL data to HBase

Bin/sqoop Import--connect jdbc:mysql://192.168.1.187:3306/trade_dev--username ' mysql '--password ' 111111 '--table TB_ Region--hbase-table Mysql_trade_dev--hbase-row-key region_id--column-family Region

4.3 Verification

Scan ' Mysql_trade_dev '
Count ' Mysql_trade_dev '

5. Import Hive

Bin/sqoop Import--connect jdbc:mysql://192.168.1.187:3306/trade_dev--username ' mysql '--password ' 111111 '--table TB_ Region--hive-import--create-hive-table--target-dir/user/hive/warehouse/tb_region--hive-table tb_region

6. Incremental hive6.1 Initialization Import HDFs

Bin/sqoop Job Import--connect jdbc:mysql://192.168.1.187:3306/trade_dev--username mysql--password 111111--table TB_ Dictionary-m 1--target-dir/sqoop/mysql/trade_dev/tb_dic--incremental append--check-column DIC_ID


Return Data:


 
 
16/09/07 10:27:06 INFO tool.ImportTool: --incremental append
16/09/07 10:27:06 INFO tool.ImportTool: --check-column DIC_ID
16/09/07 10:27:06 INFO tool.ImportTool: --last-value 287
16/09/07 10:27:06 INFO tool.ImportTool: (Consider saving this with ‘sqoop job --create‘)
6.2 Creating a hive External table

CREATE EXTERNAL TABLE tb_dic (dic_id int, domain_id string, dic_type_id int, DESCRIPTION string, CODE int, NAME STRING, MN EMONIC string, ATTRIBUTE string, Mark_for_default int, mark_for_delete int, opt_counter int, create_date STRING, create_by String, last_modified_date string, last_modified_by string, ATTRIBUTE1 int, ATTRIBUTE2 int, ATTRIBUTE3 string, ATTRIBUTE4 String, ATTRIBUTE5 string) ROW FORMAT delimited fields TERMINATED by ', ' STORED as textfile location '/sqoop/mysql/trade_ Dev/tb_dic ';

MySQL Build table statement
DROP TABLE IF EXISTS `TB_DICTIONARY`;
CREATE TABLE `TB_DICTIONARY` (
 `DIC_ID` int (11) NOT NULL AUTO_INCREMENT COMMENT‘ Dictionary ID ’,
 `DOMAIN_ID` varchar (45) NOT NULL DEFAULT‘ domain1 ’COMMENT‘ service domain distinguished ID ’,
 `DIC_TYPE_ID` int (11) NOT NULL COMMENT‘ Dictionary type ID-foreign key-TB_DICTIONARY_TYPE ’,
 `DESCRIPTION` varchar (1024) NOT NULL COMMENT‘ Escape Code Explanation ’,
 `CODE` tinyint (2) NOT NULL COMMENT‘ escape code ’,
 `NAME` varchar (45) NOT NULL COMMENT‘ Escape code corresponding meaning ’,
 `MNEMONIC` varchar (45) DEFAULT NULL COMMENT‘ mnemonic code ’,
 `ATTRIBUTE` varchar (45) DEFAULT NULL COMMENT‘ Current dictionary attribute: such as the dimension type of the unit of measure ’,
 `MARK_FOR_DEFAULT` tinyint (2) NOT NULL DEFAULT‘ 0 ’COMMENT‘ default mark (1 is default, 0 is non-default) ’,
 `MARK_FOR_DELETE` tinyint (2) NOT NULL DEFAULT‘ 1‘COMMENT ’is valid, 1: valid; 0: invalid’,
 `OPT_COUNTER` int (5) DEFAULT NULL COMMENT‘ version management flag ’,
 `CREATE_DATE` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT‘ creation date ’,
 `CREATE_BY` varchar (45) NOT NULL DEFAULT‘ admin ’COMMENT‘ Creator ID ’,
 `LAST_MODIFIED_DATE` datetime DEFAULT NULL COMMENT‘ modified date ’,
 `LAST_MODIFIED_BY` varchar (45) DEFAULT NULL COMMENT‘ Modifier ID ’,
 `ATTRIBUTE1` int (11) DEFAULT NULL,
 `ATTRIBUTE2` int (11) DEFAULT NULL,
 `ATTRIBUTE3` varchar (45) DEFAULT NULL,
 `ATTRIBUTE4` varchar (45) DEFAULT NULL,
 `ATTRIBUTE5` date DEFAULT NULL,
 PRIMARY KEY (`DIC_ID`)
) ENGINE = InnoDB AUTO_INCREMENT = 290 DEFAULT CHARSET = utf8 COMMENT = ‘Dictionary Table-Li Siyu’;

SET FOREIGN_KEY_CHECKS = 1;
6.3 Updating incremental data

Bin/sqoop Job--create incjob--Import--connect jdbc:mysql://192.168.1.187:3306/trade_dev--username mysql--password 1 11111--table tb_dictionary-m 1--target-dir/sqoop/mysql/trade_dev/tb_dic--incremental append--check-column DIC_ID-- Last-value 287
Bin/sqoop Job--exec Incjob

6.4 Verification

Select COUNT (*) from Tb_dic;


Return Data:
First time


Time taken:0.068 seconds, fetched:489 row (s)


Second time


Time taken:0.068 seconds, fetched:490 row (s)

7. Full Library Import Test 7.1 new HBase table


HBase Shell
Create ' new_table ', ' data '


7.2 Importing MySQL data to HBase


Bin/sqoop import-all-tables--connect jdbc:mysql://192.168.1.187:3306/new_schema--username mysql--password 111111-- Hbase-create-table--hbase-table new_table--column-family data--hbase-bulkload


Attention


Full library import requires that each table has a primary key, or it will error


16/09/08 15: 03:50 INFO orm. compilationmanager:writing jar file: /tmp/sqoop-xdata/compile/ 070fa1eda1e77fc70eaa0c532cfa94b8/nopk.jar16/09/08 15:03:50 ERROR tool. Importalltablestool: error during import:no primary key could be found for table NOPK. Please specify one with--split-by or perform a sequential import with '-M 1 '. 


or a field (such as an ID) that can be rowkey for each table, increasing the--hbase-row-key ID


7.3 Verification


Scan ' new_table '
Count ' new_table '






Hadoop is a distributed system infrastructure developed by the Apache Foundation.



Users can develop distributed programs without knowing the underlying details of the distribution. Take advantage of the power of the cluster to perform high-speed operations and storage.



It has the characteristics of reliability, efficiency and scalability.



At the heart of Hadoop is Yarn,hdfs and MapReduce.



is the Hadoop ecosystem, integrated with the spark ecosystem. In the future, Hadoop will coexist with spark, Hadoop and Spark



Can be deployed on yarn, Mesos's resource management system






Each of these components is briefly described below, as described in the following series of posts.


2. HDFS (Hadoop Distributed File System)


A GFS paper from Google, published in October 2003, is a GFS clone version of HDFs.



HDFs is the basis for data storage management in the Hadoop system. It is a highly fault-tolerant system capable of detecting and responding to hardware failures and for running on low-cost generic hardware.



HDFs simplifies the file consistency model and provides high-throughput application data access through streaming data access for applications with large data sets.



It provides a mechanism for writing multiple reads at once, in the form of blocks and distributed across different physical machines on the cluster.


3. Mapreduce (distributed computing framework)


The MapReduce paper from Google, published in December 2004, is the Google MapReduce clone version of Hadoop MapReduce.



MapReduce is a distributed computing model for the calculation of large data volumes. It masks the details of the distributed computing framework, abstracting the calculations into the map and reduce two parts,



Where map specifies the operation of a separate element on the dataset, generating a key-value pair as an intermediate result. Reduce all "values" of the same "key" in the intermediate result is regulated to obtain the final result.



MapReduce is ideal for data processing in a distributed, parallel environment consisting of a large number of computers.



4. HBASE (Distributed Columnstore database)



BigTable paper from Google, published in November 2006, HBase is Google bigtable clone version



HBase is a scalable, high-reliability, high-performance, distributed, and column-oriented dynamic schema database for structured data that is built on HDFS and is column-oriented.



HBase uses the BigTable data model: An enhanced sparse sort mapping table (Key/value), where keys are composed of row keywords, column keywords, and timestamps.



HBase provides random, real-time read-and-write access to large-scale data, while data stored in HBase can be processed using MapReduce, which combines data storage and parallel computing in a perfect way.



5. Zookeeper (distributed Collaboration service)



Chubby paper from Google, published in November 2006, zookeeper is chubby clone version



Solve the problem of data management in distributed environment: Unified naming, State synchronization, cluster management, configuration synchronization and so on.



Many of the components of Hadoop depend on zookeeper, which runs on top of a computer cluster for managing Hadoop operations.



6. HIVE (Data Warehouse)



Open source from Facebook, originally used to solve the massive structural log data statistics problem.



Hive defines a SQL-like query Language (HQL) that translates SQL into a mapreduce task executed on Hadoop. Typically used for offline analysis.



HQL is used to run query statements stored on Hadoop, and Hive lets unfamiliar mapreduce developers write data query statements, which are then translated into the MapReduce task above Hadoop.



7.Pig (Ad-hoc script)



Open Source by Yahoo!, the design motive is to provide a mapreduce-based Ad-hoc (calculation occurs at query time) data analysis tool



Pig defines a data flow language-pig Latin, which is an abstraction of the complexities of MapReduce programming, including the runtime environment and the scripting language used to analyze Hadoop datasets (pig Latin).



Its compiler translates the Pig Latin into a mapreduce program sequence to convert the script to a MapReduce task executed on Hadoop. Typically used for offline analysis.



8.Sqoop (Data etl/Sync Tool)



SQOOP is the acronym for SQL-TO-HADOOP, which is used primarily for traditional databases and for transferring data before Hadoop. The import and export of data is essentially a mapreduce program, which takes advantage of the parallelism and fault tolerance of Mr.



Sqoop uses database technology to describe the data architecture used to transfer data between relational databases, data warehouses, and Hadoop.



9.Flume (log Collection Tool)



Cloudera Open Source Log collection system with distributed, high reliability, high fault tolerance, easy to customize and extend the features.



It abstracts data from the process of generating, transmitting, processing, and eventually writing to the target's path into a data stream, in which the data source supports the custom data sender in flume to support the collection of different protocol data.



At the same time, flume data stream provides the ability of simple processing of log data, such as filtering, format conversion and so on. In addition, Flume has the ability to write logs to various data targets (customizable).



In general, Flume is a scalable, large-scale log collection system for complex environments. can also be used to collect other types of data


10.Mahout (Data mining algorithm library)


Mahout originated in 2008, was originally a sub-project of Apache Lucent, it has achieved considerable development in a very short period of time, and is now the top project of Apache.



The main goal of Mahout is to create a number of extensible machine learning domain classic algorithms that are designed to help developers create smart applications more quickly and easily.



Mahout now includes a wide range of data mining methods such as clustering, classification, recommendation engine (collaborative filtering), and frequent set mining.



In addition to the algorithms, Mahout also includes input/output tools for data, data mining support architectures such as integration with other storage systems such as databases, MongoDB, or Cassandra.



Oozie (Workflow Scheduler)



Oozie is an extensible work system integrated into the Hadoop stack that coordinates the execution of multiple mapreduce jobs. It is able to manage a complex system that is executed based on external events, including timing of data and the appearance of data.



The Oozie workflow is a set of actions placed in a control-dependent dag (directed acyclic graph Direct acyclic graph) that specifies the order in which the actions are executed, such as map/reduce jobs for Hadoop, pig jobs, and so on.



Oozie uses HPDL, an XML process definition language, to describe this diagram.



Yarn (Distributed Resource Manager)


Yarn is the next generation of MapReduce, the MRV2, evolved on the basis of the first generation of mapreduce, primarily to address the poor scalability of the original Hadoop and not support the multi-computing framework. Yarn is the next generation of Hadoop computing platforms, and yarn is a common runtime framework that allows users to write their own computing framework that runs in that environment. The framework used for its own writing is a Lib for the client, which is packaged when it is applied to submit a job. The framework provides the following components:


-Resource management: Includes application management and machine resource management



-Dual Resource scheduling



-Fault tolerance: fault tolerance is considered by each component



-Extensibility: Scalable to tens of thousands of nodes



Mesos (Distributed Resource Manager)



Mesos, a research project that was born in UC Berkeley, has now become an Apache project, and some companies currently use Mesos to manage cluster resources, such as Twitter.



Like yarn, Mesos is a unified resource management and scheduling platform, as well as supporting a variety of computing frameworks such as Mr and steaming.



Tachyon (distributed memory file system)



Tachyon (/' T?ki:?? N/is a memory-centric distributed file system with high performance and fault tolerance,



A file sharing service that provides reliable memory-level speeds for cluster frameworks such as Spark, MapReduce.



Tachyon was born in the Amplab of UC Berkeley.



. Tez (DAG Compute model)



Tez is the latest open source support for DAG Job computing framework, which originates directly from the MapReduce framework, and the core idea is to further divide the map and reduce two operations,



That is, the map is split into input, Processor, sort, merge, and output, and reduce is split into input, Shuffle, sort, merge, Processor, and output, etc.



In this way, these decomposed meta-operations can be arbitrarily and flexibly combined, resulting in new operations that, after some control procedures are assembled, can form a large dag job.



Now that Hive supports the MR and Tez computing models, Tez can perfect the binary Mr Program and improve the computational performance.



Spark (Memory dag compute model)



Spark is an Apache project that is advertised as "lightning-fast Cluster Computing". It has a thriving open source community and is currently the most active Apache project.



The earliest spark was a common parallel computing framework for the open source class Hadoop MapReduce for UC Berkeley AMP Labs.



Spark provides a faster, more general-purpose data processing platform. Spark can make your program run 100 times times faster in-memory or 10 times times faster on disk than Hadoop.



Giraph (Figure calculation Model)



Apache Giraph is a scalable distributed iterative processing system based on the Hadoop platform, inspired by the BSP (bulk synchronous parallel) and Google Pregel.



Originally from Yahoo. In developing Giraph, Yahoo adopted the principles of Pregel: a large-scale charting system, published by Google engineers in 2010. Later, Yahoo donated giraph to the Apache Software Foundation.



Now everyone can download giraph, which has become an open source project for the Apache Software Foundation and has been supported by Facebook for many improvements.



GraphX(Figure calculation model)



The Spark Graphx was first a distributed Graph Computing framework project at Berkeley Amplab, which is now integrated into the Spark Runtime framework to provide the BSP massively parallel graph computing capability.



MLib (Machine learning Library)



Spark Mllib is a machine learning library that provides a variety of algorithms for clustering, regression, clustering, collaborative filtering, and more.



Streaming (flow calculation model)



Spark streaming supports real-time processing of streaming data and calculates real-time data in a micro-batch manner



Kafka (Distributed Message Queuing)



Kafka is LinkedIn's open source messaging system in December 2010, which is used primarily to process active streaming data.



Active streaming data is very common in Web site applications, including the PV of the site, what users have visited, what content they searched for, and so on.



This data is usually recorded in the form of a log, and then the statistics are processed at regular intervals.



Phoenix (HBase SQL Interface)



Apache Phoenix is the SQL driver for HBase, and Phoenix makes HBase support access through JDBC and transforms your SQL queries into HBase scans and corresponding actions.



Ranger (Security management tool)



Apache Ranger is a Hadoop cluster permissions framework that provides the ability to operate, monitor, and manage complex data permissions, providing a centralized management mechanism to manage all data permissions for the yarn-based Hadoop ecosystem.



Knox (Hadoop security Gateway)



Apache Knox is a RESTAPI gateway that accesses a Hadoop cluster, providing a simple access point for all rest access, completing 3A authentication (authentication,authorization, Auditing) and SSO (Single Sign-on).



Falcon (Data lifecycle management tools)



Apache Falcon is a hadoop-oriented, new data processing and management platform designed for data mobility, data pipeline orchestration, lifecycle management, and data discovery. It enables end users to quickly "upload (onboard)" their data and their associated processing and management tasks to a Hadoop cluster.



26.Ambari (Install Deployment Configuration Management tool)



The role of Apache Ambari is to create, manage, and monitor Hadoop clusters as a web tool to make Hadoop and related big data software easier to use.






Hive originates from Facebook and plays the role of Data Warehouse in Hadoop. Build on top of the Hadoop cluster and manipulate the SQL-like interface for data stored on the Hadoop cluster. You can use HIVEQL to do Select, join, and so on. If you have data warehousing requirements and you're good at writing SQL and don't want to write mapreduce jobs, you can use hive instead.
The built-in data types for hive can be divided into two main categories:
(1), the basic data type;
(2), complex data types.
The underlying data types are: TINYINT, SMALLINT, INT, BIGINT, BOOLEAN, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP, DECIMAL, CHAR, VARCHAR, DATE.
The following table lists the bytes of these underlying types and the versions from which these types are supported.


Type
Complex types include array, MAP, STRUCT, UNION, which are composed of the underlying types.
HBase
HBase runs on HDFs as a column-oriented database, and HDFs lacks the read and write operations that HBase does. HBase is modeled on Google bigtable and stored as a key-value pair. The goal of the project is to quickly locate and access the required data within billions of rows of data in the host.
HBase is a database, a NoSQL database that provides the ability to read and write like other databases, Hadoop does not meet real-time needs, and HBase is ready to meet. If you need real-time access to some data, put it into hbase.
You can use hive as a static data warehouse, HBase as the data store, and put some data that will change. In hive, the normal table is stored in HDFs, and you can specify the data storage location by creating external table appearances, either the system directory or the Elasticsearch, or hbase.
When you use Sqoop to export data from MySQL to Hadoop, you need to consider whether you want to go directly into hive (which is the normal table) or import data to Hbase,sqoop and support both imports.
Test Sqoop
Test
After the execution of the above Sqoop statement, you can confirm that the Sqoop is working properly, sqoop connection to MySQL is normal.
import data from MySQL to hive using Sqoop
using complex SQL
Sql
Note:
Because using Sqoop to import data from MySQL to hive requires specifying TARGET-DIR, the import is a normal table and cannot be an external table.
The following is a brief summary of the SQOOP implementation process:
Perform
As you can see, after the –split-by is set up, the job is segmented by the set value and the number of slices is set to-m (the default job cut score is 4 if the-M 5 is not set). This more complex SQL statement has been tested to be well supported by Sqoop.
Adjusting the hive data type
After successful execution of the above task, it is detected that the data type in the Hive table structure has the following relationship with the MySQL corresponding column:
Relationship
You can see that the decimal type of MySQL becomes the double type in hive. You need to specify the mapping relationship at the time of import by –map-column-hive, as follows:
Specified
The above command succeeds, but when the Hive column type is set to Decimal, the Mysql[decimal (12,2)]–>hive[decimal] causes the decimal to be lost after the import.
Note:
Execution of a mapping statement that specifies the accuracy of cost= "DECIMAL (10,2)" Fails in Sqoop1.4.5. This is a bug for Sqoop1.4.5, see details: https://issues.apache.org/jira/browse/SQOOP-2103, which was fixed in version 1.4.7.
constantly updated
The above Sqoop statement is executed two times and an error occurs the second time it is executed:
Error
This means that the corresponding storage already exists in HDFs, where an incremental import statement of sqoop-hive is required.
Note:
Because Hive does not have a rowkey, its HDFs storage determines that sqoop-hive can only be added, and update import cannot proceed.
import data from MySQL to hbase using Sqoop
using complex SQL
Import
The SQL statement above is simpler. After testing, more complex SQL statements, Sqoop support very well, import normal.
constantly updated
After specifying the rowkey of HBase, the Sqoop statement that imports data from MySQL to HBase is executed again, based on the same Rowkey value, and the corresponding row in HBase is replaced with the update.
Hive uses hbase data
Data
For more details on how hive uses data stored in HBase, see the article "Executing SQL statements using hive or Impala for data stored in HBase".
about SQOOP2
Architecture, SQOOP1 uses maponly jobs to import and export data from Hadoop (hdfs/hbase/hive) with relational databases, where users interact with them using command-line methods, data transfer is tightly coupled with the format, and the ease of use is poor, Connector data format support is limited, security is not good, the restrictions on connector too dead. SQOOP2 set up a centralized service, responsible for the management of the complete mapreduce job, providing a variety of user interaction (CLI/WEBUI/RESTAPI), with a rights management mechanism, with a standardized connector, making it more user-friendly, more secure, more focused.
In summary
Using Sqoop to import data from MySQL to HBase is easier than importing to the hive, and when using hive for hbase data, there are no decimal precision related bugs, and the update can be well supported. It is therefore recommended to use Sqoop to import data from MySQL to HBase instead of direct hive.
After testing, it takes 7-12 minutes to import data from MySQL to hbase,100 using Sqoop. Impala's query efficiency for hbase is not efficient for HDFS.

I'm Mr Little Mr. Zhao.
Links: https://www.jianshu.com/p/fcfdc1ec9501
Source: Pinterest
Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.


Sqoop command, MySQL import to HDFs, HBase, Hive


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.