Tutorials | Import data from MySQL to hive and hbase using Sqoop

Last Update:2016-12-12 Source: Internet

Author: User

Tags sqoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Basic Environment

sqoop:sqoop-1.4.5+cdh5.3.6+78,

hive:hive-0.13.1+cdh5.3.6+397,

hbase:hbase-0.98.6+cdh5.3.6+115

Introduction to Sqool and Hive, HBase

Sqoop

Sqoop is an open source tool used to transfer data from Hadoop and relational databases to and from a relational database (e.g. MySQL, Oracle, Postgres, etc.) into the HDFs of Hadoop. HDFs data can also be directed into a relational database.

Hive

Friends who do not want to develop mapreduce in a programming language such as DB, friends who are familiar with SQL can use hive to open offline data processing and analysis work. Hive is a Hadoop-based data warehousing tool that maps structured data files into a single database table and provides simple SQL query functionality that translates SQL statements into MapReduce tasks. Note that hive is now suitable for offline data manipulation, which means that it is not suitable for real-time online queries or operations in real-world production environments because one word is "slow".

Hive originates from Facebook and plays the role of Data Warehouse in Hadoop. Build on top of the Hadoop cluster and manipulate the SQL-like interface for data stored on the Hadoop cluster. You can use HIVEQL to do Select, join, and so on. If you have data warehousing requirements and you're good at writing SQL and don't want to write mapreduce jobs, you can use hive instead.

The built-in data types for hive can be divided into two main categories:

(1), the basic data type;

(2), complex data types.

The underlying data types are: TINYINT, SMALLINT, INT, BIGINT, BOOLEAN, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP, DECIMAL, CHAR, VARCHAR, DATE.

The following table lists the bytes of these underlying types and the versions from which these types are supported.

Complex types include array, MAP, STRUCT, UNION, which are composed of the underlying types.

HBase

HBase runs on HDFs as a column-oriented database, and HDFs lacks the read and write operations that HBase does. HBase is modeled on Google bigtable and stored as a key-value pair. The goal of the project is to quickly locate and access the required data within billions of rows of data in the host.

HBase is a database, a NoSQL database that provides the ability to read and write like other databases, Hadoop does not meet real-time needs, and HBase is ready to meet. If you need real-time access to some data, put it into hbase.

You can use hive as a static data warehouse, HBase as the data store, and put some data that will change. In hive, the normal table is stored in HDFs, and you can specify the data storage location by creating external table appearances, either the system directory or the Elasticsearch, or hbase.

When you use Sqoop to export data from MySQL to Hadoop, you need to consider whether you want to go directly into hive (which is the normal table) or import data to Hbase,sqoop and support both imports.

Test Sqoop

After the execution of the above Sqoop statement, you can confirm that the Sqoop is working properly, sqoop connection to MySQL is normal.

import data from MySQL to hive using Sqoop

Using Complex SQL

Attention:

Because using Sqoop to import data from MySQL to hive requires specifying TARGET-DIR, the import is a normal table and cannot be an external table.

The following is a brief summary of the SQOOP implementation process:

As you can see, after the--split-by is set up, the job is segmented by the set value and the number of slices is set to-m (the default job cut score is 4 if the-M 5 is not set). This more complex SQL statement has been tested to be well supported by Sqoop.

Adjusting the hive data type

After successful execution of the above task, it is detected that the data type in the Hive table structure has the following relationship with the MySQL corresponding column:

You can see that the decimal type of MySQL becomes the double type in hive. You need to specify the mapping relationship at the time of import by--map-column-hive, as follows:

The above command succeeds, but when the Hive column type is set to Decimal, the Mysql[decimal (12,2)]-->hive[decimal] causes the decimal to be lost after the import.

Attention:

Execution of a mapping statement that specifies the accuracy of cost= "DECIMAL (10,2)" Fails in Sqoop1.4.5. This is a bug for Sqoop1.4.5, see details: https://issues.apache.org/jira/browse/SQOOP-2103, which was fixed in version 1.4.7.

Constantly updated

The above Sqoop statement is executed two times and an error occurs the second time it is executed:

This means that the corresponding storage already exists in HDFs, where an incremental import statement of sqoop-hive is required.

Attention:

Because Hive does not have a rowkey, its HDFs storage determines that sqoop-hive can only be added, and update import cannot proceed.

import data from MySQL to hbase using Sqoop

Using Complex SQL

The SQL statement above is simpler. After testing, more complex SQL statements, Sqoop support very well, import normal.

Constantly updated

After specifying the rowkey of HBase, the Sqoop statement that imports data from MySQL to HBase is executed again, based on the same Rowkey value, and the corresponding row in HBase is replaced with the update.

Hive uses hbase data

For more details on how hive uses data stored in HBase, see the article "Executing SQL statements using hive or Impala for data stored in HBase".

about SQOOP2

Architecture, SQOOP1 uses maponly jobs to import and export data from Hadoop (hdfs/hbase/hive) with relational databases, where users interact with them using command-line methods, data transfer is tightly coupled with the format, and the ease of use is poor, Connector data format support is limited, security is not good, the restrictions on connector too dead. SQOOP2 set up a centralized service, responsible for the management of the complete mapreduce job, providing a variety of user interaction (CLI/WEBUI/RESTAPI), with a rights management mechanism, with a standardized connector, making it more user-friendly, more secure, more focused.

Sum up

Using Sqoop to import data from MySQL to HBase is easier than importing to the hive, and when using hive for hbase data, there are no decimal precision related bugs, and the update can be well supported. It is therefore recommended to use Sqoop to import data from MySQL to HBase instead of direct hive.

After testing, it takes 7-12 minutes to import data from MySQL to hbase,100 using Sqoop. Impala's query efficiency for hbase is not efficient for HDFS.

Summary

In the field of big data, we know how companies in the current world understand the technology of big data direction. Some company data is large enough, but the data is not value or value is not big enough, we call it Lu (large but unavailable) type company, some company data quantity is not much, but are priceless, we call SV (small but valuable) type company; Some companies have large data volumes, and the data is valuable, and we call it the LV (large and valuable) type company.

LU Company has a lot of market, cause the data is not valuable factors may be the reason for the data itself, there are products, they also often want to test the current environment under the big data of water, it should be carefully screened. SV companies also have a lot of, for them, good data analysts than big data platform is more important, want to test water big data, or to have a clear estimate of the amount of data. LV Company is the most need to build big data platform, but often they do not have the technology and desire to do this thing, this is also more hateful.

In short, what you want to do is not necessarily necessary.

End.

Featured Message

The author of the article has been set to pay attention before leaving a message

Write a message

The author of the article has been set to pay attention before leaving a message

Write a message

Load inThe above message is screened by the public number and displayed

Learn more about message features

Sweep
Follow the public number

From for notes (Wiz)

Tutorials | Import data from MySQL to hive and hbase using Sqoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More