Using Hadoop to analyze and process data requires loading the data into a cluster and combining it with other data in the enterprise production database. It is a challenge to load large chunks of data from production systems into Hadoop or to get data from map reduce applications in large clusters. Users must be aware of the details of ensuring data consistency, consuming production system resources, and supplying downstream pipeline data preprocessing. Using a script to transform data is inefficient and time-consuming. Using map reduce applications to get data directly from external systems makes applications more complex and increases the risk of production systems from excessive load on cluster nodes.
That's what Apache Sqoop can do. Aapche Sqoop is currently a hatchery project for Apache software. More information on this project can be found in Http://incubator.apache.org/sqoop.
Sqoop enables you to import and export data from a structured data warehouse as simply as a relational database, an enterprise Data Warehouse, and a nosql system. You can use Sqoop to load data from an external system into HDFS, which is stored in hive and HBase tables. Sqoop with Ooozie can help you schedule and automate import and export tasks. Sqoop uses a connector based on a support plug-in to provide new external links.
When you run the Sqoop, it looks very simple, but what happens underneath the façade? The dataset will be sliced into different partitions and run a map-only job to be responsible for a slice of the dataset. Because Sqoop uses the metadata of the database to infer the data type, each piece of data is handled in a type-safe manner.
In the remainder of this article we will use an example to demonstrate the various ways in which sqoop are used. The goal of this article is to provide an overview of the Sqoop operation rather than the details of the advanced features in depth.
Import data
The following command is used to import all data from a table named orders in a MySQL database into the cluster
---
$ sqoop Import--connect jdbc:mysql://localhost/acmedb \
--table ORDERS--username Test--password * * * * *
---
The various options in this command are explained in the following way:
-
- Import: Instructs Sqoop to start importing
- --connect <connect string>,--username <user name>,--password <password>: These are the parameters that are required to connect to the database. This is no different from the parameters you use when you connect to the database through JDBC
- --table <table name>: Specifies which table to import
The import operation is done through the two steps depicted in the following Figure1. In the first step, Sqoop gets the metadata for the data to be imported from the database. In the second step, Sqoop submits the map-only job to the Hadoop cluster. The second step is to do the actual data transfer work by the metadata obtained in the previous step.
Figure 1:sqoop Import Overview
The imported data is stored in the HDFs directory. As with most operations in Sqoop, users can specify any alternate path to store the imported data.
By default, these documents contain fields separated by commas, separating different records with new lines. You can explicitly specify the field delimiter and the record terminator to easily implement formatting overrides during file copying.
Sqoop also supports data import in different data formats. For example, you can simply implement importing data in Avro format by specifying the command line of the--as-avrodatafile option.
There is many other options that Sqoop provides which can is used to further tune the import operation to suit your speci FIC requirements.
Sqoop provides many options for importing operations that can be used to meet specified requirements.
Import data to Hive
In many cases, importing data into hive is followed by running an import task and then using hive to create and load a deterministic table and partition. Performing this operation manually requires you to know the correct data type mappings and other details like serialization format and delimiter. The Sqoop is responsible for populating the appropriate tabular metadata into the hive metadata warehouse and invoking the necessary instructions to load tables and partition. These actions can be implemented by simply specifying--hive-import on the command line.
----
$ sqoop Import--connect jdbc:mysql://localhost/acmedb \
--table ORDERS--username Test--password * * * * *--hive-import
----
When you run a hive import, Sqoop converts the type of data from the native data type of the external Data warehouse to the corresponding type in hive, Sqoop automatically selects the local delimiter used by hive. If there are new rows in the imported data or other hive separators, Sqoop allows you to remove the characters and get the correct data imported into hive.
Once the import operation is complete, you can view and manipulate it just like any other table in hive.
Import data to HBase
You can use Sqoop to insert data into a specific column family in an hbase table. Much like the Hive import operation, you can specify the HBase table and column family to insert by specifying an additional option. All data imported into HBase is converted to a string and inserted into hbase in the format of a UTF-8 byte array
----
$ sqoop Import--connect jdbc:mysql://localhost/acmedb \
--table ORDERS--username Test--password * * * *
--hbase-create-table--hbase-table ORDERS--column-family MySQL
----
The following are explanations of the various options on the command line:
-
- --hbase-create-table: This option instructs Sqoop to create an hbase table.
- --hbase-table: This option specifies the name of the HBase table.
- --column-family:t This option specifies the name of the column family.
The remaining options are the same as for normal import operations.
Exporting data
In some cases, processing data through Hadoop pipelines may require additional key business functions to be run in the production system to provide assistance. Sqoop can be used to export these data to an external data warehouse when necessary. Using the example above, if the data generated by Hadoop pieplines corresponds to some place in the database Oreders table, you can use the following command line:
----
$ sqoop export--connect jdbc:mysql://localhost/acmedb \
--table ORDERS--username Test--password * * * *
--export-dir/user/arvind/orders
----
The following are explanations of the various options:
-
- Export: Instructs Sqoop to start exporting
- --connect <connect string>,--username <user name>,--password <password>: These are the parameters that are required to connect to the database. This is no different from the parameters you use when you connect to the database through JDBC
- --table <table name>: Specifies the table to be populated
- --export-dir <directory path>: Export path.
The import operation is done through the two steps depicted in the following Figure2. The first step is to get the metadata for the data to be imported from the database, and the second step is to transfer the data. Sqoop splits the input dataset into pieces and inserts the slices into the database with the map task. To ensure optimal throughput and minimal resource utilization, each map task performs this data transfer through multiple transactions.
Figure 2:sqoop Export Overview
Some connectors support temporary tables to help isolate production forms that result from job failures caused by any cause. Once all the data has been transferred, the data in the temporary table is first populated into the map task and merged into the target table.
Sqoop Connector
With dedicated connectors, Sqoop can connect to external systems that have an optimized import and export infrastructure, or do not support native JDBC. A connector is a sqoop-based extensible framework for plug-in components and can be added to any currently existing sqoop. Once the connector is installed, Sqoop can use it to efficiently transfer data between Hadoop and the external warehouses supported by the connector.
By default, Sqoop contains connectors that support a variety of common databases such as Mysql,postgresql,oracle,sqlserver and DB2. It also contains a fast path connector that supports MySQL and PostgreSQL databases. The fast path connector is a specialized connector used to achieve high throughput of batch transfer data. The Sqoop also contains a generic JDBC connector for connecting to a JDBC-connected database
Unlike the built-in connections, many companies develop their own connectors into Sqoop, from specialized enterprise warehouse connectors to NoSQL databases.
Summarize
In this document, you can see how easy it is to transfer large datasets in Hadoop and external data warehouses such as relational databases. In addition, Sqoop provides many advanced reminders such as different data formats, compression, processing queries, and more. We recommend that you try Sqoop and give us feedback.
More information about Sqoop can be found in the following path:
Project Website:http://incubator.apache.org/sqoop
Wiki:https://cwiki.apache.org/confluence/display/sqoop
Project status:http://incubator.apache.org/projects/sqoop.html
Mailing lists:https://cwiki.apache.org/confluence/display/sqoop/mailing+lists
Here is the original
Apache Sqoop-overview
Using Hadoop for analytics and data processing requires loading data to clusters and processing it in conjunction with O ther data, often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large cluste RS can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data Prepar ation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and Exposes the production system to the risk of excessive load originating from cluster nodes.
This is where the Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software Foundation. More information on this project can is found at Http://incubator.apache.org/sqoop.
Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data ware Houses, and NoSQL systems. Using Sqoop, you can provision the data from external system on to HDFS, and populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing your to schedule and automate import and export tasks. Sqoop uses a connector based architecture which supports plugins that provide connectivity to new external systems.
What happens underneath the covers when you run Sqoop is very straightforward. The dataset being transferred is sliced to different partitions and a map-only job is launched with individual mapper s responsible for transferring a slice of the this dataset. Each record of the data was handled in a type safe manner since Sqoop uses the database metadata to infer the data types.
In the rest of this post we'll walk through an example it shows the various ways you can use Sqoop. The goal of this post are to give a overview of Sqoop operation without going into much detail or advanced functionality.
Importing Data
The following command is used to import all data from a table called ORDERS from a MySQL database:
---
$ sqoop Import--connect jdbc:mysql://localhost/acmedb \
--table ORDERS--username Test--password * * * * *
---
In this command the various options specified is as follows:
-
- Import:this is the Sub-command, instructs Sqoop to initiate an import.
- --connect <connect string>,--username <user name>,--password <password>: These is Connection Paramet ERs that is used to connect with the database. This was no different from the connection parameters, that's the when connecting to the database via a JDBC connection.
- --table <table name>: This parameter specifies the table which would be imported.
The import is do in both steps as depicted in Figure 1 below. In the first Step Sqoop introspects the "database to gather" the necessary metadata for the data being imported. The second step is a map-only Hadoop job, which Sqoop submits to the cluster. It's This job, the does the actual data transfer using the metadata captured in the previous step.
Figure 1:sqoop Import Overview
The imported data is saved in a directory on HDFS based on the table being imported. As is the case with most aspects of Sqoop operation, the user can specify any alternative directory where the files should Be populated.
By default these files contain comma delimited fields, with new lines separating different records. You can easily override the format of which data is copied through explicitly specifying the field separator and record Te Rminator characters.
Sqoop also supports different data formats for importing data. For example, you can easily import data in Avro data format by simply specifying the option--as-avrodatafile with the Imp ORT command.
There is many other options that Sqoop provides which can is used to further tune the import operation to suit your speci FIC requirements.
Importing Data into Hive
In very cases, importing data into Hive are the same as running the import task and then using Hive to create and load a CE Rtain table or partition. Doing this manually requires so know the correct type mapping between the data and other details like the Serializat Ion format and delimiters. Sqoop takes care of populating the Hive Metastore with the appropriate metadata for the table and also invokes the Necessa Ry commands to load the table or partition as the. All of the that is do by simply specifying the option--hive-import with the import command.
----
$ sqoop Import--connect jdbc:mysql://localhost/acmedb \
--table ORDERS--username Test--password * * * * *--hive-import
----
When you run a Hive import, Sqoop converts the data from the native datatypes within the external datastore into the Corre Sponding types within Hive. Sqoop automatically chooses the native delimiter set used by Hive. If the data being imported has a new line or other Hive delimiter characters in it, Sqoop allows you to remove such characte RS and get the data correctly populated for consumption in Hive.
Once the import is complete, you can see and operate on the table just like any other table in Hive.
Importing Data into HBase
You can use Sqoop to populate data in a particular column family within the HBase table. Much like the Hive import, this can is done by specifying the additional options, which relate to the HBase table and column Family being populated. All data imported into HBase are converted to their string representation and inserted as UTF-8 bytes.
----
$ sqoop Import--connect jdbc:mysql://localhost/acmedb \
--table ORDERS--username Test--password * * * *
--hbase-create-table--hbase-table ORDERS--column-family MySQL
----
In this command the various options specified is as follows:
-
- --hbase-create-table:this option instructs Sqoop to create the HBase table.
- --hbase-table:this option specifies the table name to use.
- --column-family:this option specifies the column family name to use.
The rest of the options is the same as, for regular import operation.
Exporting Data
In some cases data processed by Hadoop pipelines is needed in production systems to help run additional critical busin ESS functions. Sqoop can used to export such data into external datastores as necessary. Continuing our example from ABOVE-IF data generated by the pipeline on Hadoop corresponded to the ORDERS table in a data Base somewhere, you could populate it using the following command:
----
$ sqoop export--connect jdbc:mysql://localhost/acmedb \
--table ORDERS--username Test--password * * * *
--export-dir/user/arvind/orders
----
In this command the various options specified is as follows:
-
- Export: this is the Sub-command, instructs Sqoop to initiate an export.
- --connect <connect string>,--username <user name>,--password <password>: these are Connection parameters that is used to connect with the database. This is no different from the connection parameters, if connecting to the database via a JDBC connection.
- --table <table name>: this parameter specifies the table which would be populated.
- --export-dir <directory Path>: this is the directory from which data would be exported.
Export is do in the steps as depicted in Figure 2. The first step is to introspect the database for metadata, and followed by the second step of transferring the data. Sqoop divides the input dataset into splits and then uses individual maps tasks to push the splits to the database. Each map task performs the transfer over many transactions on order to ensure optimal throughput and minimal resource UTI Lization.
Figure 2:sqoop Export Overview
Some Connectors support staging tables, help isolate production tables from possible corruption in case of job failure s due to any reason. Staging tables is first populated by the map tasks and then merged to the target table once all of the data have been de Livered it.
Sqoop Connectors
Using Specialized connectors, Sqoop can connect with external systems, which has optimized import and export facilities, or Do not support native JDBC. Connectors is plugin components based on Sqoop ' s extension framework and can is added to any existing Sqoop installation. Once a connector is installed, Sqoop can use it to efficiently transfer data between Hadoop and the external store suppor Ted by the connector.
By default Sqoop includes connectors for various popular databases such as MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes Fast-path connectors for MySQL and PostgreSQL databases. Fast-path connectors is specialized connectors that use database specific batch tools to transfer data with high THROUGHP Ut. Sqoop also includes a generic JDBC connector that can being used to connect to all database that's accessible via JDBC.
Apart from the built-in connectors, many companies has developed their own connectors that can being plugged into Sqoop. These range from specialized connectors for enterprise Data Warehouse systems to NoSQL Datastores.
Wrapping up
In this post you saw how easy it was to transfer large datasets between Hadoop and external datastores such as relational D Atabases. Beyond This, Sqoop offers many advance features such as different data formats, compression, working with queries instead of tables etc. We encourage you to try out Sqoop and give us your feedback.
Apache Sqoop-overview Apache Sqoop Overview