Azure Cloud Platform uses SQOOP to import SQL Server 2012 data tables into Hive/hbase

Source: Internet
Author: User
Tags hadoop ecosystem sqoop hdinsight

My name is Farooq and I am with HDinsight support team here at Microsoft. In this blog I'll try to give some brief overview of Sqoop on HDinsight and then use an example of importing data from a Windows Azure SQL Database table to HDInsight cluster to demonstrate how can I get stated with Sqoop in HDInsight.

What is Sqoop?

Sqoop is a Apache project and part of Hadoop ecosystem. IT allows data transfer between Hadoop\hdinsight cluster and relational databases such as SQL, Oracle, MySQL etc. Sqoop is a collection of related tools, for example import, export, List-all-tables, list-databases etc. To use Sqoop, you specify the tool you want to use and the arguments that control the tool. For more information on Sqoop please check the Sqoop User Guide.

When does need to use Sqoop?

You need to the use Sqoop if you were trying to import/export data between Hadoop and a relational Database. HDInsight provides a full-featured Hadoop distributed File System (HDFS) over Windows Azure Blob storage (wabs) and if you Want to upload data to HDInsight or WASB from any other source, for example from your local computer ' s file system then Y OU should use any of the tools discussed in this article. The same article also discusses how to import data to HDFS from SQL Database/sql Server using Sqoop. In this blog I'll elaborate on the same with a example and try to provide more details information along the.

What does I need to does for Sqoop to work in my HDInsight cluster?

HDInsight 2.1 includes Sqoop 1.4.3. The Microsoft SQL Server SQOOP Connector for Hadoop are now part of the Apache SQOOP 1.4. Need to install the connector separately. All HDInsight clusters also has Microsoft SQL Server JDBC driver installed; So all components that is needed to transfer data between HDInsight cluster and SQL Server is already installed in a HDI Cluster and you does not have a to install anything.

How can I run a Sqoop job?

With HDInsight Preview version we could only run the Sqoop commands from Hadoop command line after doing a remote desktop Session (RDP) on the HDInsight cluster head node. However the release version of HDInsight SDK includes the PowerShell cmdlet to run Sqoop job remotely. So we can

    1. Run Sqoop jobs locally from HDInsight head node using Hadoop Command Line
    2. Run Sqoop job remotely using HDInsight SDK PowerShell cmlets

We recommend that you run your Sqoop commands remotely using HDInsight SDK cmdlets. We'll discuss both the options in detail. First let's see how we can run Sqoop jobs locally from HDInsight head node using Hadoop Command line.

Run Sqoop jobs locally from HDInsight head node using Hadoop Command Line

I am Assuming you already has a Windows Azure SQL Database. If you don't have a want to get one of the follow the steps in this article. Let's follow the steps below to create a test table and populate with some sample data in your Windows Azure SQL Database Which we'll import in our HDInsight cluster shortly. I'll show how does this from Windows Azure portal But can also connect to the Windows Azure SQL Database from SSMS and do the same.

Note:if want to transfer data from a SQL Server on your environment instead then you need to change the Sqoop command With appropriate connection information and it should being very similar to the connection string I has provided later in T His blog under 'more sample Sqoop commands ' sections for SQL server on Window Azure VMs.

  1. Login to your Windows Azure Portal and select ' SQL Databases ' from the left and click ' Manage ' at the bottom.

  2. Provider your Windows Azure SQL Database user ID and password to login and then click on ' New query ' to open a new query wind ow to run T-SQL queries.

  3. Copy paste the following T-SQL query and execute to create a test table Table1.

    CREATE TABLE [dbo]. [Table1] (

    [ID] [int] not NULL,

    [FName] [nvarchar] () not NULL,

    [LName] [nvarchar] () not NULL,

    CONSTRAINT [Pk_table_4] PRIMARY KEY CLUSTERED

    (

    [ID] ASC

    )

    ) on [PRIMARY]

    GO

  4. Run the following to Populate Table1 with 4 rows.

    INSERT into [dbo]. [Table1] VALUES (1, ' Jhon ', ' Doe '), (2, ' Harry ', ' Hoe '), (3, ' Carla ', ' Coe '), (4, ' Jackie ', ' Joe ');

    GO

  5. Now finally run the following T-SQL to make sure, which is table was populated with the sample data. You should see the output as below.

    SELECT * FROM [dbo]. [Table1]

Now let's follow the steps below to Import the rows in Table1 to the HDInsight Cluster.

  1. Login to your HDInsight cluster head node via Remote Desktop (RDP) and double click on the ' Hadoop Command Line ' icon in the Desktop to open Hadoop Command line. RDP access is turned off by default but can follow the steps inthis Blog to enable RDP and then RDP to the head node O F your HDInsight cluster.
  2. In Hadoop Command, navigate to the "C:\apps\dist\sqoop-1.4.3.1.3.1.0-06\bin" folder.

    Note:please verify the path for the Sqoop Bin folder in your environment. It may slightly vary from version to version.

  3. Run The following Sqoop command to import all the rows of table "Table1" from  Windows Azure SQL Database "M Farooqsqldb "to HDInsight Cluster.

    sqoop.cmd import–-connect "Jdbc:sqlserver://<sqldatabaseservername >.database.windows.net:1433;username=<SQLDatabasUsername>@<SQLDatabaseServerName>;p assword=< sqldatabasepassword>;d atabase=<sqldatabasedatabasename> "--table Table1--target-dir/user/hdp/ SqoopImportTable1

    Once the command is executed successfully to should see something similar as below in had OOP Command Line window.

  4. There is quite a number of tools available to upload/download and view data in WASB. Let's use the Azure Storage Explorer tool. You need to install the tool in your work station and configure for your cluster. Once all are done open the tool and find /user/hdp/sqoopimporttable1 folder. You should see something similar as below. It shows 4 files indicating 4 map jobs were used. You can select a file and click the ' View ' button to see the actual text data.

Now let's export the same rows back to the SQL Server from HDInsight cluster. Different table with the same schema as ' Table1 '. Otherwise would get a Primary Key violation error since the rows already exist in ' Table1 '.

  1. Create An empty table ' Table2 ' with the same schema as ' Table1 '.

    CREATE TABLE [dbo]. [Table2] (

    [ID] [int] not NULL,

    [FName] [nvarchar] () not NULL,

    [LName] [nvarchar] () not NULL,

    CONSTRAINT [pk_table_2] PRIMARY KEY CLUSTERED

    (

    [ID] ASC

    )

    ) on [PRIMARY]

    GO

  2. Run the following Sqoop command from Hadoop command line.

Sqoop.cmd Export--connect "jdbc:sqlserver://<sqldatabaseservername>.database.windows.net:1433;username= <SQLDatabasUsername>@<SQLDatabaseServerName>;p assword=<sqldatabasepassword>;d atabase=< Sqldatabasedatabasename> "--table Table2 --export-dir/user/hdp/sqoopimporttable1-- Input-fields-terminated-by ","

More Sample Sqoop commands:

Import from a SQL Server on Window Azure VM:

Sqoop.cmd import--connect "jdbc:sqlserver://<WindowsAzureVMServerName>.cloudapp.net:1433; username=<sqlserverusername>; password=<sqlserverpassword>; Database=<sqlserverdatabasename> "--table table_1--target-dir/user/hdp/sqoopimporttable

Export to a SQL Server on Window Azure VM:

Sqoop.cmd Export--connect "jdbc:sqlserver://<windowsazurevmservername>.cloudapp.net:1433; username=<sqlserverusername>; password=<sqlserverpassword>; Database=<sqlserverdatabasename> "--table table_2--export-dir/user/hdp/sqoopimporttable2-- Input-fields-terminated-by ","

Importing to HIVE from Windows Azure SQL Database:

C:\apps\dist\sqoop-1.4.2\bin>sqoop.cmd Import–connect "JDBC:SQLSERVER://<WINDOWSAZUREVMSERVERNAME> cloudapp.net:1433; username=<sqlserverusername>; password=<sqlserverpassword>; Database=<sqlserverdatabasename> "--table Table1--hive-import

Note:this would store the files under Hive/warehouse/tablename folder in HDFS (for example hive/warehouse/table1/part-m-00 000)

Run Sqoop job remotely using HDInsight SDK PowerShell cmlets

To use HDInsight PowerShell tools need to install Windows Azure PowerShell Tools First and then install HDInsight Powe Rshell tools. Then you need to prepare your workstation the HDInsight SDK. Follow the detail steps in this earlier blog post to install the tools and prepare your work station to use the HDI Nsight SDK.

Once you has installed and configured Windows Azure PowerShell tools and HDInsight SDK running a Sqoop job is very easy. Follow the steps below to import all the rows of table "Table2" from Windows Azure SQL Database "Mfarooqsqldb" to H Dinsight Cluster.

  1. Open the Windows Azure PowerShell console on the workstation and run the following cmdlets one at a time.

    Note:you can also use Windows Powershell ISE to type the code and run all at once. Powershell ISE makes edits easy and your can open the tool from "C:\Windows\System32\WindowsPowerShell\v1.0\powershell_ Ise.exe ".

  2. Set The variables for your Windows Azure Subscription name and the HDInsight cluster name.

    $subscriptionName =" <WindowsAzureSubscriptionName> "

    $clusterName = "<HDInsightClusterName>"

    select-azuresubscription $subscriptionName

    use-azurehdinsightcluster $clusterName-subscription $subscriptionName

  3. Define The Sqoop job, we want to run. In this exercise we'll import all the rows of table "Table2" so we created earlier in Windows Azure SQL Database.

    $sqoop = new-azurehdinsightsqoopjobdefinition-command "Import--connect jdbc:sqlserver://<sqldatabaseservername >.database.windows.net:1433;username=<SQLDatabasUsername>@<SQLDatabaseServerName>; password=<sqldatabasepassword>; database=<sqldatabasedatabasename>--table Table2--target-dir/user/hdp/sqoopimporttable8 "

  4. Run The Sqoop job that we just defined.

    $sqoopJob = start-azurehdinsightjob-subscription $subscriptionName-cluster $clusterName-jobdefinition $sqoop

  5. Run the following to wait for the completion or failure of the HDInsight job and show its progress.

    Wait-azurehdinsightjob-subscription $subscriptionName-waittimeoutinseconds 3600-job $sqoopJob

  6. Run the following to retrieve the log output for a job from the storage account associated with a specified cluster.

    Get-azurehdinsightjoboutput-cluster $clusterName-subscription $subscriptionName-standarderror-jobid $ Sqoopjob.jobid

If The Sqoop job completes successfully you should see something similar as below in your Windows Azure PowerShell command Line window.

Troubleshooting tips

When you run a Sqoop job command it runs the MapReduce job in the Hadoop Cluster (map only and no reduce task). You can specify the number of map tasks, and the default four tasks is used. There is no separate log file specific to Sqoop. So we need to troubleshoot Sqoop job failure or performance issues as any other MapReduce job failure or performance issue  S and start by checking the task logs. I plan to write more on how to troubleshot Sqoop issues by focusing on some specific scenarios

That's all for today and I hope your found this blog useful. I look forward to your comments and suggestions J.

Azure Cloud Platform uses SQOOP to import SQL Server 2012 data tables into Hive/hbase

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.