The distributed framework of Hadoop, Zookeeper and HBase has been introduced in previous essays, and a distributed cluster with 11 nodes has been built. The use of HBase database is limited to the test of the nature of the deletion and modification instructions, in order to become more familiar with the use of distributed framework, this article introduces the existing data from the relational database SQL Server into HBase in the method.
To complete the migration from the relational database to hbase data, we need to use the Sqoop tool, Sqoop is a standalone project for Apache, designed to deliver data between Hadoop (Hive) and traditional databases (MySQL, PostgreSQL). The Sqoop tool is based on the Data Warehouse tool hive, which transforms data queries into MapReduce tasks to deliver data through hive. Therefore, to complete the migration of this data, we need the following several preparations:
①hive:apache-hive-2.1.1-bin.tar.gz
②sqoop:sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
③jdbc for SQL server:sqljdbc_3.0.1301.101_enu.tar.gz
④connector between SQL Server and sqoop:sqoop-sqlserver-1.0.tar.gz
====================== All of the following operations are on the master host and are executed as root user ======================
1. Installing hive
① Creating a hive Directory
Cd/home
mkdir Hive
② Unpacking the installation package (installation package moved to/home/hive)
TAR-ZXVF apache-hive-2.1.1-bin.tar.gz
③ Setting Environment variables
Vi/etc/profile
Append the following:
Export Hive_home=/home/hive/apche-hive-2.1.1-bin
Export path= $HIVE _home/bin: $PATH
Export PATH
Append the following:
Export Hcat_home= $HIVE _home/hcatalog
④ make configuration effective
Source/etc/profile
2, Installation Sqoop
① establishing Sqoop Directory
Cd/home
mkdir Sqoop
② Unpacking the installation package (installation package moved to/home/sqoop)
TAR-ZXVF sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
③ Setting Environment variables
Vi/etc/profile
Append the following:
Export Sqoop_home=/home/sqoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha
Export sqoop_conf_dir= $SQOOP _home/conf
Export path= $SQOOP _home/bin: $PATH
Export PATH
④ make configuration effective
Source/etc/profile
3. Configure JDBC
① Decompression (location optional)
TAR-ZXVF sqljdbc_3.0.1301.101_enu.atr.gz
② replicating JDBC to Sqoop
CP Sqljdbc_3.0/enu/sqljdbc4.jar/home/sqoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/lib
4. Configuring SQL Server Sqoop Connector
① Decompression (location is optional, this is/home)
TAR-ZXVF sqoop-sqlserver-1.0.tar.gz
② Setting Environment variables
Vi/etc/profile
Append the following:
Export mssql_connector_home=/home/sqoop-sqlserver-1.0/
Configuration takes effect:
Source/etc/profile
③ Configuration to Sqoop
CD sqoop-sqlserver-1.0
./install.sh
5, Configuration Sqoop
The reason for this is that the default configuration of SQOOP will have something we do not need, in its configuration file $sqoop_home/bin/configure-sqoop file, define many parameters and environments that need to be preconfigured, some of which we have configured to complete, But there are other things that are not needed (and are not yet aware of the effect), so in order to prevent the runtime check configuration from passing, our direct approach is to cancel this part of the configuration check.
Note Accumulo related configuration: In the Configure-sqoop file, comment out all the command lines associated with Accumulo_home and save the exit.
6, the current environment variables
It is worth noting that the environment variables that were not added to HBase were previously added to the HBase-related environment variables.
#/etc/profile#System wide environment and startup programs, for login setup#Functions and aliases go IN/ETC/BASHRC#It's not a good idea to the change this file unless you know#is doing. It ' s much better to create a custom.sh shell script in#/etc/profile.d/to make custom changes to your environment, as this#Would prevent the need for merging on future updates.Pathmunge () { case": ${path}:" inch*:" $":*) ;; *) if[" $"=" After" ] ; Then PATH=$PATH: $ ElsePATH= $:$PATHfi Esac}if[-x/usr/bin/ID]; Thenif[-Z"$EUID" ]; Then#Ksh WorkaroundEuid= ' ID-u ' UID= ' ID-ru ' fi USER="' Id-un ' "Logname= $USER mail="/var/spool/mail/$USER"fi# Path manipulationif ["$EUID" = "0" ]; Then Pathmunge/sbin pathmunge/usr/sbin pathmunge/usr/local/sbinelse pathmunge/usr/local/sbin after path Munge/usr/sbin after Pathmunge/sbin afterfihostname= '/bin/hostname 2>/dev/null ' histsize=1000if ["$HISTCONTROL" = "Ignorespace" ] ; Then export histcontrol=ignorebothelse export histcontrol=ignoredupsfiexport PATH USER LOGNAME MAIL HOSTNAME histsiz E histcontrol# By default, we want Umask to get set. This sets it for login shell# current threshold for system reserved Uid/gids are 200# you could check uidgid reservation VA Lidity in#/usr/share/doc/setup-*/uidgid Fileif [$UID-gt 199] && ["' Id-gn '" = "' Id-un '" ]; Then umask 002else umask 022fifor i in/etc/profile.d/*.sh; do if [-R"$i" ]; Then if ["${-#*i} "! =" $-"]; Then."$i" Else . "$i">/dev/null 2>&1fi fidoneunset Iunset- FPathmungeexport Java_home=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.131. X86_64export CLASSPATH=.:$JAVA _home/jre/lib/rt.jar:$JAVA _home/lib/dt.jar:$JAVA _home/lib/Tools.jarexport PATH=$PATH:$JAVA _home/bin#HadoopExport hadoop_home=/home/hadoop/hadoop-2.7.3Export PATH=$PATH:$HADOOP _home/Sbinexport PATH=$PATH:$HADOOP _home/bin#ZookeeperExport zookeeper_home=/home/zookeeper/zookeeper-3.4.6/Export PATH=$ZOOKEEPER _home/bin:$PATHExport PATH#HBaseExport hbase_home=/home/hbase/hbase-1.2.4Export PATH=$HBASE _home/bin:$PATHExport PATH#HiveExport hive_home=/home/hive/apache-hive-2.1.1-Binexport PATH=$HIVE _home/bin:$PATHExport Pathexport hcat_home=$HIVE _home/Hcatalog#SqoopExport sqoop_home=/home/sqoop/sqoop-1.4.6.bin__hadoop-2.0.4-Alphaexport Sqoop_conf_dir=$SQOOP _home/Confexport PATH=$SQOOP _home/bin:$PATHExport Pathexport mssql_connector_home=/home/sqoop-sqlserver-1.0/
7. Data Migration Experiment
You need to create a table in HBase before you execute the migration command
#hbase Shell
Execute after entering HBase shell
>create ' test ', ' CF '
And then execute it in terminal.
#sqoop import--connect ' Jdbc:sqlserver://<ip address>;username=<username>;p assword=<password>; Database=<database> '--table <sql server table name>--hbase-table
Need to note:
① If the table of SQL Server is a normal table structure for a single primary key, you can specify hbase-row-key directly to execute the above command, which will be executed by default with multiple MapReduce tasks
② If the table of SQL Server is a federated primary key, then there is a problem with this import, the query task cannot be decomposed based on the primary key, there is no way to do mapreduce, then you must specify the parameter '-M 1 ' with only one mapreduce task
③ for cases where there is no primary key, if the amount of data is large and must be divided into multiple mapreduce tasks, a split field needs to be found so that hive can split the task based on that field. At this point we must add '--split-by <id> ' in the import directive
In my actual operation, the table structure in SQL Server is the Federated primary Key, the first import table of 100 data records, according to this blog introduction, CONFIGURED $sqoop_home/conf/sqoop-site.xml, I specified the '--hbase-row-key <id1>,<id2> ' and specify '-M 1 ' to import the data successfully. (Time consuming 22sec)
The second attempt is to import nearly 160W data records from the database into HBase, add the parameter '--split-by <id> ' and specify '-M 12 ', and the data is also imported successfully. (Time 17min25sec)
8, there are problems
The understanding of the Sqoop split task is still not deep enough, although the successful completion of the experiment, but did not do comparative experiments to determine the real impact factors
9. Reference Articles
Sqoop User Guide (v1.4.6)
Using Sqoop to import data from a database into HDFs-I feed myself bag salt-blog channel-csdn.net
Sqoop SQL Server data into hbase-nma_123456 's column-Blog channel-csdn.net
Centos uses Sqoop to import data from SQL Server to HDFs or hive-Wang Weixian column-Blog channel-csdn.net
Sqoop Parallel Import data-Dongjie Bookstore-Blog channel-csdn.net
Sqoop simple control tips for importing HBase tables for federated primary keys-gang's technology blog-Blog Channel-csdn.net
Reference links
HBase Import SQL Server database data