The following shell script is used to import data from a database to HDFS using SQOOP:
# Oracle connection string, which contains the Oracle address, SID, and port CONNECTURL = jdbc: oracle: thin: @ 20.135.60.21: 1521: DWRAC2 # username used: Export lename = kkkaa # password ORACLEPASSWORD = kkkaa123 # Name of the table to be imported from Oracle oralceTableName = tt # field name columns = AREA_ID in the table to be imported from Oracle, TEAM_NAME # import data from Oracle to the HDFS storage path hdfsPath = apps/as/hive/$ oralceTableName # Run the import logic. Import data from Oracle to HDFS sqoop import -- append -- connect $ CONNECTURL -- username $ export lename -- password $ ORACLEPASSWORD -- target-dir $ hdfsPath -- num-mappers 1 -- table $ oralceTableName -- columns $ columns -- fields-terminated-by '\ 001'
After the script is executed, the import program is complete. Next, you can create an external table by yourself and match the path of the External table with the path where the Oracle data is stored in HDFS. Note: The data imported from this program to HDFS is in text format. Therefore, when creating a Hive External table, you do not need to specify the file format as RCFile, but use the default TextFile. The delimiter between data is '\ 001 '. If the data in the same table is imported multiple times, the data is inserted into the HDFS directory in the form of append. Parallel import assume that this sqoop Command needs to import data from Oracle to HDFS:
sqoop import --append --connect $CONNECTURL --username $ORACLENAME --password $ORACLEPASSWORD --target-dir $hdfsPath --m 1 --table $oralceTableName --columns $columns --fields-terminated-by '\001' --where "data_desc='2011-02-26'"
Note that in this command, there is a parameter "-m", which indicates how many parallel statements are used. The value of this parameter is 1, indicating that the parallel function is not enabled. Now, we can increase the value of the "-m" parameter and use the parallel import function, as shown in the following command:
sqoop import --append --connect $CONNECTURL --username $ORACLENAME --password $ORACLEPASSWORD --target-dir $hdfsPath --m 4 --table $oralceTableName --columns $columns --fields-terminated-by '\001' --where "data_desc='2011-02-26'"
Generally, Sqoop starts four processes and imports data at the same time. However, if the table imported from Oracle does not have a primary key, the following error message is displayed:
ERROR tool.ImportTool: Error during import: No primary key could be found for table creater_user.popt_cas_redirect_his. Please specify one with --split-by or perform a sequential import with '-m 1'.
In this case, to better use Sqoop's parallel import function, we need to understand the implementation mechanism of Sqoop parallel import in principle. If the primary key of the Oracle table to be imported in parallel is id and the number of parallel tables is 4, Sqoop will first execute the following query:
Select max (id) as max, select min (id) as min from table [where if the where clause is specified];
Through this query, the maximum and minimum values of the fields to be split (id) are obtained, which are 1 and 1000, respectively. Then, Sqoop will split the query based on the number of parallel imports required. For example, in the preceding example, the parallel import will be split into the following four SQL statements for simultaneous execution:
select * from table where 0 <= id < 250;select * from table where 250 <= id < 500;select * from table where 500 <= id < 750;select * from table where 750 <= id < 1000;
Note that the split field must be an integer. From the example above, we can see that if the table to be imported does not have a primary key, how should we manually select an appropriate split field and select an appropriate parallel row. Next, we will introduce creater_user.popt_cas_redirect_his from Oracle. This table does not have a primary key, so we need to manually select an appropriate shard field. First, let's take a look at the fields in this table: Then, I assume that the ds_name field is a sharding field that can be selected, and then execute the following SQL statement to verify my idea:
select min(ds_name), max(ds_name) from creater_user.popt_cas_redirect_his where data_desc='2011-02-26'
The result is not ideal. The values of min and max are equal. Therefore, this field is not suitable for splitting fields. Test another field: CLIENTIP.
select min(CLIENTIP), max(CLIENTIP) from creater_user.popt_cas_redirect_his where data_desc='2011-02-26'
This result is good. Therefore, we use the CLIENTIP field as the sharding field. Therefore, we use the following command to import data in parallel:
sqoop import --append --connect $CONNECTURL --username $ORACLENAME --password $ORACLEPASSWORD --target-dir $hdfsPath --m 12 --split-by CLIENTIP --table $oralceTableName --columns $columns --fields-terminated-by '\001' --where "data_desc='2011-02-26'"
When this command is executed, it can be seen that the consumed time is 20 mins, 35sec, and 33,222,896 pieces of data are imported. In addition, if we think this split cannot meet our needs well, we can execute multiple Sqoop commands at the same time, and then specify the sharding rule after the where parameter. For example:
sqoop import --append --connect $CONNECTURL --username $ORACLENAME --password $ORACLEPASSWORD --target-dir $hdfsPath --m 1 --table $oralceTableName --columns $columns --fields-terminated-by '\001' --where "data_desc='2011-02-26' logtime<10:00:00"sqoop import --append --connect $CONNECTURL --username $ORACLENAME --password $ORACLEPASSWORD --target-dir $hdfsPath --m 1 --table $oralceTableName --columns $columns --fields-terminated-by '\001' --where "data_desc='2011-02-26' logtime>=10:00:00"
To achieve parallel import.