Mongodb Synchronizing data to Hive (ii)
1. Overview
The previous article mainly introduced the mongodb-based, through the direct connection MongoDB way data mapping to carry on the data query, but that way will have the influence on the online database, so today introduces the second way-bson-based, Even if you export the required tables to a local file with Mongoexport, the file is Bson by default. Then put the exported Bson file into the HDFs file system and finally create the corresponding table in hive to query using hive SQL.
2. Export File
Use the Mongoexport command to export the required collection or fields. The common commands are as follows:
1 ) #mongoexport-uhuoqiu-phuoqiuapp-h 127.0.0.1:27017-d saturn-c mycol-o/root/data/mycol-' Date +%f_%h-%m-%s '. json< /c2>
! -U: Specifies that users must have read access to the database for the backup user.
! -P: Specify user password
! -H: Specify the database server IP and port, for example: Ip:port
! -D: Specify the database name
! -C: Specifies the name of the collection to be backed up
! -O: Specifies the output file,
! --type: Specifies the output type, which is the default JSON format.
2 ) to back up a Collection one of the fields inside
For example, export the ID field inside the mycol and export it as a csv file
#mongoexport-uhuoqiu-phuoqiuapp-h 127.0.0.1:27017-d saturn-c mycol--type csv-f "id"-o/root/data/mycol-' date +%F _%h-%m-%s '. csv
! -D database name
! -C Collection Name
! -O Output filename
! --type output format, default to JSON
! -F Output field, if--type is CSV, you need to add-f "field name"
! -Q Filters A condition: For example:-Q ' {' function ': ' test100 '} '
#mongoexport-h127.0.0.1:27017-uhuoqiu-phuoqiuapp-d saturn-c mycol--csv-f id,function-q ' {"function": "test100"} ' -o/root/data/oplog.rs-' Date +%f_%h-%m-%s '. csv
3) If MongoDB is deployed separately from Hadoop and hive, then a mongodb will need to be deployed on the Hadoop server, and this service is not running, just to copy data using the Mongoexport command.
3. import a file into HDFS
1) First, you need to create a corresponding directory in HDFs to store the corresponding table file.
2) Note that each table needs to create a directory for the corresponding
3) The command is as follows (I have added the bin of Hadoop to the environment variable):
#hdfs Dfs-mkdir/myjob
#hdfs DFS-MKDIR/MYJOB/JOB1
!! Note that the directory for HDFs must be created one level at a time and cannot be created at once.
#将文件传入到HDFS
#hdfs DFS-PUT/DATA/JOB1/MYJOB/JOB1
@/data/job1 for local path, that is, the path of the exported MongoDB file
@/myjob/job1 is the path to HDFs
4) View files that have been uploaded to HDFs
#hdfs DFS-LS/MYJOB/JOB1
5) Modify Permissions
#hdfs Dfs-chmod 777/myjob/job1
6) Get the files inside HDFs
#hdfs DFS–GET/MYJOB/JOB1/DATA/JOB1
7) Delete Files
#hdfs DFS-RM/MYJOB/JOB1
Delete Directory
#hdfs Dfs-rm-r/myjob
Myjob directory needs to be empty, if you want to force the deletion of non-empty directories, you need to add-F.
4. Hive Create a table inside
#hive
Hive>create table if not exists ${table_name}
(
Id String,
Userid String,
.
.
.
Comment ' description '
Row format Serd ' Com.mongodb.hadoop.hive.BSONSerDe '
With Serdeproperties (' mongo.columns.mapping ' = ' {' {') ' the mapping of the Hive field to the MONGO field} ')
stored as InputFormat ' Com.mongodb.hadoop.mapred.BSONFileInputFormat '
OutputFormat ' Com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat '
Directory of location ' HDFs '
#location指示的是bson文件所在的 the HDFS directory, which is/myjob/job1.
5, for the convenience of use, will export MongoDB to local, and import files into the HDFs inside. made a script.
#cat hdfs.sh
#!/bin/bash
#此脚本用于将mongodb里面的collection到处为BSON文件 and upload the file to HDFs.
#定义要导出的表名
List= "
Merchants
Martproducts
Products
Coupons
Couponlogs
Reviews
Orderoplogs
Orders
"
#判断文件是否存在, the presence is deleted
For I in $list
Do
If [-e/data/mongodata/$i];then
rm-rf/data/mongodata/$i
Sleep 5s
Fi
Done
#从mongodb导出数据到本地
For a in $list
Do
Nohup/data/mongodb/bin/mongoexport-utest-ptestpwd-h 192.168.1.11:27017-d saturn-c $a-o/data/mongodata/$a >> /data/nohup.out 2>&1 &
#sleep 5m
Done
#将HDFS里面的文件删除
For B in $list
Nohup/data/hadoop-2.7.3/bin/hdfs dfs-rm/$b/$b >>/data/nohuprm.out 2>&1 &
Done
#将本地的文件导入到HDFS里面
For C in $list
Do
/data/hadoop-2.7.3/bin/hdfs dfs-put/data/mongodata/$c/$c
Sleep 1m
Done
5, add the script to the scheduled task, there are two ways: one is to use crontab, one is to use Jenkins.
1) Use Crontab
#crontab-E
0 * * * */data/hdfs.sh 2>&1 &
2) Using Jenkins
1, create a project, the name of their own definition,
2. Create a running cycle
3) execution
MongoDB Synchronizing data to Hive (ii)