Function: Import files in HDFs into Mongdo via spark SQL
The required jar packages are: Mongo-spark-connector_2.11-2.1.2.jar, Mongo-java-driver-3.8.0.jar
The Scala code is as follows:
ImportOrg.apache.spark.sql.Row
ImportOrg.apache.spark.sql.Dataset
ImportOrg.apache.spark.SparkContext
ImportOrg.apache.spark.sql.SQLContext
ImportOrg.apache.hadoop.conf.Configuration
ImportOrg.apache.spark.sql.SparkSession
ImportCom.mongodb.spark._
ImportOrg.bson.Document
ImportCom.mongodb.spark.config._
ObjectExec {
defMain(args:array[String]) {
if(Args.length <6) {
System.Err. println ("Usage:exec )
System.Exit(1)
}
ValHdfsserver = args (0)//"Hdfs://master"
ValLogPath = args (1)//"/user/hdfs/log/"
ValFileName = args (2)//2017-05-04.txt
ValMongohost = args (3)//"10.15.22.22:23000"
ValMongoDB = args (4)//"MONGO DB"
ValMongocollection = args (5)//"MONGO collection"
Try{
ImportOrg.apache.spark.sql.SparkSession
ValSpark = Sparksession
.Builder()
. Master ("Local")
. AppName ("Sparkimportdatatomongo")
. config ("Spark.debug.maxToStringFields", -). Getorcreate ()
ImportSpark.implicits._
ValDF = Spark.read.json (hdfsserver + LogPath +"/" + fileName)
Df.printschema ()
Df.write.mode ("Append"). Format ("Com.mongodb.spark.sql.DefaultSource"). Option ("Spark.mongodb.output.uri", "mongodb://"+ Mongohost +"/" + MongoDB +"." + mongocollection). Save ()
}Catch{
CaseExException= = {
printf(Ex.tostring ())
}
}
}
}
Execute the following command in the Spark run directory:
./bin/spark-submit--master spark://11.12.13.14:7077--class Exec//bigdata/spark-2.1.1-bin-hadoop2.6/examples/ Importdatatomongo.jar hdfs://master/user/hdfs/log/2017-05-04.txt 10.15.22.22:27017 mydb data_default_test
Run:
[[email protected] spark-2.1.1-bin-hadoop2.6]# ./bin/spark-submit --master spark:// 11.12.13.14:7077--class Exec//bigdata/spark-2.1.1-bin-hadoop2.6/examples/importdatatomongo.jar hdfs://master/ User/hdfs/log/2017-05-04.txt 10.15.22.22:27017 mydb data_default_test18/07/20 23:41:13 INFO Spark. Sparkcontext:running Spark version 2.1.118/07/20 23:41:14 INFO Spark. Securitymanager:changing View ACLs to:root18/07/20 23:41:14 INFO Spark. Securitymanager:changing Modify ACLs to:root18/07/20 23:41:14 INFO Spark. Securitymanager:changing View ACLs groups to: 18/07/20 23:41:14 INFO Spark. Securitymanager:changing Modify ACLs groups to: 18/07/20 23:41:14 INFO Spark. SecurityManager:SecurityManager:authentication Disabled; UI ACLs Disabled; Users with view Permissions:set (root); Groups with view Permissions:set (); Users with modify Permissions:set (root); Groups with Modify Permissions:set () 18/07/20 23:41:14 INFO util. Utils:successfully started ServiCe ' sparkdriver ' on port 24073.18/07/20 23:41:14 INFO Spark. sparkenv:registering mapoutputtracker18/07/20 23:41:14 INFO Spark. sparkenv:registering blockmanagermaster18/07/20 23:41:14 INFO storage. Blockmanagermasterendpoint:using org.apache.spark.storage.DefaultTopologyMapper for getting topology information18/ 07/20 23:41:14 INFO Storage. Blockmanagermasterendpoint:blockmanagermasterendpoint up18/07/20 23:41:14 INFO storage. diskblockmanager:created Local directory at/tmp/blockmgr-9c42a710-559b-4c97-b92a-58208a77afeb18/07/20 23:41:14 INFO Memory. Memorystore:memorystore started with capacity 366.3 mb18/07/20 23:41:14 INFO Spark. Sparkenv:registering outputcommitcoordinator18/07/20 23:41:14 INFO util.log:Logging initialized @1777ms18/07/20 23:41:14 INFO Server. SERVER:JETTY-9.2.Z-SNAPSHOT18/07/20 23:41:14 INFO handler. contexthandler:started [Email protected]{/jobs,null,available, @Spark}18/07/20 23:41:14 INFO handler. contexthandler:started [Email protected]{/jobs/json,nulL,available, @Spark}18/07/20 23:41:14 INFO handler. contexthandler:started [Email protected]{/jobs/job,null,available, @Spark}18/07/20 23:41:14 INFO handler. contexthandler:started [Email protected]{/jobs/job/json,null,available, @Spark}18/07/20 23:41:14 INFO handler. contexthandler:started [Email protected]{/stages,null,available, @Spark}18/07/20 23:41:14 INFO handler. contexthandler:started [Email protected]{/stages/json,null,available, @Spark}18/07/20 23:41:14 INFO handler. contexthandler:started [Email protected]{/stages/stage,null,available, @Spark}18/07/20 23:41:14 INFO handler. contexthandler:started [Email protected]{/stages/stage/json,null,available, @Spark}18/07/20 23:41:14 INFO Handler. contexthandler:started [Email protected]{/stages/pool,null,available, @Spark}18/07/20 23:41:14 INFO handler. contexthandler:started [Email protected]{/stages/pool/json,null,available, @Spark}18/07/20 23:41:14 INFO Handler. contexthandler:started [Email protEcted]{/storage,null,available, @Spark}18/07/20 23:41:14 INFO handler. contexthandler:started [Email protected]{/storage/json,null,available, @Spark}18/07/20 23:41:14 INFO handler. contexthandler:started [Email protected]{/storage/rdd,null,available, @Spark}18/07/20 23:41:14 INFO handler. contexthandler:started [Email protected]{/storage/rdd/json,null,available, @Spark}18/07/20 23:41:14 INFO Handler. contexthandler:started [Email protected]{/environment,null,available, @Spark}18/07/20 23:41:14 INFO handler. contexthandler:started [Email protected]{/environment/json,null,available, @Spark}18/07/20 23:41:14 INFO Handler. contexthandler:started [Email protected]{/executors,null,available, @Spark}18/07/20 23:41:14 INFO handler. contexthandler:started [Email protected]{/executors/json,null,available, @Spark}18/07/20 23:41:14 INFO Handler . contexthandler:started [Email protected]{/executors/threaddump,null,available, @Spark}18/07/20 23:41:14 INFO Handler. Contexthandler:started [Email protected]{/executors/threaddump/json,null,available, @Spark}18/07/20 23:41:14 INFO Handler. contexthandler:started [Email protected]{/static,null,available, @Spark}
Importing files from HDFs into MongoDB via spark SQL