Spark/hadoop Integrated MongoDB

Last Update:2016-05-07 Source: Internet

Author: User

Tags mongodb query mongodb update

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

MongoDB is a document database, it can be easily applied to most languages, followed by the implementation is C + +, according to the test of the relevant personnel to prove that MongoDB query performance is better than the current market many NoSQL database, the relevant test connection is as follows:

Http://www.kuqin.com/shuoit/20140928/342398.html
Below is a brief introduction to MongoDB:
First, MongoDB features
1, mode free, support dynamic query, full index, can easily query the document embedded objects and arrays.
2, for the collection of storage, easy to store the object type of data, including documents within the document embedded objects and arrays.
3. Efficient data storage, supporting binary data and large objects
4. Support Replication and Recovery: provides master-slave, master-master-mode data replication and data replication between servers
5, automatic Shard to support cloud-level scalability, support the level of the database cluster, can dynamically add additional servers.
Second, the use of the scene
1, suitable as the information infrastructure of the persistent cache layer
2, suitable for real-time insert, update and query, and with the application of real-time data storage required for replication and high scalability.
3, MongoDB Bson data format is very suitable for the document format of storage and query
4, suitable for the database consisting of dozens of or hundreds of servers. Because MongoDB already contains built-in support for the MapReduce engine
Third, unsuitable scene
1. Highly transactional system required
2. Traditional Business intelligence applications
3. Complex cross-document (table) cascading queries.
Now facing the challenges of the big Data era, here's how the Spark computing framework integrates with MONGODB.
First talk about spark read MongoDB: I often have two ways to start, respectively:

Spark integrated MongoDB read from MongoDB:

 //programme 1
ValMongoconfig =NewConfiguration () Mongoconfig.set ("Mongo.input.uri","Mongodb://master:20000,slave1:20000,slave2:20000/yang.relation2") Mongoconfig.set ("Mongo.input.split_size"," the")//size of inputMongoconfig.set ("Mongo.input.split.read_shard_chunks","true")//Read ShardsMongoconfig.set ("mongo.input.fields{\" srcid\ ": \" 1\ ", \" dstid\ ": \" 1\ "}")//Read only the column you need to read 1 to read out, 0 means that does not need to resemble MongoDB inside the projecttionMongoconfig.set ("Mongo.input.query","{\" dstid\ ": {\" $gt \ ": \" 0\ "}}")ValReadFile = Sc.newapihadooprdd (Mongoconfig, classof[Mongoinputformat], classof[Object],classof[Bsonobject]) readfil E.count ()//Programme 2
 ValSqlcontex =NewSqlContext (SC)ValBuilder = Mongodbconfigbuilder (Map (host, host, List ("master:27017","slave1:27017","slave2:27017"), Database-"Graphdb", Collection"MONGO", Samplingratio1.0, Writeconcern-Mongodbwriteconcern.normal))Valmconf = Builder.build ()ValReadFile2 = Sqlcontex.frommongodb (mconf) Readfile2.count ()

Spark Integrated MongoDB writes MongoDB:
Scenario 1:

Val mongoconfig = new Configuration () mongoconfig. Set("Mongo.auth.uri","mongodb://"+ UserName +":"+ pwd+"@"+hosts+"/admin") Mongoconfig. Set("Mongo.output.uri","mongodb://"+ Hosts +"/graphdb. Db_graph ") Saverdd. Saveasnewapihadoopfile("", Classof[object], Classof[bsonobject], Classof[mongooutputformat[object, Bsonobject]], mongoconfig) scheme2: Import mongodbconfig._ Importcom. MongoDB. Casbah. {Writeconcern = Mongodbwriteconcern, mongoclient} Importcom. Stratio. Provider. MongoDB. _ Val SqlContext = new SqlContext (SC) Val property = Array ("id","Name","Age","Sex","Info") Val DataFrame = SqlContext. Createdataframe(data). TODF(property:_*) Val builder = Mongodbconfigbuilder (Map (Host-List ("master:27017","slave1:27017","slave2:27017"), Database-"Test", Collection"Test", Samplingratio1.0, Writeconcern-Mongodbwriteconcern. Normal) Val mongoconf = Builder. Build() Val dataframe:dataframe = Sqlcontex. Createdataframe(RDD) DataFrame. Savetomongodb(mongoconf,true) programme3: Using the foreachpartition of the RDD in each paritition to establish a connection, import data, if the partition is more than the number of CPU cores allocated to spark, there will be many problems, such as: When viewing the MongoDB log, The MONGOs process sometimes hangs up because MongoDB has a problem allocating read and write locks, and Oom (unable to create a local thread, this small white is being resolved). Be sure to create a connection in the inside Yo, otherwise there will be a serialization problem.

Hadoo Integrated MongoDB Update:
Val mongoconfig = new Configuration ()
Mongoconfig.set ("Mongo.output.uri", "mongodb://master:27017/db.table")
Saverdd.saveasnewapihadoopfile ("", classof[Object], classof[mongoupdatewritable],
classof[mongooutputformat[object,mongoupdatewritable]],mongoconfig).

It can be used in conjunction with MongoDB's numeric modifier when updating. We have time to share with you the use of data modifiers. I small white one, if there is a problem, I hope you give the point, Xiao Yang here humbly you big God.

Spark/hadoop Integrated MongoDB

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More