Spark/hadoop Integrated MongoDB

Source: Internet
Author: User
Tags mongodb query mongodb update

MongoDB is a document database, it can be easily applied to most languages, followed by the implementation is C + +, according to the test of the relevant personnel to prove that MongoDB query performance is better than the current market many NoSQL database, the relevant test connection is as follows:

Http://www.kuqin.com/shuoit/20140928/342398.html
Below is a brief introduction to MongoDB:
First, MongoDB features
1, mode free, support dynamic query, full index, can easily query the document embedded objects and arrays.
2, for the collection of storage, easy to store the object type of data, including documents within the document embedded objects and arrays.
3. Efficient data storage, supporting binary data and large objects
4. Support Replication and Recovery: provides master-slave, master-master-mode data replication and data replication between servers
5, automatic Shard to support cloud-level scalability, support the level of the database cluster, can dynamically add additional servers.
Second, the use of the scene
1, suitable as the information infrastructure of the persistent cache layer
2, suitable for real-time insert, update and query, and with the application of real-time data storage required for replication and high scalability.
3, MongoDB Bson data format is very suitable for the document format of storage and query
4, suitable for the database consisting of dozens of or hundreds of servers. Because MongoDB already contains built-in support for the MapReduce engine
Third, unsuitable scene
1. Highly transactional system required
2. Traditional Business intelligence applications
3. Complex cross-document (table) cascading queries.
Now facing the challenges of the big Data era, here's how the Spark computing framework integrates with MONGODB.
First talk about spark read MongoDB: I often have two ways to start, respectively:

Spark integrated MongoDB read from MongoDB:

 //programme 1
ValMongoconfig =NewConfiguration () Mongoconfig.set ("Mongo.input.uri","Mongodb://master:20000,slave1:20000,slave2:20000/yang.relation2") Mongoconfig.set ("Mongo.input.split_size"," the")//size of inputMongoconfig.set ("Mongo.input.split.read_shard_chunks","true")//Read ShardsMongoconfig.set ("mongo.input.fields{\" srcid\ ": \" 1\ ", \" dstid\ ": \" 1\ "}")//Read only the column you need to read 1 to read out, 0 means that does not need to resemble MongoDB inside the projecttionMongoconfig.set ("Mongo.input.query","{\" dstid\ ": {\" $gt \ ": \" 0\ "}}")ValReadFile = Sc.newapihadooprdd (Mongoconfig, classof[Mongoinputformat], classof[Object],classof[Bsonobject]) readfil E.count ()//Programme 2
 ValSqlcontex =NewSqlContext (SC)ValBuilder = Mongodbconfigbuilder (Map (host, host, List ("master:27017","slave1:27017","slave2:27017"), Database-"Graphdb", Collection"MONGO", Samplingratio1.0, Writeconcern-Mongodbwriteconcern.normal))Valmconf = Builder.build ()ValReadFile2 = Sqlcontex.frommongodb (mconf) Readfile2.count ()

Spark Integrated MongoDB writes MongoDB:
Scenario 1:

Val mongoconfig = new Configuration () mongoconfig. Set("Mongo.auth.uri","mongodb://"+ UserName +":"+ pwd+"@"+hosts+"/admin") Mongoconfig. Set("Mongo.output.uri","mongodb://"+ Hosts +"/graphdb. Db_graph ") Saverdd. Saveasnewapihadoopfile("", Classof[object], Classof[bsonobject], Classof[mongooutputformat[object, Bsonobject]], mongoconfig) scheme2: Import mongodbconfig._ Importcom. MongoDB. Casbah. {Writeconcern = Mongodbwriteconcern, mongoclient} Importcom. Stratio. Provider. MongoDB. _ Val SqlContext = new SqlContext (SC) Val property = Array ("id","Name","Age","Sex","Info") Val DataFrame = SqlContext. Createdataframe(data). TODF(property:_*) Val builder = Mongodbconfigbuilder (Map (Host-List ("master:27017","slave1:27017","slave2:27017"), Database-"Test", Collection"Test", Samplingratio1.0, Writeconcern-Mongodbwriteconcern. Normal) Val mongoconf = Builder. Build() Val dataframe:dataframe = Sqlcontex. Createdataframe(RDD) DataFrame. Savetomongodb(mongoconf,true) programme3: Using the foreachpartition of the RDD in each paritition to establish a connection, import data, if the partition is more than the number of CPU cores allocated to spark, there will be many problems, such as: When viewing the MongoDB log, The MONGOs process sometimes hangs up because MongoDB has a problem allocating read and write locks, and Oom (unable to create a local thread, this small white is being resolved). Be sure to create a connection in the inside Yo, otherwise there will be a serialization problem.

Hadoo Integrated MongoDB Update:
Val mongoconfig = new Configuration ()
Mongoconfig.set ("Mongo.output.uri", "mongodb://master:27017/db.table")
Saverdd.saveasnewapihadoopfile ("", classof[Object], classof[mongoupdatewritable],
classof[mongooutputformat[object,mongoupdatewritable]],mongoconfig).

It can be used in conjunction with MongoDB's numeric modifier when updating. We have time to share with you the use of data modifiers. I small white one, if there is a problem, I hope you give the point, Xiao Yang here humbly you big God.

Spark/hadoop Integrated MongoDB

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.