Use Spark on MongoDB

Last Update:2015-05-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Preface] Nosql technology only master MongoDB. I saw an article about how to use Spark on MongoDB, and quickly translated it to improve the core competitiveness. Original http://codeforhire.com/2014/02/18/using-spark-with-mongodb/

[Body]

Use Spark on MongoDB

Published by 2014.02.18Sampo N

I recently started to study Apache Spark as a data mining framework. Spark is built on Apache Hadoop and can implement more operations besides Map-Reduce. It also supports stream data processing using iterative algorithms.

Since Spark is based on Hadoop and HDFS, it is suitable for any HDFS data source. Our server uses MongoDB, So we naturally choose the mongo-hadoop connector, which can be used to read and write data from MongoDB.

However, this is far from figuring out how to configure and use mongo-hadoop + spark (at least for Spark beginners ). After some experimentation and frustrating processes, as well as sending emails to the spark user list, I finally succeeded in the Java and Scala environments. Now I am writing this tutorial to save you.

Read the following carefully to see the sample code of the application.

Version and APIs

The Hadoop ecosystem is filled with different libraries, and the possible APIs conflicts between them will drive people crazy. The main API changes are in Hadoop 0.20. In this version, the old org. apache. hadoop. mapred API is changed to org. apache. hadoop. mapreduce API. API changes in turn affect these libraries: The mongo-hadoop package com. mongodb. hadoop. mapred is changed to com. mongodb. hadoop, and SparkContext includes methods hadoopRDD and newAPIHadoopRDD.

You need to carefully select the correct version of each API. This makes things more complicated, because in most cases, the class names of the two APIs are identical and only the package names are different. If you encounter a mysterious error, check again that the APIs used are consistent.

The example uses Hadoop 2.2.0 and the new API.

Library dependency

Apache Spark relies on multiple supporting libraries from Apache Commons and Hadoop to slf4j and Jetty. Do not manage the dependencies of these libraries by yourself. Use Maven, Ivy, SBT, or other build tools.

The example uses SBT to load the Akka Maven repository. This Maven repository contains the mongo-Hadoop connector for different hadoop versions, but does not have the 2.2.0 connector. Therefore, the mongo-hadoop connector is added separately.

Use mongo-hadoop in spark

Mongo-hadoop configuration parameters are transmitted using the configuration object (obtained from the Hadoop package. The most important parameters are mongo. input. uri and mongo. output. uri. This parameter provides the MongoDB host, port, authentication, db, and collection names. You can also provide other configuration options, such as the Mong query statement, to limit the output data.

Each Mongo Collection is loaded as an independent RDD, and sparkcontext is used for loading:

JavaPairRDD Rdd = SC. newAPIHadoopRDD (config, MongoInputFormat. class, Object. class, BSONObject. class );

A new API is used here, And MongoInputFormat must be imported from com. mongodb. hadoop. For old APIs, you should use the hadoopRDD method and com. mongodb. hadoop. mapred. MongoInputFormat.

The return type is RDD. Its first parameter is the ObjectId instance in the MongoDB document. Its second parameter contains the BSON document.

Save RDD to MongoDB and use the saveAsNewAPIHadoopFile method:

rdd.saveAsNewAPIHadoopFile("file:///bogus", Object.class, Object.class, MongoOutputFormat.class, config);

Only the last two parameters seem relevant (although the first parameter must be a valid hdfs uri ). RDD is also RDD Type. However, there is a bug where the first parameter cannot be ObjectId. If you want to specify ObjectId, it is represented by a String object. If you want the Mongo driver to automatically generate an ID, set the first parameter to null (in this example ).

Example app

The sample application contains a simple word counting algorithm, both java and scala. They read data from the beowulf. input collection of MongoDB and run it locally. (MongoDB) documents only contain text fields, and counting algorithms work on text fields.

The results are stored in the same beowulf database. The collection is output, and the document contains the word and log fields.

For example, MongoDB runs locally. Scala version 2.10 and SBT are installed. Then you can import the sample data, run the program, and output the result. Run the following command:

Export Import-d beowulf-c input beowulf. json
Sbt 'run-main javawordcount'
Sbt 'run-main scalawordcount'
Mongo beowulf -- eval 'printjson (db. output. find (). toArray () '| less

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More