Analyzing MongoDB Data using Hadoop mapreduce: (1)

Last Update:2016-06-09 Source: Internet

Author: User

Tags git clone hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently consider using Hadoop mapreduce to analyze the data on MongoDB, from the Internet to find some demo, patchwork, finally run a demo, the following process to show you

Environment

Ubuntu 14.04 64bit
Hadoop 2.6.4
MongoDB 2.4.9
Java 1.8
Mongo-hadoop-core-1.5.2.jar
Mongo-java-driver-3.0.4.jar

Download and configuration of Mongo-hadoop-core-1.5.2.jar and Mongo-java-driver-3.0.4.jar

Compiling Mongo-hadoop-core-1.5.2.jar
```
$ git clone https://github.com/mongodb/mongo-hadoop$ cd mongo-hadoop$. /gradlew jar
```
- Compile time is long, the path that Mongo-hadoop-core-1.5.2.jar exists after successful compilation is Core/build/libs
Download Mongo-java-driver-3.0.4.jar

http://central.maven.org/maven2/org/mongodb/mongo-java-driver/3.0.4/
Mongo-java-driver-3.0.4.jar

Data

Sample Data

> DB. inch  "_id": ObjectId ("5758db95ab12e17a067fbb6f"), "x": "Hello World""_id": ObjectId (" 5758db95ab12e17a067fbb70 ")," X ":" Nice to meetYou "" _id ": ObjectId (" 5758db95ab12e17a067fbb71 ")," X ":" Good to See You "_id": ObjectId ("5758db95ab12e17a067fbb72"), "X": "World War 2""_id": ObjectId (" 5758db95ab12e17a067fbb73 ")," X ":" See You Again "" _id ": ObjectId (" 5758db95ab12e17a067fbb74 ")," X ":" Bye Bye "}

The final result

>"_id": "2", "Value": 1"_id": "Again", "value": 1"_id": "Bye", "Value": 2"_id ":" Good "," value ": 1" _id ":" Hello "," value ": 1" _id ":" Meet "," value ": 1" _id ":" Nice "," Value ": 1" _id ":" See "," Value ": 2" _id ":" To "," value ": 2" _id ":" War "," Value ": 1 "_id": "World", "value": 2"_id": "You", "Value": 3}

The goal is to count the frequency of words appearing in each document, and to use the word as the key, and the frequency as value exists in MongoDB

Hadoop MapReduce Code

Mapreduce Code

1 ImportJava.util.*; 2 ImportJava.io.*;3 4 Importorg.bson.*;5 6 ImportCom.mongodb.hadoop.MongoInputFormat;7 ImportCom.mongodb.hadoop.MongoOutputFormat;8 9 Importorg.apache.hadoop.conf.Configuration;Ten ImportOrg.apache.hadoop.io.*; One ImportOrg.apache.hadoop.mapreduce.*; A  -  -  Public classWordCount { the      Public Static classTokenizermapperextendsMapper<object, Bsonobject, Text, intwritable> { -         Private Final StaticIntwritable one =NewIntwritable (1); -         PrivateText Word =NewText (); -          Public voidmap (Object key, Bsonobject value, context context) +                 throwsIOException, interruptedexception { -System.out.println ("Key:" +key); +System.out.println ("Value:" +value); AStringTokenizer ITR =NewStringTokenizer (Value.get ("X"). toString ()); at              while(Itr.hasmoretokens ()) { - Word.set (Itr.nexttoken ()); - Context.write (Word, one); -             } -         } -     } in      Public Static classIntsumreducerextendsReducer<text,intwritable,text,intwritable> { -         Privateintwritable result =Newintwritable (); to          Public voidReduce (Text key, iterable<intwritable>values, context context) +             throwsIOException, interruptedexception { -             intsum = 0; the              for(intwritable val:values) { *Sum + =val.get (); $             }Panax Notoginseng result.set (sum); - Context.write (key, result); the         } +     } A      Public Static voidMain (string[] args)throwsException { theConfiguration conf =NewConfiguration (); +Conf.set ("Mongo.input.uri", "mongodb://localhost/testmr.in" ); -Conf.set ("Mongo.output.uri", "Mongodb://localhost/testmr.out" ); $@SuppressWarnings ("Deprecation") $Job Job =NewJob (conf, "word count"); -Job.setjarbyclass (WordCount.class); -Job.setmapperclass (Tokenizermapper.class); theJob.setcombinerclass (Intsumreducer.class); -Job.setreducerclass (Intsumreducer.class);WuyiJob.setoutputkeyclass (Text.class); theJob.setoutputvalueclass (intwritable.class); -Job.setinputformatclass (Mongoinputformat.class ); WuJob.setoutputformatclass (Mongooutputformat.class ); -System.exit (Job.waitforcompletion (true) ? 0:1); About     } $}

Note: Set Mongo.input.uri and Mongo.output.uri

1 conf.set ("Mongo.input.uri", "mongodb://localhost/testmr.in" ); 2 conf.set ("Mongo.output.uri", "mongodb://localhost/testmr.out");

Compile

$ Hadoop Com.sun.tools.javac.Main wordcount.java-xlint:deprecation

Compiling the jar package
```
WC. Jar Wordcount*.class
```

Run
- Starting Hadoop, running the MapReduce code must start Hadoop
```
$ start-all. SH
```
- Run the program
- ```
$ Hadoop jar  WC. Jar WordCount
```
View Results

$ mongomongodb Shell version:2.4.9Connecting To:test>Use testmr;switched to DB TESTMR> Db.out.Find({}){ "_id":"2","value":1 }{ "_id":"again","value":1 }{ "_id":"Bye","value":2 }{ "_id":"Good","value":1 }{ "_id":"Hello","value":1 }{ "_id":"Meet","value":1 }{ "_id":" Nice","value":1 }{ "_id":" See","value":2 }{ "_id":" to","value":2 }{ "_id":"War","value":1 }{ "_id":" World","value":2 }{ "_id":" You","value":3 }>

The above is a simple example, and then I'm going to use Hadoop mapreduce to deal with more complex data in MongoDB. Please look forward to, if you have any questions, ask in the message area ^_^

References and documentation

The elephant in the MONGO DB + Hadoop
http://chenhua-1984.iteye.com/blog/2162576
Http://api.mongodb.com/java/2.12/com/mongodb/MongoURI.html
Http://stackoverflow.com/questions/27020075/mongo-hadoop-connector-issue

If the elephant in the MONGO DB +

Analyzing MongoDB Data using Hadoop mapreduce: (1)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More