Introduction to hadoop driver for MongoDB

Last Update:2018-12-04 Source: Internet

Author: User

Tags mongodb version

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction to hadoop driver for MongoDB

------------------------
1. Concepts

Hadoop is an Apache open-source distributed computing framework, which includes DFS and mapreduce, a distributed file system. MongoDB is a document-Oriented Distributed Database and nosql, here we will introduce a MongoDB hadoop driver. Here we use MongoDB as the input source of mapreduce and take full advantage of mapreduce to process and compute MongoDB data.

2. hadoop drive of MongoDB

The current version of the hadoop driver is still a test version and cannot be applied to the actual production environment.
You can download the latest driver package from the following URL: https://github.com/mongodb/developer-hadoop. the dependency is described below:

We recommend that you use the latest version of hadoop 0.20.203 or the cloudera chd3.
The MongoDB version is preferably 1.8 +.
In addition, the MongoDB Java driver must be 2.5.3 +.

Some of its features:

Provides a hadoop Input and Output adaptation layer to read and write data.
Most parameters are configurable and can be configured in an xml configuration file. You can define the fields to be queried, query conditions, and sorting policies in the configuration file.

Currently, the following functions are not supported:

Currently, multi-sharding source data reading is not supported.
Data split operation not supported yet

3. Code Analysis
Run the wordcount. Java code in its examples

// Add the test sample to the in table of MongoDB's test database in advance. Use the following method/*** test. in dB. in. insert ({X: "Eliot was here"}) dB. in. insert ({X: * "Eliot is here"}) dB. in. insert ({X: "Who is here"}) = */public class wordcount {Private Static final log = logfactory. getlog (wordcount. class); // This is a map Operation Public static class tokenizermapper extends mapper <object, bsonobject, text, intwritable> {private final static in Twritable one = new intwritable (1); private final text word = new text (); Public void map (Object key, bsonobject value, context) throws ioexception, interruptedexception {system. out. println ("key:" + key); system. out. println ("value:" + value); // divide words by spaces. Final stringtokenizer itr = new stringtokenizer (value. get ("X "). tostring (); While (itr. hasmoretokens () {word. set (itr. Nexttoken (); context. write (word, one); // The key here is the word, and the value is 1 }}// this is the reduce operation, used to calculate the frequency of occurrence of words: public static class intsumreducer extends reducer <text, intwritable, text, intwritable> {private final intwritable result = new intwritable (); Public void reduce (Text key, iterable <intwritable> values, context) throws ioexception, interruptedexception {// calculate the occurrence frequency of words and add the values of the same words to int sum = 0; fo R (final intwritable VAL: values) {sum + = Val. get ();} result. set (SUM); context. write (Key, result); // key is a single word, and value is the word frequency} public static void main (string [] ARGs) throws exception {final configuration conf = new configuration (); // defines the input and output table names of the MongoDB database. The local MongoDB is called. The default port number is 27017 bytes configutil. setinputuri (Conf, "MongoDB: // localhost/test. in "); Using configutil. setoutputuri (Conf, "MongoDB: // localhost/test. out "); system. out. println ("conf:" + conf); final job = new job (Conf, "Word Count"); job. setjarbyclass (wordcount. class); // defines Mapper, reduce and combiner class job. setmapperclass (tokenizermapper. class); job. setcombinerclass (intsumreducer. class); job. setreducerclass (intsumreducer. class); // defines the type of output key/value of Mapper and reduce. setoutputkeyclass (text. class); job. set Outputvalueclass (intwritable. class); // defines the type of inputformat and outputformat job. setinputformatclass (extends inputformat. class); job. setoutputformatclass (mongooutputformat. class); system. exit (job. waitforcompletion (true )? 0: 1 );}}

4. Brief introduction to the block mechanism

The split operation on different Shard is not implemented here. That is to say, only one map operation is generated for data distributed on different shard.
Here I provide a sharding idea. If you are interested, you can discuss it.
We know that a Config database will be generated after collection is segmented. There is a table named chunks under this database, and each chunk records start_row and end_row, however, these chunks can be distributed on different shard instances. We can analyze the collection to obtain the chunk information on each shard, and then combine the chunk Information Group on each shard into an inputsplit, this is the mongoinputsplit here. In this case, you only need to modify the getsplits method of the mongoinputformat class, add the analysis on the chunks table, and obtain the shard information, in this way, you can implement multi-split map operations. For different Shard, each map will call the local mongos proxy service, thus achieving the goal of moving computing rather than moving data.
This is just some of my thoughts. If you are interested, you can discuss it together.
I will release a specific implementation.

5. Reference

* Https://github.com/mongodb/mongo-hadoop
* Http://www.mongodb.org/display/DOCS/Java+Language+Center

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More