Introduction to MongoDB's Hadoop drive

Source: Internet
Author: User
Tags mongodb mongodb version static class
Introduction to MongoDB's Hadoop drive------------------------
1. Some ConceptsHadoop is a set of Apache open source distributed computing framework, including the Distributed File System DFS and distributed computing model MapReduce, and MongoDB is a document-oriented distributed database, it is a nosql, And here is to introduce a MongoDB Hadoop drive, here is the MongoDB as a mapreduce input source, take full advantage of MapReduce to the MongoDB data processing and calculation.


2. MongoDB's Hadoop driveThe current version of the Hadoop drive or beta version is not yet applied to the actual production environment.
You can download the latest driver package from the URL below Https://github.com/mongodb/mongo-hadoop, and here are some of its dependency instructions:
Currently, it is recommended to use the latest Hadoop version 0.20.203, or use Cloudera CHD3 to do MongoDB version preferably with 1.8+ and MongoDB Java drivers must be 2.5.3+

Some of its features:
Provides a Hadoop input and output adaptation layer, read the data read and write provides a large number of parameters can be configured, these parameters can have an XML configuration file to configure, you can define in the configuration file to query the fields, query conditions, sorting strategy, etc. Features that are not currently supported:
Sharding source data reads are not supported and data split operations are not supported

3. Code Analysis
Run the Wordcount.java code in its examples

	Test samples that were added in the in table of the MongoDB test database are/** * test.in Db.in.insert ({x: "Eliot is Here"}) Db.in.insert ({x: * "Eliot is Here"}) Db.in.insert ({x: ' Who's Here '}) = */public class WordCount {private static final Lo


	G log = Logfactory.getlog (Wordcount.class);


        This is a map operation public static class Tokenizermapper extends Mapper<object, Bsonobject, Text, intwritable> {
        Private final static intwritable one = new intwritable (1);


        Private final Text word = new text ();


            public void Map (Object key, Bsonobject value, Context context) throws IOException, interruptedexception{
            System.out.println ("key:" + key);


			System.out.println ("Value:" + value);
            The word is divided into final stringtokenizer ITR = new StringTokenizer (Value.get ("X"). toString ());
                while (Itr.hasmoretokens ()) {Word.set (Itr.nexttoken ()); Context.write (Word, one); The key hereFor word, and value 1}}//This is the reduce operation, which is used to calculate the frequency of Word occurrences public static class Intsumreducer extends Reduce


        R<text, Intwritable, Text, intwritable> {private final intwritable result = new intwritable (); public void reduce (Text key, iterable<intwritable> values, context context) throws IOException, Interruptedexce
            ption{//The frequency with which the word appears, adding the value of the same word int sum = 0;
            For (final intwritable val:values) {sum + = Val.get ();
            } result.set (sum); Context.write (key, result); Key is a single word, value is the word frequency of the word} public static void Main (string[] args) throws exception{fin
		Al Configuration conf = new Configuration (); Defines the input and output table name of the MongoDB database, where the local MongoDB is invoked, and the default port number is 27017 Mongoconfigutil.setinputuri (conf, "mongodb://localhost/test.i
        n ");
        Mongoconfigutil.setoutputuri (conf, "mongodb://localhost/test.out"); System.out.println ("conf:" + conf);


        Final Job Job = new Job (conf, word count);


		Job.setjarbyclass (Wordcount.class);


        Define Mapper,reduce and Combiner class Job.setmapperclass (Tokenizermapper.class);
        Job.setcombinerclass (Intsumreducer.class);


		Job.setreducerclass (Intsumreducer.class);
        Defines the type Job.setoutputkeyclass (text.class) of the output Key/value of mapper and reduce;


		Job.setoutputvalueclass (Intwritable.class);
        Defines the type Job.setinputformatclass (Mongoinputformat.class) of InputFormat and OutputFormat;


        Job.setoutputformatclass (Mongooutputformat.class);
    System.exit (Job.waitforcompletion (true)? 0:1);	 }
}




4. The simple introduction of the chunking mechanismThere is no split operation for different Shard, that is, for data distributed across different Shard, only one map operation will be generated.
Here I provide a fragment of ideas, interested can be discussed.
As we know, for collection blocks, a config database is generated, under which there is a table called chunks, where each chunk records Start_row and End_row, which can be distributed on different chunk. We can analyze this collection to the chunk information on each shard, so that the chunk information group on each shard is synthesized into a inputsplit, which is the mongoinputsplit here, so Just to modify the Getsplits this method of Mongoinputformat, add the analysis of the chunks table, get the Shard information, so that we can achieve a more split map operation, for different Shard, Each map invokes the local MONGOs Proxy service, which enables mobile computing rather than moving data.
This is just some of my ideas, interested friends can come together to discuss.
Down I will send a concrete implementation.


5. For reference* Https://github.com/mongodb/mongo-hadoop
* Http://www.mongodb.org/display/DOCS/Java+Language+Center
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.