Introduction to MongoDB's Hadoop drive------------------------
1. Some ConceptsHadoop is a set of Apache open source distributed computing framework, including the Distributed File System DFS and distributed computing model MapReduce, and MongoDB is a document-oriented distributed database, it is a nosql, And here is to introduce a MongoDB Hadoop drive, here is the MongoDB as a mapreduce input source, take full advantage of MapReduce to the MongoDB data processing and calculation.
2. MongoDB's Hadoop driveThe current version of the Hadoop drive or beta version is not yet applied to the actual production environment.
You can download the latest driver package from the URL below Https://github.com/mongodb/mongo-hadoop, and here are some of its dependency instructions:
Currently, it is recommended to use the latest Hadoop version 0.20.203, or use Cloudera CHD3 to do MongoDB version preferably with 1.8+ and MongoDB Java drivers must be 2.5.3+
Some of its features:
Provides a Hadoop input and output adaptation layer, read the data read and write provides a large number of parameters can be configured, these parameters can have an XML configuration file to configure, you can define in the configuration file to query the fields, query conditions, sorting strategy, etc. Features that are not currently supported:
Sharding source data reads are not supported and data split operations are not supported
3. Code Analysis
Run the Wordcount.java code in its examples
Test samples that were added in the in table of the MongoDB test database are/** * test.in Db.in.insert ({x: "Eliot is Here"}) Db.in.insert ({x: * "Eliot is Here"}) Db.in.insert ({x: ' Who's Here '}) = */public class WordCount {private static final Lo
G log = Logfactory.getlog (Wordcount.class);
This is a map operation public static class Tokenizermapper extends Mapper<object, Bsonobject, Text, intwritable> {
Private final static intwritable one = new intwritable (1);
Private final Text word = new text ();
public void Map (Object key, Bsonobject value, Context context) throws IOException, interruptedexception{
System.out.println ("key:" + key);
System.out.println ("Value:" + value);
The word is divided into final stringtokenizer ITR = new StringTokenizer (Value.get ("X"). toString ());
while (Itr.hasmoretokens ()) {Word.set (Itr.nexttoken ()); Context.write (Word, one); The key hereFor word, and value 1}}//This is the reduce operation, which is used to calculate the frequency of Word occurrences public static class Intsumreducer extends Reduce
R<text, Intwritable, Text, intwritable> {private final intwritable result = new intwritable (); public void reduce (Text key, iterable<intwritable> values, context context) throws IOException, Interruptedexce
ption{//The frequency with which the word appears, adding the value of the same word int sum = 0;
For (final intwritable val:values) {sum + = Val.get ();
} result.set (sum); Context.write (key, result); Key is a single word, value is the word frequency of the word} public static void Main (string[] args) throws exception{fin
Al Configuration conf = new Configuration (); Defines the input and output table name of the MongoDB database, where the local MongoDB is invoked, and the default port number is 27017 Mongoconfigutil.setinputuri (conf, "mongodb://localhost/test.i
n ");
Mongoconfigutil.setoutputuri (conf, "mongodb://localhost/test.out"); System.out.println ("conf:" + conf);
Final Job Job = new Job (conf, word count);
Job.setjarbyclass (Wordcount.class);
Define Mapper,reduce and Combiner class Job.setmapperclass (Tokenizermapper.class);
Job.setcombinerclass (Intsumreducer.class);
Job.setreducerclass (Intsumreducer.class);
Defines the type Job.setoutputkeyclass (text.class) of the output Key/value of mapper and reduce;
Job.setoutputvalueclass (Intwritable.class);
Defines the type Job.setinputformatclass (Mongoinputformat.class) of InputFormat and OutputFormat;
Job.setoutputformatclass (Mongooutputformat.class);
System.exit (Job.waitforcompletion (true)? 0:1); }
}
4. The simple introduction of the chunking mechanismThere is no split operation for different Shard, that is, for data distributed across different Shard, only one map operation will be generated.
Here I provide a fragment of ideas, interested can be discussed.
As we know, for collection blocks, a config database is generated, under which there is a table called chunks, where each chunk records Start_row and End_row, which can be distributed on different chunk. We can analyze this collection to the chunk information on each shard, so that the chunk information group on each shard is synthesized into a inputsplit, which is the mongoinputsplit here, so Just to modify the Getsplits this method of Mongoinputformat, add the analysis of the chunks table, get the Shard information, so that we can achieve a more split map operation, for different Shard, Each map invokes the local MONGOs Proxy service, which enables mobile computing rather than moving data.
This is just some of my ideas, interested friends can come together to discuss.
Down I will send a concrete implementation.
5. For reference* Https://github.com/mongodb/mongo-hadoop
* Http://www.mongodb.org/display/DOCS/Java+Language+Center