Mapreduce with MongoDB and Python [ZT]

Source: Internet
Author: User
Tags install mongodb
Mapreduce with MongoDB and python from artificial intelligence in motion Author: Marcel Pinheiro caraciolo)

Hi all,

In this post, I'll present a demonstration of a map-Reduce example with MongoDB and server side JavaScript. based on the fact that I 've been working with this technology recently, I thought it wocould be useful to present here a simple example of how it works and how to integrate with python. But what is MongoDB? For you, Who doesn' t know what is and the basics of how to use MongoDB , It is important to explain a little bit about No-SQL Movement . Currently, there are several databases that break with the requirements present in the traditional relational database systems. I present as follows the main keypoints shown at several no-SQL databases:
    • SQL commands are not used as query API (examples of APIs used include JSON, bson, etc .)
    • Doesn' t guarantee atomic operations.
    • Distributed and horizontally scalable.
    • It doesn' t have to predefine schemas. (non-schema)
    • Non-tabular data storing (eg; key-value, object, graphs, etc ).
Although it is not so obvious, no-SQL is an abbreviation Not only SQL. the effort and development of this new approach have been doing a lot of noise since 2009. you can find more information about it here and here. it is important to notice that the non-relational databases does not represent a complete replacement for relational databases. it is necessary to know the pros and cons of each approach and decide the most appropriate for your needs in the scenario that you're facing.

MongoDBIs one of the most popularNo-SQLToday and what this article will focus on. it is a schemaless, document oriented, high performance, scalable database that uses the key-values concepts to store documents as JSON structured documents. it also schemdes some relational database features such as indexing models and dynamic queries. it is used today in production in over than 40 websites, including web services suchSourceForge,GitHub,Eletronic ArtsAndThe New York Times..

One of the best functionalities that I like in MongoDB isMap-Reduce. In the next section I will explain how it works got strated with a simple example using MongoDB and python. if you want to install MongoDB or get more information, you can download it here and read a nice tutorial here.

map-Reduce

Mapreduce is a programming model for processing and generating large data sets. it is a framework introduced by Google for support parallel computations large data sets spread over clusters of computers. now mapreduce is considered a popular model in distributed computing, using red by the functions map and reduce commonly used in functional programming. it can be considered 'data-oriented' which process data in two primary steps: map and reduce. on top of that, the query is now executed on simultaneous data sources. the process of mapping the request of the input reader to the data set is called 'map ', and the process of aggregation of the intermediate results from the mapping function in a specified lidated result is called 'reduce '. the paper about the mapreduce with more details it can be read here. today there are several implementations of mapreduce such as hadoop, disco, Skynet, etc. the most famous is hadoop and is implemented in Java as an open-source project. in MongoDB there is also a similar implementation in spirit like hadoop with all input coming from a collection and output going to a collection. for a practical definition, map-reduce in MongoDB is useful for batch manipulation of data and aggregation operations. in real case scenarios, in a situation where you wowould have used group by in SQL, MAP/reduce is the equivalent tool in MongoDB. now thtat we have introduced map-Reduce, let's see how to access the MongoDB by python.

Pymongo

PymongoIs a python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python. it's easy to install and to use. see here how to install and use it.

Map-reduce in action

Now let's see map-reduce in action. for demonstrate the map-Reduce I 've decided to used of the classical problems solved using it: Word Frequency count every SS a series of statements. it's a simple problem and is suited to being solved by a map-Reduce query. I 've decided to use two samples for this task. the first one is a list of simple sentences to explain strate how the map reduce works. the second one is the 2009 Obama's speech at his election for president. it will be used to show a real example created strated by the Code. let's consider the distribelow in order to help demonstrate how the map-Reduce cocould be distributed. it shows four sentences that are split in words and grouped by the function Map And after fully CED independently (aggregation) by the function Reduce . This is interesting as it means our query can be distributed into separate nodes (computers), resulting in faster processing in Word Count frequency runtime. it's also important to notice the example below shows a balanced tree, but it cocould be unbalanced or even show some redundancy.

Map-reduce distribution

 

Some notes you need to know before developing your Map And Reduce Functions:
    • The mapreduce engine may invokeReduceFunctions iteratively; thus; these functions must be idempotent. That is, the following must hold for your reduce function:
For all K, Vals: reduce (K, [reduce (K, Vals)]) = reduce (K, Vals)
    • Currently, the return value from a reduce function cannot be an array (it's typically an object or a number)
    • If you need to perform an operation only once, useFinalizeFunction.
Let's go now to the Code. For this task, I'll use Pymongo Framework, which has support for MAP/reduce. as I said earlier, the input text will be the Obama's speech, which has By The Way Too repeated words. take a look at the tags cloud (cloud of words which each word fontsize is evaluated based on its frequency) of Obama's speech.

 

Obama's speech in 2009

 

 

For writing our map and reduce functions, MongoDB Allows clients to send JavaScript map and reduce implementations that will get evaluated and run on the server. Here is our map function.

 

 

Wordmap. js

 

As you can see'This'Variable refers to the context from which the function is called. that is, MongoDB will call the map function on each document in the collection we are querying, and it will be pointing to document where it will have the access the key of a document such as 'text', by callingThis. Text. The map function doesn' t return a list, instead it CallanEmitFunction which it expects to be defined. this parameters of this function (Key, value) will be grouped with others intermediate results from another map evaluations that have the same key (Key, [value1, value2]) and passed to the function reduce that we will define now.

 

Wordreduce. js

 

 

The Reduce Function must reduce a list of a chosen type to a single value of that same type; it must be transitive so it doesn't matter how the mapped items are grouped.

Now let's code our word count example usingPymongoClient and passing the MAP/reduce functions to the server.

 

 

Mapreduce. py

 

 

Let's see the result now:

 

And it works!

With map-Reduce function the word frequency count is extremely efficient and even performs better in a distributed environment. with this brief experiment we can see the potential of Map-reduce model for distributed computing, specially on large data sets.

All code used in this article can be downloaded here.

My next posts will be about performance evaluation on machine learning techniques. wait for news!

Marcel caraciolo

References

    • Http://nosql.mypopescu.com/post/394779847/mongodb-tutorial-mapreduce
    • Http://fredzvt.wordpress.com/2010/04/24/no-sql-mongodb-from-introduction-to-high-level-usage-in-csharp-with-norm/

 

 

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.