Map-reduce Getting Started

Source: Internet
Author: User

Map-reduce Getting Started


Recently in rewriting mahout source code, feel oneself map-reduce skill is not deep enough, therefore intends to system to learn a bit.

Map-reduce is actually a programming paradigm, from the statistical word frequency (WordCount) program to explain Map-reduce thought is the most easy to understand.

Given a file, the contents are as follows, requiring the frequency of each word to be counted.

Hello Angela

I Love You Angela

How is You Angela

Map (each word is processed as one line, Key,value form)

hello,1

angela,1

i,1

love,1

you,1

angela,1

how,1

are,1

you,1

angela,1

Reduce (key same line sinks together)

Hello,<1>

Angela,<1,1,1>

I, <1>

Love, <1>

You, <1,1>

How, <1>

IS, <1>

Reducer post-processing output

hello,1

angela,3

I, 1

Love, 1

You, 2

How, 1

IS, 1

It can be seen from the above that the input and output data of the map phase and the reduce phase are in key,value form. Key exists to flag what data needs to be processed together. Obviously, for the example of the above statistical frequency, our goal is to let the same word data fall together, and then count how many times the word appears.

After understanding Map-reduce's ideas, let's look at how the distributed map-reduce is.

Hadoop has two types of nodes, one jobtracker and one sequence of tasktracker.

Jobtracker calls Tasktracker to run the task, and if one of the Tasktracker tasks fails, jobtracker dispatches another tasktracker node to re-execute the task.

Hadoop will fragment the input data, and each shard is a large chunk of data,

Each shard is assigned to a map task to process each row of data in sequence.

In general, a reasonable shard size tends to be the size of a block in HDFs, which defaults to 64MB. This allows the map task to run on the node where the input data is stored, reducing the network transmission of the data.

If there are multiple reduce tasks, then the map task partitions the output and falls on the same partition data, leaving a reduce task to process. Of course, the data for the same key must be in one partition.

Map before output to reduce, there can actually be a combine task, that is, localreduce, to do a local data merge, so as to reduce the transmission of data. Many times, combiner and reducer can be the same class.

This article linger

This article link: http://blog.csdn.net/lingerlanlan/article/details/46713733


Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Map-reduce Getting Started

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.