Python Development MapReduce Series (i) WordCount Demo

Source: Internet
Author: User

Original, forwarding please indicate the source.

MapReduce is the core of the Hadoop elephant, and the core of data processing in Hadoop is the MapReduce program design model. A map/reduce job typically divides the input dataset into separate pieces of data that are processed in a completely parallel manner by the Map Task (Task) . The framework sorts the output of the map first and then inputs the results to the reduce task . Usually the inputs and outputs of the job are stored in the file system. Therefore, our programming center is mainly the mapper stage and the reducer stage.

Below to develop a mapreduce program from zero and run it on a Hadoop cluster.
Mapper Code map.py:

Import SYS          for inch Sys.stdin:         = Line.strip (). Split (")            for in  word_list:             Print ' \ t '
View Code

Reducer Code reduce.py:

 ImportSYS Cur_word=None sum=0 forLineinchSys.stdin:ss= Line.strip (). Split ('\ t')                ifLen (ss) < 2:            ContinueWord=Ss[0].strip () Count= Ss[1].strip ()ifCur_word = =None:cur_word=WordifCur_word! =Word:Print '\ t'. Join ([Cur_word, str (sum)]) Cur_word=Word sum=0 Sum+=Int (count)Print '\ t'. Join ([Cur_word, str (sum)]) sum= 0
View Code

Resource file Src.txt (for testing, when running in a cluster, remember to upload to HDFs):

Hello        ni hao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ao ni haoni hao ni hao Ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni hao ni haoao ni haoni hao ni hao ni hao ni hao ni hao ni hao ni hao n I hao ni hao ni hao ni hao    dad would get out he mandolin and play for the family    Dad loved to play the mandolin F Or his family he knew we enjoyed singing    I had to mature into a man and has children of my own before I realized how Much he had sacrificed    I had to,mature into a man and,have children of my own before. I realized how much he had sacrificed
View Code

First local debugging to see if the results are correct, enter the following command:

Cat Src.txt | Python map.py | Sort-k 1 | Python reduce.py

Results from the command line output:

A    2    and    2    and,have    1    ao    1    before    1    before. I    1    Children    2    Dad    2    enjoyed    1    family    2    for    2    get    1    had    4    hao    haoao    1    haoni 3 have    1    He    3    Hello    1    his    2    what    2    I    3    into    2    knew    1    loved    1    man    2    mandolin    2    mature    1    much    2    my    2    ni    2 out    1    own    2    play    2    realized    2    sacrificed    2    singing    1    the    2    to    2    to,mature    1    We    1    would    1
View Code

Debug to find local debugging, the code is OK. Below the cluster to run. For convenience, specially wrote a script run.sh, Liberation Labor.

Hadoop_cmd="/home/hadoop/hadoop/bin/hadoop"Stream_jar_path="/home/hadoop/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar"Input_file_path="/home/input/src.txt"Output_path="/home/output"$HADOOP _cmd FS-RMR $OUTPUT _path $HADOOP _cmd jar $STREAM _jar_path-input $INPUT _file_path-Output $OUTPUT _path-mapper"python map.py"         -reducer"python reduce.py"         -file./map.py-file./reduce.py

The following script is parsed:

Path of the Hadoop_cmd:hadoop bin stream_jar_path:streaming the path to the JAR package input_file_path:hadoop The resource input path on the cluster output_path:h Adoop The result output path on the cluster. (Note: This directory should not exist, so the script is added first to delete this directory.) * * NOTE * * * * NOTE * * *: If the first execution, without this directory, will be an error. You can create a new output directory manually. ) $HADOOP _cmd FS-RMR $OUTPUT _path $HADOOP _cmd jar $STREAM _jar_path-input $INPUT _file_path-Output $OUTPUT _path-mapper"python map.py"         -reducer"python reduce.py"         -file./map.py-file./reduce.py#Here fixed format, specify input, Output path, specify Mapper,reducer file, and distribute Mapper,reducer role of our user write code file, because other nodes of the cluster do not have mapper, reducer executable file. 

Enter the following command to view the records that were output after the reduce phase:

Cat Src.txt | Python map.py | Sort-k 1 | Python reduce.py | Wc-l
Command line output: 43

In Browser input: master:50030 View the details of the task.

Kind    % Complete    Num Tasks    Pending    Running    complete    killed     failed/killed Task Attemptsmap       100.00%        2            0        0        2            0            0/0reduce    100.00%        1            0        0        1            0            0/0

See this in the Map-reduce framework.

Counter                    Map    Reduce    totalreduce output records    0    0       43

Demonstrate the success of the process. The development of the first Hadoop program ends.

Python Development MapReduce Series (i) WordCount Demo

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.