_php instance of MapReduce program with PHP and Shell writing Hadoop

Source: Internet
Author: User
Tags stdin hadoop mapreduce hadoop fs

Enables any executable program that supports standard IO (stdin, stdout) to be a mapper or reducer of Hadoop. For example:

Copy Code code as follows:

Hadoop jar Hadoop-streaming.jar-input Some_input_dir_or_file-output Some_output_dir-mapper/bin/cat-reducer/usr/bin /wc

In this case, the use of Unix/linux's own cat and WC tools as a mapper/reducer, is not very magical?

If you are accustomed to using some dynamic language, write mapreduce in dynamic language, no different from previous programming, Hadoop is just a framework to run it, and I'll show you how to implement the MapReduce of Word counter in PHP.

First, find streaming jar

Hadoop root directory is not Hadoop-streaming.jar, because streaming is a contrib, so go to contrib look below, take hadoop-0.20.2 as an example, it is here:

Copy Code code as follows:
$HADOOP _home/contrib/streaming/hadoop-0.20.2-streaming.jar

Second, write mapper

Create a new wc_mapper.php and write the following code:

Copy Code code as follows:

#!/usr/bin/php
<?php
$in = fopen ("Php://stdin", "R");
$results = Array ();
while ($line = Fgets ($in, 4096))
{
$words = Preg_split ('/\w/', $line, 0, Preg_split_no_empty);
foreach ($words as $word)
$results [] = $word;
}
Fclose ($in);
foreach ($results as $key => $value)
{
Print "$value \t1\n";
}

The general meaning of this code is to find the words in each line of text that you enter and to "
Hello 1
World 1″
Output in such a form.

PHP is no different from the previous writing, right, it may make you feel a little strange there are two places:

PHP as an executable program

The first line of "#!/usr/bin/php" tells Linux to use the/usr/bin/php program as the interpreter for the following code. People who have written the Linux shell should be familiar with this writing, and this is the first line of every shell script: #!/bin/bash, #!/usr/bin/python

With this line, when you save this file, you can execute it like this directly as cat, grep-like wc_mapper.php:./wc_mapper.php

Receive input using stdin

PHP support a variety of parameters passed in the method, we are most familiar with the $_get, $_post in the Super global variable inside the pass through the Web parameters, the second is from $_server[' argv ' to fetch the parameters passed through the command line, here, using the standard input stdin

It is used in the following effects:

Enter in the Linux console./wc_mapper.php

wc_mapper.php Run, the console enters the waiting user keyboard input state

User enters text by keyboard

The user presses CTRL + D to terminate the input, wc_mapper.php starts executing the real business logic and outputs the execution results

So where's stdout? Print itself is already stdout, no different from our previous writing of Web programs and CLI scripts.

Third, write Reducer

Create a new wc_reducer.php and write the following code:

Copy Code code as follows:

#!/usr/bin/php
<?php
$in = fopen ("Php://stdin", "R");
$results = Array ();
while ($line = Fgets ($in, 4096))
{
List ($key, $value) = Preg_split ("/\t/", Trim ($line), 2);
$results [$key] + + = $value;
}
Fclose ($in);
Ksort ($results);
foreach ($results as $key => $value)
{
Print "$key \t$value\n";
}

The gist of this code is to count how many times each word appears, and to "
Hello 2
World 1″
Such a form of output.

Four, run with Hadoop

Upload the sample text to be counted

Copy Code code as follows:

Hadoop fs-put *. Txt/tmp/input

Execute PHP mapreduce program in streaming way

Copy Code code as follows:
Hadoop jar hadoop-0.20.2-streaming.jar-input/tmp/input-output/tmp/output-mapper wc_mapper.php Absolute path-reducer wc_ Absolute path of reducer.php

Attention:

The input and output directories are the paths on the HDFs

Mapper and Reducer is in the local machine path, must write absolute path, do not write the relative path, lest Hadoop error said to find the MapReduce program.

View Results

Copy Code code as follows:
Hadoop fs-cat/tmp/output/part-00000

Shell version of the Hadoop MapReduce program

Copy Code code as follows:

#!/bin/bash-

# Load configuration file
SOURCE './config.sh '

# Processing command line arguments
While getopts "D:" Arg
Todo
Case $arg in
D
Date= $OPTARG

?)
echo "Unkonw argument"
Exit 1

Esac
Done

# The default processing date is yesterday
Default_date= ' date-v-1d +%y-%m-%d '

# Final processing date. Exit execution if the date format is not correct
Date=${date:-${default_date}}
if! [[' $date ' =~ [12][0-9]{3}-(0[1-9]|1[12])-(0[1-9]|[ 12][0-9]|3[01])]]
Then
echo "Invalid date (YYYY-MM-DD): $date"
Exit 1
Fi

# Pending files
log_files=$ (${hadoop_home}bin/hadoop fs-ls ${log_file_dir_in_hdfs} | awk ' {print $} ' | grep $date)

# If the number of files to be processed is zero, exit execution
Log_files_amount=$ (($ (echo $log _files | wc-l) + 0))
If [$log _files_amount-lt 1]
Then
echo "No log files found"
Exit 0
Fi

# input File List
For f in $log _files
Todo
Input_files_list= "${input_files_list} $f"
Done

function Map_reduce () {
If ${hadoop_home}bin/hadoop jar ${streaming_jar_path}-input${input_files_list}-output ${mapreduce_output_dir}${ date}/${1}/-mapper "${mapper} ${1}"-reducer "${reducer}"-file "${mapper}"
Then
echo "Streaming Job done!"
Else
Exit 1
Fi
}

# Recycle each bucket
For bucket in ${bucket_list[@]}
Todo
Map_reduce $bucket
Done

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.