_php tutorial on using PHP and Shell to write a mapreduce program for Hadoop

Last Update:2016-07-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Enables any executable program that supports standard IO (stdin, stdout) to be the mapper or reducer of Hadoop. For example:
Copy CodeThe code is as follows:
Hadoop jar Hadoop-streaming.jar-input Some_input_dir_or_file-output Some_output_dir-mapper/bin/cat-reducer/usr/bin /wc

In this case, is it magical to use Unix/linux's own cat and WC tools as mapper/reducer?

If you're used to using some dynamic language to write mapreduce in a dynamic language, no different from the previous programming, Hadoop is just a framework for running it, and here's a demonstration of the mapreduce of Word counter with PHP.

First, find the streaming jar

Hadoop root is no hadoop-streaming.jar, because streaming is a contrib, so go to contrib below, take hadoop-0.20.2 as an example, it is here:

Copy the Code code as follows: $HADOOP _home/contrib/streaming/hadoop-0.20.2-streaming.jar

Second, write mapper

Create a new wc_mapper.php and write the following code:

Copy the Code code as follows:
#!/usr/bin/php
$in = fopen ("Php://stdin", "R");
$results = Array ();
while ($line = Fgets ($in, 4096))
{
$words = Preg_split ('/\w/', $line, 0, Preg_split_no_empty);
foreach ($words as $word)
$results [] = $word;
}
Fclose ($in);
foreach ($results as $key = $value)
{
Print "$value \t1\n";
}

The approximate meaning of this code is: to find out the words in each line of text entered, and to "
Hello 1
World 1″
Output in this form.

It's basically no different from the PHP you wrote before, right, maybe a little strange to you. There are two places:

PHP as an executable program

The first line of "#!/usr/bin/php" tells Linux to use the/usr/bin/php program as the interpreter for the following code. The person who wrote the Linux shell should be familiar with this notation, which is the first line of every shell script: #!/bin/bash, #!/usr/bin/python

With this line, after saving this file, you can just like this wc_mapper.php as cat, grep command execution:./wc_mapper.php

Using stdin to receive input

PHP supports a variety of parameters in the method, we are most familiar with should be from the $_get, $_post the Super global variable inside the parameters passed through the web, followed by the command line from $_server[' argv '] to take the parameters passed in, here, the use of the standard input stdin

Its use effect is:

In the Linux console, enter./wc_mapper.php

wc_mapper.php run, console enters waiting for user keyboard input status

User enters text via keyboard

The user presses CTRL + D to terminate the input, wc_mapper.php starts executing the real business logic and outputs the execution results

So where's stdout? Print itself is stdout, no different from the way we used to write Web programs and CLI scripts.

Third, write Reducer

Create a new wc_reducer.php and write the following code:
Copy the Code code as follows:
#!/usr/bin/php
$in = fopen ("Php://stdin", "R");
$results = Array ();
while ($line = Fgets ($in, 4096))
{
List ($key, $value) = Preg_split ("/\t/", Trim ($line), 2);
$results [$key] + = $value;
}
Fclose ($in);
Ksort ($results);
foreach ($results as $key = $value)
{
Print "$key \t$value\n";
}

The effect of this code is to count how many times each word appears and to "
Hello 2
World 1″
Output in this form.

Four, run with Hadoop

Upload the sample text to be counted
Copy the Code code as follows:
Hadoop fs-put *. Txt/tmp/input

Execute PHP mapreduce program in streaming mode

Copy the Code code as follows: Hadoop jar Hadoop-0.20.2-streaming.jar-input/tmp/input-output/tmp/output-mapper wc_ Absolute path of the mapper.php-reducer wc_reducer.php Absolute Path
Attention:

The input and output directories are the paths on the HDFs

Mapper and Reducer is the path of the local machine, be sure to write absolute path, do not write the relative path, lest the Hadoop error can not find the MapReduce program.

View Results
Copy the Code code as follows: Hadoop fs-cat/tmp/output/part-00000

V. Shell version of the Hadoop MapReduce program

Copy the Code code as follows:
#!/bin/bash-

# Load configuration file
SOURCE './config.sh '

# Processing command-line arguments
While getopts "D:" Arg
Do
Case $arg in
D
Date= $OPTARG

?)
echo "Unkonw argument"
Exit 1

Esac
Done

# The default processing date is yesterday
Default_date= ' date-v-1d +%y-%m-%d '

# Final processing date. If the date is not formatted, exit execution
Date=${date:-${default_date}}
if! [["$date" =~ [12][0-9]{3}-(0[1-9]|1[12])-(0[1-9]|[ 12][0-9]|3[01]) []
Then
echo "Invalid date (YYYY-MM-DD): $date"
Exit 1
Fi

# Pending files
log_files=$ (${hadoop_home}bin/hadoop fs-ls ${log_file_dir_in_hdfs} | awk ' {print $8} ' | grep $date)

# If the number of pending files is zero, exit execution
Log_files_amount=$ (($ ($ (echo $log _files | wc-l) + 0))
If [$log _files_amount-lt 1]
Then
echo "No log files found"
Exit 0
Fi

# input File List
For f in $log _files
Do
Input_files_list= "${input_files_list} $f"
Done

function Map_reduce () {
If ${hadoop_home}bin/hadoop jar ${streaming_jar_path}-input${input_files_list}-output ${mapreduce_output_dir}${ date}/${1}/-mapper "${mapper} ${1}"-reducer "${reducer}"-file "${mapper}"
Then
echo "Streaming Job done!"
Else
Exit 1
Fi
}

# Cycle through each bucket
For buckets in ${bucket_list[@]}
Do
Map_reduce $bucket
Done

http://www.bkjia.com/PHPjc/754798.html www.bkjia.com true http://www.bkjia.com/PHPjc/754798.html techarticle enables any executable program that supports standard IO (stdin, stdout) to be the mapper or reducer of Hadoop. For example: Copy code code as follows: Hadoop jar Hadoop-streaming.jar-input ...



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More