Using PHP to write a mapreduce program for Hadoop

Source: Internet
Author: User
Tags hadoop fs

Using PHP to write a mapreduce program for Hadoop

Hadoop itself is written in Java, so to write MapReduce for Hadoop, people naturally think of Java

But there's a contrib in Hadoop called Hadoop streaming, a gadget that provides streaming support for Hadoop, making any support for standard IO (stdin, stdout) Executable program can be mapper or reducer of Hadoop.

For example: Hadoop jar hadoop-streaming.jar-input some_input_dir_or_file-output some_output_dir-mapper/bin/cat-reducer/usr/ Bin/wc

In this case, is it magical to use Unix/linux's own cat and WC tools as mapper/reducer?

If you're used to using some dynamic language to write mapreduce in a dynamic language, no different from the previous programming, Hadoop is just a framework for running it, and here's a demonstration of the mapreduce of Word counter with PHP.

Find the streaming jar

Hadoop root is no hadoop-streaming.jar, because streaming is a contrib, so go to contrib below, take hadoop-0.20.2 as an example, it is here:

$HADOOP _home/contrib/streaming/hadoop-0.20.2-streaming.jar

Write mapper

Create a new wc_mapper.php and write the following code:

#!/usr/bin/php<?PHP$in=fopen("PHP://stdin "," R ");$results=Array(); while($line=fgets($in, 4096) ){$words=Preg_split('/\w/',$line, 0,preg_split_no_empty);foreach($words  as $word)$results[] =$word;}fclose($in);foreach($results  as $key=$value){Print“$value\t1\n ";}
?>

The approximate meaning of this code is: to find out the words in each line of text entered, and to "

Hello 1

World 1″

Output in this form.

It's basically no different from the PHP you wrote before, right, maybe a little strange to you. There are two places:

PHP as an executable program

The first line of "#!/usr/bin/php" tells Linux to use the/usr/bin/php program as the interpreter for the following code. The person who wrote the Linux shell should be familiar with this notation, which is the first line of every shell script: #!/bin/bash, #!/usr/bin/python

With this line, after saving this file, you can just like this wc_mapper.php as cat, grep command execution:./wc_mapper.php

Using stdin to receive input

PHP supports a variety of parameters in the method, we are most familiar with should be from the $_get, $_post the Super global variable inside the parameters passed through the web, followed by the command line from $_server[' argv '] to take the parameters passed in, here, the use of the standard input stdin

Its use effect is:

In the Linux console, enter./wc_mapper.php

wc_mapper.php run, console enters waiting for user keyboard input status

User enters text via keyboard

The user presses CTRL + D to terminate the input, wc_mapper.php starts executing the real business logic and outputs the execution results

So where's stdout? Print itself is stdout, no different from the way we used to write Web programs and CLI scripts.

Write Reducer

Create a new wc_reducer.php and write the following code:

#!/usr/bin/php<?PHP$in=fopen("PHP://stdin "," R ");$results=Array(); while($line=fgets($in, 4096) ){List($key,$value) =Preg_split("/\t/",Trim($line), 2);$results[$key] +=$value;}fclose($in);Ksort($results);foreach($results  as $key=$value){Print“$key\ t$value\ n ";}
?>

The effect of this code is to count how many times each word appears and to "

Hello 2

World 1″

This form of output

Use Hadoop to run

Upload the sample text to be counted

Hadoop fs-put *. Txt/tmp/input

Execute PHP mapreduce program in streaming mode

The absolute path of the Hadoop jar Hadoop-0.20.2-streaming.jar-input/tmp/input-output/tmp/output-mapper wc_mapper.php-reducer wc_ Absolute path to reducer.php

Attention:

The input and output directories are the paths on the HDFs

Mapper and Reducer is the path of the local machine, be sure to write the absolute path, do not write the relative path, so that the Hadoop error will not find the MapReduce program

View Results

Hadoop fs-cat/tmp/output/part-00000

Using PHP to write a mapreduce program for Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.