Using PHP to write a mapreduce program for Hadoop
Hadoop Stream
Although Hadoop is written in Java, Hadoop provides a stream of Hadoop, and Hadoop streams provide an API that allows users to write map functions and reduce functions in any language.
The key to Hadoop flow is that it uses the UNIX standard stream as the interface between the program and Hadoop. Therefore, any program that can read data from a standard input stream and write data to a standard output stream can write the map and reduce functions of the MapReduce program in any language using the Hadoop stream.
Example: Bin/hadoop jar Contrib/streaming/hadoop-streaming-0.20.203.0.jar-mapper/usr/local/hadoop/mapper.php-reducer/ Usr/local/hadoop/reducer.php-input test/*-output Out4
Package introduced by Hadoop stream: Hadoop-streaming-0.20.203.0.jar, Hadoop root is no hadoop-streaming.jar, because streaming is a contrib, so go to contrib below, take hadoop-0.20.2 as an example, it is here:
-input: Indicates the path of the input HDFs file
-output: Indicates the path of the output HDFs file
-mapper: Indicates the map function
-reducer: Indicates the reduce function
Mapper function
mapper.php file, write the following code:
[PHP]View PlainCopyprint?
- #!/usr/local/php/bin/php
- <?php
- $word 2count = Array ();
- Input comes from STDIN (standard input)
- You can this code: $stdin = fopen ("Php://stdin", "R");
- while (($line = fgets (STDIN))!== false) {
- //Remove leading and trailing whitespace and lowercase
- $line = strtolower (Trim ($line));
- //Split the line to words while removing any empty string
- $words = preg_split ('/\w/', $line, 0, Preg_split_no_empty);
- //Increase counters
- foreach ($words as $word) {
- $word 2count[$word] + = 1;
- }
- }
- Write the results to STDOUT (standard output)
- What we output here would be the input for the
- Reduce step, i.e. the input for reducer.py
- foreach ($word 2count as $word = + $count) {
- //tab-delimited
- echo $word, chr (9), $count, Php_eol;
- }
- ?>
The approximate meaning of this code is: to find out the words in each line of text entered, and to "
Hello 1
World 1″
Output in this form.
It's basically no different from the PHP you wrote before, right, maybe a little strange to you. There are two places:
PHP as an executable program
The first line of
[PHP] view plain copy print?
- #!/usr/local/php/bin/php
Tell Linux to use the #!/usr/local/php/bin/php program as the interpreter for the following code. The person who wrote the Linux shell should be familiar with this notation, which is the first line of every shell script: #!/bin/bash, #!/usr/bin/python
With this line, after saving this file, you can just like this mapper.php as cat, grep command execution:./mapper.php
Using stdin to receive input
PHP supports a variety of parameters in the method, we are most familiar with should be from the $_get, $_post the Super global variable inside the parameters passed through the web, followed by the command line from $_server[' argv '] to take the parameters passed in, here, the use of the standard input stdin
Its use effect is:
In the Linux console, enter./mapper.php
mapper.php run, console enters waiting for user keyboard input status
User enters text via keyboard
The user presses CTRL + D to terminate the input, mapper.php starts executing the real business logic and outputs the execution results
So where's stdout? Print itself is stdout, no different from the way we used to write Web programs and CLI scripts.
Reducer function
Create the reducer.php file and write the following code:
[PHP]View PlainCopyprint?
- #!/usr/local/php/bin/php
- <?php
- $word 2count = Array ();
- Input comes from STDIN
- while (($line = fgets (STDIN))!== false) {
- //Remove leading and trailing whitespace
- $line = Trim ($line);
- //Parse the input we got from mapper.php
- List ($word, $count) = explode (chr (9), $line);
- //Convert count (currently a string) to int
- $count = intval ($count);
- //Sum counts
- if ($count > 0) $word 2count[$word] + = $count;
- }
- Sort the words lexigraphically
- //
- This set are not required, we just does it so
- Final output would look more like the official Hadoop
- Word count examples
- Ksort ($word 2count);
- Write the results to STDOUT (standard output)
- foreach ($word 2count as $word = + $count) {
- echo $word, chr (9), $count, Php_eol;
- }
- ?>
The effect of this code is to count how many times each word appears and to "
Hello 2
World 1″
This form of output
Use Hadoop to run
Put the file into DFS in Hadoop:
Bin/hadoop dfs-put Test.log Test
Execute the PHP program to process the text (the
PHP mapreduce program is executed in a streaming manner :):
Bin/hadoop Jar contrib/streaming/hadoop-streaming-0.20.203.0.jar-mapper/usr/local/hadoop/mapper.php-reducer/usr/ Local/hadoop/reducer.php-input test/*-output out
Attention:
1) The input and output directories are the paths on the HDFs
2) Mapper and reducer is the path of the local machine, be sure to write the absolute path, do not write the relative path, lest the Hadoop error can not find the MapReduce program
3) mapper.php and reducer.php must be copied to the same path on all DataNode servers, all servers have PHP installed. And the installation path is the same.
View Results
Bin/hadoop D fs-cat/tmp/out/part-00000
Using PHP to write a mapreduce program for Hadoop