Hadoop stream
Although Hadoop is written in java, Hadoop provides a Hadoop stream, which provides an API that allows you to write map and reduce functions in any language.
The key to Hadoop flow is that it uses the standard UNIX stream as the interface between the program and Hadoop. Therefore, any program can read data from the standard input stream and write data to the standard output stream, then, you can use the Hadoop stream to write MapReduce Program map functions and reduce functions in any language.
E. g.: bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar-mapper/usr/local/hadoop/mapper. php-reducer/usr/local/hadoop/reducer. php-input test/*-output out4
Hadoop stream introduced package: hadoop-streaming-0.20.203.0.jar, Hadoop root directory is not hadoop-streaming.jar, because streaming is A contrib, so to go to The contrib to find, take hadoop-0.20.2 as an example, it is here:
-Input: Specifies the path of the input hdfs file.
-Output: Specifies the path of the output hdfs file.
-Mapper: Specifies the map function.
-CER: Specifies the reduce function.
Mapper Function
Write the following code to the mapper. php file:
#!/usr/local/php/bin/php<?php$word2count = array();// input comes from STDIN (standard input)// You can this code :$stdin = fopen(“php://stdin”, “r”);while (($line = fgets(STDIN)) !== false) { // remove leading and trailing whitespace and lowercase $line = strtolower(trim($line)); // split the line into words while removing any empty string $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY); // increase counters foreach ($words as $word) { $word2count[$word] += 1; }}// write the results to STDOUT (standard output)// what we output here will be the input for the// Reduce step, i.e. the input for reducer.pyforeach ($word2count as $word => $count) { // tab-delimited echo $word, chr(9), $count, PHP_EOL;}?>
The general meaning of this Code is: Find out the words in each line of text input, and"
Hello 1
World 1 ″
Output in this form.
It is basically no different from the php I wrote earlier, right? It may seem a little strange to you in two places:
PHP as an executable program
The first line
#!/usr/local/php/bin/php
Tell linux to use #! /Usr/local/php/bin/php as the interpreter of the following code. People who have written linux shell should be familiar with this method. The first line of each shell script is like this :#! /Bin/bash ,#! /Usr/bin/python
With this line, after saving this file, you can directly use mapper. php as the cat and grep commands and execute:./mapper. php.
Use stdin to receive input
PHP supports multiple parameter input methods. The most familiar method is to retrieve the parameters passed through the Web from the $ _ GET, $ _ POST super global variables, the second is to take the parameters passed in through the command line from $ _ SERVER ['argv']. Here, stdin is input using standard.
Its usage is as follows:
Enter./mapper. php In the linux Console
Mapper. php running, console enters the waiting user keyboard input status
Enter text on the keyboard
Press Ctrl + D to terminate the input. mapper. php starts to execute the real business logic and outputs the execution result.
So where is stdout? Print itself is already stdout, which is no different from the web program and CLI script we used to write.
CER Function
Create the CER Cer. php file and write the following code:
#!/usr/local/php/bin/php<?php$word2count = array();// input comes from STDINwhile (($line = fgets(STDIN)) !== false) { // remove leading and trailing whitespace $line = trim($line); // parse the input we got from mapper.php list($word, $count) = explode(chr(9), $line); // convert count (currently a string) to int $count = intval($count); // sum counts if ($count > 0) $word2count[$word] += $count;}// sort the words lexigraphically//// this set is NOT required, we just do it so that our// final output will look more like the official Hadoop// word count examplesksort($word2count);// write the results to STDOUT (standard output)foreach ($word2count as $word => $count) { echo $word, chr(9), $count, PHP_EOL;}?>
The purpose of this Code is to count the number of times each word appears, and"
Hello 2
World 1 ″
Output in this form
Run with Hadoop
Put the file into Hadoop DFS:
bin/hadoop dfs -put test.log test
Execute the php program to process the text (Execute PHP mapreduce program in Streaming mode:):
Bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar-mapper/usr/local/hadoop/mapper. php-reducer/usr/local/hadoop/reducer. php-input test/*-output out
Note:
1) the input and output directories are on hdfs.
2) mapper and reducer are the paths on the local machine. Be sure to write the absolute path instead of the relative path. In this case, hadoop reports that the mapreduce program cannot be found.
3) mapper. php and reducer. php must be copied to the same path on all DataNode servers. php has been installed on all servers, and the installation path is the same.
View results
Bin/hadoop d fs-cat/tmp/out/part-00000