So that any executable program supporting standard I/O (stdin, stdout) can become hadoop er or reducer. For example:
Copy codeThe Code is as follows:
Hadoop jar hadoop-streaming.jar-input SOME_INPUT_DIR_OR_FILE-output SOME_OUTPUT_DIR-mapper/bin/cat-CER/usr/bin/wc
In this example, the cat and wc tools provided by Unix/Linux are used as mapper/reducer. Is it amazing?
If you are used to some dynamic languages, use them to write mapreduce. It is no different from the previous programming. hadoop is just a framework for running it, the following shows how to use PHP to implement Word Counter mapreduce.
1. Find Streaming jar
Hadoop root directory is not hadoop-streaming.jar, because streaming is A contrib, so go to The contrib to find, take the hadoop-0.20.2 as an example, it is here:
Copy codeThe Code is as follows: $ HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar
Ii. Write Mapper
Create a new wc_mapper.php file and write the following code:
Copy codeThe Code is as follows:
#! /Usr/bin/php
<? Php
$ In = fopen ("php: // stdin", "r ");
$ Results = array ();
While ($ line = fgets ($ in, 4096 ))
{
$ Words = preg_split ('/\ W/', $ line, 0, PREG_SPLIT_NO_EMPTY );
Foreach ($ words as $ word)
$ Results [] = $ word;
}
Fclose ($ in );
Foreach ($ results as $ key => $ value)
{
Print "$ value \ t1 \ n ";
}
The general meaning of this Code is: Find out the words in each line of text input, and"
Hello 1
World 1 ″
Output in this form.
It is basically no different from the php I wrote earlier, right? It may seem a little strange to you in two places:
PHP as an executable program
#! /Usr/bin/php "tells linux to use the/usr/bin/php program as the interpreter of the following code. People who have written linux shell should be familiar with this method. The first line of each shell script is like this :#! /Bin/bash ,#! /Usr/bin/python
With this line, after saving this file, you can directly treat wc_mapper.php as cat, grep, and execute the following command:./wc_mapper.php.
Use stdin to receive input
PHP supports multiple parameter input methods. The most familiar method is to retrieve the parameters passed through the Web from the $ _ GET, $ _ POST super global variables, the second is to take the parameters passed in through the command line from $ _ SERVER ['argv']. Here, stdin is input using standard.
Its usage is as follows:
On the linux console, enter./wc_mapper.php
Run wc_mapper.php and wait for the user's keyboard input status on the console
Enter text on the keyboard
Press Ctrl + D to terminate the input. wc_mapper.php starts to execute the real business logic and outputs the execution result.
So where is stdout? Print itself is already stdout, which is no different from the web program and CLI script we used to write.
Iii. Write CER
Create a new wc_reducer.php file and write the following code:
Copy codeThe Code is as follows:
#! /Usr/bin/php
<? Php
$ In = fopen ("php: // stdin", "r ");
$ Results = array ();
While ($ line = fgets ($ in, 4096 ))
{
List ($ key, $ value) = preg_split ("/\ t/", trim ($ line), 2 );
$ Results [$ key] + = $ value;
}
Fclose ($ in );
Ksort ($ results );
Foreach ($ results as $ key => $ value)
{
Print "$ key \ t $ value \ n ";
}
The purpose of this Code is to count the number of times each word appears, and"
Hello 2
World 1 ″
Output in this form.
4. Run with Hadoop
Upload sample text for Statistics
Copy codeThe Code is as follows:
Hadoop fs-put *. TXT/tmp/input
Execute PHP mapreduce program in Streaming mode
Copy codeCode: hadoop jar hadoop-0.20.2-streaming.jar-input/tmp/input-output/tmp/output-mapper wc_mapper.php absolute path-reducer wc_reducer.php absolute path
Note:
The input and output directories are on hdfs.
Mapper and reducer are the paths on the local machine. Be sure to write absolute paths instead of relative paths, so that hadoop reports an error saying that mapreduce programs cannot be found.
View results
Copy codeCode: hadoop fs-cat/tmp/output/part-00000
V. shell version of Hadoop MapReduce Program
Copy codeThe Code is as follows:
#! /Bin/bash-
# Load the configuration file
Source './config. Sh'
# Processing command line parameters
While getopts "d:" arg
Do
Case $ arg in
D)
Date = $ OPTARG
?)
Echo "unkonw argument"
Exit 1
Esac
Done
# The default processing date is yesterday.
Default_date = 'date-v-1d + % Y-% m-% d'
# Final processing date. Exit execution if the date format is incorrect
Date =$ {date:-$ {default_date }}
If! [["$ Date" = ~ [12] [0-9] {3}-(0 [1-9] | 1 [12]) -(0 [1-9] | [12] [0-9] | 3 [01])]
Then
Echo "invalid date (yyyy-mm-dd): $ date"
Exit 1
Fi
# Files to be processed
Log_files =$ ($ {hadoop_home} bin/hadoop fs-ls $ {log_file_dir_in_hdfs} | awk '{print $8}' | grep $ date)
# Exit execution if the number of files to be processed is zero
Log_files_amount = $ (echo $ log_files | wc-l) + 0 ))
If [$ log_files_amount-lt 1]
Then
Echo "no log files found"
Exit 0
Fi
# Input file list
For f in $ log_files
Do
Input_files_list = "$ {input_files_list} $ f"
Done
Function map_reduce (){
If $ {hadoop_home} bin/hadoop jar $ {streaming_jar_path}-input $ {input_files_list}-output $ {mapreduce_output_dir} $ {date}/$ {1}/-mapper "$ {mapper }$ {1} "-CER" $ {reducer} "-file" $ {mapper }"
Then
Echo "streaming job done! "
Else
Exit 1
Fi
}
# Process each bucket cyclically
For bucket in $ {bucket_list [@]}
Do
Map_reduce $ bucket
Done