This article explains three questions: 1 How to pass parameters to the map and reduce functions when writing a mapreduce program using Java. 2 How to use streaming to write a mapreduce program (c + +, Shell, Python), how to map, reduce scriptPassing Parameters。 3 How to use streaming to write a mapreduce program (c + +, Shell, Python), how to map, reduce scriptpass a file or folder。 (1) Streaming load local single file (2) streaming load local multiple files (3) streaming load local directory (4) streaming programming in the MapReduce script read HDFs file (5) streaming read the HDFs directory in the MapReduce script when programming
1. How to pass parameters to the map, reduce function when writing a mapreduce program in Java
I started to pass it in the following way.
Declare two static variables in the main class and assign a value to the variable in the main function, trying to get the value of the variable in the map and reduce functions. The code structure is similar to the following:
Commit to cluster run found in the map and reduce functions, the value of the static variable Maxscore is always the initial 1. The attempt was made to assign a value to the variable in the static section of the main class (because the code in the static area was executed first than the code in main), which is still unsuccessful, and the value of Maxscore is always initial 1. Run the above code on a single machine Hadoop, the result is normal, the value of the variable can be obtained in the map function. Thinking is the reason: After submitting the job to the Hadoop cluster, the Mapper class and the Reducer class go to each tasktracker to run, independent from the main class, and not interactive. Therefore, the method of passing parameters to map and reduce functions is naïve. It is possible to think of other methods such as writing parameters to the HDFs file and then reading the file in the Run method of the mapper and reducer classes and reading the values to the appropriate variables, but the method is more complex and the code is as follows:
Although the above method is available, but is not a regular method, the following is a common method: (1) passing parameters through the Configuration call the set method in the main function setting parameters, for example:
In Mapper, you get the configuration of the current job and get parameters, such as:
Note: Context is useful to get a lot of information about the current job, such as getting the task ID above. (2) Using Org.apache.hadoop.io.DefaultStringifier class
Example:
In main:
Configuration conf = new configuration ();
Text Maxscore = new text ("12989");
Defaultstringifier.store (conf, Maxscore, "Maxscore");
In this way, the text object maxscore is stored in the Conf object with "Maxscore" as the key, and then the load method is called in the map and reduce functions to read the object.
Mapper Get:
Configuration conf = context.getconfiguration ()
Text out = defaultstringifier.load (conf, "Maxscore", Text.class);
It is necessary to note that The object that needs to be passed must first implement the serialized interface, and the serialization of Hadoop is through the writable interface to implement the .
(2) Reference from: http://blog.sina.com.cn/s/blog_6b7cf18f0100x9jg.html
2. How to pass parameters to map, reduce function when writing streaming program
You can set environment variables by streaming the CMDENV option, and then get environment variables in the map and reduce scripts.
Refer to << Hadoop streaming advanced Programming >>
http://dongxicheng.org/mapreduce/hadoop-streaming-advanced-programming/
(0) Job Submission script:
#!/usr/bin/env Bash
Max_read_count=${array[0]}
MIN_READ_COUNT=${ARRAY[1]}
MAX_WRITE_COUNT=${ARRAY[2]}
MIN_WRITE_COUNT=${ARRAY[3]}
Hadoop jar $HADOOP _home/contrib/streaming/hadoop-0.20.2-streaming.jar \
-D Mapred.reduce.tasks=1
-input $input \
-output $output \
-mapper $mapper _script \
-file $map _file \
-reducer $reducer _script \
-file $reduce _file \
-cmdenv "max_read_count=${array[0]}" \ # Sets the environment variable Max_read_count.
-cmdenv "min_read_count=${array[1]}" \ # Multiple variables are used multiple times-cmdenv
-cmdenv "max_write_count=${array[2]}" \
-cmdenv "min_write_count=${array[3]}"
(1) Python mapper.py
#!/usr/bin/env python
Import Sys
Import OS
Min_r_count = float (os.environ.get (' Min_read_count ')) # Get environment variables.
Max_r_count = float (os.environ.get (' Max_read_count '))
Min_w_count = float (os.environ.get (' Min_write_count '))
Max_w_count = float (os.environ.get (' Max_write_count '))
(2) Shell mapper.sh
#!/usr/bin/env Bash
While reading line # Read the line
Do
A= $line
Done
echo $min _read_count $max _read_count # Get environment variables.
(3) C + + MAPPER.C
#include
#include
int main (int argc, char *argv[], char *env[])
{
Double Min_r_count;
int i = 0;
for (i = 0; Env[i]! = NULL; i++)//env[i] The environment variable is stored, and the value of each item is in this form: path=******, so you need to intercept the value of the variable
{
if (Strstr (Env[i], "path=")) {
Char *p =null;
p = strstr (Env[i], "=");
if ((p-env[i]) = = 4)
printf ("%s\n", ++p); Get PATH environment variable
}
if (Strstr (Env[i], "min_write_count=")) {
Char *p =null;
p = strstr (Env[i], "=");
if ((p-env[i]) = = strlen ("Min_write_count"))
printf ("%s\n", ++p); Get min_write_count environment variable
}
}
Char eachline[200]={0};
while (Fgets (Eachline, 199, stdin)//Read line from stdin
{
printf ("%s", eachline);
}
}
Note: The options for Hadoop when executing commands are sequential, in order Bin/hadoop command [genericoptions] [commandoptions]. For streaming,-D belongs to Genericoptions, which is the common option for Hadoop, so it must be written in front. All options for streaming can be consulted: Hadoop jar $HADOOP _home/contrib/streaming/hadoop-0.20.2-streaming.jar-info
3. How to pass a file or folder to the map, reduce function when writing a streaming program.
(1) Streaming loading local individual files
The streaming supports the-file option to package Local files (note Local files) behind-file as part of the job submission, which is packaged into the job's jar file. This makes it possible to access the packaged files like local files in a MapReduce script.
Instance:
Job Submission File run.sh
mapper.py
Note: When you submit a job, you use-file logs/wbscoretest.log to specify the files that you want to load. In the map script only need to read directly the file Wbscoretest.log, do not need to write logs/wbscoretest.log, because only the file Wbscoretest.log loaded, not loaded logs directory and
Wbscoretest.log file.
(2) streaming loading local multiple files
(3) Streaming load local directory (if multiple directories are loaded, separated by commas,-files dir1, Dir2, Dir3 )
The-file option to use streaming cannot load the local directory, as I have experimented with.
We can use the common option-files for Hadoop to load the local directory , which in the MapReduce script can access the loaded directory as if it were in the local directory.
In practice, we use this method when we need to load the word-breaker dictionary when we write the word-breaker-mapreduce job.
Job Submission Script:
Map script: Reads the file under the directory.
To load multiple directories:
Note: Multiple directories separated by commas, and there can be no space, otherwise it will be wrong, this limit is too painful.
For example:
(4) streaming read HDFs file in the MapReduce script when programming
Use the-files option followed by the path of the HDFs file that needs to be read. This allows the file to be accessed directly through a file name in the MapReduce script.
Job Submission Script:
Map script:
If a large file needs to be loaded, we can upload the file to HDFs and then read the HDFs file in the MapReduce script.
(5) Read the HDFs directory in the MapReduce script when streaming programming
Use the-files option, followed by the HDFs directory you want to read. This allows access to the directory in the MapReduce script as if it were accessing a local directory.
Job Submission Script:
Map script: Read the Tmp_kentzhan directory directly.
How to pass parameters to map and reduce scripts, load files and directories