Duang~ for a long time did not update the blog, the reason is very simple, internship ~ Well, I came to work here to say that I feel like a weak explosion. The first week, the configuration environment, the second week, the data visualization, including learning the excel2013 of some tall skills, such as PivotTables and Mappower to draw 3d map, of course, originally intended to use Matplotlib in Tkinter to create an interactive graphical interface, However, the drawing is simply not excel2013, because the interface and Matplotlib research is not very deep, short time is not able to study it, last week was disastrous; now, third week, start contacting Hadoop, Although most of the current Hadoop running programs are Java, but after a week of Java beginners, I still have a decisive choice to run on Hadoop python, yes, Python is a deep pit, please follow me into the pit, Follow the tutorials to learn how to write the MapReduce of Hadoop in Python!
About Hadoop, it is recommended to follow the online tutorial on their Linux to build a single node and multi-node Hadoop platform, I am here to demonstrate the direct login server, so the environment God horse is ready. About MapReduce, I am a novice, can only be considered from the perspective of "divide and conquer", the first "map" is the "division"-Data segmentation, and then "reduce" to "map" processing results of further operations, the example given here is the general Hadoop starter program " WordCount ", is to first write a map program used to separate the input string into a single word, and then reduce these individual words, the same word on the count, the different words output, the result output each word appears frequency. That's the idea of our simple program, let's Play ~
Note: The input and output of the data is controlled by Sys.stdin (System standard input) and sys.stdout (System standard output) to control the reading and output of the data. All scripts need to be modified before they execute, otherwise there is no execute permission, for example, the following script is created before using "chmod +x mapper.py"
1.mapper.py
1 #!/usr/bin/env python2 ImportSYS3 4 forLineinchSys.stdin:#iterate through each row of the read data5 6line = Line.strip ()#Remove a space at the beginning of the end of a line7Words = Line.split ()#divide sentences into single words by Space8 forWordinchwords:9 Print '%s\t%s'% (Word, 1)
2.reducer.py
1 #!/usr/bin/env python2 3 fromoperatorImportItemgetter4 ImportSYS5 6Current_word = None#as the current word7Current_count = 0#Current Word frequency8Word =None9 Ten forLineinchSys.stdin: OneWords = Line.strip ()#remove whitespace characters from the end of a string AWord, Count = Words.split ('\ t')#separate words and numbers by tab - - Try: thecount = Int (count)#Convert String type ' 1 ' to integer 1 - exceptValueError: - Continue - + ifCurrent_word = = Word:#if the current word equals the read-in Word -Current_count + = Count#Word frequency plus 1 + Else: A ifCurrent_word:#prints the word and frequency if the current word is not empty at Print '%s\t%s'%(Current_word, Current_count) -Current_count = Count#Otherwise, the read-in Word is assigned to the current word, and the update frequency -Current_word =Word - - ifCurrent_word = =Word: - Print '%s\t%s'% (Current_word, Current_count)
Run the following script in the shell to see the output:
1 " foo Foo quux Labs foo Bar Zoo Zoo hying " | /home/wuying/mapper.py | Sort-k | /home/wuying/reducer.py23# Echo is outputting the following "foo * * * * * * * * * * * * * * *" string and using the pipe character "| The output data is used as the input data of the mapper.py script, and the mapper.py data is entered into reducer.py, where the parameter sort-k is to sort the output of reducer in ascending order by the ASCII value of the first letter of the first column
In fact, I think the reducer.py processing word frequency is a bit troublesome, the word stored in the dictionary, the word as ' key ', the frequency of each word as ' value ', and then the frequency of statistical feeling will be more efficient. Therefore, the improved script is as follows:
mapper_1.py
However, it seems to write that the use of two cycles, but inefficient. The key is not to understand the role of Current_word and current_count here, if the literal old look is the current word, then how and traverse read Word and count is different?
Here's a look at some of the script's output:
We can see that the same input data, the same shell for different reducer, the latter did not sort the data, it is difficult to understand ~
Let Python code run on Hadoop!
First, prepare the input data
Next, download the three book first:
1 $ mkdir-p tmp/Gutenberg2 $ cd tmp/Gutenberg3 $ wget Http://www.gutenberg . org/ebooks/20417.txt.utf-84 $ wget http://www.gutenberg.org/files/5000/5000-8.txt5 $ wget Http://www.gutenberg.org/ebooks/4300.txt.utf-8
Then upload these three books to the HDFs file system:
1 # Create a folder of input files under the user directory on HDFs 2 # uploading documents to an input folder on HDFs
Look for your streaming jar file storage address, note that version 2.6 is placed in the share directory, you can go to the Hadoop installation directory to find the file:
" *streaming* "
You will then find the Hadoop-straming*.jar file in our share folder:
The search speed may be a bit slow, so you'd better be based on your version number to the corresponding directory to look for this streaming file, because the path of this file is longer, so we can write it to the environment variable:
$ VI ~/.bashrc # Open environment variable config file # write streaming path in it export stream= $HADOOP _home/ Share/hadoop/tools/lib/hadoop-streaming-*.jar
Because the script that runs through the streaming interface is too long, creating a shell name directly is called run.sh to run:
1 Hadoop jar $STREAM 2 -files./mapper.py,./3 -mapper./4 -reducer./5 -input/user/$ (whoami)/input/*6 -output/user/$ (whoami)/output
Then "source run.sh" to execute the MapReduce. The result is a resounding out. Here are some special reminders:
1, the local input file must be transferred to the HDFS system above, otherwise you can not identify your input content;
2, must have the authority, must be in your HDFS system under the establishment of your personal folder otherwise it will be denied, yes, this is the two mistakes made me in the server pain, everywhere asked people feel really as good as their sober treatment;
3, if you are playing Hadoop on the server for the first time, it is recommended that you configure the pseudo-distributed on your own virtual machine or Linux system before you get started Hadoop is not so headache, I did not know that I am on the server operation without giving me permission to run, Later run the example instance in your own virtual machine and wordcount to find your own error.
Well, then, without an accident, it will complete, and you can see the count results in the following ways:
The number of words in the above may be different from mine, that is because I changed another document test, so don't worry about it.
Again, thanks to the following documentation for support:
The most classic Python use tutorial on Hadoop
Hadoop Getting Started Tutorial blog
Streaming introduction
Life is long, and walk and cherish, refueling, all are well, just do it!
Let Python run on Hadoop.