Let python run on hadoop, pythonhadoop

Last Update:2016-01-29 Source: Internet

Author: User

Tags hdfs dfs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Let python run on hadoop, pythonhadoop

The example in this article describes the general hadoop Entry Program "WordCount", which is to write a map program to split the input string into a single word, and then reduce these individual words, the same word is counted, and different words are output separately. The result outputs the frequency of each word.

Note: the input and output of data are controlled through sys. stdin (system standard input) and sys. stdout (system standard output. You must modify the permissions before executing all the scripts. Otherwise, you do not have the execution permission. For example, before creating the following script, use "chmod + x mapper. py"

1. mapper. py

#! /Usr/bin/env pythonimport sysfor line in sys. stdin: # traverse each line that reads data. line = line. strip () # Remove the spaces at the beginning of the end of a row from words = line. split () # split sentences into single words in words: print '% s \ t % s' % (word, 1) by Space)

2. Cer. py

#! /Usr/bin/env pythonfrom operator import itemgetterimport syscurrent_word = None # current word current_count = 0 # current word Frequency word = Nonefor line in sys. stdin: words = line. strip () # Remove the leading and trailing spaces of the string word, count = words. split ('\ t') # separate words and quantities by tab. try: count = int (count) # convert '1' of string type to integer 1 character t ValueError: continue if current_word = word: # if the current word is equal to the read word current_count + = count # word Frequency plus 1 else: if current_word: # If the current word is not empty, print its word and frequency print '% s \ t % s' % (current_word, current_count) current_count = count # Otherwise, the read words are assigned to the current word, and the update frequency is current_word = wordif current_word = word: print '% s \ t % s' % (current_word, current_count)

Run the following script in shell to view the output result:

Echo "foo quux labs foo bar zoo hying" |/home/wuying/mapper. py | sort-k 1, 1 |/home/wuying/CER Cer. py # echo outputs the following "foo ***" string and uses the pipeline operator "|" to output data as mapper. py script input data, and The mapper. py data is input to Cer. in py, the sort-k parameter sorts the output content of CER in ascending order according to the ASCII value of the first letter in the first column.

In fact, I think the CER is later. it is a little troublesome to process Word Frequency in py. The word is stored in the dictionary, and the word is used as the 'key'. The frequency of each word is used as the 'value', which makes frequency statistics more efficient. Therefore, the script for improvement is as follows:

Mapper_1.py

However, it seems that two cycles are used for writing, but the efficiency is low. The key is that you do not quite understand the functions of current_word and current_count. If the current word is literally used, how can it be different from the word and count read through traversal?

The output result of some scripts is as follows:

We can see that the same input data, the same shell for different reducers, the result of the latter does not sort the data, it is really confusing ~

Let Python code run on hadoop!

I. Prepare input data

Next, download three books:

$ mkdir -p tmp/gutenberg$ cd tmp/gutenberg$ wget http://www.gutenberg.org/ebooks/20417.txt.utf-8$ wget http://www.gutenberg.org/files/5000/5000-8.txt$ wget http://www.gutenberg.org/ebooks/4300.txt.utf-8

Then upload the three books to the hdfs File System:

$ Hdfs dfs-mkdir/user/$ {whoami}/input # create an input file folder in the user directory of hdfs $ hdfs dfs-put/home/wuying/tmp /gutenberg /*. txt/user/$ {whoami}/input # upload the document to the input Folder on hdfs

Find the address where your streaming jar file is stored. Note that version 2.6 is placed in the share directory. You can go to the hadoop installation directory to find the file:

$ cd $HADOOP_HOME$ find ./ -name "*streaming*"

Then we will find the hadoop-straming *. jar file in our share Folder:

The search speed may be a little slow, so you 'd better find the streaming file based on your version number to the corresponding directory. Because the file path is long, therefore, we can write it into the environment variable:

$ Vi ~ /. Bashrc # Open the environment variable configuration file # Write the streaming path export STREAM = $ HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*. jar

Because the script running through the streaming interface is too long, you can directly create a shell named run. sh to run it:

hadoop jar $STREAM \-files ./mapper.py,./reducer.py \-mapper ./mapper.py \-reducer ./reducer.py \-input /user/$(whoami)/input/*.txt \-output /user/$(whoami)/output

Then, run "source run. sh" to run mapreduce. The results are quite impressive. Note:

1. You must transfer the local input file to the hdfs system; otherwise, your input content cannot be recognized;

2. Be sure to have permissions. Be sure to create your personal folder under your hdfs system or it will be denied, the feeling of asking people around is really not as good as being sober;

3. If you are playing hadoop on the server for the first time, it is recommended that you configure pseudo-distributed in your own virtual machine or linux system before that, and then get started with hadoop, previously, I did not know that I was not authorized to run the O & M service on the server. Then I ran the example instance and wordcount in my own virtual machine to find my own error.

Well, if there is no accident, it will be complete. You can view the counting result in the following way:

The above is all of the content in this article, hoping to help you learn python software programming.

Articles you may be interested in:

Using python + hadoop streaming distributed programming (1)-principles, sample programs and local debugging
Guide to Using the Python framework in Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More