Let python run on hadoop, pythonhadoop

Source: Internet
Author: User
Tags hdfs dfs

Let python run on hadoop, pythonhadoop

The example in this article describes the general hadoop Entry Program "WordCount", which is to write a map program to split the input string into a single word, and then reduce these individual words, the same word is counted, and different words are output separately. The result outputs the frequency of each word.

Note: the input and output of data are controlled through sys. stdin (system standard input) and sys. stdout (system standard output. You must modify the permissions before executing all the scripts. Otherwise, you do not have the execution permission. For example, before creating the following script, use "chmod + x mapper. py"

1. mapper. py

#! /Usr/bin/env pythonimport sysfor line in sys. stdin: # traverse each line that reads data. line = line. strip () # Remove the spaces at the beginning of the end of a row from words = line. split () # split sentences into single words in words: print '% s \ t % s' % (word, 1) by Space)

2. Cer. py

#! /Usr/bin/env pythonfrom operator import itemgetterimport syscurrent_word = None # current word current_count = 0 # current word Frequency word = Nonefor line in sys. stdin: words = line. strip () # Remove the leading and trailing spaces of the string word, count = words. split ('\ t') # separate words and quantities by tab. try: count = int (count) # convert '1' of string type to integer 1 character t ValueError: continue if current_word = word: # if the current word is equal to the read word current_count + = count # word Frequency plus 1 else: if current_word: # If the current word is not empty, print its word and frequency print '% s \ t % s' % (current_word, current_count) current_count = count # Otherwise, the read words are assigned to the current word, and the update frequency is current_word = wordif current_word = word: print '% s \ t % s' % (current_word, current_count)

Run the following script in shell to view the output result:

Echo "foo quux labs foo bar zoo hying" |/home/wuying/mapper. py | sort-k 1, 1 |/home/wuying/CER Cer. py # echo outputs the following "foo ***" string and uses the pipeline operator "|" to output data as mapper. py script input data, and The mapper. py data is input to Cer. in py, the sort-k parameter sorts the output content of CER in ascending order according to the ASCII value of the first letter in the first column.

In fact, I think the CER is later. it is a little troublesome to process Word Frequency in py. The word is stored in the dictionary, and the word is used as the 'key'. The frequency of each word is used as the 'value', which makes frequency statistics more efficient. Therefore, the script for improvement is as follows:

Mapper_1.py

However, it seems that two cycles are used for writing, but the efficiency is low. The key is that you do not quite understand the functions of current_word and current_count. If the current word is literally used, how can it be different from the word and count read through traversal?

The output result of some scripts is as follows:

We can see that the same input data, the same shell for different reducers, the result of the latter does not sort the data, it is really confusing ~

Let Python code run on hadoop!

I. Prepare input data

Next, download three books:

$ mkdir -p tmp/gutenberg$ cd tmp/gutenberg$ wget http://www.gutenberg.org/ebooks/20417.txt.utf-8$ wget http://www.gutenberg.org/files/5000/5000-8.txt$ wget http://www.gutenberg.org/ebooks/4300.txt.utf-8

Then upload the three books to the hdfs File System:

$ Hdfs dfs-mkdir/user/$ {whoami}/input # create an input file folder in the user directory of hdfs $ hdfs dfs-put/home/wuying/tmp /gutenberg /*. txt/user/$ {whoami}/input # upload the document to the input Folder on hdfs

Find the address where your streaming jar file is stored. Note that version 2.6 is placed in the share directory. You can go to the hadoop installation directory to find the file:

$ cd $HADOOP_HOME$ find ./ -name "*streaming*"

Then we will find the hadoop-straming *. jar file in our share Folder:

The search speed may be a little slow, so you 'd better find the streaming file based on your version number to the corresponding directory. Because the file path is long, therefore, we can write it into the environment variable:

$ Vi ~ /. Bashrc # Open the environment variable configuration file # Write the streaming path export STREAM = $ HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*. jar

Because the script running through the streaming interface is too long, you can directly create a shell named run. sh to run it:

hadoop jar $STREAM \-files ./mapper.py,./reducer.py \-mapper ./mapper.py \-reducer ./reducer.py \-input /user/$(whoami)/input/*.txt \-output /user/$(whoami)/output

Then, run "source run. sh" to run mapreduce. The results are quite impressive. Note:

1. You must transfer the local input file to the hdfs system; otherwise, your input content cannot be recognized;

2. Be sure to have permissions. Be sure to create your personal folder under your hdfs system or it will be denied, the feeling of asking people around is really not as good as being sober;

3. If you are playing hadoop on the server for the first time, it is recommended that you configure pseudo-distributed in your own virtual machine or linux system before that, and then get started with hadoop, previously, I did not know that I was not authorized to run the O & M service on the server. Then I ran the example instance and wordcount in my own virtual machine to find my own error.

Well, if there is no accident, it will be complete. You can view the counting result in the following way:

The above is all of the content in this article, hoping to help you learn python software programming.

Articles you may be interested in:
  • Using python + hadoop streaming distributed programming (1)-principles, sample programs and local debugging
  • Guide to Using the Python framework in Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.