Writing WordCount program tasks in Python

Source: Internet
Author: User

1. Write the WordCount program in Python and submit the task

Program

WordCount

Input

A text file that contains a large number of words

Output

Each word in the file and the number of occurrences (frequency), sorted alphabetically by word, with each word and its frequency as a line, with intervals between words and frequencies

2. Write the map function, reduce function

Import sysfor line in Sys.stdin:     line=line.strip ()     words=line.split () for     word in words:          print '%s\t%s '% (word,1) from operator import Itemgetterimport syscurrent_word=nonecurrent_count=0word=nonefor line in Sys.stdin:     Line=line.strip ()     word,count=line.split (' \ t ', 1)     try:          count=int (count)     except ValueError:          continue     if Current_word==word:          current_count+=count     Else:          if Current_word:              print ' %s\t%s '% (current_word,current_count)          current_count=count          current_word=wordif Current_word==word:     print '%s\t%s '% (current_word,current_count)

3. Modify its authority accordingly

chmod a+x/home/hadoop/wc/mapper.pychmod a+x/home/hadoop/wc/reducer.py

4. Test run code on this machine

5. View running Results

2. Using MapReduce to process meteorological data sets

Write a program to find the highest minimum temperature per day with the highest minimum temperature

    1. The meteorological data set is: FTP://FTP.NCDC.NOAA.GOV/PUB/DATA/NOAA
    2. Download data for different year months by number three (for example, number No. 201506110136, download data starting with 6 in 2013 to see the details of the data are somewhat flexible)
      wget-d--accept-regex=regex-p Data-r-C ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2013/6*
    3. Unzip the dataset and save it in a text file
      Zcat ftp.ncdc.noaa.gov/pub/data/noaa/2013/6*.gz >qxdatazwt.txt
    4. Analysis of meteorological data format

5. Write the map function, reduce function

    1. Make appropriate changes to their permissions
      chmod a+x/home/hadoop/mapper.pychmod a+x/home/hadoop/wc/reducer.py
    2. Test run code on this machine

Put it on HDFs to run

Upload a previously crawled text file to HDFs

Submit a task with the Hadoop streaming command

View Run Results

Writing WordCount program tasks in Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.