1. Write the WordCount program in Python and submit the task
Program |
WordCount |
Input |
A text file that contains a large number of words |
Output |
Each word in the file and the number of occurrences (frequency), sorted alphabetically by word, with each word and its frequency as a line, with intervals between words and frequencies |
2. Write the map function, reduce function
Import sysfor line in Sys.stdin: line=line.strip () words=line.split () for word in words: print '%s\t%s '% (word,1) from operator import Itemgetterimport syscurrent_word=nonecurrent_count=0word=nonefor line in Sys.stdin: Line=line.strip () word,count=line.split (' \ t ', 1) try: count=int (count) except ValueError: continue if Current_word==word: current_count+=count Else: if Current_word: print ' %s\t%s '% (current_word,current_count) current_count=count current_word=wordif Current_word==word: print '%s\t%s '% (current_word,current_count)
3. Modify its authority accordingly
chmod a+x/home/hadoop/wc/mapper.pychmod a+x/home/hadoop/wc/reducer.py
4. Test run code on this machine
5. View running Results
2. Using MapReduce to process meteorological data sets
Write a program to find the highest minimum temperature per day with the highest minimum temperature
- The meteorological data set is: FTP://FTP.NCDC.NOAA.GOV/PUB/DATA/NOAA
- Download data for different year months by number three (for example, number No. 201506110136, download data starting with 6 in 2013 to see the details of the data are somewhat flexible)
wget-d--accept-regex=regex-p Data-r-C ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2013/6*
- Unzip the dataset and save it in a text file
Zcat ftp.ncdc.noaa.gov/pub/data/noaa/2013/6*.gz >qxdatazwt.txt
- Analysis of meteorological data format
5. Write the map function, reduce function
- Make appropriate changes to their permissions
chmod a+x/home/hadoop/mapper.pychmod a+x/home/hadoop/wc/reducer.py
Test run code on this machine
Put it on HDFs to run
Upload a previously crawled text file to HDFs
Submit a task with the Hadoop streaming command
View Run Results
Writing WordCount program tasks in Python