Original: http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-count-on-hadoop/
Running a MapReduce program based on RMM Chinese word segmentation algorithm on Hadoop 23 replies
I know the title of this article is very "academic", very vulgar, people seem to be a very cow B or a very loaded paper! In fact, it is just an ordinary experiment report, and this article does not study the RMM Chinese word segmentation algorithm. This lab report was presented in my experiment on high-performance computing courses. So, the following is an excerpt from my lab report as a personal experience I learned from Hadoop.
Experimental objectives
Learn to write a MapReduce program on Hadoop.
Using the Hadoop distributed computing novel "The Heaven Slayer Dragon Kee" in the Chinese word frequency, compare Zhang Mowgli side of the two women Zhou Zhijo and Zhao who in the novel Heat high. (Why did you mention the Dragon Slayer?) Because one of my roommates recently finished the play of Alyssa, he never mentions Alyssa's Zhao, so the experiment also draws on the self-university life ... )
Principle of experiment
By learning Hadoop's streaming work mode, using streaming allows Hadoop to run non-Java mapreduce programs.
To reduce our experimental time, we used the Python language known as the development efficiency to write our mapper.py and reducer.py. Among them, we also used a small Chinese word breaker module smallseg.py, referenced from (Http://code.google.com/p/smallseg/,Apache License 2.0).
For Chinese thesaurus, we use the Chinese thesaurus provided by Sogou Lab Main.dic and a font suffix.dic, which can be obtained from the SMALLSEG project.
The input for distributed computing is a text file: the heavenly Dragon Slayer. txt, we download this text resource from the Web and convert it to UTF8 text encoding to make it easier for us to do word segmentation under Linux.
Iconv-fgbk-tutf8-day Dragon-Slayer-----> Utf8.txt
Experimental environment
NameNode:
os:ubuntu11.04
Cpu:intel Core I3
Memory:512mb
ip:125.216.244.28
DataNode1:
os:ubuntu11.10
Cpu:intel Pentium 4
Memory:512mb
ip:125.216.244.21
DataNode2:
os:ubuntu11.10
Cpu:intel Pentium 4
Memory:512mb
ip:125.216.244.22
Mapper Program
The following is the code for mapper.py.
View Plaincopy to Clipboardprint?
- #!/usr/bin/env python
- From smallseg import SEG
- Import Sys
- SEG = SEG ()
- For line in Sys.stdin:
- Wlist = Seg.cut (Line.strip ())
- For word in wlist:
- Try:
- print "%s\t1"% (Word.encode ("UTF8"))
- except:
- Pass
Smallseg is a Chinese word segmentation module using the RMM string segmentation algorithm. The process of mapper the program is very simple, the Chinese content of each line of word segmentation, and then the results in the form of words and frequency output. For all Chinese words, it is the following format,
Word [tab]1
Each word has a frequency of 1. Mapper does not count the frequency of words appearing in each line, and we give this statistical frequency to the reducer program.
REDUCER Program
The following is the code for reducer.py.
View Plaincopy to Clipboardprint?
- #!/usr/bin/env python
- Import Sys
- Current_word,current_count,word = None, 1, none
- For line in Sys.stdin:
- Try:
- line = Line.rstrip ()
- Word, Count = line.split ("\ T", 1)
- count = Int (count)
- except: continue
- if Current_word = = Word:
- Current_count + = Count
- Else:
- if Current_word:
- print "%s\t%u"% (Current_word, current_count)
- Current_count, Current_word = count, Word
- if Current_word = = Word:
- print "%s\t%u"% (Current_word, current_count)
Each word frequency is read from the standard input and counted. Because these words have been ordered by Hadoop for us, we only need to accumulate the number of occurrences of a word, and when different words are present, we output the frequency of the word in the following format
Word [tab] frequency
Experimental steps
The experiment uses a Namenode node and two datanode nodes.
First, copy the required files to each host. These files are placed in the/HOME/HADOOP/WC directory.
Scp-r WC [email protected]:.
Scp-r WC [email protected]:.
Scp-r WC [email protected]:.
Running the Hadoop Job
This task, using 3 mapper processes and 2 reducer processes. Because the steps of the participle are the most time consuming, we try to allocate the maximum number of mapper processes.
[email protected]:~/hadoop-0.20.203.0$./bin/hadoop Jar Contrib/streaming/hadoop-streaming-0.20.203.0.jar- Mapper/home/hadoop/wc/mapper.py-reducer/home/hadoop/wc/reducer.py-input 2-in-output 2-out-jobconf Mapred.map.tasks=3-jobconf mapred.reduce.tasks=2[...] WARN streaming. Streamjob:-jobconf option is deprecated, please use-d instead.packagejobjar: [/tmp/hadoop-unjar2897218480344074444/] [ ]/tmp/streamjob7946660914041373523.jar tmpdir=null[...] INFO mapred. Fileinputformat:total input paths to process:1[...] INFO streaming. Streamjob:getlocaldirs (): [/tmp/mapred/local][...] INFO streaming. Streamjob:running job:job_201112041409_0005[...] INFO streaming. Streamjob:to kill this job, run:[...] INFO streaming. Streamjob:/home/hadoop/hadoop-0.20.203.0/bin/. /bin/hadoop Job-dmapred.job.tracker=http://125.216.244.28:9001-kill job_201112041409_0005[...] INFO streaming. Streamjob:tracking url:http://localhost:50030/jobdetails.jsp?jobid=job_201112041409_0005[...] INFO StreamIng. Streamjob:map 0% reduce 0%[...] INFO streaming. Streamjob:map 9% reduce 0%[...] INFO streaming. Streamjob:map 40% reduce 0%[...] INFO streaming. Streamjob:map 67% reduce 12%[...] INFO streaming. Streamjob:map 71% reduce 22%[...] INFO streaming. Streamjob:map 100% reduce 28%[...] INFO streaming. Streamjob:map 100% reduce 100%[...] INFO streaming. Streamjob:job complete:job_201112041409_0005[...] INFO streaming. Streamjob:output:2-out
The map process takes time: 41s
Reduce process time: 21s
Total time: 62s
Calculation results
Copy the calculation results to the local file system.
./bin/hadoop dfs-get 2-out/part*.. /wc/
To view some of the contents of part*:
[Email protected]:~/wc$ tail part-00000 Dragon's 1 Longan 1 dragon Tiger 2 Dragon was 1 Dragon body 2 Long town 1 keel 1 Turtle Life 2 Tortoise Mountain 1 crack 1[email protected]:~/wc$ tail part-00001 longmen 85 flamboyant 1 Dragon Drive 1 Turtle 3 Turtle 1 turtle two 1 turtle 1 holed 1 Turtle Snake 3
Below, the results of the output are merged and sorted by frequency. The process is faster and has been completed within 1 seconds.
[Email protected]:~/wc$ cat part-00000 part-00001 | sort-rnk2,2 > Sorted[email protected]:~/wc$ head sorted 7157 Mowgli 4373 is 4199 channel 3465 3187 i 2516 he 2454 you 2318 this 1991 that 1776
We remove the interference from the single word, because our experiment is only interested in the name of the person.
[email protected]:~/wc$ cat Sorted | awk ' {if (length ($) >=4) print $} ' | Head-n 50 Mowgli 4373 said 1584 Zhao 1227 Sheson 1173 self 1115 what 1034 Zhang Trixan 926 Kung Fu 8,671 777 We 767 Zhou Zhijo 756 guru 739 laughs 693 Ming Party 6,851 sounds 670 634 Girls 612 Master 606 See only 590 Mowgli 576 shaolin 555 so 547 disciple 537 among 527 Yin Vegetarian 518 Yang Xiao 496 They 490 don't know 484 how 466 We 4,532 people 453 Call Road 4,502 people 445 today 443 thought 433 Zhang Sanfeng 425 channel 425 Yifu 412 out 402 Although 395 extermination division too 392 under 389 then 381 Lotus boat 374 Hearts 374 is 371 dare 371 Houlian 369 not 359 body 356
Statistical charts
Conclusion
Zhao 1227 votes to win Zhou Zhijo 756 votes, so Zhao in the "Day of the Dragon" in the heat than Zhou Zhijo high.
Through this experiment, we have a certain degree of understanding of the principles of Hadoop, and successfully completed the design and testing of mapper functions and reducer functions. The ability to use Hadoop for simple parallel computing implementations. We also have a deeper understanding of the difference and design of parallel algorithms and serial algorithms. In addition, the experiment has enhanced our spirit of cooperation and improved our ability to operate.
Running a MapReduce program based on RMM Chinese word segmentation algorithm on Hadoop