Running a MapReduce program based on RMM Chinese word segmentation algorithm on Hadoop

Last Update:2015-12-19 Source: Internet

Author: User

Tags intel pentium

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-count-on-hadoop/

Running a MapReduce program based on RMM Chinese word segmentation algorithm on Hadoop 23 replies

I know the title of this article is very "academic", very vulgar, people seem to be a very cow B or a very loaded paper! In fact, it is just an ordinary experiment report, and this article does not study the RMM Chinese word segmentation algorithm. This lab report was presented in my experiment on high-performance computing courses. So, the following is an excerpt from my lab report as a personal experience I learned from Hadoop.

Experimental objectives

Learn to write a MapReduce program on Hadoop.
Using the Hadoop distributed computing novel "The Heaven Slayer Dragon Kee" in the Chinese word frequency, compare Zhang Mowgli side of the two women Zhou Zhijo and Zhao who in the novel Heat high. (Why did you mention the Dragon Slayer?) Because one of my roommates recently finished the play of Alyssa, he never mentions Alyssa's Zhao, so the experiment also draws on the self-university life ... ）

Principle of experiment

By learning Hadoop's streaming work mode, using streaming allows Hadoop to run non-Java mapreduce programs.

To reduce our experimental time, we used the Python language known as the development efficiency to write our mapper.py and reducer.py. Among them, we also used a small Chinese word breaker module smallseg.py, referenced from (Http://code.google.com/p/smallseg/,Apache License 2.0).

For Chinese thesaurus, we use the Chinese thesaurus provided by Sogou Lab Main.dic and a font suffix.dic, which can be obtained from the SMALLSEG project.

The input for distributed computing is a text file: the heavenly Dragon Slayer. txt, we download this text resource from the Web and convert it to UTF8 text encoding to make it easier for us to do word segmentation under Linux.

Iconv-fgbk-tutf8-day Dragon-Slayer-----> Utf8.txt

Experimental environment

NameNode:
os:ubuntu11.04
Cpu:intel Core I3
Memory:512mb
ip:125.216.244.28

DataNode1:
os:ubuntu11.10
Cpu:intel Pentium 4
Memory:512mb
ip:125.216.244.21

DataNode2:
os:ubuntu11.10
Cpu:intel Pentium 4
Memory:512mb
ip:125.216.244.22

Mapper Program

The following is the code for mapper.py.

View Plaincopy to Clipboardprint?

#!/usr/bin/env python
From smallseg import SEG
Import Sys
SEG = SEG ()
For line in Sys.stdin:
Wlist = Seg.cut (Line.strip ())
For word in wlist:
Try:
print "%s\t1"% (Word.encode ("UTF8"))
except:
Pass

Smallseg is a Chinese word segmentation module using the RMM string segmentation algorithm. The process of mapper the program is very simple, the Chinese content of each line of word segmentation, and then the results in the form of words and frequency output. For all Chinese words, it is the following format,

Word [tab]1

Each word has a frequency of 1. Mapper does not count the frequency of words appearing in each line, and we give this statistical frequency to the reducer program.

REDUCER Program

The following is the code for reducer.py.

View Plaincopy to Clipboardprint?

#!/usr/bin/env python
Import Sys
Current_word,current_count,word = None, 1, none
For line in Sys.stdin:
Try:
line = Line.rstrip ()
Word, Count = line.split ("\ T", 1)
count = Int (count)
except: continue
if Current_word = = Word:
Current_count + = Count
Else:
if Current_word:
print "%s\t%u"% (Current_word, current_count)
Current_count, Current_word = count, Word
if Current_word = = Word:
print "%s\t%u"% (Current_word, current_count)

Each word frequency is read from the standard input and counted. Because these words have been ordered by Hadoop for us, we only need to accumulate the number of occurrences of a word, and when different words are present, we output the frequency of the word in the following format

Word [tab] frequency

Experimental steps

The experiment uses a Namenode node and two datanode nodes.
First, copy the required files to each host. These files are placed in the/HOME/HADOOP/WC directory.

Scp-r WC [email protected]:.
Scp-r WC [email protected]:.
Scp-r WC [email protected]:.

Running the Hadoop Job

This task, using 3 mapper processes and 2 reducer processes. Because the steps of the participle are the most time consuming, we try to allocate the maximum number of mapper processes.

[email protected]:~/hadoop-0.20.203.0$./bin/hadoop Jar Contrib/streaming/hadoop-streaming-0.20.203.0.jar- Mapper/home/hadoop/wc/mapper.py-reducer/home/hadoop/wc/reducer.py-input 2-in-output 2-out-jobconf Mapred.map.tasks=3-jobconf mapred.reduce.tasks=2[...] WARN streaming. Streamjob:-jobconf option is deprecated, please use-d instead.packagejobjar: [/tmp/hadoop-unjar2897218480344074444/] [ ]/tmp/streamjob7946660914041373523.jar tmpdir=null[...] INFO mapred. Fileinputformat:total input paths to process:1[...] INFO streaming. Streamjob:getlocaldirs (): [/tmp/mapred/local][...] INFO streaming. Streamjob:running job:job_201112041409_0005[...] INFO streaming. Streamjob:to kill this job, run:[...] INFO streaming. Streamjob:/home/hadoop/hadoop-0.20.203.0/bin/. /bin/hadoop Job-dmapred.job.tracker=http://125.216.244.28:9001-kill job_201112041409_0005[...] INFO streaming. Streamjob:tracking url:http://localhost:50030/jobdetails.jsp?jobid=job_201112041409_0005[...] INFO StreamIng. Streamjob:map 0% reduce 0%[...] INFO streaming. Streamjob:map 9% reduce 0%[...] INFO streaming. Streamjob:map 40% reduce 0%[...] INFO streaming. Streamjob:map 67% reduce 12%[...] INFO streaming. Streamjob:map 71% reduce 22%[...] INFO streaming. Streamjob:map 100% reduce 28%[...] INFO streaming. Streamjob:map 100% reduce 100%[...] INFO streaming. Streamjob:job complete:job_201112041409_0005[...] INFO streaming. Streamjob:output:2-out

The map process takes time: 41s
Reduce process time: 21s
Total time: 62s

Calculation results

Copy the calculation results to the local file system.

./bin/hadoop dfs-get 2-out/part*.. /wc/

To view some of the contents of part*:

[Email protected]:~/wc$ tail part-00000 Dragon's    1 Longan    1 dragon Tiger    2 Dragon was    1 Dragon body    2 Long town    1 keel    1 Turtle Life    2 Tortoise    Mountain 1 crack    1[email protected]:~/wc$ tail part-00001 longmen    85 flamboyant        1 Dragon Drive    1 Turtle      3 Turtle 1    turtle two    1 turtle    1 holed    1 Turtle Snake    3

Below, the results of the output are merged and sorted by frequency. The process is faster and has been completed within 1 seconds.

[Email protected]:~/wc$ cat part-00000 part-00001 | sort-rnk2,2 > Sorted[email protected]:~/wc$ head sorted      7157 Mowgli  4373 is      4199 channel      3465      3187 i      2516 he      2454 you      2318 this      1991 that      1776

We remove the interference from the single word, because our experiment is only interested in the name of the person.

[email protected]:~/wc$ cat Sorted | awk ' {if (length ($) >=4) print $} ' | Head-n 50 Mowgli  4373 said    1584 Zhao    1227 Sheson    1173 self    1115 what    1034 Zhang Trixan  926 Kung Fu    8,671 777    We    767 Zhou Zhijo  756 guru    739 laughs    693 Ming Party    6,851 sounds    670 634    Girls 612    Master    606 See    only 590 Mowgli    576 shaolin    555 so    547 disciple    537 among    527 Yin Vegetarian  518 Yang Xiao    496 They    490 don't know    484 how    466 We    4,532 people    453 Call Road    4,502 people    445 today    443 thought    433 Zhang Sanfeng  425 channel    425 Yifu    412 out    402 Although    395 extermination division too        392 under    389 then    381 Lotus boat    374 Hearts    374 is    371 dare    371 Houlian    369 not    359 body    356

Statistical charts

Conclusion

Zhao 1227 votes to win Zhou Zhijo 756 votes, so Zhao in the "Day of the Dragon" in the heat than Zhou Zhijo high.

Through this experiment, we have a certain degree of understanding of the principles of Hadoop, and successfully completed the design and testing of mapper functions and reducer functions. The ability to use Hadoop for simple parallel computing implementations. We also have a deeper understanding of the difference and design of parallel algorithms and serial algorithms. In addition, the experiment has enhanced our spirit of cooperation and improved our ability to operate.

Running a MapReduce program based on RMM Chinese word segmentation algorithm on Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More