&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; Environment:
Host: Ubuntu10.04
Hadoop version: 1.2.1
Development tools: eclipse4.4.0
Description:
Requirements: A total of 6,428,632 raw data, analysis of the registration of different mailboxes, and according to the number of users from large to small sort.
Analysis: Hadoop comes with a sort, sorted by key value. To sort by value, you need to sort two times.
Steps:
1.JOB1: Statistics The number of users of different registered mailboxes, sorted by default key value, stored in HDFS system
2.JOB2: The output of the JOB1 is sorted two times, sorted by value from large to small
Result output:
The number of users above 1W has a total of 24 mailboxes:
Qq.com 1976196
163.com 1766927
126.com 807895
Sina.com 351596
yahoo.com.cn 205491
hotmail.com 202948
gmail.com 186843
Sohu.com 104736
yahoo.cn 87048
Tom.com 72365
Yeah.net 53295
21cn.com 50710
Vip.qq.com 35119
139.com 29207
263.net 24779
sina.com.cn 19156
live.cn 18920
sina.cn 18601
Yahoo.com 18454
Foxmail.com 16432
163.net 15176
MSN.com 14211
Eyou.com 13372
yahoo.com.tw 10810
Source:
JOB1: Count the number of different registered mailboxes
Csdndata.java