標籤:
上網資料
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681200
1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200
1363157991076 1392643565620-10-7A-28-CC-0A:CMCC120.196.100.99241321512200
1363154400022 139262511065C-0E-8B-8B-B1-50:CMCC120.197.40.4402400200
1363157993044 1821157596194-71-AC-CD-E6-18:CMCC-EASY120.196.100.99iface.qiyi.com視頻網站15121527 2106200
1363157995074 841384135C-0E-8B-8C-E8-20:7DaysInn120.197.40.4122.72.52.12201641161432 200
1363157993055 13560439658C4-17-FE-BA-DE-D9:CMCC120.196.100.9918151116954200
1363157995033 159201332575C-0E-8B-C7-BA-20:CMCC120.197.40.4sug.so.360.cn資訊安全202031562936 200
1363157983019 1371919941968-A1-B7-03-07-B1:CMCC-EASY120.196.100.82402400200
1363157984041 136605779915C-0E-8B-92-5C-20:CMCC-EASY120.197.40.4s19.cnzz.com網站統計2496960 690200
1363157973098 150136858585C-0E-8B-C7-F7-90:CMCC120.197.40.4rank.ie.sogou.com搜尋引擎28273659 3538200
1363157986029 15989002119E8-99-C4-4E-93-E0:CMCC-EASY120.196.100.99www.umeng.com網站統計331938 180200
1363157992093 13560439658C4-17-FE-BA-DE-D9:CMCC120.196.100.991599184938200
1363157986041 134802531045C-0E-8B-C7-FC-80:CMCC-EASY120.197.40.433180180200
資料一共有11列,每一列的說明
要求:求每一個使用者的上行資料包數總量,下行資料包數總量,上行總流量,下行總流量
分析:
1、按照使用者進行排序,
2、將不同時間段的相同使用者進行分組
3、相同的使用者的上網流量進行相加
下面我們將通過兩種方式進行操作,雖然方式不同,但思想很重要
1、普通方法(shell 方法)
由於是找出每個使用者的上網流量,所以我們對資料進行了提取,只提取手機號、上行包、下行包、上行流量、下行流量
cat file|awk -F\\t ‘{print $2,$7,$8,$9,$10}‘
排序
cat file|awk -F\\t ‘{print $2,$7,$8,$9,$10}‘|sort -k 2
13480253104 3 3 180 180
13502468823 57 102 7335 110349
13560439658 15 9 918 4938
13560439658 18 15 1116 954
13600217502 18 138 1080 186852
13602846565 15 12 1938 2910
13660577991 24 9 6960 690
13719199419 4 0 240 0
13726230503 24 27 2481 24681
13760778710 2 2 120 120
13823070001 6 3 360 180
13826544101 4 0 264 0
13922314466 12 12 3008 3720
13925057413 69 63 11058 48243
13926251106 4 0 240 0
13926435656 2 4 132 1512
15013685858 28 27 3659 3538
15920133257 20 20 3156 2936
15989002119 3 3 1938 180
18211575961 15 12 1527 2106
18320173382 21 18 9531 2412
84138413 20 16 4116 1432
分組
對於上面排好的資料進行分組,就行把相同的使用者的流量合并的一起(沒有相加計算,只是簡單的放在一起),如
13480253104 3 3 180 180
13502468823 57 102 7335 110349
13560439658 [15 9 918 4938] [18 15 1116 954]
shell中的分組可以利用awk 中的數組
相同使用者流量相加
cat HTTP_20130313143750.dat |sort -k2|awk -F\\t ‘{print $2,$7,$8,$9,$10}‘|awk ‘{a[$1]+=$2;b[$1]+=$3;c[$1]+=$4;d[$1]+=$5}END{for(i in a)print(i,a[i],b[i],c[i],d[i])}‘|sort
13480253104 3 3 180 180
13502468823 57 102 7335 110349
13560439658 33 24 2034 5892
13600217502 18 138 1080 186852
13602846565 15 12 1938 2910
13660577991 24 9 6960 690
13719199419 4 0 240 0
13726230503 24 27 2481 24681
13760778710 2 2 120 120
13823070001 6 3 360 180
13826544101 4 0 264 0
13922314466 12 12 3008 3720
13925057413 69 63 11058 48243
13926251106 4 0 240 0
13926435656 2 4 132 1512
15013685858 28 27 3659 3538
15920133257 20 20 3156 2936
15989002119 3 3 1938 180
18211575961 15 12 1527 2106
18320173382 21 18 9531 2412
84138413 20 16 4116 1432
對於上面的shell指令碼我們進行了兩次的awk 兩次sort,顯然沒有必要,整理一下
cat HTTP_20130313143750.dat |awk -F\\t ‘{a[$2]+=$7;b[$2]+=$8;c[$2]+=$9;d[$2]+=$10}END{for(i in a)print(i,a[i],b[i],c[i],d[i])}‘|sort
2、由於資料量比較小可以用shell指令碼解決,但是資料量大的話,處理時間變長,記憶體崩潰,這時就需要mr來實現
根據mr的map、reduce 函數的定義,我們很容易的找到這兩個階段的輸入、輸出
map階段
輸入就是每行的記錄
輸出是使用者 每個時間段的上網流量
reduce階段
輸入是map的輸出
輸出是相加後的流量和
實現
public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
Text k2 = new Text();
Text v2 = new Text();
@Override
protected void map(LongWritable k1, Text v1, Context context)
throws IOException, InterruptedException {
String[] splited = v1.toString().split("\t");
String telno = splited[1];
String uppacknum = splited[6];
String dwpacknum = splited[7];
String uppayload = splited[8];
String dwpayload = splited[9];
String data = uppacknum + "," + dwpacknum + "," + uppayload + ","
+ dwpayload;
k2.set(telno);
v2.set(data);
System.out.println(data);
context.write(k2, v2);
}
}
public static class MyReducer extends Reducer<Text, Text, Text, Text> {
Text v3 = new Text();
@Override
protected void reduce(
Text k2,
Iterable<Text> v2s,
org.apache.hadoop.mapreduce.Reducer<Text, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
long uppacknum = 0L;
long dwpacknum = 0L;
long uppayload = 0L;
long dwpayload = 0L;
String data = "";
for (Text v2 : v2s) {
String[] split = v2.toString().split(",");
uppacknum += Long.parseLong(split[0]);
dwpacknum += Long.parseLong(split[1]);
uppayload += Long.parseLong(split[2]);
dwpayload += Long.parseLong(split[3]);
data = uppacknum + "," + dwpacknum + "," + uppayload + ","
+ dwpayload;
System.out.println(data);
}
v3.set(data);
context.write(k2, v3);
}
}
這裡用的是hadoop的基本類型,沒有自己定義序列化類別型
最後的結果
13480253104 3,3,180,180
1350246882357,102,7335,110349
1356043965833,24,2034,5892
1360021750218,138,1080,186852
1360284656515,12,1938,2910
1366057799124,9,6960,690
137191994194,0,240,0
1372623050324,27,2481,24681
137607787102,2,120,120
138230700016,3,360,180
138265441014,0,264,0
1392231446612,12,3008,3720
1392505741369,63,11058,48243
139262511064,0,240,0
139264356562,4,132,1512
1501368585828,27,3659,3538
1592013325720,20,3156,2936
159890021193,3,1938,180
1821157596115,12,1527,2106
1832017338221,18,9531,2412
8413841320,16,4116,1432
求相同號碼一天內的上網流量——mapreduce