Ask for the same number of Internet traffic in one day--mapreduce

Source: Internet
Author: User

Internet data

1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681200

1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200
1363157991076 1392643565620-10-7a-28-cc-0a:cmcc120.196.100.99241321512200
1363154400022 139262511065c-0e-8b-8b-b1-50:cmcc120.197.40.4402400200
1363157993044 1821157596194-71-ac-cd-e6-18:cmcc-easy120.196.100.99iface.qiyi.com video site 15121527 2106200
1363157995074 841384135c-0e-8b-8c-e8-20:7daysinn120.197.40.4122.72.52.12201641161432 200
1363157993055 13560439658c4-17-fe-ba-de-d9:cmcc120.196.100.9918151116954200
1363157995033 159201332575c-0e-8b-c7-ba-20:cmcc120.197.40.4sug.so.360.cn Information Security 202031562936 200
1363157983019 1371919941968-a1-b7-03-07-b1:cmcc-easy120.196.100.82402400200
1363157984041 136605779915c-0e-8b-92-5c-20:cmcc-easy120.197.40.4s19.cnzz.com Site Statistics 2496960 690200
1363157973098 150136858585c-0e-8b-c7-f7-90:cmcc120.197.40.4rank.ie.sogou.com search engine 28273659 3538200
1363157986029 15989002119e8-99-c4-4e-93-e0:cmcc-easy120.196.100.99www.umeng.com Site Statistics 331938 180200
1363157992093 13560439658c4-17-fe-ba-de-d9:cmcc120.196.100.991599184938200
1363157986041 134802531045c-0e-8b-c7-fc-80:cmcc-easy120.197.40.433180180200

Data a total of 11 columns, the description of each column

Requirements: The total number of upstream packets per user, the total number of downlink packets, upstream total traffic, downlink total traffic

Analysis:

1, according to the user to sort,

2. Group the same users for different time periods

3, the same user's Internet traffic to add

Here we will operate in two ways, albeit in a different way, but the idea is important.

1. Common method (Shell method)

Because it is to find each user's Internet traffic, so we extract the data, only the mobile phone number, uplink packet, downlink packet, upstream traffic, downlink traffic

Cat file|awk-f\\t ' {print $2,$7,$8,$9,$10} '

Sort

Cat file|awk-f\\t ' {print $2,$7,$8,$9,$10} ' |sort-k 2

13480253104 3 3) 180 180
13502468823 57 102) 7335 110349
13560439658 15 9 918 4938
13560439658 18 15 1116 954
13600217502 18 138) 1080 186852
13602846565 15 12) 1938 2910
13660577991 24 9) 6960 690
13719199419 4 0) 240 0
13726230503 24 27) 2481 24681
13760778710 2 2) 120 120
13823070001 6 3) 360 180
13826544101 4 0) 264 0
13922314466 12 12) 3008 3720
13925057413 69 63) 11058 48243
13926251106 4 0) 240 0
13926435656 2 4) 132 1512
15013685858 28 27) 3659 3538
15920133257 20 20) 3156 2936
15989002119 3 3) 1938 180
18211575961 15 12) 1527 2106
18320173382 21 18) 9531 2412
84138413 20 16) 4116 1432

Group

To group the data that is lined up, the line merges the same user's traffic together (without adding calculations, simply putting them together), as

13480253104 3 3) 180 180
13502468823 57 102) 7335 110349
13560439658 [15 9 918 4938] [18 15 1116 954]

Groups in the shell can take advantage of arrays in awk

Add the same user traffic

Cat Http_20130313143750.dat |sort-k2|awk-f\\t ' {print $2,$7,$8,$9,$10} ' |awk ' {a[$1]+=$2;b[$1]+=$3;c[$1]+=$4;d[$ 1]+=$5}end{for (i in a) print (I,a[i],b[i],c[i],d[i])} ' |sort

13480253104 3 3) 180 180
13502468823 57 102) 7335 110349
13560439658 33 24) 2034 5892
13600217502 18 138) 1080 186852
13602846565 15 12) 1938 2910
13660577991 24 9) 6960 690
13719199419 4 0) 240 0
13726230503 24 27) 2481 24681
13760778710 2 2) 120 120
13823070001 6 3) 360 180
13826544101 4 0) 264 0
13922314466 12 12) 3008 3720
13925057413 69 63) 11058 48243
13926251106 4 0) 240 0
13926435656 2 4) 132 1512
15013685858 28 27) 3659 3538
15920133257 20 20) 3156 2936
15989002119 3 3) 1938 180
18211575961 15 12) 1527 2106
18320173382 21 18) 9531 2412
84138413 20 16) 4116 1432

For the shell script above we made two awk two sort, obviously no need to tidy up

Cat Http_20130313143750.dat |awk-f\\t ' {a[$2]+=$7;b[$2]+=$8;c[$2]+=$9;d[$2]+=$10}end{for (i in a) print (i,a[i],b[i],c [I],d[i]} ' |sort

2, because the data volume is small can be solved with shell script, but the large amount of data, processing time is longer, memory crashes, then need Mr to achieve

Based on the definition of the Mr Map and the reduce function, we can easily find the inputs and outputs of these two stages.

Map Stage

Input is the record of each row

The output is the user's Internet traffic per time period

Reduce phase

Input is the output of map

The output is the sum of the traffic and

Realize

public static class Mymapper extends Mapper<longwritable, text, text, text> {
Text K2 = new text ();
Text v2 = new text ();

@Override
protected void Map (longwritable K1, Text v1, Context context)
Throws IOException, Interruptedexception {
string[] splited = v1.tostring (). Split ("\ t");
String Telno = splited[1];
String uppacknum = splited[6];
String dwpacknum = splited[7];
String uppayload = splited[8];
String dwpayload = splited[9];
String data = Uppacknum + "," + Dwpacknum + "," + Uppayload + ","
+ dwpayload;
K2.set (Telno);
V2.set (data);
SYSTEM.OUT.PRINTLN (data);
Context.write (K2, V2);
}

}

public static class Myreducer extends Reducer<text, text, text, text> {
Text V3 = new text ();

@Override
protected void Reduce (
Text K2,
Iterable<text> V2s,
Org.apache.hadoop.mapreduce.reducer<text, text, text, Text> Context context)
Throws IOException, Interruptedexception {
Long uppacknum = 0L;
Long dwpacknum = 0L;
Long uppayload = 0L;
Long dwpayload = 0L;
String data = "";
for (Text v2:v2s) {

string[] split = V2.tostring (). Split (",");
Uppacknum + = Long.parselong (split[0]);
Dwpacknum + = Long.parselong (split[1]);
Uppayload + = Long.parselong (split[2]);
Dwpayload + = Long.parselong (split[3]);
data = Uppacknum + "," + Dwpacknum + "," + Uppayload + ","
+ dwpayload;
SYSTEM.OUT.PRINTLN (data);
}
V3.set (data);
Context.write (K2, V3);
}
}

This is the basic type of Hadoop, and does not have its own definition of serialization type

The final result

13480253104 3,3,180,180
1350246882357,102,7335,110349
1356043965833,24,2034,5892
1360021750218,138,1080,186852
1360284656515,12,1938,2910
1366057799124,9,6960,690
137191994194,0,240,0
1372623050324,27,2481,24681
137607787102,2,120,120
138230700016,3,360,180
138265441014,0,264,0
1392231446612,12,3008,3720
1392505741369,63,11058,48243
139262511064,0,240,0
139264356562,4,132,1512
1501368585828,27,3659,3538
1592013325720,20,3156,2936
159890021193,3,1938,180
1821157596115,12,1527,2106
1832017338221,18,9531,2412
8413841320,16,4116,1432

Ask for the same number of Internet traffic in one day--mapreduce

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.