Big Data Combat: User Traffic Analysis system

Source: Internet
Author: User
Tags iterable hadoop fs

This article is a combination of mapreduce in Hadoop to analyze user data, statistics of the user's mobile phone number, uplink traffic, downlink traffic, total traffic information, and can be in accordance with the total traffic size of the user group sorting. is a very simple and easy to use Hadoop project, the main users to further enhance the understanding of MapReduce and practical application. At the end of the article provides source data acquisition files and system source code.

This case is ideal for Hadoop beginner learning and for learning from friends who want to get started with big data, cloud computing, data analytics, and more.

I. Data sources to be analyzed

The following is a text file to be analyzed, there are a lot of users browsing information, to protect the user mobile phone number, Internet time, machine serial number, access to the IP, access to the site, upstream traffic, downlink traffic, total traffic and other information. Here only a small section, the specific documents at the end of the text to provide download links.


Second, the basic ability to achieve to calculate the user's upstream traffic, downlink traffic, total traffic information, we need to establish a bean class to encapsulate the data. So the new should be Java Engineering, guide package, or directly build a MapReduce project. Create a Flowbean.java file in this area.
        Private long Upflow;private long dflow;private long sumflow;
Then there are various right-click Generation Get,set methods, as well as ToString (), and generate constructors, (remember to generate an empty constructor, or the subsequent analysis will be an error). The complete code is as follows:
Package Cn.tf.flow;import Java.io.datainput;import Java.io.dataoutput;import java.io.ioexception;import Org.apache.hadoop.io.writable;import Org.apache.hadoop.io.writablecomparable;public class FlowBean implements Writablecomparable<flowbean>{private long Upflow;private long dflow;private long Sumflow;public long GetUpFlow () {return upflow;} public void Setupflow (long upflow) {this.upflow = Upflow;} Public long Getdflow () {return dflow;} public void Setdflow (long dflow) {this.dflow = Dflow;} Public long Getsumflow () {return sumflow;} public void Setsumflow (long sumflow) {this.sumflow = Sumflow;} Public Flowbean (Long upflow, long Dflow) {super (); this.upflow = Upflow;this.dflow = Dflow;this.sumflow = Upflow+dflow;} @Overridepublic void ReadFields (Datainput in) throws IOException {Upflow=in.readlong ();d flow=in.readlong (); sumflow= In.readlong ();} @Overridepublic void Write (DataOutput out) throws IOException {Out.writelong (upflow); Out.writelong (dflow); o Ut.writelong (Sumflow);} Public Flowbean () {super ();} @Overridepublic String toString () {return upflow + "\ T" + Dflow + "\ T" + sumflow;} @Overridepublic int compareTo (Flowbean o) {return This.sumflow>o.getsumflow ()? -1:1;}}

Then there is this statistic code, a new Flowcount.java. In this class, I directly write mapper and reduce in the same class, if the requirements of the specification should be written separately. In mapper, get the value of the next three pieces of data, so here's my length-2,length-3.
       public static class Flowcountmapper extends Mapper<longwritable, text, text, flowbean> {@Overrideprotected void map (longwritable key, Text value, Context context) throws IOException, interruptedexception {//Get the contents of this line turned into stringstring lines = Value.tostring (); string[] field = Stringutils.split (line, "\ t"), try {if (Fields.length > 3) {//Get phone number and upper and lower traffic fields value string phone = fields[1]; Long Upflow = Long.parselong (fields[fields.length-3]); Long Dflow = Long.parselong (fields[fields.length-2]);//Output This line The result, key is the mobile number, value is the traffic information Beancontext.write (new Text (phone), new Flowbean (Upflow, Dflow));} else {return;}} catch (Exception e) {}}}

In the reduce squadron data are collated, statistics
public static class Flowcountreducer extends Reducer<text, Flowbean, Text, flowbean> {@Overrideprotected void Reduc E (Text key, iterable<flowbean> values, context context) throws IOException, interruptedexception {long upsum = 0;lon G DSum = 0;for (Flowbean bean:values) {upsum + = Bean.getupflow ();d sum + = Bean.getdflow ();} Flowbean Resultbean = new Flowbean (upsum, DSum); Context.write (key, Resultbean);}}


Finally, execution is called in the Main method.
public static void Main (string[] args) throws Exception {configuration conf = new Configuration (); Job Job = job.getinstance (conf); Job.setjarbyclass (Flowcount.class); Job.setmapperclass (Flowcountmapper.class); Job.setreducerclass (Flowcountreducer.class); Job.setmapoutputkeyclass (Text.class); Job.setMapOutputValueClass ( Flowbean.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Flowbean.class); Fileinputformat.setinputpaths (Job, New Path (Args[0])); Fileoutputformat.setoutputpath (Job, New Path (Args[1])); Boolean res = Job.waitforcompletion (true); System.exit (res. 0:1);}
Of course, you also need to set up the/flow/data data in your HDFs root directory, and then upload it to the user's data source.
Bin/hadoop fs-mkdir-p/flow/data bin/hadoop fs-put http_20130313143750.dat/flow/data bin/hadoop jar  : /lx/flow.jar

Package the above MapReduce project into a jar file, and then use Hadoop to execute the jar file. For example I put in ~/hadoop/lx/flow.jar and then the Hadoop installation directory executes
Bin/hadoop jar: /lx/flowsort.jar cn/tf/flow/flowcount  /flow/data  /flow/output

The results of the final implementation are as follows:


During this whole process, we have yarnchild processes executing, as shown in: Yarnchild also exits automatically when the entire process is completed.
Three, according to the total flow from large to small sort

If you have this basic operation above and finished, it is very simple to sort by total traffic. We created a new Flowcountsort.java.

The full code is as follows:

Package Cn.tf.flow;import Java.io.ioexception;import Org.apache.commons.lang.stringutils;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import org.apache.hadoop.io.LongWritable; Import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.Mapper ; Import Org.apache.hadoop.mapreduce.reducer;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;public class Flowcountsort {public static class Flowcountsortmapper extends Mapper<longwritable, Text, Flowbean, text>{@Overrideprotected void Map (longwritable Key, Text value, Context context) throws IOException, interruptedexception {String line=value.tostring (); String[] Fields=stringutils.split (line, "\ t"); String Phone=fields[0];long Upsum=long.parselong (fields[1]); Long Dsum=long.parselong (fields[2]); Flowbean sumbean=new Flowbean (upsum,dsum); Context.write (Sumbean, New Text (phone));}} public static CLFlowcountsortreducer extends Reducer<flowbean, text, text, flowbean>{//come in "a group" Data is a mobile phone's traffic bean and mobile phone number @overrideprotected void reduce (Flowbean key, iterable<text> values, context context) throws IOException, interruptedexception {context.write (Values.iterator (). Next (), key);}} public static void Main (string[] args) throws Exception {configuration conf = new Configuration (); Job Job = job.getinstance (conf); Job.setjarbyclass (Flowcountsort.class); Job.setmapperclass ( Flowcountsortmapper.class); Job.setreducerclass (Flowcountsortreducer.class); Job.setmapoutputkeyclass ( Flowbean.class); Job.setmapoutputvalueclass (Text.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Flowbean.class); Fileinputformat.setinputpaths (Job, New Path (Args[0])); Fileoutputformat.setoutputpath (Job, New Path (Args[1])); Boolean res = Job.waitforcompletion (true); System.exit (res. 0:1);}}

The main thing is to use the code in the Flowbean.java to implement, mainly inherited the Writablecomparable<flowbean> interface to implement, and then rewrite the CompareTo () method.

@Overridepublic int compareTo (Flowbean o) {return This.sumflow>o.getsumflow ()? -1:1;}
Use the same method to make a jar of the file, and then execute it using the related statement of Hadoop.

Bin/hadoop jar: /lx/flowsort.jar cn/tf/flow/flowcountsort  /flow/output  /flow/sortoutput
Result diagram:




Four, according to the user number area classification

The result of the traffic summary needs to be resolved with two issues, depending on the province output to a different result file:

1. How to produce multiple files for the final result of MR: principle: The number of result files in Mr is determined by the reduce
The number of tasks is absolute, one by one corresponds to the practice: Specify the number of reduce tasks in your code


2, how to get the phone number into the correct file principle: Let the different mobile phone number data sent to the correct reduce task, entered the correct result file
To customize the mechanism of partitioning partition in MR (the default mechanism is to follow the number of K Hashcode%reducetask in KV)
Procedure: Customizing a class to intervene in Mr's partitioning policy--partitioner custom implementation class

The main code is very similar to the previous sort, as long as the following two lines of code are added to the main method.

          Specify the custom Partitionerjob.setpartitionerclass (Provincepartioner.class); Job.setnumreducetasks (5);

Here we need to create a new Provincepartioner.java to handle the logic of number classification.

public class Provincepartioner extends Partitioner<text, flowbean>{private static hashmap<string, integer> Provincemap = new hashmap<string, integer> (), Static {Provincemap.put ("135", 0);p rovincemap.put ("136", 1); Provincemap.put ("137", 2);p rovincemap.put ("138", 3);} @Overridepublic int getpartition (Text key, Flowbean value, int numpartitions) {String prefix = key.tostring (). substring (0 , 3); Integer partNum = provincemap.get (prefix); if (partNum = = null) Partnum=4;return PartNum;}}

The execution method is the same as before. From the execution of the process we can see that 5 reduce task is started here, because I have a small amount of data here, so only a map task is started.



Here, the entire user traffic analysis system is all over. For more information on big data, please pay attention. In the top left corner, click on "click Attention" below. Thank you for your support!



Data Source: http://download.csdn.net/detail/sdksdk0/9545935

SOURCE Project Address: Https://github.com/sdksdk0/HDFS_MapReduce




Big Data Combat: User Traffic Analysis system

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.