IP attribution query for spark data analytics

Source: Internet
Author: User
Tags file size split

Some time ago, in the project, the leader asked for real-time view of IP access from various provinces, according to this demand, through Flume/logstack real-time capture Nginx log to production to Kafka, and then through spark real-time consumption analysis saved to redis/ MySQL, the last front end through the Echart map of Baidu Real-time display.  
First, there must be a list of rules for IP attribution, either local or distributed on multiple machines (such as HDFs).   the
IP Rule table section is as follows:

1.0.1.0|1.0.3.255|16777472|16778239| Asia | china | fujian | fuzhou | | Telecom |350100| china| cn|119.306239|26.075302 1.0.8.0|1.0.15.255|16779264|16781311| Asia | china | guangdong | guangzhou | | Telecom |440100| china| cn|113.280637|23.125178 1.0.32.0|1.0.63.255|16785408|16793599| Asia | china | guangdong | guangzhou | | Telecom |440100| china| cn|113.280637|23.125178 1.1.0.0|1.1.0.255|16842752|16843007| Asia | china | fujian | fuzhou | | Telecom |350100| china| cn|119.306239|26.075302 1.1.2.0|1.1.7.255|16843264|16844799| Asia | china | fujian | fuzhou | | Telecom |350100| china| cn|119.306239|26.075302 1.1.8.0|1.1.63.255|16844800|16859135| Asia | china | guangdong | guangzhou | | Telecom |440100| china| cn|113.280637|23.125178 1.2.0.0|1.2.1.255|16908288|16908799| Asia | china | fujian | fuzhou | | Telecom |350100| china| cn|119.306239|26.075302 1.2.2.0|1.2.2.255|16908800|16909055| Asia | china | beijing | beijing | haidian | North Dragon Middle NET |110108| china| cn|116.29812|39.95931 1.2.4.0|1.2.4.255|16909312|16909567| Asia | china | beijing | beijing | | China Internet Information Center |110100| china| cn|116.405285|39.904989 1.2.5.0|1.2.7.255|16909568|16910335| Asia | china | fujian | fuzhou | | Telecom |350100| china| cn|119.306239|26.075302 1.2.8.0|1.2.8.255|16910336|16910591| Asia | china | beijing | beijing | | China Internet Information Center |110100| china| cn|116.405285|39.904989 1.2.9.0|1.2.127.255|16910592|16941055| Asia | china | guangdong | guangzhou | | Telecom |440100| china| cn|113.280637|23.125178 1.3.0.0|1.3.255.255|16973824|17039359| Asia | china | guangdong | guangzhou | | Telecom |440100| china| cn|113.280637|23.125178 1.4.1.0|1.4.3.255|17039616|17040383| Asia | china | fujian | fuzhou | | Telecom |350100| china| cn|119.306239|26.075302
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Local mode

Import java.sql. {Date, PreparedStatement, Connection, DriverManager} import Org.apache.spark.
 {sparkcontext, sparkconf}/** * Compute IP from dependency * Created by Tianjun on 2017/2/13.
    */Object Iplocation {def ip2long (ip:string): Long = {val fragments = Ip.split ("[.]") var ipnum = 0L for (i <-0 until fragments.length) {ipnum=fragments (i). Tolong | ipnum << 8L} IP Num} def binarysearch (lines:array[(string,string,string)],ip:long): Int ={var low =0 var = lines.lengt H-1 while (Low<=high) {val middle = (low + high)/2 if ((Ip>=lines (middle). _1.tolong) && (ip&lt
      ; =lines (middle). _2.tolong) {return middle} if (Ip<lines (middle). _1.tolong) {high=middle-1 
    }else{low = Middle +1}}-1} val data2mysql = (iterator:iterator[(string,int)]) =>{ var conn:connection = null var ps:preparedstatement = null val sql = "INSERT into Location_info (LocatiOn,counts,access_date) VALUES (?,?,?) " try {conn = drivermanager.getconnection ("jdbc:mysql://localhost:3306/bigdata?useunicode=true&
        Characterencoding=utf-8 "," root "," 123 ") iterator.foreach (line = = {PS = conn.preparestatement (SQL)
        Ps.setstring (1, line._1) Ps.setint (2, Line._2) ps.setdate (3, New Date (System.currenttimemillis ())) Ps.executeupdate ()})} catch {case e:exception = E.printstacktrace ()} finally {if (

    PS! = null) Ps.close () if (conn! = null) Conn.close ()}} def main (args:array[string]) { Windows escalation error only, on Linxu do not need System.setproperty ("Hadoop.home.dir", "c:\\tianjun\\winutil\\") Val conf = new Spa Rkconf (). Setmaster ("local"). Setappname ("Iplocation") val sc = new Sparkcontext (conf)//Load IP dependency rules (can be obtained from multiple data) VA
      L Ipruelsrdd = sc.textfile ("C://ip.txt"). Map (line=>{val fields = Line.split ("\\|") Val start_num = Fields (2) Val end_num = Fields (3) Val province = Fields (6) (start_num,end_num,province)})//All IP mapping rules Val Iprulesarray = Ipruelsrdd.collect ()//broadcast rule val iprulesbroadcast = Sc.broadcast (iprulesarray)//Load handling
      The data val Ipsrdd = Sc.textfile ("C://log"). Map (line=>{val fields = Line.split ("\\|") Fields (1)}) Val result = Ipsrdd.map (IP =>{val ipnum = Ip2long (IP) val index = BinarySearch (Iprul
      Esbroadcast.value,ipnum) Val info = iprulesbroadcast.value (index)//(IP start num,ip end num, province) info}) Accumulate results for each province. Map (t = = (t._3,1)). Reducebykey (_+_) result.foreachpartition (data2mysql)//println (res
 Ult.collect (). Tobuffer) Sc.stop ()}}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 8 5 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 9 1 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114

As you can see, using Spark's operators for data analysis is very easy.
Spark docking Kafka, database, etc. can be seen on the Spark website, which is very easy.
Let's take a look at the results of writing to the database in this example:

+----+----------+--------+---------------------+
| id | location | counts | access_date |
+----+----------+--------+---------------------+
|  7 | Shaanxi     |   1824 | 2017-02-13 00:00:00 |
|  8 | Hebei     |    383 | 2017-02-13 00:00:00 |
|  9 | Yunnan     |    126 | 2017-02-13 00:00:00 |
| 10 | Chongqing     |    868 | 2017-02-13 00:00:00 |
| 11 | Beijing     |   1535 | 2017-02-13 00:00:00 |
+----+----------+--------+---------------------+
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

In this test, only the Nginx log is intercepted 4,700 or so of the log, the file size of about 1.9M.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.