IP attribution query for spark data analytics

Last Update:2018-07-25 Source: Internet

Author: User

Tags file size split

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Some time ago, in the project, the leader asked for real-time view of IP access from various provinces, according to this demand, through Flume/logstack real-time capture Nginx log to production to Kafka, and then through spark real-time consumption analysis saved to redis/ MySQL, the last front end through the Echart map of Baidu Real-time display.
First, there must be a list of rules for IP attribution, either local or distributed on multiple machines (such as HDFs). the
IP Rule table section is as follows:

1.0.1.0|1.0.3.255|16777472|16778239| Asia | china | fujian | fuzhou | | Telecom |350100| china| cn|119.306239|26.075302 1.0.8.0|1.0.15.255|16779264|16781311| Asia | china | guangdong | guangzhou | | Telecom |440100| china| cn|113.280637|23.125178 1.0.32.0|1.0.63.255|16785408|16793599| Asia | china | guangdong | guangzhou | | Telecom |440100| china| cn|113.280637|23.125178 1.1.0.0|1.1.0.255|16842752|16843007| Asia | china | fujian | fuzhou | | Telecom |350100| china| cn|119.306239|26.075302 1.1.2.0|1.1.7.255|16843264|16844799| Asia | china | fujian | fuzhou | | Telecom |350100| china| cn|119.306239|26.075302 1.1.8.0|1.1.63.255|16844800|16859135| Asia | china | guangdong | guangzhou | | Telecom |440100| china| cn|113.280637|23.125178 1.2.0.0|1.2.1.255|16908288|16908799| Asia | china | fujian | fuzhou | | Telecom |350100| china| cn|119.306239|26.075302 1.2.2.0|1.2.2.255|16908800|16909055| Asia | china | beijing | beijing | haidian | North Dragon Middle NET |110108| china| cn|116.29812|39.95931 1.2.4.0|1.2.4.255|16909312|16909567| Asia | china | beijing | beijing | | China Internet Information Center |110100| china| cn|116.405285|39.904989 1.2.5.0|1.2.7.255|16909568|16910335| Asia | china | fujian | fuzhou | | Telecom |350100| china| cn|119.306239|26.075302 1.2.8.0|1.2.8.255|16910336|16910591| Asia | china | beijing | beijing | | China Internet Information Center |110100| china| cn|116.405285|39.904989 1.2.9.0|1.2.127.255|16910592|16941055| Asia | china | guangdong | guangzhou | | Telecom |440100| china| cn|113.280637|23.125178 1.3.0.0|1.3.255.255|16973824|17039359| Asia | china | guangdong | guangzhou | | Telecom |440100| china| cn|113.280637|23.125178 1.4.1.0|1.4.3.255|17039616|17040383| Asia | china | fujian | fuzhou | | Telecom |350100| china| cn|119.306239|26.075302

1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Local mode

Import java.sql. {Date, PreparedStatement, Connection, DriverManager} import Org.apache.spark.
 {sparkcontext, sparkconf}/** * Compute IP from dependency * Created by Tianjun on 2017/2/13.
    */Object Iplocation {def ip2long (ip:string): Long = {val fragments = Ip.split ("[.]") var ipnum = 0L for (i <-0 until fragments.length) {ipnum=fragments (i). Tolong | ipnum << 8L} IP Num} def binarysearch (lines:array[(string,string,string)],ip:long): Int ={var low =0 var = lines.lengt H-1 while (Low<=high) {val middle = (low + high)/2 if ((Ip>=lines (middle). _1.tolong) && (ip&lt
      ; =lines (middle). _2.tolong) {return middle} if (Ip<lines (middle). _1.tolong) {high=middle-1 
    }else{low = Middle +1}}-1} val data2mysql = (iterator:iterator[(string,int)]) =>{ var conn:connection = null var ps:preparedstatement = null val sql = "INSERT into Location_info (LocatiOn,counts,access_date) VALUES (?,?,?) " try {conn = drivermanager.getconnection ("jdbc:mysql://localhost:3306/bigdata?useunicode=true&
        Characterencoding=utf-8 "," root "," 123 ") iterator.foreach (line = = {PS = conn.preparestatement (SQL)
        Ps.setstring (1, line._1) Ps.setint (2, Line._2) ps.setdate (3, New Date (System.currenttimemillis ())) Ps.executeupdate ()})} catch {case e:exception = E.printstacktrace ()} finally {if (

    PS! = null) Ps.close () if (conn! = null) Conn.close ()}} def main (args:array[string]) { Windows escalation error only, on Linxu do not need System.setproperty ("Hadoop.home.dir", "c:\\tianjun\\winutil\\") Val conf = new Spa Rkconf (). Setmaster ("local"). Setappname ("Iplocation") val sc = new Sparkcontext (conf)//Load IP dependency rules (can be obtained from multiple data) VA
      L Ipruelsrdd = sc.textfile ("C://ip.txt"). Map (line=>{val fields = Line.split ("\\|") Val start_num = Fields (2) Val end_num = Fields (3) Val province = Fields (6) (start_num,end_num,province)})//All IP mapping rules Val Iprulesarray = Ipruelsrdd.collect ()//broadcast rule val iprulesbroadcast = Sc.broadcast (iprulesarray)//Load handling
      The data val Ipsrdd = Sc.textfile ("C://log"). Map (line=>{val fields = Line.split ("\\|") Fields (1)}) Val result = Ipsrdd.map (IP =>{val ipnum = Ip2long (IP) val index = BinarySearch (Iprul
      Esbroadcast.value,ipnum) Val info = iprulesbroadcast.value (index)//(IP start num,ip end num, province) info}) Accumulate results for each province. Map (t = = (t._3,1)). Reducebykey (_+_) result.foreachpartition (data2mysql)//println (res
 Ult.collect (). Tobuffer) Sc.stop ()}}

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 8 5 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 9 1 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114

As you can see, using Spark's operators for data analysis is very easy.
Spark docking Kafka, database, etc. can be seen on the Spark website, which is very easy.
Let's take a look at the results of writing to the database in this example:

+----+----------+--------+---------------------+
| id | location | counts | access_date |
+----+----------+--------+---------------------+
|  7 | Shaanxi     |   1824 | 2017-02-13 00:00:00 |
|  8 | Hebei     |    383 | 2017-02-13 00:00:00 |
|  9 | Yunnan     |    126 | 2017-02-13 00:00:00 |
| 10 | Chongqing     |    868 | 2017-02-13 00:00:00 |
| 11 | Beijing     |   1535 | 2017-02-13 00:00:00 |
+----+----------+--------+---------------------+

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

In this test, only the Nginx log is intercepted 4,700 or so of the log, the file size of about 1.9M.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More