* The purpose is to prevent collection. A real-time IP access monitoring is required for the site's log information.
1, Kafka version is the latest 0.10.0.0
2. Spark version is 1.61
650) this.width=650; "Src=" Http://s2.51cto.com/wyfs02/M00/82/AD/wKioL1deabCzOFV5AACEDD54How890.png-wh_500x0-wm_3 -wmp_4-s_3584357356.png "title=" Qq20160613160228.png "alt=" Wkiol1deabczofv5aacedd54how890.png-wh_50 "/>
3, download the corresponding Spark-streaming-kafka-assembly_2.10-1.6.1.jar in the Spark directory under the Lib directory
650) this.width=650; "Src=" Http://s1.51cto.com/wyfs02/M02/82/AF/wKiom1deaE3BCFKLAADzm76aKeA815.png-wh_500x0-wm_3 -wmp_4-s_2400629517.png "title=" Qq20160613160107.png "alt=" Wkiom1deae3bcfklaadzm76akea815.png-wh_50 "/>
4, using flume to write the Nginx log to Kafka (subsequent additions)
5. Write a python script named test_spark_collect_ip.py
# coding:utf-8__author__ = ' Chenhuachao ' uses Pyspark to connect Kafka, statistics the visitor's IP information, make a real-time anti-capture ' import sysreload (SYS) sys.setdefaultencoding (' Utf-8 ') import redisimport datetimefrom pyspark.streaming.kafka import kafkautilsfrom pyspark.streaming import Streamingcontextfrom pyspark import sparkconf, sparkcontextdef parse (logstring): try: infodict = eval ( Logstring.encode (' Utf-8 ')) ip =infodict.get (' IP ') assert infodict[' tj-event '] == ' onload ' assert ip return (IP) except: return () def Insert_redis (RDD): "will write the eligible results toRedis "' conn = redis. Redis (host= ' Redis IP ', port=6380) for i,j in rdd.collect (): print i,j if j >=3 and j != "": conn.sadd (' cheating_ip_set_{0} '. Format (Datetime.datetime.now (). Strftime ("%y%m%d")), i) conn.expire (' Cheating_ip_set ', 86400) if __name __ == "__main__": topic = ' statis-detailinfo-pageevent ' sc = sparkcontext (appname= "PYSPARK_KAFKA_STREAMING_CHC")     SSC = streamingcontext (sc,10) checkpointDirectory = '/tmp/checkpoint/ CP3 ' ssc.checkpoint (checkpointdirectory) &Nbsp;kvs = kafkautils.createdirectstream (ssc,[' statis-detailinfo-pageevent '],kafkaParams={" Auto.offset.reset ": " largest "," Metadata.broker.list ":" kafka-ip:9092,kafka-ip:9092 "}) #kvs. Map (lambda line:line[1]). Map (Lambda x:parse (x)). Pprint () # Here is the concept of a sliding window, which needs to be understood in depth to refer to http://www.kancloud.cn/kancloud/spark-programming-guide/51567 #ipcount = kvs.map (lambda line: line[1]). Map (Lambda x:parse (x)). Map (Lambda ip: (ip,1)). Reducebykey (Lambda ips,num:ips+num) ipcount = kvs.map ( LAMBDA LINE: LINE[1]). Map (Lambda x:parse (x)). Map (Lambda ip: (ip,1)). Reducebykeyandwindow ( lambda ips,num:ips+num,30,10) # preprocessing, using cache # If multiple computations are required Incoming RDD is bad, i.e. for Foreachrdd (Insertredis) ipcount.foreachrdd (Insert_redis) # the   of each node's RDD; # wordcounts.foreachrdd (Lambda rdd: rdd.foreach (Sendrecord)) Ssc.start ()
6. Execution of orders
Bin/spark-submit--jars Lib/spark-streaming-kafka-assembly_2.10-1.6.1.jar test_spark_collect_ip.py
7. Output interface
650) this.width=650; "Src=" Http://s4.51cto.com/wyfs02/M02/82/AF/wKiom1deagizWsl0AALbESH16_0737.png-wh_500x0-wm_3 -wmp_4-s_3246752603.png "title=" Qq20160613160831.png "alt=" Wkiom1deagizwsl0aalbesh16_0737.png-wh_50 "/>
8. For more information, please refer to Spark's website Http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#module-pyspark.streaming.kafka
This article is from the "people on the Run" blog, please be sure to keep this source http://leizhu.blog.51cto.com/3758740/1788742
Spark+kafka+redis Statistics Website Visitor IP