The course content of this issue:
Advertising billing system is an essential function point for e-commerce. To prevent malicious ad clicks (assuming that merchant A and B are at the same time advertising, A and B are competitors, if a uses click Bots to make a malicious click on B's ad, then B's advertising costs will be exhausted soon), the ad click must be blacklisted.
You can use the Leftouter join to correlate the target data with the blacklist data and filter out the hit blacklist data.
This paper mainly introduces the use of the transform function of Dstream
Sparkstreaming Code Implementation
package com.dt.spark.streamingimport org.apache.spark.sparkconfimport Org.apache.spark.streaming. {seconds, streamingcontext}/** * Spark Online blacklist filter * @author  DINGLQ using the Scala development cluster * background Description: In the ad click Billing system, we filter out the blacklist click Online, and thus protect the interests of advertisers, Only effective AD click Billing * or in the anti-brush scoring (or traffic) system, filter out invalid votes or ratings or traffic; * implementation technology: Use TRANSFORM API to direct RDD-based programming for join operations */object onlineblacklistfilter { def main (args: array[string]) { /** * 1th step: Create a Spark Configuration object sparkconf, set the runtime configuration information for the SPARK program, * For example, use Setmaster to set the URL of the master of the Spark Cluster to which the program is linked, if set * is local, which represents the spark program running locally, especially for beginners with very poor machine configuration conditions (e.g. * only 1G of memory) */ &nBsp; val conf = new sparkconf (). Setappname ("Onlineblacklistfilter") //setting app name // set Batch interval to 30 seconds val ssc = new streamingcontext (Conf, seconds ()) /** * Blacklist data preparation, in fact, the blacklist is generally dynamic, for example, in Redis or database, blacklist generation often has complex business * logic, the situation of the algorithm is different, but in the spark streaming when processing the full information can be accessed every time */ // true is a blacklist, If you need to shut down temporarily, you can set the value to False val blacklist = array ("Hadoop", true), ("Mathou", true)) //the array into rdd val blacklistrdd = . Ssc.sparkContext.parallelize (blacklist) val adsClickStream = Ssc.sockettextstream ("Spark-master", 9999) /** * The format of each piece of data that the ad clicked here is: TiMe, name * The result of the map operation here is the name, (time,name) format */ val formattedadsclickstream = adsclickstream.map (item => ( Item.split (" ") (1), item) val validateAds = Formattedadsclickstream.transform (userclickrdd => { /** * Leftouterjoin All content of the RDD that retains the content of the left-hand user ad Click, * got the corresponding click in the Blacklist */ val joinedblacklistrdd = userclickrdd.leftouterjoin (BlackListRDD) /** * when filter is filtered, its INPUT element is a tuple: (Name, (time , name), boolean) * where the first element is the name of a blacklist, The second element of the second element is whether the leftouterjoin is present at the value * If there is, the surface of the current ad click is the blacklist, need to filter out, otherwise it is effective click content; */ joinedblacklistrdd.filter (joineditem => if (Joineditem._2._2.getorelse (false)) { true } else { false } ) }) validateads.print () ssc.start () ssc.awaittermination () ssc.stop () }}
Package the program and upload it to the spark cluster
On the Spark-master node, start the NC
[Email protected]:~# nc-lk 9999
Running the Onlineblacklistfilter program
[Email protected]:~#/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin/spark-submit--class Com.dt.spark.streaming.OnlineBlackListFilter--master spark://spark-master:7077./spark.jar
Input data on NC side
[Email protected]:~# nc-lk 9999134343 Hadoop343434 spark3432777 Java0983743 Hbase893434 Mathou
Sparkstreaming Running results:
16/05/01 09:42:30 INFO Scheduler. Dagscheduler:resultstage 8 (print at onlineblacklistfilter.scala:63) finished in 0.048 s16/05/01 09:42:30 INFO Scheduler. Dagscheduler:job 3 Finished:print at onlineblacklistfilter.scala:63, took 0.111805 S-------------------------------------------time:1462066950000 Ms-------------------------------------------3432777 Java343434 spark0983743 Hbase
As a result, Hadoop and Mathou have been filtered out of the blacklist settings.
On the basis of this program, more complex business logic rules can be added to meet the needs of the enterprise.
Note:
1. DT Big Data Dream Factory public number Dt_spark
2, the IMF 8 o'clock in the evening big data real combat YY Live channel number: 68917580
3, Sina Weibo: Http://www.weibo.com/ilovepains
This article is from the "Ding Dong" blog, please be sure to keep this source http://lqding.blog.51cto.com/9123978/1769290
94th Lesson: sparkstreaming Implementation of online blacklist filtering in AD billing system