Spark-spark streaming-Online blacklist filter for ad clicks

Source: Internet
Author: User

Task

Online blacklist filter for ad clicks
Use
nc -lk 9999
Enter some data on the data send port, such as:

1375864674543 Tom1375864674553 Spy1375864674571 Andy1375864688436 Cheater1375864784240 Kelvin1375864853892 Steven1375864979347 John
Code
ImportOrg.apache.spark.SparkConfImportOrg.apache.spark.streaming.StreamingContextImportOrg.apache.spark.streaming.Seconds Object onlineblacklistfilter {     defMain (args:array[string]) {/** * Step 1th: Create a Configuration object for Spark sparkconf, set the configuration information for the Spark Program runtime, * Local, on behalf of the Spark program running locally, especially suitable for very poor machine configuration conditions ( For example * /              //Create sparkconf Object             Valconf =NewSparkconf ()//Set the name of the application, you can see the name in the monitoring interface of the program runConf.setappname ("Onlineblacklistfilter")//At this time, the program is in the spark clusterConf.setmaster ("spark://master:7077")ValSSC =NewStreamingContext (conf, Seconds ( -))/** * Blacklist data preparation, in fact the blacklist is generally dynamic, for example, in Redis or database, * blacklist generation often has complex business logic, the case algorithm is different, * but in               When the Spark streaming is processed, it can access the complete information every time. */              ValBlacklist = Array ("Spy",true),("Cheater",true))ValBlacklistrdd = Ssc.sparkContext.parallelize (blacklist,8)ValAdsclickstream = Ssc.sockettextstream ("Master",9999)/** * The format of each piece of data that the ad clicked here is: Time, name * Here the result of the map operation is the format of name, (time,name) */              Valadsclickstreamformatted = adsclickstream.map {ads = (Ads.split (" ")(1), Ads)} adsclickstreamformatted.transform (Userclickrdd = {//through Leftouterjoin operation both preserves all content of the RDD on the left user ad click Content,                  //Get the corresponding click on the blacklist                  ValJoinedblacklistrdd = Userclickrdd.leftouterjoin (Blacklistrdd)/** * Filter filtering, its INPUT element is a tuple: (Name, ((Time,name), Boolean)) * Where the first element is the name of the blacklist                    , the second element of the second element is the value that exists when the leftouterjoin is made. * If present, indicates that the current ad click is blacklisted, need to filter out, otherwise it is effective click content; * /                       Valvalidclicked = Joinedblacklistrdd.filter (Joineditem = = {if(Joineditem._2._2.getorelse (false)) {false}Else{true}) Validclicked.map (Validclick = {validclick._2._1})}). Print/** * Calculated valid data will generally be written to Kafka, the downstream billing system will pull from the Kafka to the effective data for billing * *Ssc.start () Ssc.awaittermination ()}} * * Note: * *//Change the program's batch interval setting from 30 seconds to 300 seconds: ValSSC =NewStreamingContext (conf, Seconds ( -))

Run the previously generated jar package with Spark-submit
/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin/spark-submit --class com.test.spark.sparkstreaming.Filter --master spark://Master:7077 /root/Documents/SparkApps/Filter.jar

Analysis
    • 5 Job (or jobs)

Job 0: Does not reflect the business logic code, the calculation of the subsequent load balancing considerations

Job 0 contains stage 0, Stage 1.
For example Stage 1, where the aggregated Metrics by executor part:

The stage is present on all executor.

    • Job 1: Long running time, 1.5 minutes

      Stage 2,aggregated Metrics by executor section:

      Stage 2 executes only one executor on the worker and executes for 1.5 minutes (4 workers), and from the point of view of the business process we send a little bit of data, there is no need to start a task that runs for 1.5 minutes. So what's the mission? From the DAG visualization section, you know that this job actually starts a receiver that receives data:

      Receiver is started by a job. There must be an action to trigger it.
      Tasks section:

      Only one worker runs this job, which is used to receive data.
      Locality level is process_local, originally a memory node. Therefore, by default, data reception does not use disk, but instead uses the data in memory directly. After the Spark streaming application starts, it will start some jobs on its own. The default is to start a job to receive data and prepare for subsequent processing. A spark application can start many jobs, and these different jobs can work with each other.

    • Job 2: See details to discover the main business logic of our program, embodied in Stag 3, STAG4, Stag 5

STAG3, Stage4 details, 2 stages are executed with 4 executor, all data processing is carried out on 4 machines.
! [Write a description of the picture here] (http://img.blog.csdn.net/20160511121339417

Stag 5 only on Worker4, because this stage has shuffle operation.

    • JOB3: There are stage 6, Stage 7, Stage 8. Where Stage 6 and stage 7 are skipped

      Stage 8 of the aggregated Metrics by executor section. As you can see, the data processing is performed on 4 machines:
    • JOB4: It also embodies the business logic in our application. There are stage 9, stage 10, Stage 11. Where stage 9, stage
      10 has been skipped

      Details of Tage 11. As you can see, data processing is performed on 3 other machines outside of the Worker2.
Summarize

Spark Streaming Essence

Park streaming receives real-time input data from a variety of sources such as Kafka, Flume, HDFs, and kinesis, and after processing, the results are stored in various places such as HDFS, databases, and so on.
The spark streaming receives these live input streams, divides them into batches, and then gives the spark engine processing to generate a stream of results in batches.
Spark Streaming provides a highly abstract, Dstream called discrete stream that represents a continuous stream of data. Dstream essentially represents the sequence of the RDD. Any operation on the Dstream will be turned into an operation on the underlying RDD.
Spark streaming creates a dstream using a data stream produced by the data source, or you can use some action on an existing dstream to create a new dstream.
The preceding code receives a batch of data every 300 seconds, based on which the RDD is generated, triggering the job and performing the processing. Dstream is a collection without boundaries, with no size limit. Dstream represents the concept of space-time. Over time, the inside continues to produce an rdd. Locked To the time slice, is the operation of the space, that is, the time slice of the corresponding batch of data processing.

In the process of converting a spark streaming program into a job performed by Spark, there are generally several operations to dstream in the Dstreamgraph,spark streaming program. Dstreamgraph are the dependencies of these operations.
Conversion from program to Dstreamgraph:

Starting with each foreach, backtracking occurs. The Dstreamgraph is formed when the dependencies between these operations are traced back and forth.
Performing the conversion from Dstream to Rdd also forms the Rdd Graph:

After the spatial dimension is determined, as time progresses, the RDD Graph is instantiated continuously and the job is triggered to perform processing.

In-depth official documents (excerpt from Wang's books):


Each step of the Spark core process is based on the RDD and there is a dependency between the RDD. The RDD in the DAG shows that there are 3 actions that will trigger 3 job,rdd from the bottom up dependency, and the Rdd generation job will be executed specifically. As you can see from Dsteam graph, the logic of Dstream is basically consistent with the RDD, which is based on the RDD and adds time dependence. The Rdd Dag can also be called a spatial dimension, meaning that the entire Spark streaming a time dimension, or it can become a space and time dimension.
From this perspective, spark streaming can be placed in a coordinate system. Where the y-axis is the operation of the Rdd, the dependency of the Rdd forms the logic of the entire job, and the x-axis is the time. As time passes, a fixed interval (Batch Interval) generates a job instance that runs in the cluster.

For spark streaming, when data flows from different data sources come in, a fixed set of time intervals will result in a series of immutable datasets or event collections (for example, from Flume and Kafka). This coincides with the RDD based on a fixed set of data, in fact, the RDD graph, which is based on a fixed time interval of Dstream, is based on the data set of one batch.
As can be seen in each batch, the spatial dimension of the RDD dependency is the same, the difference is that the five batch inflow of data size and content is not the same, so that is generated by different rdd dependency relationship instances, so that the Rdd graph derived from the Dstream Graph, that is to say, Dstream is the template of the RDD, different time intervals, generate different RDD graph instances.
Starting with the spark streaming itself:
1. Generate template for Rdd dag: DStream Graph
2 requires a timeline-based job Controller
3 requires inputstreamings and outputstreamings, representing the input and output of the data
4 The specific job runs on the spark cluster, and because streaming is critical to the system's fault tolerance regardless of whether the cluster is digestible
5 transaction processing, we hope that the data flowing in will be processed and processed only once. How to ensure exactly once transaction semantics in case of a crash

As can be seen from here, Dstream is the core of spark streaming, the core of Spark core is the RDD, it also has dependency and compute. More critical is the following code:

This is a hashmap, with time as key, with the RDD as value, which is also proof of the constant generation of rdd over time, generating dependency jobs, and running through Jobscheduler on the cluster. Again, Dstream is the template for the RDD.
Dstream can be said to be a logical level, the RDD is the physical level, Dstream is ultimately expressed through the transformation of the RDD to achieve. The former is a higher level of abstraction, the latter being the underlying implementation. Dstream is actually the encapsulation of the RDD set in the Time dimension, and the Dstream and Rdd are the same as the RDD, and the Dstream operation is to operate the RDD on a fixed time.
Summarize:
The business logic in the spatial dimension acts on the Dstream, and as time goes by, each batch interval forms a specific data set, generates the RDD, transform the Rdd, and then forms the Rdd's dependency on the Rdd DAG, which forms the job. Then Jobscheduler, based on the time schedule and the Rdd dependency, publishes the job to spark cluster to run, generating spark jobs constantly.

Spark-spark streaming-Online blacklist filter for ad clicks

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.