Spark version customization: A thorough understanding of sparkstreaming through a case study of kick

Source: Internet
Author: User

Contents of this issue:

1 Spark streaming Alternative online experiment

2 instantly understand the nature of spark streaming

Q: Why cut into spark source version from spark streaming?
    1. Spark did not start with spark streaming, Spark Sql, Spark ML, Spark R, Spark GRAPHX and other related content, is very original spark Core,spark streaming itself is spark A framework on the core, through a framework of thorough research can be thoroughly proficient in all aspects of spark;
    2. Spark SQL, which uses most of all except spark core, is not suitable as a concrete sub-framework to thoroughly study spark because spark SQL involves too much parsing or optimization of SQL syntax details, and Spark R has limited and immature functionality, Also ruled out, the recent releases of Spark Graphx have largely failed to improve, meaning that the GRAPHX is essentially at an end, and that there are a number of mathematical algorithms in the figure calculation, while the spark ml encapsulates vector (vector), Matrix (matrices) and the combination of RDD to build a large number of libraries, ML also has a lot of mathematical knowledge, so integrated down the selection of spark streaming to start a custom spark version;
    3. Spark streaming is the most widely used, best and most attractive technology in spark engineers.
Q: What kind of magic does spark streaming have?
    1. Spark streaming is streaming computing, a stream-processing era in which all data is invalid if it is not lost or irrelevant to the churn process;
    2. Streaming is really the initial impression of big data, data flow in, immediately give feedback, not batch processing, data mining, of course, it is the most powerful place spark streaming can use the results of machine learning online or the result of the graph calculation or spark SQL, spark The result of R, which is the source of Spark Unified Lake;
    3. Spark Streaming is one of the most prone to problems in your program. Because the data is constantly flowing in, it to dynamically control the flow of data, job segmentation and data processing, the most error-prone places, but also the most easy to reflect the value of personal places;
    4. Spark streaming is much like an application on top of spark core, and if you're proficient with spark streaming, it's no problem for any spark.
    5. Spark streaming wins at the tipping point; it senses what the program is going to do next;
First, Spark streaming alternative online experiment Case Description:      Online blacklist filtering for ad clicks: Filter out blacklist clicks Online in the ad click Billing System to protect advertisers ' interests,Only effective AD click Billing Experimental Techniques:      In practice we will generally set the batch interval small such as 5s 10s 30s, here we use the technique of enlarging batch interval, the batch interval set to 300s, There is no effect on job execution after zooming in, but it is more advantageous to observe the;  experiment Code of data inflow, job execution and so on:
Object Onlineblacklistfilter {def main (args:array[string]) {/** * 1th step: Create a Configuration object for Spark sparkconf, set the runtime of the Spark program      Configuration information, such as using Setmaster to set the URL of the master of the Spark Cluster to which the program is linked, if set to local, is run locally on behalf of the Spark program, especially for beginners who have very poor machine configuration conditions (for example, only 1G of memory) */       Val conf = new sparkconf ()//Create sparkconf Object Conf.setappname ("Onlineblacklistfilter")//Set the name of the application, you can see the name in the monitoring interface of the program run      Conf.setmaster ("spark://master:7077")//At this time, the program in the spark cluster//val SSC = new StreamingContext (conf, Seconds (30)) Changed from 30s to 300s val ssc = new StreamingContext (conf, Seconds (300))/** * blacklist data preparation, in fact blacklists are generally dynamic, such as in Redis or several According to the library, the blacklist generation often has complex business logic, the specific situation algorithm is different, but when the spark streaming processing, each can access the complete information */val blacklist = Array (("Hadoop", true), ( "Mahout", True)) Val Blacklistrdd = Ssc.sparkContext.parallelize (blacklist, 8) Val Adsclickstream = Ssc.sockett Extstream ("Master", 9999)/** * The format of each piece of data that the ad clicked here is: Time, Name Here the result of the map operation is name, (time,name) format */Val adsclickstreamformatted= Adsclickstream.map {ads = (Ads.split (") (1), ADS)} adsclickstreamformatted.transform (Userclickrdd + = { The leftouterjoin operation preserves all content of the RDD on the left side of the user's ad click Content, and gets the corresponding click if the content is in the blacklist val joinedblacklistrdd = Userclickrdd.leftouterjo In (Blacklistrdd)/** * Filter filtering, its INPUT element is a tuple: (Name, ((Time,name), Boolean)) * Where the first element is the name of a blacklist,  The second element of the second element is the time when the Leftouterjoin is present in the value * if present, the surface of the current ad click is the blacklist, need to filter out, otherwise it is effective click content; */val validclicked =          Joinedblacklistrdd.filter (Joineditem = {if (Joineditem._2._2.getorelse (False)) {false } else {true}}) Validclicked.map (Validclick = {validclick._2._1})} ). Print/** * Calculated valid data will generally be written to Kafka, and the downstream billing system will pull from Kafka to valid data for billing */Ssc.start () ssc.awaittermin ation ()}}

  

Experimental steps: 1. Start the spark cluster and spark history server process (view the job's execution trajectory); 2, the master node executes the command NC-LK 9999 starts the data sending service, executes the script sparkstreamingapps.sh starts the job; 3, observe the results of the implementation:Enter some data on the data send port, for example: 2255554 Spark 455554444 Hadoop 55555 Flink 6 6666 Kafka 6666855 rockyspark 666638 Scala 66666 Dt_spark by browsing To view the Spark history Server information: Click the first job to open: Job0 Insider:                 
There are 5 completed jobs, we are actually performing a job, here we can analyze a lot of insider, 5 jobs from top to bottom respectively revicer, print, print, print, start, click into the Dag Visual view of start corresponding: From Job0The corresponding DAG diagram and case code shows that this job is not our corresponding business logic code, so we can draw the following conclusions: Spark Streaming in the process of running, they will start some jobs, such as we click on the stage ID Stage1 go in, see aggregated Metrics by executor part                                         Discover 4 nodes on the job run, maximize the use of cluster resources, also fully explained that spark streaming is an application; JOB1 Insider:Question: Our application does not have a job in 1.5min, so why is there a task in 1.5min? is the data receiver receiver, has been continuously circulating to receive data, so need to continue to run! Receiver is a job!. Receiver starts through job! Receiver running on executor, running as a task, receiver receiving data and normal job no difference, this gives us the inspiration is: Spark application can start a lot of jobs, different jobs can cooperate with each other, It lays a good foundation for us to write complex applications, and complex programs must be composed of multiple jobs! From the tasks in the figure can be seen locallty level is process_local, not from the node, Spark streaming receive data, by default is the way of memory_and_disk_ser_2, again that the amount of data is less, Data reception does not use disk, but directly uses in-memory data; JOB2 Insider:You can see that JOB2 is primarily responsible for executing the program's Business logic code! Task is scattered across the executor, making full use of the resources of the cluster! 2 instantly understand the nature of spark streaming

The descriptions in the figure are as follows:

Spark streaming receives Real-time input data from a variety of sources such as Kafka, Flume, HDFs, and kinesis, and after processing, the results are stored in various places such as HDFS, databases, and so on.

The spark streaming receives these live input streams, divides them into batches, and then gives the spark engine processing to generate a stream of results in batches.

Spark Streaming provides a highly abstract, DStream called discrete stream that represents a continuous stream of data . Dstream essentially represents the sequence of the RDD. Any operation on the Dstream will be turned into an operation on the underlying Rdd.

Spark streaming creates a DStream using a data stream produced by the data source , or you can use some action on an existing DStream to create a new DStream.

Dstream Features:

Dstream is a collection without boundaries, with no size limit.

Dstream represents the concept of space-time. Over time, the inside continues to produce an RDD.

When locked to a time period, it is the operation of the space. That is, the processing of the data for the corresponding batch in this time period

The following describes the internal implementation of spark streaming:

Spark streaming program converted to Dstream Graph

Programs written with spark streaming are very similar to writing spark programs, in spark programs, primarily by manipulating the interfaces provided by the RDD (resilient distributed Datasets Elastic distributed datasets), such as map, reduce , filter, etc., to achieve batch processing of data.

In spark streaming, the interfaces provided by manipulating Dstream (the RDD sequence representing the data flow) are similar to those provided by the RDD.

Dstr eam graph converted to spark jobs

Spark streaming Converts the operation of DStream to DStream graph for each time slice, DStream graph produces an rdd graph for each output operation (such as print, foreach, etc.). Spark streaming will create a spark action, and for each spark Action,spark streaming will generate a corresponding spark job and hand it over to JobManager. A jobs queue is maintained in JobManager, and the spark job is stored in this queue, and JobManager submits the spark job to the spark Scheduler,spark Scheduler is responsible for dispatching the task to the corresponding spark Executed on the executor.

Another big advantage of Spark streaming is its fault tolerance, where the Rdd remembers to create its own operations, each batch of input data is backed up in memory, and if the data on that node is lost due to a node failure, the final result can be re-calculated at other nodes by the data being backed up.

Just like the original goal of spark streaming, it enables users to combine applications such as streaming, batching, and interactive queries through a rich API and high-speed, memory-based computing engine. The spark streaming is therefore suitable for applications where historical data and real-time data can be combined for analysis. Of course, the real-time requirements are not particularly high applications can also be fully competent. In addition, the data reuse mechanism of RDD can get more efficient fault-tolerant processing.


This chapter summarizes:

The 1.Spark streaming itself is like an application on the spark core that processes the data internally by invoking the Spark Core's RDD interface;

2.Spark application can start a lot of jobs, different jobs can cooperate with each other, so as to build a complex, large-scale application;

3.DStream interior is converted into a series of rdd to run;

Spark version customization: A thorough understanding of sparkstreaming through a case study of kick

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.