Lesson 1th: A thorough understanding of sparkstreaming through cases kick

Source: Internet
Author: User
Tags shuffle

spark作为apache旗下顶级项目之一,在2015年火得一塌糊涂,在2016年更是势不可挡,下面两图可见一斑:


对于spark的学习,掌握其API的使用仅仅只是皮毛,我们要深入源码研究其本质,能够做到源码级别的修改和定制,才是真正掌握了它,也才能更好地使用它。从今天起,我们将踏上这一征程。Spark的子框架有若干, 我们将从Spark Streaming着手切入Spark版本定制,通过对该框架的彻底研究,我们推而广之到spark的各个框架,可以精通Spark力量的源泉和所有问题的解决之道。为什么选择Spark Streaming作为切入点呢?首先是因为数据有时效性,过期的数据就像过期的食物一样,远没有新鲜的食物来的有营养,我们以往选择批处理很多时候是因为技术和资源的限制,做不到流处理,只能退而求其次,从本质上来讲,流处理才是数据处理的王道!现在的时代是流处理的时代。其次,Spark Streaming自从推出以来,收到了越来越多的关注,50%以上的用户都将它视作spark中最重要的部分,如可见:


Spark's streaming can work seamlessly with spark sql,graphx and mllib, thanks to Spark's integrated, diversified infrastructure design, the so-called brotherhood, whose power is broken, and this is exactly what spark is really scary about, and spark Streaming will be the root of eminence; Spark streaming because of the dynamic nature of data input, the need to dynamically control the flow of data, job segmentation, and data processing, so the most prone to problems, need to carefully learn to master; Spark Streaming differs from other sub-frameworks in that it is more like an application on top of Spark core, and after learning Spark Core, further research on spark streaming also provides the best reference for us to follow up on the complex program of Spark.
Let's take a look at the essentials of the Xia Guan Web for spark streaming:

Let's get to the point today.
One. Spark Streaming Alternative Online experiment
Since the data in spark streaming is dynamically flowing, the processing of the data is handled by the framework to generate jobs automatically and periodically, and how can we clearly see the flow of data and the process of being processed in this dynamic variability? Our technique is to reduce dynamic variability by adjusting the batch interval to facilitate the perspective of detail. It is necessary to note that this is only a study of the needs of the actual production environment in the cluster can handle the situation, batch interval is generally smaller the better. We start with the online blacklist filtered spark streaming application that has been written by the ad click, First look at the program code:

 Object onlineblacklistfilter {    defMain (args:array[string]) {/** * 1th step: Create a Spark Configuration object sparkconf, set the runtime configuration information for the SPARK program, * For example, by Setmaster to set the URL of the master of the Spark Cluster to which the program is linked, if set * For local, the spark program is run locally, especially for beginners with very poor machine configuration conditions (e.g. * only 1G of memory) * */      Valconf =NewSparkconf ()//Create sparkconf ObjectConf.setappname ("Onlineblacklistfilter")//Set the name of the application, which can be seen in the monitoring interface run by the programConf.setmaster ("spark://master:7077")//At this time, the program is in the spark cluster      //*val SSC = new StreamingContext (conf, Seconds ())      ValSSC =NewStreamingContext (conf, Seconds ( -))/** * Blacklist data preparation, in fact the blacklist is generally dynamic, for example, in Redis or database, blacklist generation often has complex business * logic, the specific situation algorithm is different, but when the spark streaming processing time to access the full The information * /      ValBlacklist = Array ("Hadoop",true),("Mahout",true))ValBlacklistrdd = Ssc.sparkContext.parallelize (blacklist,8)ValAdsclickstream = Ssc.sockettextstream ("Master",9999)/** * The format of each piece of data that the ad clicked here is: Time, name * Here the result of the map operation is the format of name, (time,name) */      Valadsclickstreamformatted = adsclickstream.map {ads = (Ads.split (" ")(1), Ads)} adsclickstreamformatted.transform (Userclickrdd = {//Through the leftouterjoin operation not only retains the left-hand user ads Click on the content of the RDD all content, but also to obtain the corresponding click is not in the blacklist        ValJoinedblacklistrdd = Userclickrdd.leftouterjoin (Blacklistrdd)/** * Filter filtering, its INPUT element is a tuple: (Name, ((Time,name), Boolean)) * Where the first element is the name of the blacklist, the second element is the second element is performed Leftoute Rjoin If there is a value * if present, the surface of the current ad click is the blacklist, need to filter out, otherwise it is effective click content;        Valvalidclicked = Joinedblacklistrdd.filter (Joineditem = = {if(Joineditem._2._2.getorelse (false))          {false}Else{true}) Validclicked.map (Validclick = {validclick._2._1})}). Print/** * Calculated valid data will generally be written to Kafka, the downstream billing system will pull from the Kafka to the effective data for billing * *Ssc.start () Ssc.awaittermination ()}}

Once the program has been packaged, we can start our test.
1. First prepare to start the HDFs cluster (the reason to start HDFs is because our spark cluster's spark.eventLog.dir and spark.history.fs.logDirectory are stored in HDFS);
Start-dfs.sh:

2. Launch the Spark cluster, start-all.sh:

3. Start Spark's history Server, which allows us to view the details of the program running.
Start-history-server.sh:

4. Open the port of the data send, need to execute the NC first, if directly executes the packaged program, will error, will newspapers the mouth is rejected errors, because 9999 port does not start:
NC-LK 9999

5. Run the previously generated jar package with Spark-submit:
/usr/local/spark/spark-1.6.1-bin-hadoop2.6/bin/spark-submit–class Com.dt.spark.sparkstreaming.onlineblacklistfilter–master Spark://master:7077/usr/local/idea/sparkapps.jar
6. Enter some data on the data send port, such as:
111111 Spark
222222 Hadoop
333333 Flink
444444 Kafka
555555 Scala
666666 Flume
7. During the operation of the program, the following print information is visible in the console:

As can be seen, the blacklist of Hadoop is really filtered out, the rest of the information is correctly displayed.
8. While the program is running, check the running status through WebUI:

9.ctrl+c After you manually close the program, WebUI view the Run details:


Initially, the program runs in 2 executor of 2 worker nodes, with a total of 5 jobs.
This is a little different from the spark core, spark SQL, which we know, and the program runs for a total of 2.7 minutes, why do you have 5 jobs?
Next go to the specific job to see.
Job 0:

It can be found that the job is divided into 2 stages, run on all 2 executor, and the two stages are triggered by the 66 lines of code in our program, that is, Ssc.start ().
Next look at the STAGE0:

The stage is also triggered by our Ssc.start (), which contains a total of 50 tasks, running on all 2 executor.
Next look at the Stage1:

The stage is also triggered by our Ssc.start (), which contains a total of 20 tasks, running on all 2 executor.
Details of 11.Job 1:

Details of the job's Stage2:

The job is also triggered by our Ssc.start (), which contains only 1 tasks, running on executor0, which is characterized as process_local and takes 1.9 minutes. The job actually starts a thread receiver that receives data.
When Spark streaming starts receiver, it is started by job. And receiver will only execute in one executor and accept our data as a task. Receiver accepts data without any distinction from normal job.
Receiver: Long (possibly 7x24) runs above excutor, each receiver is responsible for a inuptdstream (such as reading an input stream of a Kafka message). Each receiver, plus inputdstream, will occupy a core/slot
Important revelation: Many jobs can be started in a spark application, and these different jobs can work with each other. This understanding lays a good foundation for us to write complex spark programs.
Details of 12.Job 2:

Each stage of the job corresponds to the 40th, 32, and 61 lines of our program (val adsclickstreamformatted = adsclickstream.map {ads = Ads.split ("") (1), ads)},v Al Blacklistrdd = Ssc.sparkContext.parallelize (blacklist, 8) and print), the job deals with the main business logic of our program.
We look at the details of Stage3:



The stage is executed by the Executor1 on the Worker2, and the data is written on the Worker2 disk (shuffle write).
Stage4:


The stage corresponds to (val Blacklistrdd = Ssc.sparkContext.parallelize (blacklist, 8), You can see that the Worker1 and Worker2 have the RDD partition data on the executor, and the degree of parallelism is a manually specified 8, and the data is written on the Worker1 disk (shuffle write).
Stage5:


The stage runs on the executor1 of the Worker2.
As you can see, data is processed on multiple servers, although it is accepted on a single server.
13.JOB3:


Stage6 and Stage7 were dropped because the data was shuffle write on disk when the previous job was executed.
Details of Stage8:


The stage runs on the executor of Worker1 and Worker12.
As with JOB2, the Job3 stages correspond to the 40th, 32, and 61 lines of our program (val adsclickstreamformatted = adsclickstream.map {ads = Ads.split ("") (1) , ads)}, Val Blacklistrdd = Ssc.sparkContext.parallelize (blacklist, 8) and print), the job deals with the main business logic of our program.
14.JOB4:


As with JOB2,JOB3, the Job4 stages correspond to the 40th, 32, and 61 lines of our program (val adsclickstreamformatted = adsclickstream.map {ads = Ads.split (" ") (1), Ads)}, Val Blacklistrdd = Ssc.sparkContext.parallelize (blacklist, 8) and print), the job handles the main business logic of our program.
2 instantly understand the nature of spark streaming
Dstream is a collection without boundaries, with no size limit. Dstream represents the concept of space-time. Over time, the inside continues to produce an rdd.
When the time is fixed, we lock into the operation of the space.
In terms of space dimensions, it is the processing plane.
The operation of the Dstream constitutes the dstreamgraph. As shown in the example:

A job is triggered by each foreach in, and the job follows the dependencies back and forth, forming the DAG, as shown in:

After the spatial dimension is determined, as time progresses, the RDD Graph is instantiated continuously and the job is triggered to perform processing.
Finally, read the official Spark streaming documentation:

Summary: In the Spark streaming application, the framework automatically helps us to submit some jobs to accomplish something, simplifying our program logic so that we just focus on the business logic code, which is the essence of spark streaming, is a reflection of the ease of use of the spark framework. If we want to master spark streaming, we need to pass these phenomena, in turn, backtrack to the source of root. In the next few lessons we will step into the cobwebs to see the essence.

This sharing is from Liaoliang Teacher's course ' source code version custom release class ' 1th lesson: through Case to sparkstreaming A thorough understanding of one of kick: Decryption sparkstreaming alternative Experiment and sparkstreaming Essence Analysis ', in this to Liaoliang teacher thanked!

Lesson 1th: A thorough understanding of sparkstreaming through cases kick

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.