Spark Release NOTE 1: A thorough understanding of sparkstreaming through the case kick one of the

Source: Internet
Author: User
Tags shuffle

This lesson explains Sparkstreaming's understanding through two sections:

first, decryption sparkstreaming alternative online experiment

Second, the instantaneous understanding sparkstreaming essence

Spark source Customization class is mainly to do their own release version, self-improvement spark source code, usually in the telecommunications, finance, education, medical, Internet and other fields have their own different business, if the official version of Sprak do not have the business functions you need, you can customize, extend the function of Spark, Meet the company's business needs.

Choose the sparkstreaming Framework source code Research, two development reasons

1. Spark initially had no other sub-frameworks (Spark SQL, spark streaming, Spark ML, Spark Graphx, Spark R, etc.) for the spark core Infrastructure, followed by a variety of sub-frameworks to meet different requirements. The analysis of these sub-frameworks found that choosing the spark streaming framework to study is the smartest choice, by researching the spark streaming sub-framework on Spark Core, is moving toward mastering the spark power source and solving all the problems.

2, Spark has many sub-frameworks, we choose Spark streaming and why not other frameworks?

Spark SQL involves a lot of parsing and optimization of SQL syntax details, but it's important to analyze it and optimize it to focus on spark, but it's not the most important thing, so spark SQL doesn't fit as a concrete sub-frame worth studying.

Currently, Spark R is immature and has limited support capabilities.

Figure calculation, from the evolution of the version of the Graphx almost no improvement, this trend, GRAPHX is not already developed basic to the end of the figure, there are many math-level algorithms, and to the extreme of spark, mathematics is important to us, but not the most important for research.

Mechine Learning in the encapsulation vector vector, metrics constructs a lot of algorithm library, thus involves too many mathematics knowledge, all chooses ML actually is not too good choice.

The final screening of the Sparkstreaming sub-framework is the best way to cut into the golden spot.

2015 in the United States spark is the biggest data field of the year, of course, sparkstreaming streaming year; Sparkstreaming is the most attractive place for big data.

Big Data spark field in the industry payroll rankings

In the first half of 2016, the StackOverflow survey concluded that spark was the most paid in big data, while the use of sparkstreaming in spark technology was the most attractive, and the spark technology was the main consideration for spark Streaming the child frame.

Spark Streaming is a streaming computing framework, where all data is invalid if it is not related to stream processing. Streaming is really the first impression we have on big data, and it immediately gives us a feedback instead of data mining and graph calculations. Spark's strong place is the result of using stream processing to perfectly combine machine learning results, graph calculations, spark SQL, or spark R. This is thanks to the integration of Spark and the design of a diversified technical architecture, which means that all of the things in the technology stack can be called through a technology stack, without any setup at all, and that spark's unmatched point is also the source of Sparkstreaming eminence. The flow of this era alone is not a fight, sparkstreaming and a number of brotherhood framework together, unmatched. If you are proficient in sparkstreaming, congratulations, because Sparkstreaming and a few of the brothers behind the frame just show the spark big Data infinite charm.

What programs are prone to problems across all spark applications? Must be the Sparkstreaming program is the most prone to problems, because the data is constantly flowing in, SS to dynamically control the flow of data, job segmentation, data processing, so the most prone to problems, but the most prone to problems is also the most interesting place to show Big data is the most magical charm of the place. The combination of these features, but also the most need for technical personnel places. Focusing on the development of sparkstreaming in Spark, you will be well aware that SS is different from other frameworks, and SS is very much like an application on Sparkcore.

Like all things development in the world, any technology has its key point or turning point, sparkstreaming equivalent to the single-solitary nine swords, sparkcore equivalent to the easy-rib sutra. Sparkstreaming runs on Sparkcore, so many performance tuning is built on Sparkcore; Spark is the dragon vein of Big Data, and Sparkstreaming is the acupoints of the dragon vein.

Next, feel the dragon vein and the acupoints.

When studying sparkstreaming, there are things that confuse you, sparkstreaming data constantly flow in, according to Batchinterval time slices constantly generate jobs, and the job submitted to the cluster processing, if you can clearly see the flow of data and data processing, Your heart will be very practical.

How can we clearly see the process of data processing? Just a little trick: to put the batchinterval in the sparkstreaming big enough, for example, from 30 seconds to 1 minutes for batch, or 5 minutes to batch, you will see clearly the entire process sequence running.

Take an ad click on the online blacklist filter as an example

To adjust the time dimension:

    

We turn the time from 30 seconds to 300 seconds:

Read the data from the sparkstreaming socket port:

 The wrapper is published to the cluster deployment environment:

  

  

Check the cluster process:

  

Check the HDFs boot situation with WebUI:

  

Start the History-server monitoring process and its corresponding WebUI monitoring interface:

  

At this point the entire cluster environment starts all OK.

 Start to start running the sparkstreaming application

 

to start the external Request Sparkstreaming service Port client:

 

Enter the data stream to be processed:

 

See the results as follows:

Look at the WebUI console:

Click on the link to create a 0~4 job:

The interesting thing is that the Sparkstreaming Application Launcher actually executes a job, but the actual execution is 5 jobs, which are receiver Job,output job,output job,output job,start job, respectively.

Is the No. 0 job a code in our logic? No, not reducebykey. Execution result job, such as:

Sparkstreaming will automatically start some jobs during startup, such as the start operation:

Sparkstreaming is most like an application, and even if it is counted once, it executes several jobs, just like the spark application, which can start different jobs to accomplish different functions.

Continue to see JOB1:

The job tells you the insider: by tracking receiver, you find that it generates MAKERDD, which actually acts as the entire job independent stage, only on a single machine, and executed for 1.5 minutes, just started sparkstreaming, no tasks performed for 1.5 minutes, such as:

Think about what's been done for 1.5 minutes, and the entire job only ran for 2 minutes?

The answer is that the Receivertracker receiver is running, and it needs to receive incoming data. This job is receiver and executes for 1.5 minutes, and the receiver that is started is a job.

Conclusion:

Sparkstreaming start receiver is a job, in the specific cluster worker on the executor, the start receiver is started through the job. The run time of the job shows that the entire sparkstreaming is running for 2 minutes, There is a job run for 1.5 minutes, this job is receiver, in fact, refers to receiver start running time, receiver is running in executor, that is, sparkstreaming framework in the boot Recevier is through the job start. and receiver (which can initiate multiple receivers to receive data) is run in a executor and receives our data through a task:

What is the difference between receiver receiving data and normal job from this angle? No difference. Turn around to enlighten us: In a spark application, you can start a lot of jobs, and these jobs can work with each other. For example: The sparkstreaming framework default startup job gives you the data to receive, then prepares for subsequent processing, laying a good foundation for you to write complex applications. This is the golden entry point for writing very complex spark applications, and complex programs typically consist of multiple jobs.

Process_local is the memory node, sparkstreaming receive data by default is Memory_and_disk_ser_2 way, that is, the amount of data received is less than the memory can be saved by default, the disk will not be stored, Use it directly in memory here.

Take a look at the No. 0 job:

Starting 4 executor on 4 workers is maximizing the use of compute resources and continuously receiving data through a 1th job.

The data processed here are shuffle Read,shuffle write, via Sockettextstream, the RDD, called Blockrdd, and Blockrdd method from Sockettextstream:

In fact, InputStream helps us to generate a fixed rdd within a fixed time interval, receiving data is received in a executor task, but now processing data is transform operation occurs in the executor inside of 4 executor, This result tells us to receive data on a single machine, but it actually processes the data on four machines. Maximize the use of cluster resources to process data. The Sparkstreaming program executes as a batch-level job, and it does a lot of things. The whole process, in fact, only one job is actually executing, but there are a lot of job coordination to complete the complex business processing, this situation tells us that sparkstreaming is not the internet, blogs, books, official online speaking so simple.

Sparkstreaming itself is the process of generating the job on a per-time basis as the data flowing in, and then triggering the job's streaming engine on cluster, which itself is a time-dimension batch, with 300 seconds in the instance generating a batch of data, Based on this batch of data will generate RDD, based on the RDD will trigger Job,rdd generation, job trigger, are sparkstreaming framework to do. There is a sparkstreaming in the Dstream, we generate an rdd every time, creating an rdd dependency or triggering job specific execution. Every time, so get a dstream,dstream represents the concept of space and time, the dimension, with the advance of time to produce rdd, in fact, Dstream is the set of RDD, but the order of time, the space dimension is actually the dstream of the processing plane, What we do with Dstream is actually dealing with each of the rdd inside the Dstream. The whole space-time is a very large concept, fixed words, can lock the operation of space, operation is actually transform, the operation of Dstream will build Dstream Graph.

Summarize:

With time for the dimension has a dstream graph, while in the time dimension, there is a spatial dimension, the space dimension is the operation, the space dimension is determined in the case of the time is constantly advancing, he constantly put the spatial dimension of Dstream graph instantiation into the RDD graph, The specific job is then triggered to execute.

Thank Liaoliang teacher for their knowledge sharing

Liaoliang Teacher's card:

China Spark first person

Thank Liaoliang teacher for their knowledge sharing

Sina Weibo: Http://weibo.com/ilovepains

Public Number: Dt_spark

Blog: http://blog.sina.com.cn/ilovepains

Mobile: 18610086859

qq:1740415547

Email: [Email protected]

YY classroom: Daily 20:00 live teaching channel 68917580

Liaoliang: Founder of DT Big Data Dream Factory, dean and chief expert of Spark Asia Pacific Research Institute, Big data training expert, big Data architect.

Spark Release NOTE 1: A thorough understanding of sparkstreaming through the case kick one of the

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.