1th Lesson: A thorough understanding of sparkstreaming through cases kick: Decryption sparkstreaming alternative Experiment and sparkstreaming essence analysis

Last Update:2016-05-06 Source: Internet

Author: User

Tags log log

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Background:
Using Spark is primarily the magic of using spark Streaming,spark streaming:

    1. 流式处理，如今是一个流处理时代，一切与流不相关的都是无效的数据。    2. 流式处理才是真正的对大数据的印象。Spark Streaming的流式处理非常强大的一个功能是可以在线处理，ML，Spark SQL等流进来的数据，这也是Spark提供的一体化，多元化的技术架构设计带来的优势。    3. Spark Streaming本身是一个程序，Spark Streaming在处理数据的时候会不断感知数据。所以对构建复杂的Spark应用程序，就极为有参考价值。

In the spark experiment, if you want to analyze how the data flows in, how it is calculated, we can be implemented by spark streaming, the batch interval setup time is very large, so that many of the details can be observed through the log log, This is equivalent to the former photographer's kung Fu, and then slow down so that you can see more clearly.

One: Spark streaming alternative online experiment

Start port 9999 and append data to it. As shown in the following:
Spark streaming will cycle through the set interval, receive the data, and then calculate and print as follows:
Viewing the operation of a job through the master:18080 port essentially runs a job, and then the Web side shows that 5 jobs are running. Why is that? Go on and explore!!
First look at job Id 0 job, then the operation inside the DAG we do not use these operations in the actual code, Spark streaming will automatically start some jobs for us when calculating.

The first job is started on 4 workers, and in order to load balance, the cluster resources can be maximized using the subsequent calculations.
Job The job data receiver with ID 1 is running 1.5min, and receiver is started by job, and receiver is running in executor and receiving data in a task, and the normal job receive data is no different, so we can be in a s Park application starts a lot of jobs, and different jobs can work with each other, and spark streaming starts a job by default that receiver receives the data. As a result, a very good foundation for building complex implementations can be achieved by extending the business logic that satisfies their needs, and also enabling multiple receiver launches.
2. Receiver's locality level is process_local (Memory node), Spark streaming receive data by default is the way memory_and_disk_2, it can be seen by default, if it is small data, By default, spark puts data into memory.
4. The Job,dag view for Job ID 2 is as follows:
  
  At this point the Blockrdd comes from the Sockettextstream, essentially inputdstream the RDD based on the time interval.
  
  Although the data is received on a single machine, it is calculated in 4 executor so that the cluster resources can be maximized.
  
  Second: Instantly understand the nature of spark streaming
  Spark streaming is an application based on the spark core, scalable, high-throughput, fault-tolerant (primarily with spark Core's fault-tolerant approach) for processing online data streams, where data can be sourced from different sources, while simultaneously processing data from different sources. Data processed by Spark streaming can be combined with ML and graph
  The Spark streaming itself is a streaming engine that, in essence, adds a time dimension to the job that is generated by the flow of data in time, then triggers the job, and then cluster the execution.
6. Spark streaming supports reading data from a variety of data sources, such as kafka,flume,hdfs,kinesis,twitter, and can use high-order functions such as Map,reduce,join,window. The processed data can be saved to the file system, database, dashboard and so on.
8. How Spark streaming works
  Receive data streams in real time, split the data into multiple batch in the dimension of time, then compute each batch, and the final result is in batch form.
10. Spark Streaming provides a more advanced abstraction, DStream, that represents a continuous stream of data, DStream can be created via input data sources (Kafka,flume and kinesis), or through operators such as (Map,reduce, Join,window), Dstream inside is a series of continuously generated rdd. Each RDD contains data for a period of time.
12. For Dstream application operators such as map, in fact, the bottom layer is translated into dstream each rdd operation, each dstream executes a map, will generate a new dstream, but at the bottom of the essence is to map the RDD operation, and then generate a new RDD, This process is done through spark core, which is a layer of encapsulation of spark streaming, which hides the details and provides a convenient, easy-to-use API for developers.
  
  The operation of the Dstream will produce graph, the figure of the T1,T2 for the input data, the Join,map,foreach and so on, and then generate a new dstream this constitutes a graph, and finally in the calculation of the time will be backtracking.
  
  Spark streaming job generation, in the dimension of time, constantly produce batch, under operator operation, do not produce new dstream, the internal essence is to produce a new RDD, specifically in the calculation, so that the Dstream graph into the RRD graph. Then there is the spark core engine calculation.

Summarize:
In a short period of time, this course has an essential understanding of the logic of spark streaming's handling of data, and the subsequent course will be deeply understood, and the truth will slowly emerge from the many details of the process. As the three axes series, the show is still behind!!

This course note comes from:

1th Lesson: A thorough understanding of sparkstreaming through cases kick: Decryption sparkstreaming alternative Experiment and sparkstreaming essence analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More