In various frameworks of big data, hadoop is undoubtedly the mainstream of big data, but with the development of e-commerce companies, hadoop is only suitable for some offline data processing, unable to deal with some real-time data processing and analysis, we need some real-time computing Framework to analyze data. Therefore, many streaming real-time computing frameworks have appeared, such as Storm, Spark Streaming, Samaz and other frameworks. This article mainly explains the working principle of Spark
Streaming and how to use it.
First, streaming computing
1. What is flow?
Streaming: is a data transmission technology, which transforms the data received by the client into a stable and continuous
Streaming, sending out continuously, so that the sound or image seen by the user is very stable, and the user is
You can start browsing the files on the screen before the entire file is sent.
2. Common flow computing framework
Apache Storm
Spark Streaming
Apache Samza
The above three real-time computing systems are all open source distributed systems with low latency, scalability and fault tolerance
Many advantages, their common feature is: allows you to assign tasks to when running data flow code
A series of fault-tolerant computers run in parallel. In addition, they all provide simple APIs to
Simplify the complexity of the underlying implementation.
For the comparison of the above three streams to make computing frameworks, you can refer to this article on the three frameworks of streaming big data processing: Storm, Spark and Samza
Two, Spark Streaming
1. Introduction to Spark Streaming
Spark Streaming is an important framework in the Spark ecosystem. It is built on Spark Core.
The official explanation for Spark Streaming is as follows:
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark's machine learning and graph processing algorithms on data streams.
Spark Streaming is an extended application of Spark Core. It has the characteristics of scalability, high throughput, and fault tolerance for streaming data. Can monitor from Kafka, Flumn, HDFS. Kinesis, Twitter, ZeroMQ or Scoket socket data through complex algorithms and a series of calculation analysis data, and the analysis results can be stored in the HDFS file system, database and front-end pages.
Spark
Streaming has the following characteristics:
High scalability, can run on hundreds of machines (Scales to hundreds of nodes)
Low latency, data can be processed on the second level (Achieves low latency)
Highly fault-tolerant (Efficiently recover from failures)
Ability to integrate parallel computing programs, such as Spark Core (Integrates with batch and interactive processing)
2. How Spark
Streaming works
For Spark Core, its core is RDD. For Spark Streaming, its core is DStream. DStream is similar to RDD. It is essentially a collection of RDDs. DStream can divide data streams in batches according to seconds. First, after receiving the streaming data, it is divided into multiple batches, and then submitted to the Spark cluster for calculation, and finally the results are batch output to HDFS or database and front-end page display and so on. You can refer to the following picture to help understand:
How to understand DStream? It is a series of continuous RDDs. It is an immutable, distributed data set built on Spark. Each RDD in DStream contains data at a certain time interval.
So, how does Spark Streaming work? How does it run on the cluster?
We all know that Spark Core will generate a SparkContext object to perform subsequent processing on the data during initialization. The corresponding Spark Streaming will create a Streaming Context whose underlying layer is SparkContext, which means that it will submit the task to SparkContext for execution. , Which also explains DStream as a series of RDDs. When starting the Spark Streaming application, a Receiver receiver will be started on the Executor of a node, and then it will be received by the Receiver when writing data from the data source. After receiving the data, the Receiver will split the data into multiple blocks. , And then back up to each node (Replicate Blocks disaster recovery), and then Receiver will report to the StreamingContext block, indicating that the data is on the Executor of those nodes, and then within a certain interval StreamingContext will process the data as RDD and give it to SparkContext Divide to each node for parallel calculation.