Unique Advantages of Spark Streaming

Last Update:2020-06-11 Source: Internet

Author: User

Keywords spark spark streaming spark streaming advantages

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Just as there are many stream processing engines available on the market, people often ask us what unique advantages Spark Sparking has? So the first thing to say is that Apache Spark provides native support for batch processing and stream processing. This is different from other systems in that the processing engines of other systems are either only focused on stream processing, or only responsible for batch processing and only provide stream processing API interfaces that require external implementation. With its execution engine and unified programming model, Spark can realize batch processing and stream processing, which is the unique advantage of Spark Streaming compared with traditional stream processing systems. Especially reflected in the following four important parts:

Can quickly restore the state in the case of fault error and straggler;
Better load balancing and resource usage;
Integration and interactive query of static data sets and streaming data;
Built-in rich advanced algorithm processing library (SQL, machine learning, graph processing).
In this article, we will describe the architecture of Spark Streaming and explain how to provide the above advantages. Then we will discuss some related follow-up work that is currently in the interest of everyone.

Stream processing architecture-past and present
The current distributed stream processing pipeline execution method is as follows:

Receive streaming data from data sources (such as time logs, system telemetry data, IoT device data, etc.), and process into data ingestion systems, such as Apache Kafka, Amazon Kinesis, and so on.
Process data in parallel on the cluster. This is also the key to designing a stream processing engine. We will make a more detailed discussion below.
The output results are stored in downstream systems (such as HBase, Cassandra, Kafka, etc.).
In order to process these data, most traditional stream processing systems are designed as continuous operator models, which work as follows:

There are a series of working nodes, each group of nodes runs one or more continuous operators;
For streaming data, each successive operator processes one record at a time, and transfers the record to other operators in the pipeline;
The source operator receives data from the intake system, and then the sink operator outputs to the downstream system.

The continuous operator is a relatively simple and natural model. However, with the continuous expansion of data scale and increasingly complex real-time analysis in the era of big data, this traditional architecture is also facing severe challenges. Therefore, we designed Spark Streaming to address the following requirements:

Rapid recovery from failures-the larger the data, the higher the probability of a node failure and a slower node operation (such as straggler). Therefore, if the system can give results in real time, it must be able to repair the fault automatically. Unfortunately, in the traditional stream processing system, it is still a challenge for the continuous operators statically allocated at these working nodes to complete this work quickly;
Load balancing-Unbalanced distribution and loading between working nodes in a continuous computing subsystem will cause a bottleneck (operating bottleneck) in the performance of some nodes. These problems are more common in the face of large-scale data and dynamically changing workloads. In order to solve this problem, then the system must be able to dynamically adjust the resource allocation between nodes according to the workload;
Unified stream processing and batch processing and interactive work-in many use cases, interaction with streaming data is necessary (after all, all streaming systems place this in memory) or combined with static data sets (such as pre-computed model). These are difficult to realize in the continuous computing subsystem. When the system dynamically adds new operators, no temporary query function is designed for it, which greatly weakens the user's ability to interact with the system. Therefore, we need an engine that can integrate batch processing, stream processing and interactive query;
Advanced analysis (such as machine learning, SQL query, etc.)-some more complex tasks need to constantly learn and update the data model, or use the latest feature information in the SQL query flow data. Therefore, these analysis tasks need to have a common integrated abstract component, so that developers can more easily complete their work.
In order to solve these requirements, Spark Streaming uses a new structure, which we call discreteized streams (discrete stream data processing), which can directly use the rich library in the Spark engine and has an excellent fault tolerance mechanism.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More