Just as there are many stream processing engines available on the market, people often ask us what unique advantages Spark
Sparking has? So the first thing to say is that Apache Spark provides native support for batch processing and stream processing. This is different from other systems in that the processing engines of other systems are either only focused on stream processing, or only responsible for batch processing and only provide stream processing API interfaces that require external implementation. With its execution engine and unified programming model,
Spark can realize batch processing and stream processing, which is the unique advantage of Spark Streaming compared with traditional stream processing systems. Especially reflected in the following four important parts:
Can quickly restore the state in the case of fault error and straggler;
Better load balancing and resource usage;
Integration and interactive query of static data sets and streaming data;
Built-in rich advanced algorithm processing library (SQL, machine learning, graph processing).
In this article, we will describe the architecture of Spark Streaming and explain how to provide the above advantages. Then we will discuss some related follow-up work that is currently in the interest of everyone.
Stream processing architecture-past and present
The current distributed stream processing pipeline execution method is as follows:
Receive
streaming data from data sources (such as time logs, system telemetry data, IoT device data, etc.), and process into data ingestion systems, such as Apache Kafka, Amazon Kinesis, and so on.
Process data in parallel on the cluster. This is also the key to designing a stream processing engine. We will make a more detailed discussion below.
The output results are stored in downstream systems (such as HBase, Cassandra, Kafka, etc.).
In order to process these data, most traditional stream processing systems are designed as continuous operator models, which work as follows:
There are a series of working nodes, each group of nodes runs one or more continuous operators;
For streaming data, each successive operator processes one record at a time, and transfers the record to other operators in the pipeline;
The source operator receives data from the intake system, and then the sink operator outputs to the downstream system.
The continuous operator is a relatively simple and natural model. However, with the continuous expansion of data scale and increasingly complex real-time analysis in the era of big data, this traditional architecture is also facing severe challenges. Therefore, we designed Spark Streaming to address the following requirements:
Rapid recovery from failures-the larger the data, the higher the probability of a node failure and a slower node operation (such as straggler). Therefore, if the system can give results in real time, it must be able to repair the fault automatically. Unfortunately, in the traditional stream processing system, it is still a challenge for the continuous operators statically allocated at these working nodes to complete this work quickly;
Load balancing-Unbalanced distribution and loading between working nodes in a continuous computing subsystem will cause a bottleneck (operating bottleneck) in the performance of some nodes. These problems are more common in the face of large-scale data and dynamically changing workloads. In order to solve this problem, then the system must be able to dynamically adjust the resource allocation between nodes according to the workload;
Unified stream processing and batch processing and interactive work-in many use cases, interaction with streaming data is necessary (after all, all streaming systems place this in memory) or combined with static data sets (such as pre-computed model). These are difficult to realize in the continuous computing subsystem. When the system dynamically adds new operators, no temporary query function is designed for it, which greatly weakens the user's ability to interact with the system. Therefore, we need an engine that can integrate batch processing, stream processing and interactive query;
Advanced analysis (such as machine learning, SQL query, etc.)-some more complex tasks need to constantly learn and update the data model, or use the latest feature information in the SQL query flow data. Therefore, these analysis tasks need to have a common integrated abstract component, so that developers can more easily complete their work.
In order to solve these requirements, Spark Streaming uses a new structure, which we call discreteized streams (discrete stream data processing), which can directly use the rich library in the Spark engine and has an excellent fault tolerance mechanism.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.