This article is published by NetEase Cloud.
This article is connected with an Apache flow framework Flink,spark streaming,storm comparative analysis (Part I)
2.Spark Streaming architecture and feature analysis
2.1 Basic Architecture
Based on the spark streaming architecture of Spark core.
Spark streaming is the decomposition of streaming calculations into a series of short batch jobs. The batch engine here is spark, which divides the input data of spark streaming into a segment of data (discretized Stream) According to batch size (for example, 1 seconds). Each piece of data is converted to the RDD (resilient distributed Dataset) in Spark, and then the transformation operation for DStream in spark streaming is changed to spark In the transformation operation of the RDD, the RDD is manipulated into intermediate results and stored in memory. The entire streaming calculation can be superimposed on intermediate results or stored on an external device, depending on the needs of the business.
In short, spark streaming splits the real-time input data stream into blocks in time slices ΔT (such as 1 seconds), and spark streaming takes each piece of data as an RDD and uses the RDD operation to process every piece of data. Each block generates a spark job processing and then commits the job to the cluster in batches, running each job and the real Spark task without any distinction.
Jobscheduler
Responsible for job scheduling
Jobscheduler is the center of all job scheduling in Sparkstreaming, and Jobscheduler startup causes Receivertracker and Jobgenerator to start. The start of the Receivertracker causes receiver running on the executor side to start and receive data, and Receivertracker records the data meta information received by receiver. Jobgenerator's startup results in every batchduration, calling Dstreamgraph to generate the Rdd Graph and generate the job. The line pool in Jobscheduler commits the encapsulated Jobset object (time value, Job, meta of the data source). The business logic is encapsulated in the job, causing the action of the last Rdd to be triggered, and the job is actually dispatched on the spark cluster by Dagscheduler.
Jobgenerator
Responsible for job generation
A Dag graph is born with a timer at intervals based on dstream dependencies.
Receivertracker
Responsible for receiving, managing and distributing data.
Receivertracker when he started receiver, he had receiversupervisor, the realization is Receiversupervisorimpl, receiversupervisor itself The receiver,receiver continuously receives data and converts the data into blocks via Blockgenerator. The timer keeps the block data constantly storing the block data through Blockmanager or Wal, After data storage, RECEIVERSUPERVISORLMPL will report the metadata of the stored data metadate to Receivertracker, In fact is reported to Receivertracker in the RPC entity Receivertrackerendpoint, mainly.
2.2 Architecture analysis based on yarn level
For spark on Yarn's cluster mode, spark on yarn starts with the driver in Spark Appmaster (the driver is activated inside the AM, mainly is a StreamingContext object) Submit receiver as a task to a spark executor;receive start input data, generate data block, and then notify Spark Appmaster;spark Appmaster generates the corresponding job based on the data block and submits the job's task to the Idle spark Executor execution. The bold blue arrows in the figure show the data stream being processed, the input data stream can be disk, network and HDFS, etc., the output can be HDFs, database, etc. Comparing the cluster modes of the Flink and spark streaming, it is found that the components within AM (Flink Jm,spark streaming are driver) carry the assignment and dispatch of the task, and the other Container carries out the execution of the task (Flink is Tm,spark streaming is executor), the difference is that spark streaming each batch communicates with the driver to reschedule, so that the latency is much lower than that of Flink.
Specific implementation
Figure 2.1 Spark streaming program converted to Dstream graph
Figure 2.2 DStream graph converted to the RDD graph
Each step of the Spark core process is based on the RDD and there is a dependency between the RDD. The RDD in the DAG shows that there are 3 actions that trigger 3 jobs, the RDD is dependent on the bottom up, and the RDD generation job is executed specifically. As you can see from Dsteam graph, the logic of Dstream is basically consistent with the RDD, which is based on the RDD and adds time dependence. The Rdd Dag can also be called a spatial dimension, meaning that the entire spark streaming has a time dimension, can also be a temporal dimension, and a program written using spark streaming is very similar to writing a spark program, in the Spark program, The data is processed mainly by manipulating the interface provided by the RDD (resilient distributed datasets elastic distributed data Set), such as map, reduce, filter, etc. In spark streaming, the interfaces provided by manipulating Dstream (the RDD sequence representing the data flow) are similar to those provided by the RDD.
Spark Streaming Converts the DStream operation to DStream graph in the program, and in Figure 2.1, for each time slice, DStream graph produces an rdd graph for each output operation such as print, (foreach, etc), spark streaming will create a spark action, and for each spark action,spark streaming a corresponding spark job will be given to Jobscheduler. A jobs queue is maintained in the Jobscheduler, and the spark job is stored in this queue, and Jobscheduler submits the spark job to spark Scheduler,spark The scheduler is responsible for scheduling the task to execute on the corresponding spark executor and finally the job of Spark.
Figure 2.3 The Time dimension generates an RDD dag
The y-axis is the operation of the Rdd, and the dependency of the Rdd forms the logic of the entire job, and the x-axis is the time. As time passes, a fixed interval (Batch Interval) generates a job instance that runs in the cluster.
Code implementation
Based on Spark 1.5 's spark streaming source code interpretation, the basic architecture has not changed much.
2.3 Component Stacks
Support for obtaining data from a variety of data sources, including KAFK, Flume, Twitter, ZeroMQ, Kinesis, and TCP sockets, after acquiring data from a data source, you can use maps, reduce, Advanced functions such as join and window handle complex algorithms. Finally, the processing results can be stored in the file system, database and field dashboards. On the basis of "one Stack rule them all", you can also use the other child frames of spark, such as cluster learning, graph calculation, and so on, to process the convection data.
2.4 Feature Analysis
Throughput and latency
Currently, Spark has been able to scale linearly to 100 nodes (4Core per node) on EC2, and can handle 6gb/s of data (60M records/s) with a few seconds of delay, and its throughput is 2~5 times higher than that of popular storm, Figure 4 is a test done by Berkeley using WordCount and grep Two use cases, where the throughput of each node in the Spark streaming is 670k records/s in grep, and Storm is 115k records/s.
Spark streaming decomposes streaming calculations into multiple spark jobs, and the processing of each piece of data goes through the Spark Dag graph decomposition and the scheduling process of the Spark's task set, with the smallest batch size selected in 0.5~ Between 2 seconds (Storm's current minimum latency is around 100ms), so spark streaming is able to meet all streaming quasi-real-time computing scenarios in addition to very high real-time requirements such as high-frequency real-time trading.
Exactly-once semantics
More stable semantic support for exactly-once.
Support for anti-pressure capability
Spark Streaming introduces a back-pressure mechanism (back-pressure) from v1.5 to adapt the data reception rate to the cluster data processing capability.
Sparkstreaming How to reverse pressure?
Simply put, the reverse pressure mechanism needs to adjust the system to accept the data rate or processing data rate, but the system processing data rate is not easy to adjust. Therefore, you can only estimate the rate at which the current system processes data, and adjust the rate at which the system accepts data to match it.
Flink How to reverse pressure?
Strictly speaking, the flink eliminates the need for backpressure because the rate at which the system receives data and the rate at which the data is processed is naturally matched. The system receives data on the premise that the task receiving the data must have free buffer available, and that the data will continue to be processed on the premise that the downstream task also has free available buffer. Therefore, there is no system that accepts too much data, resulting in more than the ability to process the system.
From this, Spark's micro-batch model leads to the need to introduce a separate back-pressure mechanism.
Back pressure and high load
Back pressure is usually generated in a scenario where a short load spike causes the system to receive data at a rate much higher than the rate at which it processes data.
However, how high the system can withstand the load is determined by the system data processing ability, the reverse pressure mechanism is not to improve the system's ability to process data, but the system is facing the load higher than the capacity of the system to adjust the rate of receiving data.
Fault tolerant
Driver and executor use the pre-write log (WAL) to preserve state, combining the fault-tolerant mechanism of the pedigree of the RDD itself.
APIs and class libraries
The introduction of structured data flow in Spark 2.0 unifies the SQL and streaming APIs, using Dataframe as a unified portal, capable of programming like a regular batch program or just as much as operating SQL streaming.
Wide integration
In addition to being able to read HDFs, Flume, Kafka, Twitter andzeromq data sources, we can define our own data sources, support running on yarn, standalone and EC2, and be able to guarantee high availability through ZOOKEEPER,HDFS, Processing results can be written directly to HDFs
Deployment of
dependent on the Java environment, as long as the application can load into spark-related jar packages.
3.Storm Architecture and Feature analysis
3.1 Basic architecture
Storm Cluster is a master-slave architecture, the main node is Nimbus, from the node is supervisor, about scheduling related information stored in the Zookeeper cluster. The schema is as follows:
Nimbus
The master node of the storm cluster is responsible for distributing user code, assigning to the worker node on the specific Supervisor node, and running the task topology the corresponding component (Spout/bolt).
Supervisor
The slave node of a storm cluster that manages the start and end of each worker process running on the Supervisor node. With the Supervisor.slots.ports configuration item in the storm's configuration file, you can specify the maximum number of slots allowed on a single supervisor, each slot uniquely identified by the port number, a port number Corresponds to a worker process (if the worker process is started).
ZooKeeper
Used to coordinate Nimbus and supervisor, if supervisor is unable to run due to a failure problem, Topology,nimbus is first perceived and reassigned topology to other available supervisor to run.
Run the schema
Run the process
1) The client submits the topology to Nimbus.
2) Nimbus The local directory for the topology to calculate the task based on the configuration of topology, assign task, establish assignments node on zookeeper to store task and supervisor correspondence between nodes;
Create a Taskbeats node on zookeeper to monitor the heartbeat of a task; start topology.
3) Supervisor go to zookeeper to get the assigned tasks, start multiple woker, each woker generate a task, a task one thread; Initialize the connection between tasks based on the topology information; Between the task and task is managed through ZEROMQ, and then the entire topology runs.
3.2 Architecture based on yarn level
Developing an application on yarn typically requires only two components to be developed, namely the client and Applicationmaster, where the client's primary role is to submit the application to yarn and interact with yarn and applicationmaster. Complete some instructions sent by the user, while Applicationmaster is responsible for requesting resources from yarn and communicating with NodeManager to initiate the task.
Without modifying any storm source code to run it on yarn, the simplest implementation is to run the various service components of storm (including Nimbus and supervisor) as separate tasks on yarn, Zookeeper, as a public service, runs on several nodes outside of the yarn cluster.
1) The Storm application is submitted to the RM of yarn via the Yarn-storm client;
2) RM applies for resources for Yarn-storm Applicationmaster and runs it on one node (Nimbus);
3) Yarn-storm Applicationmaster to launch Nimbus and UI services within itself;
4) Yarn-storm Applicationmaster request resources from RM according to user Configuration and initiate Supervisor service in the container of the application;
3.3 Component Stacks
3.4 Feature Analysis
A simple programming model.
Similar to mapreduce reduces the complexity of parallel batching, storm reduces the complexity of real-time processing.
Service of
A service framework that supports hot deployment, instant on-line or offline apps.
Can be used in a variety of programming languages
You can use a variety of programming languages on top of storm. Clojure, Java, Ruby, and Python are supported by default. To increase support for other languages, simply implement a simple storm communication protocol.
Fault tolerance
Storm manages the failure of worker processes and nodes.
Horizontal expansion
Calculations are performed in parallel between multiple threads, processes, and servers.
Reliable Message Processing
Storm guarantees that each message can be processed at least once. When a task fails, it is responsible for retrying the message from the message source.
Fast
The design of the system ensures that the message can be processed quickly, using ZEROMQ as its underlying message queue.
Local mode
Storm has a "local mode" that can fully simulate storm clusters during processing. This allows you to quickly develop and unit test.
Deployment of
Depending on zookeeper for task state maintenance, you must first deploy zookeeper.
4. Comparative analysis of three kinds of frames
Comparative analysis
If the latency requirements are not high, it is recommended to use spark streaming, rich advanced API, easy to use, natural docking of other components in the spark ecosystem, throughput, deployment is simple, the UI interface is also more intelligent, community activity is high, problem response is also faster, More suitable for the flow of ETL, and Spark's development momentum is also obvious, I believe the future performance and functionality will be more perfect.
If the latency requirements are relatively high, it is recommended to try the next flink,flink is a current development of the fire of a flow system, the use of the original stream processing system to ensure low latency, in the API and fault tolerance is also a relatively perfect, relatively simple to use, easy to deploy, And the development momentum is also getting better, I believe that the response of community problems should also be relatively fast.
Individuals on the Flink is more optimistic, because the original flow processing concept, in the premise of ensuring low latency, performance is relatively good, and more and more easy to use, the community is also evolving.
NetEase has: Enterprise-class Big Data visualization analysis platform. Self-service Agile analysis Platform for business people, using PPT mode report making, easy to learn and use, with powerful exploration and analysis capabilities, truly help users to insight into the value of data discovery. Click here to try it for free.
Learn about NetEase Cloud:
NetEase Cloud Official Website: https://www.163yun.com/
New User package: Https://www.163yun.com/gift
NetEase Cloud Community: https://sq.163yun.com/
Comparative analysis of Flink,spark streaming,storm of Apache flow frame (ii.)