Storm Data Stream model analysis and discussion

Source: Internet
Author: User

This article first introduces the basic concepts and data stream models of storm, and then describes the necessity for storm to support data stream subscription between topology in combination with a typical application scenario, finally, we compare the differences between storm and another stream processing system in the data stream model.

Storm Basic Concepts

Storm is an open-source Real-time computing system that provides a series of basic elements for computing: topology, stream, spout, and bolt.

In storm, a computing task of a real-time application is packaged and published as a topology, which is similar to hadoop's mapreduce task. However, the difference is that in hadoop, The mapreduce task will end after it is executed. In storm, the topology task will never end once it is submitted, unless it is displayed as stopping the task.

Computing task topology is a Graph Connected by different spouts and bolts through stream. The following is a topology structure:

Including:

Spout: the message source in storm, used to produce messages (data) for the topology, generally from external data sources (such as message queue, RDBMS, nosql, realtime log) continuously read data and send it to the topology message (tuple tuples ).

Bolt: a message processor in storm. It is used to process messages for the topology. bolts can perform operations such as filtering, aggregation, and database query, and can process messages at the first level.

Finally, topology will be submitted to the storm cluster for running. You can also run the command to stop topology and return the computing resources occupied by topology to the storm cluster.

Storm Data Stream Model

Stream is the abstraction of data in storm. It is a series of tuple tuples with unbounded time. In topology, spout is the source of stream and is responsible for transmitting stream from a specific data source for topology. Bolt can receive any number of streams as input and then process the data. If necessary, bolt can also launch a new stream for lower-level bolts to process.

Below is a data flow relationship between spout and bolt in topology:

Each computing component (spout and bolt) in topology has a parallel execution degree, which can be specified when the topology is created, storm allocates threads corresponding to the number of parallelism in the cluster to execute this component at the same time.

So there is a problem: since a spout or bolt will have multiple task threads to run, how can we send tuple tuples between two components (spout and bolt?

Storm provides several stream grouping policies to solve this problem. When defining the topology, You need to specify the stream that each bolt receives as its input (Note: spout does not need to receive stream, but only transmits stream ).

Storm currently provides the following seven stream grouping policies: shuffle grouping, fields grouping, all grouping, global grouping, non grouping, direct grouping, local or shuffle grouping, for specific policies, refer to here.

A scenario not supported by storm

The above introduces some basic concepts in storm. We can see that,The concept of stream in storm is unique in topology, and data flow can only be performed between different computing components (spout and bolts) according to the "publish-subscribe" method in topology, stream cannot flow between topology..

This limits the application of storm in some scenarios. Below is a simple example.

Assume that there is a topology1 structure: After data flow is generated through spout, filter bolts, join bolts, and business1 bolts are required in sequence. Among them, the filter bolt is used to filter data, the join bolt is used to aggregate data streams, and the business1 bolt is used to carry out the computing logic of an actual business.

At present, this topology1 has been submitted to the storm cluster for operation. Now we have new requirements and need to calculate a new business logic, the topology features the same data source as topology1, and the pre-processing process in the early stage is exactly the same (Going Through filter bolt and join bolt in sequence). How can storm meet this requirement at this time? According to my personal knowledge, there are several "twists and turns" implementation methods:

1) Method 1:First, kill the topology1 computing task that has been run in the cluster, then implement the computing logic of business2 bolt, and repackage it to form a new jar package for the topology computing task, submitted to the storm cluster for re-running. The overall topology structure in storm is as follows:

The disadvantage of this method is that due to the need to restart topology, spout or bolt will be lost if it is in a state. Due to the changes in the topology structure, you needProgramThe stability and correctness of the topology are verified. In addition, changes to the topology structure will incur additional O & M costs.

2)Method 2:Completely develop and deploy a new topology. The spouts and bolts of the previous public part can be reused directly. You only need to re-develop the new computing logic business2 bolt to replace the original business1 bolt. Then submit a new topology to run. The overall topology structure in storm is as follows:

The disadvantage of this method is that because both topology will read the same copy of data from external data source, it will undoubtedly increase the load pressure of external data source; in addition, the same data is transmitted to the same two copies in the storm cluster and processed by bolts, which wastes storm's computing resources and network transmission bandwidth. Suppose there are more than two such topology computing tasks, but N, the storm computing slot is a serious waste.

Note: The above two methods also have a public drawback-poor system scalability, which means that no matter which method, as long as there is a need for such new business logic in the future, complex manual operations or linear waste of resources are required.

3) method 3:OK. After reading the above two methods, you may propose the following solution: Use message middleware such as Kafka to implement spout shared data sources of different topology, in this way, reliable message transmission and message rewind return can be achieved. The advantage is that storm already supports the storm-Kafka plug-in. The overall topology structure in storm is as follows:

This implementation can reduce the repeated access pressure to external data source by introducing a message-oriented middleware layer, and eliminate the details of external data source through the message-oriented middleware layer, to expand the new business logic, you only need to redeploy and run the new topology. It should be said that it is a good implementation method in the existing storm version. However, the introduction of message-oriented middleware will undoubtedly bring complexity to the system, which increases the threshold for application development on storm.

It is worth noting that there is still one problem left behind by solution 3: For storm clusters, this method still cannot fundamentally avoid repeated data sending and processing in different storm topology. This is due to restrictions on storm's data stream model. If Storm achieves stream sharing between different topology, this problem will be solved.

Data Flow Model of a stream processing system

I am fortunate to have participated in the development and application of a stream processing framework in my personal work. Next, let's take a brief look at the data stream model used here:

Where:

1)Data Flow (Data Stream): A collection of infinite data records in time distribution and quantity;

2)Data Record (Data Record): The smallest unit of data stream. Each data record contains three types of data: the name of the data stream (Stream name) and the data used for routing (KEYS) and the data (value) required by the specific data processing logic );

3)Data processing task definition (Task Definition): Defines the basic attributes of a data processing task and cannot be directly executed. It must be converted to a specific task instance. Its basic attributes include:

    • (Optional) input stream: describes which data streams the task depends on as the input. It is a data stream name list. Data abortion sources do not depend on other data streams. You can ignore this configuration;
    • Process logic: describes the specific processing logic of a task, for example, external processing logic performed by an independent process;
    • (Optional) output stream: Specifies the data stream generated by the task. It is a data stream name. The last-level data stream processing link does not generate new data streams. You can ignore this configuration;

4)Data Processing Task instance (Task instance): After a data processing task is defined with specific constraints, it can be pushed to a logical entity running on a processing node. Add the following attributes:

    • Data processing task definition: point to the data processing task definition entity corresponding to the task instance;
    • Input filting condition: A list of boolean expressions that describe the data records that meet certain conditions in each input stream; if all data records in an input stream are valid data, they can be expressed as true;
    • (Optional) Output interval: Describes the frequency at which the task instance is forced to generate output stream records. The number or interval of input stream records can be used as the cycle. When this configuration is ignored, the generation cycle of output stream records is completely determined by the processing logic and is not subject to framework constraints;

5)Data processing node (Node): A real machine that can run a data processing task instance. The IPv4 address of each data processing node must be unique.

The distributed stream processing system consists of multiple data processing nodes. Each data processing node runs multiple data task instances ); each data task instance is a data task definition. After the input stream filter conditions and the forced output cycle attribute are added to the task definition, the logical entity that can be actually pushed to the data processing node. The data task definition contains the input data stream, data processing logic, and output data stream attributes.

In this system, all configuration information in the above data stream model is stored through the distributed application Coordination Service zookeeper cluster; different data processing nodes use the zookeeper cluster to obtain the configuration information of the data stream, and then run and stop the task instance and the inbound and outbound data streams.

At the same time, each data processing task can accept any existing data stream in the stream system as the input, and generate a new data stream with any name ), subscribed by task instances running on other nodes. The Zookeeper cluster dynamically perceives the subscription relationship between different nodes for each data stream and is responsible for notifying the stream system to make changes.

Differences between the two in the data stream model

As for the Implementation Details of the two systems, we will not make a specific comparison. The following only lists the differences between the two systems in the data stream model (this is not to fully compare the differences between the two systems, only list the key parts ):

1) In storm, data stream is defined in topology and transmitted in topology. In the above-mentioned stream processing system, stream is globally unique throughout the system and can be subscribed to throughout the cluster.

2) In storm, the publishing and subscription of stream is static. Static means that when the publishing and subscription relationship of stream is submitted to the storm cluster, once generated, this relationship cannot be changed during the running of topology. In the stream processing system mentioned above, the publishing and subscription of stream is dynamic, that is, a data processing task can dynamically publish streams or dynamically subscribe to any streams generated in the system, data Stream subscription is about maintenance and management through dynamic nodes of the Distributed Application Coordination Service zookeeper cluster.

With the comparison above, it is not difficult to find that storm's data stream mode cannot be easily supported for the Application Scenario examples in this article, the global data stream model of the stream processing system mentioned here can easily meet the needs of this application scenario.

Summary

I personally think it is necessary for storm to share streams between different topologies. This can at least be achieved without losing the existing functions of storm, this makes storm easier to cope with some application scenarios in the actual production environment.

There are many possible ways to implement this requirement based on existing storm. A simple method is to use zookeeper to centrally store and dynamically perceive the "publish-subscribe" Relationship between streams in topology, and process this situation during storm message delivery.

If the above points are incorrect, you are welcome to point out them.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.