An extensible real-time data processing architecture based on storm

Source: Internet
Author: User

Problem Introduction

Using storm, you can easily build a clustered data framework and implement business logic by defining topo.

But with Topo there is a disadvantage that Topo's processing power comes from the number of workers they set up at startup, and in many cases we need to be able to adjust the processing power of the cluster based on business pressure, when a single topo cannot solve the problem.

In order to be able to define the processing ability more flexibly, we can consider the original topo according to the business domain split, do not interfere with each other, flexible control, and in order to more economical utilization of processing resources, can consider introducing the concept of worker resource pool, to achieve the full use of resources.

However, there is a fatal problem with this multi-topo architecture, where the Topo in Storm are independent and unable to communicate directly, so there may be a scramble for resources when acquiring certain critical resources. In the face of this scenario, there are two ways to deal with it:

One: The use of distributed locks, such as zookeeper, to achieve the control of key resources, the disadvantage is the existence of reliability and efficiency problems, use and processing efficiency is not high requirements of the scene.

Second: The allocation of key resources by the third party, avoiding the topo of the resources, this scheme introduced a new construction, improve the complexity of the system.

Processing schemas

The advantage of clustering is that the processing power is extensible, but it brings data synchronization, development and maintenance complexity, data consistency and so on.

We now have a lot of cluster processing framework and corresponding components to simplify the corresponding development and maintenance work, but from the actual point of view of project development, we still need to deal with some problems that are not covered by mature components but are very difficult.

Storm-defined clusters provide a convenient and scalable processing capability, and topo are equivalent throughout the cluster, and within the storm-run environment, Topo can be exchanged.

Back to the above question, through storm, we get instant cluster processing power, we can customize the business through Topo, and easily distribute in the node, through the change of the number of worker, can adjust its processing ability.

With a large data storage platform such as Hadoop and a redis cache, the use of zookeeper-based distributed locks has basically made it possible to build a real-time, scalable, and big data processing platform.

Component diagram

Multi-top initialization

The following is a class view of the storm-based multi-topology initialization:

Key points and thinking caching strategy

Because it is a real-time data processing platform, it has the requirement of efficiency, and database storage access is often referred to as the bottleneck, so the design of the cache, the selection of Redis is caused by the use of more extensive and stable, the industry also has a relatively mature cache building strategy.

Distributed locks

Distributed locks are critical, especially if there are multiple topo in the Storm cluster, where there is a good chance of contention for critical resources.

The use of zookeeper to build distributed locks has become a more application, but the use of zookeeper built-in distributed locks must also exist robustness and lock efficiency problems, need to be considered at design time.

Collaboration for Hadoop and Oracle

These two components are very different from the usage cost and the usage scenario.

When applied, Hadoop can be used to store unstructured data, such as raw results. Thanks to Oracle's huge storage structure, reliability, and ease of use, you can choose to store your final processing results in Oracle for maintenance and presentation.

An extensible real-time data processing architecture based on storm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.