An extensible real-time data processing architecture based on storm

Last Update:2015-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Problem Introduction

Using storm, you can easily build a clustered data framework and implement business logic by defining topo.

But with Topo there is a disadvantage that Topo's processing power comes from the number of workers they set up at startup, and in many cases we need to be able to adjust the processing power of the cluster based on business pressure, when a single topo cannot solve the problem.

In order to be able to define the processing ability more flexibly, we can consider the original topo according to the business domain split, do not interfere with each other, flexible control, and in order to more economical utilization of processing resources, can consider introducing the concept of worker resource pool, to achieve the full use of resources.

However, there is a fatal problem with this multi-topo architecture, where the Topo in Storm are independent and unable to communicate directly, so there may be a scramble for resources when acquiring certain critical resources. In the face of this scenario, there are two ways to deal with it:

One: The use of distributed locks, such as zookeeper, to achieve the control of key resources, the disadvantage is the existence of reliability and efficiency problems, use and processing efficiency is not high requirements of the scene.

Second: The allocation of key resources by the third party, avoiding the topo of the resources, this scheme introduced a new construction, improve the complexity of the system.

Processing schemas

The advantage of clustering is that the processing power is extensible, but it brings data synchronization, development and maintenance complexity, data consistency and so on.

We now have a lot of cluster processing framework and corresponding components to simplify the corresponding development and maintenance work, but from the actual point of view of project development, we still need to deal with some problems that are not covered by mature components but are very difficult.

Storm-defined clusters provide a convenient and scalable processing capability, and topo are equivalent throughout the cluster, and within the storm-run environment, Topo can be exchanged.

Back to the above question, through storm, we get instant cluster processing power, we can customize the business through Topo, and easily distribute in the node, through the change of the number of worker, can adjust its processing ability.

With a large data storage platform such as Hadoop and a redis cache, the use of zookeeper-based distributed locks has basically made it possible to build a real-time, scalable, and big data processing platform.

Component diagram

Multi-top initialization

The following is a class view of the storm-based multi-topology initialization:

Key points and thinking caching strategy

Because it is a real-time data processing platform, it has the requirement of efficiency, and database storage access is often referred to as the bottleneck, so the design of the cache, the selection of Redis is caused by the use of more extensive and stable, the industry also has a relatively mature cache building strategy.

Distributed locks

Distributed locks are critical, especially if there are multiple topo in the Storm cluster, where there is a good chance of contention for critical resources.

The use of zookeeper to build distributed locks has become a more application, but the use of zookeeper built-in distributed locks must also exist robustness and lock efficiency problems, need to be considered at design time.

Collaboration for Hadoop and Oracle

These two components are very different from the usage cost and the usage scenario.

When applied, Hadoop can be used to store unstructured data, such as raw results. Thanks to Oracle's huge storage structure, reliability, and ease of use, you can choose to store your final processing results in Oracle for maintenance and presentation.

An extensible real-time data processing architecture based on storm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

An extensible real-time data processing architecture based on storm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

An extensible real-time data processing architecture based on storm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support