Problem Introduction
Using storm, you can easily build a clustered data framework and implement business logic by defining topo.
But with Topo there is a disadvantage that Topo's processing power comes from the number of workers they set up at startup, and in many cases we need to be able to adjust the processing power of the cluster based on business pressure, when a single topo cannot solve the problem.
In order to be able to define the processing ability more flexibly, we can consider the original topo according to the business domain split, do not interfere with each other, flexible control, and in order to more economical utilization of processing resources, can consider introducing the concept of worker resource pool, to achieve the full use of resources.
However, there is a fatal problem with this multi-topo architecture, where the Topo in Storm are independent and unable to communicate directly, so there may be a scramble for resources when acquiring certain critical resources. In the face of this scenario, there are two ways to deal with it:
One: The use of distributed locks, such as zookeeper, to achieve the control of key resources, the disadvantage is the existence of reliability and efficiency problems, use and processing efficiency is not high requirements of the scene.
Second: The allocation of key resources by the third party, avoiding the topo of the resources, this scheme introduced a new construction, improve the complexity of the system.
Processing schemas
The advantage of clustering is that the processing power is extensible, but it brings data synchronization, development and maintenance complexity, data consistency and so on.
We now have a lot of cluster processing framework and corresponding components to simplify the corresponding development and maintenance work, but from the actual point of view of project development, we still need to deal with some problems that are not covered by mature components but are very difficult.
Storm-defined clusters provide a convenient and scalable processing capability, and topo are equivalent throughout the cluster, and within the storm-run environment, Topo can be exchanged.
Back to the above question, through storm, we get instant cluster processing power, we can customize the business through Topo, and easily distribute in the node, through the change of the number of worker, can adjust its processing ability.
With a large data storage platform such as Hadoop and a redis cache, the use of zookeeper-based distributed locks has basically made it possible to build a real-time, scalable, and big data processing platform.
Component diagram
Multi-top initialization
The following is a class view of the storm-based multi-topology initialization:
Key points and thinking caching strategy
Because it is a real-time data processing platform, it has the requirement of efficiency, and database storage access is often referred to as the bottleneck, so the design of the cache, the selection of Redis is caused by the use of more extensive and stable, the industry also has a relatively mature cache building strategy.
Distributed locks
Distributed locks are critical, especially if there are multiple topo in the Storm cluster, where there is a good chance of contention for critical resources.
The use of zookeeper to build distributed locks has become a more application, but the use of zookeeper built-in distributed locks must also exist robustness and lock efficiency problems, need to be considered at design time.
Collaboration for Hadoop and Oracle
These two components are very different from the usage cost and the usage scenario.
When applied, Hadoop can be used to store unstructured data, such as raw results. Thanks to Oracle's huge storage structure, reliability, and ease of use, you can choose to store your final processing results in Oracle for maintenance and presentation.
An extensible real-time data processing architecture based on storm