Description: This article translates the Storm Code structure Description section of structure of the codebase, available from Storm's official wiki on GitHub, to help friends who are based on Storm's source-level learning and research.
Storm's source code is divided into three different levels.
First, Storm was designed to take into account the compatibility of multilingual development. Nimbus is a thrift service , topologies is defined as a thrift structure .
Thrift Advantages : allows storm to be used in any development language.
Second, all of Storm's interfaces are defined in the Java language . Therefore, although many of the features implemented in Storm are Clojure code, the use of these features must pass through the Java API. This means that all the features of Storm are available to Java.
Thirdly, a large part of Storm's implementation is Clojure code. From the line of code, almost half of the Java code, half clojure code. But because Clojure is more expressive, the majority of the logic is actually implemented Clojure .
The next sections will explain each of these three levels in detail.
Storm.thrift
To understand Storm's code structure, the first thing to look at is the Storm.thrift file.
Storm uses the thrift version of folk from here to automatically generate code. This thrift version is actually the thrift 7 after renaming all Java packages to "Org.apache.thrift7". In addition, it is exactly the same as the Thrfit 7. The reason for such a thrift version alone is that the thrift lacks backward compatibility, but rather to avoid package name collisions to satisfy some users in their own topologies to use other versions of thrift.
Any spout or bolt in a topology is assigned a unique identifier by the user, called the "Component ID". This "Component ID" is required when describing the output of 1 bolts that receive other spout or bolts. The stormtopology structure holds 1 maps to hold the mapping of "component ID" to "component", which contains all the component types (that is, all spout, bolts).
Thrift the definition of spout or bolt is the same, so we just need to look at the thrift definition of the bolt. It contains 1 "componentobject" structures and a "componentcommon" structure.
"Componentobject" is the implementation entity of the bolt. It can be one of the following three types:
- 1 Serialized Java Objects (this object implements the Ibolt interface)
- 1 "shellcomponent" objects means that bolts are implemented in other languages. If you define 1 bolt,storm in this way, 1 Shellbolt objects will be instantiated to handle the communication between the JVM-based worker process and the non-JVM component (that is, the bolt) implementation.
- 1 "javaobject" structures, which tells Storm the classname and constructor parameters needed to instantiate the bolt. This is useful when you want to define topology in a non-JVM language. This allows you to use JVM-based spout or bolts without having to create and serialize their Java objects when you use non-JVM languages to define topology.
"Componentcommon" defines all the other properties of this component. Including:
- This component launches what stream and the stream's metadata (whether it is a declaration of field in direct Stream,stream)
- What stream this component receives (defined in a map of 1 component_id to stream_id, used when the stream is being grouped)
- The degree of parallelism of this component
- The configuration item for this component
Note that there are also "Componentcommon" fields in the structure of spout, so spout can also be declared to receive other stream inputs. However, the Storm Java API does not provide a way to specify what stream spout receives, and if you specify 1 spout input declarations here, an error message will appear when the topology is submitted. The reason for this is because spout's input statement is not for the user to use, but is used internally by storm. Storm will automatically add stream and bolt to topology internally to construct the ACKing framework, where two streams are emitted from Acker bolts to all spout nodes in topology. As long as 1 tuple trees have been detected to be completed or failed, Acker will issue an "ACK" or "fail" message via these two streams respectively. The code for converting user-submitted topology into runtime topology can be found here.
Java interface
Storm's interface definitions are all Java interfaces. The main interfaces are as follows:
- Irichbolt
- Irichspout
- Topologybuilder
The main intent of defining these interfaces is to:
- Defining interfaces in the Java language
- Based on this interface, it is possible to provide the default implementation base classes that are most appropriate for each occasion.
The practical use of this strategy can refer to the Baserichspout class
spout and bolts are serialized into the topology thrift definition structure in the manner described above.
One of the details worth mentioning is that IBolt, Ispout and Irichbolt, Irichspout are different from each other. Their main difference is the addition of the "Declareoutputfields" method to the "Rich" version. The reason for this is that all output stream output field declarations must be in the thrift structure (so it can be declared using any programming language), but the user wants to be able to declare the stream output field information in their class. To solve this problem, "Topologybuilder" constructs the thrift structure by invoking the "Declareoutputfields" method to get the declaration of the output field and then incorporating its transformation into the thrift structure. This conversion can be seen in this section of the "Topologybuilder" code.
Interface implementation
By defining all of the storm's interfaces by the Java language, it ensures that all the features of Storm are available to Java. At the same time, the use of Java interfaces also makes Java users experience better when using storm.
It should be said that storm is mainly implemented by the Clojure language. Although half of Java is clojure from the number of lines of code, the majority of the logical implementations are clojure. There are two notable exceptions to Drpc and the topology of supporting transactions, both of which are purely Java-implemented. The main purpose of this is to show how to achieve a higher level of abstraction over storm based on Storm. The implementations of the DRPC and the topology supporting the transaction are located in the Backtype.storm.coordination and backtype.storm.transactional packages respectively.
Here is a summary of a list of the main Java packages and the contents of the Clojure namespace:
Java Package
- Backtype.storm.coordination: The storm-based batch processing functionality used in DRPC and transactional topology is implemented. The most important class in this bag is coordinatedbolt.
- Concrete implementation of higher level abstraction of Backtype.storm.drpc:DRPC
- Backtype.storm.generated: Automatically generated thrift code (generated using the thrift version here Folk, The main is to rename the Org.apache.thrift package to org.apache.thrift7 to avoid conflicts with other thrift versions)
- Backtype.storm.grouping: Contains the interface that users need to implement a custom stream grouping class
- Backtype.storm.hooks: Defines a hook interface that handles various storm events, such as when a task emits a tuple, when a tuple is ack. The manual on Hooks is detailed here
- Backtype.storm.serialization:storm serialization/deserialization of the implementation of a tuple. Built on top of Kryo.
- Definitions of Backtype.storm.spout:spout and related interfaces (e.g. "Spoutoutputcollector"). Also included is "shellspout" to implement a non-JVM language definition spout protocol.
- Definitions of Backtype.storm.task:bolt and related interfaces (e.g. "Outputcollector"). It also includes "Shellbolt" to implement a non-JVM language-defined bolt protocol. Finally, "Topologycontext" is also defined here to be used at runtime for spout and bolts to obtain topology execution information.
- Backtype.storm.testing: Includes a variety of test bolts and tools used in Storm unit testing.
- Backtype.storm.topology: The Java layer above the thrift structure to provide a pure Java API to use Storm (the user does not need to know the details of the thrift). The "Topologybuilder" and the base classes for different spout and bolts are also defined here. A slightly higher level of interface "Ibasicbolt" is also defined here, which makes it more concise to create certain types of bolts.
- Backtype.storm.transactional: Includes the implementation of transactional topology.
- Backtype.storm.tuple: Includes the implementation of the tuple data model in storm.
- Backtype.storm.utils: Contains the data structure and various tool classes used in storm source.
Clojure namespaces
- Backtype.storm.bootstrap: Includes 1 useful macros to introduce all the classes and namespaces used in the source code.
- Backtype.storm.clojure: includes a specific domain language (DSL) that is defined with Clojure for storm.
- The zookeeper logic used in the Backtype.storm.cluster:Storm daemon is encapsulated in this file. This part of the Code provides the API to map the running state of the entire cluster to the zookeeper "filesystem" (for example, where the task is running and which spout/bolt each task is running).
- backtype.storm.command.*: These namespaces include command implementations of the client command line beginning with various "Storm XXX". These implementations are very brief.
- Read/Parse implementation of Config in Backtype.storm.config:Clojure. It also includes tool functions to tell Nimbus, supervisor and other daemons which local directories should be used in various situations. For example, the "Master-inbox" function returns to the local directory telling Nimbus where to save the jar package that was uploaded to it.
- Backtype.storm.daemon.acker: The implementation of the "Acker" Bolt. This is a critical part of Storm's assurance that data is fully processed.
- Backtype.storm.daemon.common:Storm the public functions used by the daemon, such as obtaining their IDs based on the name of topology, Mapping 1 user-defined topology to a really running topology (the really running topology is adding ACK stream and Acker Bolt on a user-defined topology basis, see system-topology! function), It also includes definitions for various heartbeat and other data structures in storm.
- BACKTYPE.STORM.DAEMON.DRPC: Includes the implementation of the DRPC server for use with DRPC topology.
- Backtype.storm.daemon.nimbus: Includes the implementation of the Nimbus.
- Backtype.storm.daemon.supervisor: Includes the implementation of the supervisor.
- Backtype.storm.daemon.task: A task instance implementation that includes spout or bolts. This includes the implementation of handling message routing, serialization, statistical collections provided for the UI, and spout, bolt execution actions.
- Backtype.storm.daemon.worker: Includes implementations of worker processes (1 workers contain a lot of tasks). Includes the implementation of message transport and task initiation.
- Backtype.storm.event: An actuator that includes 1 simple asynchronous functions. Nimbus and Supervisor use asynchronous function actuators to avoid resource contention in many situations.
- Backtype.storm.log: Defines a function for outputting log information to log4j.
- Backtype.storm.messaging.*: Defines a 1-level interface to achieve point-to-point message communication. When working in local mode, Storm uses the in-memory Java queue to simulate message delivery. When working in cluster mode, message passing uses ZEROMQ. The common interface is defined in PROTOCOL.CLJ.
- Backtype.storm.stats: How to summarize the statistics that are used to write to the UI in zookeeper. The aggregation of different granularity is realized.
- Backtype.storm.testing: Includes tools to test storm topology. Including time emulation, running a fixed number of tuples and then getting the "complete-topology" of the output snapshot, "Tracker topology" can do finer-grained control operations when the cluster is "idle", and other tools.
- Backtype.storm.thrift: Includes the Clojure encapsulation of the auto-generated thrift API to make it easier to use the thrift structure.
- Backtype.storm.timer: 1 Background timers are implemented to delay the execution of functions or to schedule polling execution. Storm cannot use the timer class in Java, because in order to test Nimbus and supervisor, it must be integrated with the time simulation.
- Backtype.storm.ui.*:storm the implementation of the UI. Completely independent of the other code, through the Nimbus Thrift API to obtain the required data.
- Backtype.storm.util: Includes common tool functions used in storm code.
- Backtype.storm.zookeeper: Includes Clojure's encapsulation of the Zookeeper API, but also provides some level of operation for example: "Mkdirs", "delete-recursive"
Storm source code Structure "translation"