Simple and clear, http://www.aliyun.com/zixun/aggregation/13431.html ">storm makes large data analysis easier and enjoyable.
In today's world, the day-to-day operations of a company often generate TB-level data. Data sources include any type of data that Internet devices can capture, web sites, social media, transactional business data, and data created in other business environments. Given the amount of data generated, real-time processing has become a major challenge for many organizations. A very effective open source real-time computing tool that we often use is storm--twiitter development, often compared to "just-in-time Hadoop". However, Storm is far simpler than Hadoop, because processing large data with it does not lead to the alternation of old and new technologies.
Shruthi Kumar, Siddharth Patankar together for Infosys, respectively engaged in technical analysis and research and development work. This article details the use of storm, in the example of the project name "Overspeed Alarm system (speeding alert systems)". The function we want to achieve is to analyze the data of passing vehicles in real time, and once the vehicle data exceeds the preset threshold, a trigger is triggered and the relevant data is stored in the database.
Storm
Compared to the batch processing of Hadoop, Storm is a real-time, distributed, and highly fault-tolerant computing system. Like Hadoop, Storm can also handle large amounts of data, but storm can make processing more real-time with high reliability, that is, all information is processed. Storm also features fault-tolerant and distributed computing, which allows storm to be extended to different machines for high-volume data processing. He also has the following features:
is easy to expand. For extensions, you only need to add machines and change the corresponding topology (topology) settings. Storm uses the Hadoop zookeeper for cluster coordination, which is sufficient to ensure the good operation of large clusters. The processing of each piece of information can be guaranteed. Storm cluster management is simple. Storm fault-tolerant: Once topology is submitted, Storm will run it until topology is abolished or shut down. When an error occurs in execution, the task is reassigned by the storm. Although the topology used in Java,storm is usually available in any language design.
Of course in order to better understand the article, you first need to install and set storm. There are a few simple steps you need to follow:
from the storm official download storm installation file will extract bin/directory to your path and ensure that the Bin/storm script is executable.
Storm components
The storm cluster consists of a master node and a group of working nodes (worker node), which are coordinated through zookeeper.
Master node:
The master node typically runs a background program--nimbus, which is used to respond to nodes distributed in the cluster, assign tasks, and monitor failures. This is similar to the job Tracker in Hadoop.
Work node:
The work node also runs a daemon--supervisor, which listens to the assignment and runs the worker process based on the requirements. Each work node is an implementation of a subset of topology. The coordination between Nimbus and supervisor is through zookeeper systems or clusters.
Zookeeper
Zookeeper is the service that completes the coordination between supervisor and Nimbus. The real-time logic of the application is encapsulated into the "topology" in the storm. Topology is a set of graphs that are connected by spouts (data source) and bolts (data manipulation) through the stream groupings. The following is a more profound analysis of the terms that appear.
Spout:
In short, spout reads the data from the source and puts it into the topology. Spout is divided into two types: reliable and unreliable; when Storm receives a failure, a reliable spout sends a tuple (tuple, list of data items), and unreliable spout does not consider receiving success or firing only once. The most important method in Spout is Nexttuple (), which launches a new tuple to topology and simply returns without a new tuple launch.
Bolt:
All processing in topology is done by Bolt. Bolt can do anything, such as filtering, aggregating, accessing files/databases, etc. Bolt receives data from the spout and processes it, and it may be possible to send tuple to another bolt for processing if a complex flow is encountered. The most important method in Bolt is execute (), which receives the new tuple as a parameter. Either spout or bolt, if the tuple is emitted into multiple streams, these streams can be declared by Declarestream ().
Stream groupings:
Stream grouping defines how a stream can be split between bolt tasks. Here are the 6 stream grouping types provided by storm:
1. Random grouping (Shuffle grouping): Randomly distributing tuple to bolt tasks to ensure an equal number of tuple for each task.
2. Field grouping (Fields grouping): splits the data stream according to the specified field and groups. For example, according to the "User-id" field, tuples of the same "User-id" are always distributed to the same task, and tuples of different "User-id" may be distributed to different tasks.
3. All groups (all grouping): Tuple are copied to bolt all tasks. This type needs to be used with caution.
4. Global grouping: All streams are assigned to the same task bolt. Specifically, the task assigned to the smallest ID.
5. No grouping (none grouping): You don't need to care how streams are grouped. Currently, no grouping is equivalent to random grouping. In the end, however, Storm will put bolts into bolts or spouts subscribing to their same thread to execute, if possible.
6. Direct grouping: This is a special grouping type. The tuple producer decides which tuple processor task tuple to receive.
Of course, you can implement the Customstreamgroupimg interface to customize the groupings you need.
Project implementation
Now we need to design a topology for spout and bolt to handle a large amount of data (log files), and to trigger alerts when a particular data value exceeds a preset threshold. Using the storm topology, read the log file line by row and monitor the input data. In the storm component, spout is responsible for reading input data. Not only does it read data from existing files, but it also monitors new files. Once the file has been modified spout will read the new version and overwrite the previous tuple (which can be read in bolt format), send tuple to bolt for critical analysis so that all potentially supercritical records can be found.
The use cases are described in detail in the next section.
Critical analysis
In this section, we will focus on the two analytic types of critical values: instantaneous criticality (instant Thershold) and time series criticality (temporal series threshold).
instantaneous critical value monitoring: The value of a field exceeds the preset threshold at that moment, triggering a trigger if the condition is met. For example, when a vehicle exceeds 80 km/h, it triggers a trigger. Time series criticality monitoring: The value of a field exceeds the preset threshold for a given time period, and triggers a trigger if the condition is met. For example: In the 5-minute category, the speed of more than 80KM two times and more vehicles.
Listing one shows the type of log we will use, which contains the vehicle data information: the license plate number, the speed at which the vehicle travels, and the location of the data.
AB 123 North City BC 123 South City CD 234 South City DE 123 the East City EF 123 South City GH 123
Here you will create a corresponding XML file that will contain the schema that introduces the data. This XML will be used for parsing of log files. See the following table for the design patterns and corresponding instructions for XML.
Both XML files and log files are stored in a directory that spout can monitor at any time to focus on real-time updates of files. And the topology in this use case see the figure below.
Figure 1:storm is established in topology to realize real-time data processing
As shown in the figure: Filelistenerspout receives the input log and reads the line-by-row, then the data is sent to the Thresoldcalculatorbolt for further critical value processing. Once the processing is completed, the data for the computed row is sent to the Dbwriterbolt, which is then credited to the database by the Dbwriterbolt. The following is a detailed explanation of the implementation of this process.
The realization of spout
The spout takes the log file and the XML description file as the receiving object. The XML file contains the same design pattern as the log. Imagine a sample log file that contains the vehicle's license plate number, speed of travel, and where the data is captured. (See picture below)
Figure2: Flowchart of data from log file to spout