Reprinted from http://www.ibm.com/developerworks/cn/opensource/os-twitterstorm/
Introduction to streaming Big data
Storm is an open source, big data processing system that differs from other systems in that it is designed for distributed real-time processing and is language agnostic. Learn about Twitter Storm, its architecture, and the development of batch and streaming solutions.
Hadoop (the undisputed king of Big Data analytics) focuses on batching. This model is sufficient for many scenarios, such as indexing a Web page, but there are other usage models that require real-time information from highly dynamic sources. To solve this problem, we have to use the Storm (now known as Backtype in Twitter) that Nathan Marz launched. Storm does not process static data, but it handles expected continuous stream data. Given that Twitter users generate 140 million tweets a day, it's easy to see the huge use of this technology.
But Storm is not just a traditional big data analysis system: It's an example of a complex event processing (CEP) system. CEP systems are typically categorized as compute and orientation detection, where each system can be implemented in Storm through a user-defined algorithm. For example, CEP can be used to identify meaningful events in the event torrent, and then handle these events in real time.
Nathan Marz provides a large number of examples of using Storm in Twitter. One of the most interesting examples is generating trend information. Twitter extracts emerging trends from massive tweets and maintains them at the local and national levels. This means that when a case begins to emerge, the Twitter trend-themed algorithm will recognize the topic in real time. This real-time algorithm is implemented in Storm as a continuous analysis of Twitter data.
What is "Big data"?
Big Data refers to massive amounts of data that cannot be managed in a traditional way. Internet-wide data is driving the creation of new architectures and applications that can handle such new data. These architectures are highly scalable and can process data in parallel and efficiently across an unlimited number of servers.
Storm vs. traditional Big Data
The difference between Storm and other big data solutions is how it's handled. Hadoop is essentially a batch processing system. Data is introduced into the Hadoop file system (HDFS) and distributed to individual nodes for processing. When processing is complete, the resulting data is returned to HDFS for use by the originator. Storm supports the creation of topologies to transform data streams without endpoints. Unlike Hadoop jobs, these transformations never stop, and they continue to process the data that arrives.
Big Data implementations
The core of Hadoop is written in the Java™ language, but supports data analysis applications written in a variety of languages. The implementation of the latest applications takes a more esoteric route to take full advantage of modern languages and their features. For example, Spark from the University of California (UC) at Berkeley was implemented in the Scala language, and Twitter Storm was implemented using Clojure (pronounced with closure) language.
Clojure is a modern dialect of Lisp language. Similar to Lisp,clojure supports a functional programming style, but Clojure also introduces features to simplify multithreaded programming, a feature that is useful for creating Storm. Clojure is a virtual machine (VM)-based language that runs on a Java virtual machine. However, although Storm is developed using the Clojure language, you can still write applications in storm with almost any language. All that is required is an adapter that is connected to the Storm's schema. There are already adapters for Scala, JRuby, Perl, and PHP, but there are also structured Query language adapters that support streaming to the Storm topology.
Key properties of Storm
Some of the features of Storm's implementation determine its performance and reliability. Storm uses ZeroMQ to deliver messages, which eliminates the middle queueing process, allowing messages to flow directly between tasks themselves. Behind the message is an automated and efficient mechanism for serializing and deserializing the primitive types of Storm.
One of the most interesting parts of Storm is its focus on fault tolerance and management. Storm implements guaranteed message handling, so each tuple is fully processed by the topology, and it is automatically replayed from the nozzle if a tuple is found to have not been processed. Storm also implements task-level fault detection, where messages are automatically reassigned to quickly restart processing in the event of a task failure. Storm contains more intelligent processing management than Hadoop, and the process is managed by a supervisor to ensure that resources are fully utilized.
Storm model
Storm implements a data flow model in which data continues to flow through a transformed entity network (see Figure 1). An abstraction of a data stream is called a stream , which is an infinite array of tuples. Tuples are like a structure that uses some additional serialization code to represent standard data types such as integers, floating-point and byte arrays, or user-defined types. Each stream is defined by a unique ID that can be used to build the topology of the data source and sink (sink). The stream originates from the nozzle and the nozzle flows data from the external source into the Storm topology.
Figure 1. Conceptual architecture for a common Storm topology
The receiver (or the entity that provides the conversion) is called a bolt . The bolt implements a single transition on a stream and all processing in a Storm topology. Bolts can implement traditional functions like MapReduce or more complex operations (single-step functions), such as filtering, aggregating, or communicating with external entities such as databases. A typical Storm topology implements multiple transformations, so multiple bolts with independent tuple flows are required. Nozzles and bolts are implemented as one or more tasks in a Linux® system.
You can use Storm for Word frequency to easily implement MapReduce functionality. As shown in Figure 2, the nozzle generates a text stream, and the bolt implements the MAP function (each word of a stream is token). The stream from the "map" Bolt then flows into a bolt that implements the Reduce function (to aggregate the words into the total).
Figure 2. Simple Storm topology for MapReduce functionality
Note that the bolts can transmit data to multiple bolts, and can also be accepted from multiple sources. Storm has the concept of flow grouping , which implements a mix (shuffling) (randomly but equally distributes tuples to bolts) or field groupings (flow partitioning based on the field of the stream). There are other stream groupings, including the ability of the generator to use its own internal logical routing tuple.
However, one of the most interesting features of the Storm architecture is guaranteed message handling . Storm can guarantee that each tuple emitted by a nozzle will be processed, and storm will replay the tuple from the nozzle if it has not been processed within the timeout period. This feature requires some clever tricks to track elements in the topology, and is one of the important added values of Storm.
In addition to supporting reliable messaging, Storm uses ZeroMQ to maximize messaging performance (removing intermediate queues to enable direct delivery of messages between tasks). The ZeroMQ incorporates congestion detection and adjusts its communication to optimize the available bandwidth.
Storm Sample Demo
Now let's take a look at the Storm example by implementing a simple MapReduce topology code (see Listing 1). This example uses the cleverly designed word count examples from the Storm Starter Kit from Nathan (available from GitHub) (see Resources for links). This example shows the topology shown in Figure 2, which implements a map conversion with a bolt and a reduce conversion that contains a bolt.
Listing 1. Build a topology for Storm in Figure 2
Topologybuilder builder = new Topologybuilder (); builder.setspout ("Spout", new Randomsentencespout (), 5); Builder.setbolt ("Map", new Splitsentence (), 4), shufflegrouping (" Spout "), builder.setbolt (" Reduce ", new WordCount (), 8) fieldsgrouping (" Map ", New fields (" word ") ): Ten config conf = new config (); Conf.setdebug (true); localcluster cluster = new Localcluster (); cluster.submittopology ("Word-count", conf, Builder.createtopology ()); Thread.Sleep (10000); Cluster.shutdown ();
Listing 1 (Adding a line number for reference) first uses TopologyBuilder
declaring a new topology. Next, in line 3rd, a nozzle (named) is defined spout
, and the nozzle contains one RandomSentenceSpout
. RandomSentenceSpout
a class ( nextTuple
that is, a method) emits one of 5 random sentences as its data. setSpout
the argument at the end of the method 5
is a parallelism hint (or the number of tasks to create for this activity).
In lines 5th and 6. I defined the first bolt (or algorithmic conversion entity), in this case the map (or split) bolt. This bolt uses the SplitSentence
token input stream and sends it as an output of each word. Note that line 6th is used shuffleGrouping
, which defines the input subscriptions for this bolt ("spout" in this case), and also defines the stream grouping as mixed. This mixed grouping means that input from the nozzle will be mixed or randomly distributed to the task in this bolt (the bolt has been prompted to have 4 task parallelism).
In lines 8th and 9, I define the last bolt, which is actually used for the reduce element, using the input of that element as the map bolt. The WordCount
method implements the necessary word count behavior (grouping similar words together to maintain the total number), but is not mixed, so its output is consistent. If you have multiple tasks that implement the reduce behavior, you will end up with a count of fragments, not totals.
Lines 11th and 12 create and define a configuration object and enable Debug mode. Config
The class contains a number of configuration possibilities (see Resources for links to more information about the Storm tree).
Lines 14th and 15 Create a local cluster (in this case, to define the purpose of the local pattern). I have defined the names of my local clusters, configuration objects, and topologies (which can be builder
obtained through the elements of the class createTopology
).
Finally, on line 17th, Storm sleeps for a while and then shuts down the cluster on line 19th. Keep in mind that Storm is a running operating system, so tasks can take a considerable amount of time to process new tuples on the stream they subscribe to.
You can learn more about this very simple implementation in the Storm Starter kit, including the details of nozzles and bolts.
Back to top of page
Using Storm
Nathan Marz wrote a set of easy-to-understand documents detailing how to install Storm to perform cluster mode and local mode operations. Local mode does not require a large cluster of nodes to use Storm. If you need to use storm in a cluster but lack nodes, you can also implement a storm cluster in Amazon Elastic Compute Cloud (EC2). See Resources for reference information for each Storm mode (local, cluster, and Amazon EC2).
Other big Data solutions for open source
Since Google introduced the MapReduce paradigm in 2004, several solutions have been created using the original MapReduce paradigm (or the quality of the paradigm). Google's initial use of MapReduce was to build the World Wide Web index. Although the application is still popular, the problem with this simple model is increasing.
Table 1 provides a list of available open source Big data solutions, including traditional batch processing and streaming applications. Nearly a year before Storm was introduced to open source, Yahoo! 's S4 distributed streaming computing platform was open source to Apache. Released in October 2010, S4 provides a high-performance computing (HPC) platform that hides the complexities of parallel processing to application developers. S4 implements a scalable, decentralized cluster architecture that incorporates partial fault-tolerant functionality.
Table 1. Open Source Big Data solutions
Solution | Solutions
Developer |
type |
Description |
Storm |
Twitter |
Stream-processing |
Twitter's new streaming Big data analytics solution |
S4 |
Yahoo! |
Stream-processing |
Distributed streaming computing platform from Yahoo! |
Hadoop |
Apache |
Batch Processing |
The first open source implementation of the MapReduce paradigm |
Spark |
UC Berkeley Amplab |
Batch Processing |
The latest analytics platform to support in-memory datasets and resilience |
Disco |
Nokia |
Batch Processing |
Nokia's Distributed MapReduce framework |
HPCC |
LexisNexis |
Batch Processing |
HPC Big Data Cluster |
Back to top of page
More information
While Hadoop is still the most advertised Big data analytics solution, there may be many other solutions, each with different characteristics. I explored Spark in a past article, which incorporates in-memory processing of datasets (the ability to reconstruct lost data). But Hadoop and Spark are focused on batching large datasets. Storm provides a new big data analysis model, and because it has recently been open source, it has also aroused widespread concern.
Unlike Hadoop, Storm is a computing system that does not include any storage concepts. This allows Storm to be used in a variety of contexts, whether the data is being dynamically passed from a non-traditional source, or stored in a storage system such as a database (or by a controller for real-time operations on some other device, such as a trading system).
See Resources for links to more information about Storm, learn how to make a cluster work, and other big data analytics solutions, including batching and streaming.
[Reprint] Use Twitter Storm to deal with real-time Big data