How to use Twitter storm to process large data in real time

Source: Internet
Author: User
Tags file system unique id

Hadoop (the undisputed king of the Big Data analysis field) concentrates on batch processing. This model is sufficient for many scenarios, such as indexing a Web page, but there are other usage models that require real-time information from highly dynamic sources. To solve this problem, you have to rely on Nathan Marz's Storm (now called Backtype in Twitter). Storm does not process static data, but it handles stream data that is expected to be contiguous. Given that Twitter users generate 140 million tweets a day, it's easy to see the huge use of this technology.

But Storm is not just a traditional large data analysis system: It is an example of a complex event processing (CEP) system. CEP systems are often categorized into computations and detection-oriented, where each system can be implemented in Storm through a user-defined algorithm. For example, CEP can be used to identify meaningful events in the torrent of events and then handle them in real time.

Nathan Marz offers a number of examples of using Storm in Twitter. One of the most interesting examples is generating trend information. Twitter extracts emerging trends from a torrent of tweets and maintains them at both the local and national levels. This means that when a case begins to emerge, Twitter's trend-themed algorithm will recognize the topic in real time. This real-time algorithm is implemented as a continuous analysis of Twitter data in Storm.

What is "Big data"?

Large data refers to massive amounts of data that cannot be managed in the traditional way. Internet-wide data is driving the creation of new architectures and applications that can handle such new data. These schemas are highly scalable and can process data concurrently and efficiently across an unlimited number of servers.

Storm with the traditional large data

The difference between Storm and other large data solutions is how it is handled. Hadoop is essentially a batch processing system. The data is introduced into the Hadoop file system (HDFS) and distributed to each node for processing. When processing is complete, the resulting data is returned to the HDFS for use by the originator. Storm supports the creation of topologies to transform data streams without endpoints. Unlike Hadoop jobs, these transformations never stop, and they continue to process the incoming data.

Large Data implementations

The core of Hadoop is the use of Java

language, but supports data analysis applications written in a variety of languages. The implementation of the latest applications takes a more esoteric route to take full advantage of modern languages and their features. For example, the Spark of the University of California (UC) in Berkeley is implemented in the Scala language, and the Twitter Storm is implemented using the Clojure (pronunciation and closure) language.

Clojure is a modern dialect of the Lisp language. Similar to lisp,clojure support for a functional programming style, Clojure also introduces features to simplify multithreaded programming (a feature that is useful for creating Storm). Clojure is a virtual machine (VM) based language that runs on a Java virtual machine. However, although Storm is developed using the Clojure language, you can still write applications in almost any language in Storm. All that is required is an adapter that is connected to the Storm schema. There are already adapters for Scala, JRuby, Perl, and PHP, but there are structured query language adapters that support streaming to the Storm topology.

Key attributes for Storm

Some of the features of the Storm implementation determine its performance and reliability. Storm uses ZeroMQ to send messages, which eliminates the middle queuing process, allowing messages to flow directly between the tasks themselves. Behind the message is an automated and efficient mechanism for serializing and deserializing the primitive types of Storm.

One of the most interesting places in Storm is its focus on fault tolerance and management. Storm implements guaranteed message processing, so each tuple is fully processed through the topology, and if a tuple is found to have not been processed, it is automatically replayed from the nozzle. Storm also implements task-level fault detection, and when a task fails, the message is automatically reassigned to quickly start processing again. Storm contains more intelligent processing management than Hadoop, and processes are managed by regulators to ensure that resources are fully utilized.

Storm model

Storm implements a data flow model in which data continues to flow through a transformation entity network (see Figure 1). The abstraction of a data stream is called a stream, which is an infinite sequence of tuples. Tuples are like a structure that uses some additional serialization code to represent standard data types, such as integers, floating-point and byte arrays, or user-defined types. Each stream is defined by a unique ID, which can be used to build the topology of the data source and receiver (sink). The flow originates from the nozzle, which flows the data from the external source into the Storm topology.

Figure 1. A conceptual framework for a common Storm topological structure

The receiver (or the entity that provides the transformation) is called a bolt. The bolt implements a single transition on a stream and all processing in a Storm topology. Bolts can implement traditional functions such as MapReduce, and can also achieve more complex operations (Single-step function), such as filtering, aggregation or communication with external entities such as databases. A typical Storm topology implements multiple transformations, so multiple bolts with independent tuple streams are required. Both nozzles and bolts are implemented as Linux

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.