Storm is an open-source distributed real-time computing system that can handle a large amount of data flow simply and reliably. Storm is easy to deploy and operational, and more importantly, you can use any programming language to develop your application. This tutorial is a basic introduction to storm and wants to help all technical colleagues who are willing to use the live streaming framework.
First, real-time stream computing
Since the first time the internet has been born, the biggest change to the world is to allow the information to interact in real time, thus greatly accelerating the efficiency of each link. Because of the real-time response to information, real-time interaction needs, the software industry in addition to personal operating systems, the database (more precisely the relational database) should be the software industry's fastest-growing, most profitable products. Remember 10 years ago, many banks don't say real-time transfer, even real-time query can not do, but the database and high-speed network changed the situation.
With the further development of the Internet, from the Portal information browsing type to search information searching type to SNS interactive Transfer type, as well as e-commerce, internet tourism life products will be in the life of the circulation link online. The requirement of efficiency makes the requirement of real-time increase further, while the interaction and communication of information is developing from point-to-point to information chain and even information network, so it is necessary to bring data to cross-link in each dimension, and data explosion is unavoidable. Therefore, streaming and NoSQL products came into being, which solved the problem of real-time framework and data storage computation.
As early as 7, 8 years ago, such as UC Berkeley, Stanford and other universities began the study of streaming data processing, but because more focus on the financial industry business scenarios or Internet traffic monitoring business scenarios, and then the limitations of the Internet data scene, resulting in the research is based on the traditional database processing of streaming, There are few studies on the convection framework itself. At present, such research is gradually without sound, the industry more energy to the real-time database.
2010 Yahoo! Open source for S4, the 2011 Twitter open source for storm, changed the situation. In the past, the Internet developers in a real-time application, in addition to pay attention to the application of logic computing processing itself, but also for the real-time data transfer, interaction, distribution big headache. But now the situation is very different, in the case of storm, developers can quickly build a robust, easy-to-use real-time streaming framework, with SQL products or NoSQL products or the MapReduce computing platform, You can make a lot of real-time products that are hard to imagine at a low cost: for example, a multi-product of the Quantum Heng Dao brand in the data division is built on a real-time streaming platform.
Ii. Features of Storm
Storm is an open-source distributed real-time computing system that can handle a large amount of data flow simply and reliably. Storm has a lot of usage scenarios: real-time analytics, online machine learning, continuous computing, distributed rpc,etl, and more. Storm supports horizontal scaling with high fault tolerance, guaranteeing that every message will be processed and processed quickly (in a small cluster, each node can process millions of messages per second). storm is easy to deploy and operational, and more importantly, you can use any programming language to develop your application.
Storm has the following features:
Believing that Hadoop is familiar to big data processing, Hadoop, based on Google Map/reduce, provides the developer with a Map, reduce primitive, which makes the parallel batch process very simple and graceful. Similarly, Storm provides some simple and elegant primitives for real-time computing of big data, which greatly reduces the complexity of the task of developing parallel real-time processing, helping you develop applications quickly and efficiently.
There are three main entities that really run topology in a storm cluster: worker processes, threads, and tasks. Multiple worker processes can be run on each machine in a storm cluster, each worker process can create multiple threads, each thread can perform multiple tasks, and the task is the actual data processing entity, and the spout and bolts we develop are executed as one or more tasks.
As a result, compute tasks are performed in parallel across multiple threads, processes, and servers, supporting flexible horizontal scaling.
Storm can guarantee that every message sent by spout can be "fully processed", which is a direct distinction from other real-time systems, such as S4.
Note that spout messages may be triggered by subsequent triggering of thousands of messages that can be visualized as a message tree, where spout messages are rooted, and storm keeps track of the processing of the message tree, only if all the messages in the message tree are processed. Storm would have thought that the message sent by spout had been "fully processed". If any message processing in this message tree fails, or if the entire message tree does not have "full processing" within the time limit, the message sent by spout will be re-sent.
Taking into account the minimization of memory consumption, storm does not track every message in the message tree, but instead employs a special strategy that tracks the message tree as a whole, zero the unique ID of all messages in the message tree, and determines whether the message sent by spout is " Full processing ", this greatly saves memory and simplifies the decision logic, which is described in detail later.
This mode, each send a message, will be sent synchronously a ack/fail, for the bandwidth of the network will have a certain consumption, if the reliability requirements are not high, you can use a different emit interface to turn off the mode.
As mentioned above, storm guarantees that each message will be processed at least once, but for some computing occasions it is strictly required that each message be processed only once, fortunately Storm's 0.7.0 introduces a transactional topology that solves this problem, which is detailed later.
- High level of fault tolerance
If there are some exceptions to the message processing, Storm will reschedule the problematic processing unit. Storm ensures that a processing unit runs forever (unless you explicitly kill the processing unit).
Of course, if the middle state is stored in the processing unit, then when the processing unit is restarted by storm, it needs to apply itself to the recovery of the intermediate state.
- Supports multiple programming languages
In addition to implementing spout and bolts in Java, you can do this with any programming language you are familiar with, thanks to Storm's so-called multi-lingual protocol. A multi-language protocol is a special protocol within storm that allows spout or bolts to use standard input and standard output for message delivery, with a single line of text or multiple lines of JSON encoding.
Storm supports multi-language programming primarily through Shellbolt, Shellspout, and shellprocess classes that implement Ibolt and Ispout interfaces, And let the shell execute scripts or program protocols through the Java Processbuilder class.
As you can see, in this way, each tuple needs to encode and decode the JSON at the time of processing, so it will have a large impact on throughput.
Storm has a "local model" that simulates all the functions of a storm cluster in a process, running topology in local mode is similar to running topology on a cluster, which is useful for our development and testing.
Use ZEROMQ as the underlying message queue to ensure that messages can be processed quickly.
Storm Learning (a): About Storm