Storm Introductory Tutorial Chapter I preface [Turn]

Source: Internet
Author: User

1.1 Real-time stream computing

Since the first time the internet has been born, the biggest change to the world is to allow the information to interact in real time, thus greatly accelerating the efficiency of each link. Because of the real-time response to information, real-time interaction needs, the software industry in addition to personal operating systems, the database (more precisely the relational database) should be the software industry's fastest-growing, most profitable products. Remember 10 years ago, many banks don't say real-time transfer, even real-time query can not do, but the database and high-speed network changed the situation.

With the further development of the Internet, from the Portal information browsing type to search information searching type to SNS interactive Transfer type, as well as e-commerce, internet tourism life products will be in the life of the circulation link online. The requirement of efficiency makes the requirement of real-time increase further, while the interaction and communication of information is developing from point-to-point to information chain and even information network, so it is necessary to bring data to cross-link in each dimension, and data explosion is unavoidable. Therefore, streaming and NoSQL products came into being, which solved the problem of real-time framework and data storage computation.

As early as 7, 8 years ago, such as UC Berkeley, Stanford and other universities began the study of streaming data processing, but because more focus on the financial industry business scenarios or Internet traffic monitoring business scenarios, and then the limitations of the Internet data scene, resulting in the research is based on the traditional database processing of streaming, There are few studies on the convection framework itself. At present, such research is gradually without sound, the industry more energy to the real-time database.

2010 Yahoo! Open source for S4, the 2011 Twitter open source for storm, changed the situation. In the past, the Internet developers in a real-time application, in addition to pay attention to the application of logic computing processing itself, but also for the real-time data transfer, interaction, distribution big headache. But now the situation is very different, in the case of storm, developers can quickly build a robust, easy-to-use real-time streaming framework, with SQL products or NoSQL products or the MapReduce computing platform, You can make a lot of real-time products that are hard to imagine at a low cost: for example, a multi-product of the Quantum Heng Dao brand in the data division is built on a real-time streaming platform.

This tutorial is a basic introduction to storm, but we want it to be more than just a storm manual, where we will add more architecture to our experience and applications in the real-world data production process, with the ultimate goal of helping all the technical colleagues who are willing to use the real-time streaming framework, At the same time also silently change the world.

1.2 Storm features

Storm is an open-source distributed real-time computing system that can handle a large amount of data flow simply and reliably. Storm has a lot of usage scenarios: real-time analytics, online machine learning, continuous computing, distributed rpc,etl, and more. Storm supports horizontal scaling with high fault tolerance, guaranteeing that every message will be processed and processed quickly (in a small cluster, each node can process millions of messages per second). Storm is easy to deploy and operational, and more importantly, you can use any programming language to develop your application.

Storm has the following features:

    • Simple programming model

Believing that Hadoop is familiar to big data processing, Hadoop, based on Google Map/reduce, provides the developer with a Map, reduce primitive, which makes the parallel batch process very simple and graceful. Similarly, Storm provides some simple and elegant primitives for real-time computing of big data, which greatly reduces the complexity of the task of developing parallel real-time processing, helping you develop applications quickly and efficiently.

    • Can be extended

Really run topology in the Storm cluster (topology[][t? P?l?d??] Beauty [t? ' Pɑ:l?d??] N. Topological structure; Geology Local anatomy) has three main entities: worker processes, threads, and tasks. Multiple worker processes can be run on each machine in a storm cluster, each worker process can create multiple threads, each thread can perform multiple tasks, the task is the real data processing entity, we develop the spout (spout[][spa?t][US][spa?t] N. vents, nozzles; Water column, jet stream; (of a whale) blowhole; [Gas] dragon roll; ), Bolt (bolt[|][b?? lt][Mei][bo?lt]n. bolts, screws; Lightning, Lightning; Latch Crossbow arrows; Vt. Screening; to swallow; (Of doors, windows, etc.) bolted; to speak out suddenly; ) is performed as one or more tasks. As a result, compute tasks are performed in parallel across multiple threads, processes, and servers, supporting flexible horizontal scaling.

    • High reliability

Storm can guarantee that every message sent by spout can be "fully processed", which is a direct distinction from other real-time systems, such as S4.

Note that spout messages may be triggered by subsequent triggering of thousands of messages that can be visualized as a message tree, where spout messages are rooted, and storm keeps track of the processing of the message tree, only if all the messages in the message tree are processed. Storm would have thought that the message sent by spout had been "fully processed". If any message processing in this message tree fails, or if the entire message tree does not have "full processing" within the time limit, the message sent by spout will be re-sent.

Taking into account the minimization of memory consumption, storm does not track every message in the message tree, but instead employs a special strategy that tracks the message tree as a whole, zero the unique ID of all messages in the message tree, and determines whether the message sent by spout is " Full processing ", this greatly saves memory and simplifies the decision logic, which is described in detail later.

This mode, each send a message, will be sent synchronously a ack/fail, for the bandwidth of the network will have a certain consumption, if the reliability requirements are not high, you can use a different emit interface to turn off the mode.

As mentioned above, storm guarantees that each message will be processed at least once, but for some computing occasions it is strictly required that each message be processed only once, fortunately Storm's 0.7.0 introduces a transactional topology that solves this problem, which is detailed later.

    • High level of fault tolerance

If there are some exceptions to the message processing, Storm will reschedule the problematic processing unit. Storm ensures that a processing unit runs forever (unless you explicitly kill the processing unit).

Of course, if the middle state is stored in the processing unit, then when the processing unit is restarted by storm, it needs to apply itself to the recovery of the intermediate state.

    • Supports multiple programming languages

In addition to implementing spout and bolts in Java, you can do this with any programming language you are familiar with, thanks to Storm's so-called multi-lingual protocol. A multi-language protocol is a special protocol within storm that allows spout or bolts to use standard input and standard output for message delivery, with a single line of text or multiple lines of JSON encoding.

Storm supports multi-language programming primarily through Shellbolt, Shellspout, and shellprocess classes that implement Ibolt and Ispout interfaces, And let the shell execute scripts or program protocols through the Java Processbuilder class.

As can be seen, in this way, each tuple (tuple[][t?pl][US][t?pl]N. tuples, arrays; The JSON codec is required for processing and therefore has a large impact on throughput.

    • Support local mode

Storm has a "local model" that simulates all the functions of a storm cluster in a process, running topology in local mode is similar to running topology on a cluster, which is useful for our development and testing.

    • Efficient

Use ZEROMQ as the underlying message queue to ensure that messages can be processed quickly

Storm Introductory Tutorial Chapter I preface [Turn]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.