HDinsight Storm Overview

Source: Internet
Author: User
Tags emit hdinsight

HDinsight Storm Overview

What is Storm?

Apache Storm is a distributed, fault-tolerant, open-source computing system that allows you to process data in real time. Storm solutions can also provide guaranteed processing of data with the ability to retry the first data that is not successfully processed.

What is Azure HDInsight Storm?

Hdinsightstorm is provided as a managed cluster integrated into the azure environment, where it can be used as part of a larger azure solution. For example, storm might use data from services, such as Servicebus queues or event hubs, and use Web sites or cloud services to provide data visualization. Hdinsightstorm clusters can also be configured in an Azure virtual network, which reduces latency to the same virtual network as other communication resources, or allows for secure communication of resources in private data centers.

To start with Storm, see Getting Started with Storm in Hdinsight.

How to data processing in Hdinsightstorm?

The storm cluster topology process, rather than the mapreduce work, you may be familiar with from Hdinsight or Hadoop. The storm cluster consists of two types of nodes, the first node running the Nimbus and the running director of the work node
? Nimbus-Similar to Jobtracker in Hadoop, it is responsible for distributing code throughout the cluster, assigning tasks to machines, and monitoring failures. Hdinsight provides two clouds of cloud nodes, so there is no single point of failure for cluster storm

? Supervisor Hypervisor-Supervisor on each work node responsible for starting and stopping worker processes on nodes

? Worker process-a subset of the topology in which a worker process runs. The running topology is distributed across many worker processes throughout the cluster.

? Topology-Defines the curve of the calculation, the flow of its processing data. Working with MapReduce, the topology runs until you stop them

? Streaming the binding collection for the stream-tuple. Flows are generated through spouts and bolts, and are consumed through bolts

? Tuple tuple-named list of dynamic type values

? Spout-consumes data from a data source and emits one or more streams


Attention:

In many cases, data is read from a queue such as Kafka, Azureservicebus queue, or event hub. The queue ensures that data is persisted in the event of a power outage.

? Bolt-consumes a stream, processes the tuple, and can emit a stream. Bolt is also responsible for writing data to external storage, such as queues, HDInsight HBase, a blob, or other data storage

? Thrift-apache Thrift is a software framework developed for scalable cross-language services. It allows you to build a work service between C ++,java and Python and Php,ruby, Erlang,perl,haskell,c#,cocoa,javascript,node.js,smalltalk and other languages.
? Nimbus is a thrift service, and a topology is the definition of thrift, so you can develop a topology using a variety of programming languages


For more information about storm components, see the Storm tutorial at apache.org.

Scenario: What is use case storm?

Here are some common scenarios where you might be using Apache storm. For real-world scenario information, read how the company uses storm.

Real-time Analytics

Because storm handles real-time data flow, it is ideal for data analysis involving finding and reacting to specific events or patterns in the data stream as they arrive. For example, a storm topology might monitor the sensor's data to determine system health, and alert you when a specific pattern occurs when an SMS message is generated.

Extract transform Load (ETL)

ETL can be thought of as a side effect of almost storm processing. For example, if you are using real-time analytics for fraud detection, you ingest and have converted the data. You may want to also have a bolt stored in hbase, Hive, or other data stored in the data that is used in future analysis.

Distributed RPC

Distributed RPC is a pattern that can be created using storm. The request is passed to storm, and then the calculation is distributed across multiple nodes, and finally a result stream is returned to wait for the client.

For distributed RPC and more information on providing storm drpcclient, see distributed RPC.

Online Learning Machine

Storm can use a machine learning solution with previously trained batch processing, such as Mahout-based solutions. But its universal, distributed computing model also opens the door to flow-based machine learning solutions. For example, the scalable, advanced, high-volume online analysis (Samoa) project is a learning library that uses streaming and can work with the Storm machine.

What programming languages can I use?

The Hdinsightstorm cluster provides support for. Net,java and Python out of the box. While Storm supports other languages, many of these require you to install additional programming languages on the Hdinsight cluster, in addition to other configuration changes.

. NET

SCP is a project that enables. NET developers to design and implement topologies (including spouts and bolts). Support for SCP is provided by default with the Hdinsightstorm cluster.

For more information about using SCP development, see Developing data processing applications for storm in hdinsight streams and Scp.net and C #.

Java

Most of the Java examples you encounter are plain Java or Trident. Trident is a high-level abstraction that makes it easier to do things like Connect, aggregate, group, and filter. However, Trident acts on tuples, where the original Java solution will process batches at the same time in a tuple in the stream.

For more information about Trident, see the tutorial Trident in apache.org.

For examples of both raw Java and Trident topologies, see%storm_home%\contrib\storm launching a directory on your hdinsightstorm cluster.

What are some common patterns of development?

Guarantee Mail Processing

Storm can provide guaranteed message handling at different levels. For example, a basic storm application is guaranteed to be processed at least once, while the Trident guarantees a single processing.

For more information, please refer to the Data processing guarantee at apache.org

Basicbolt

It is so common to read the input tuples, emit 0 or more tuples, and then immediately acking the input tuple at the end of the execution method, and Storm provides the Ibasicbolt interface to automate this pattern.

Join joins

Joining two data streams will vary between applications. For example, you can join each tuple in multiple streams to a new data stream, or you may only participate in a tuple-specific window. Either way, bonding can be achieved by using fieldsgrouping, which is the way to define how the tuple is routed to the bolt.

In the following Java example, Fieldsgrouping is used to route tuples from component "1", "2", "3" to Myjoiner Bolt.
Builder.setbolt ("Join", New Myjoiner (), parallelism). Fieldsgrouping ("1", New Fields ("Joinfield1", "Joinfield2")). Fieldsgrouping ("2", New Fields ("Joinfield1", "Joinfield2")). Fieldsgrouping ("3", New Fields ("Joinfield1", "joinfield2 "));

Batch Processing

Several methods can be implemented for batch processing. A basic javastorm topology can use a simple counter to send a tuple's batch x number of batches before firing, or use the internal timing mechanism known as tick-tick tuples to emit every batch of x seconds.

For an example of using tick-tick tuples, see Storm and Hdinsight analyzing sensor data

If you are using Trident, it is a batch based on the processing of tuples.

Cache

In-memory caches are often used as a mechanism for speeding up processing because it enables frequently used assets in memory. Because the topology is distributed across multiple nodes, in each node of multiple processes, you should consider using fieldsgrouping to make sure that the field tuples that contain the cache lookups are always routed to the same process. This avoids duplication of cache entries across processes.

Top N of streaming

When your topology depends on calculating the ' top ' n values, such as the first 5 trends on Twitter, you should calculate the first n values in parallel and then merge the output from these calculations into a global value. This can be done by using the fieldsgrouping route to parallel the bolts, which passes the field value, and then eventually routes to a bolt, the global determination of the top N value division of data.

For an example of this, see the rollingtopwords example.

The next steps


? Getting started at Hdinsight Storm

? Analysis with Storm and hdinsight sensor data

? Develop data processing applications for storm in Hdinsight flow and scp.net and C #

This article is translated from Windows Azure official website: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-storm-overview/

Reprint Please specify the Source: http://blog.csdn.net/yangzhenping , thank you!

HDinsight Storm Overview

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.