Spring xd Introduction: The runtime environment for big data applications

Source: Internet
Author: User
Tags http post imap syslog all mail groovy script gemfire pmml

Brief introduction

Spring XD (eXtreme data, limit) is pivotal's big data product. It combines spring Boot and Grailsto form the execution part of the spring IO platform. Although Spring XD leverages a large number of existing spring projects, it is a runtime environment rather than a class library or framework that contains a bin directory with servers that you can start and interact with from the command line. The runtime can run on a development machine, on a client's own server, on AWS EC2, or on a cloud foundry.

The key components in Spring XD are the management and container servers (admin and Container Servers). Using a DSL, you can submit a description of the required processing task to the Management Server via HTTP. The Management server then maps the processed tasks to the processing modules (each module is an execution unit, implemented as a spring application context).

The product has two modes of operation:-single and Multi-node. The first is that the individual processes are responsible for all the processing and management of the work. This is useful for getting started, as well as for rapid application development and testing. All the instances in this article are designed to work in a single node mode. The second type is a distributed model. The Distributed Integration Runtime (distributed integration Runtime,dirt) distributes the processed tasks across multiple nodes. In addition to owning VMS or physical servers as these nodes, Spring XD also allows you to run on a Hadoop YARN cluster.



The XD Management Server divides the processed tasks into separate module definitions and assigns each module to a container instance that uses Apache ZooKeeper . Each container listens to the module definition assigned to it, and then deploys the module, creating a spring application context to run it. It is important to note that when I write this article, Spring XD does not bring its own zookeeper. The compatible version is 3.4.6, which you can download from here .

Modules share data by passing messages using the configured message middleware. The transport layer is pluggable and supports two other pivotal projects-Redis and Rabbit MQ-as well as ready-to-use memory databases.

Case

So that you can have a general understanding of spring XD.


The Spring XD Team believes that there are four main use cases for creating big data solutions: Data absorption, real-time analysis, workflow scheduling, and export.

Data ingestion provides the ability to receive data from a variety of input sources and transfer it to big data repositories like HDFs (Hadoop file system), Splunk, or MPP databases. As with files, the data source might include sensors from mobile devices, support for MQ Remote Sensing Transfer Protocol (MQTT), and events such as Twitter and other social interactions.

The absorption process runs through the processing of event-driven data, as well as batch processing of other types of data (MR, PIG, Hive, cascading, SQL, and so on). The two worlds of the stream and the job are very different, but spring XD attempts to blur the boundary between them using the channel Abstraction , which allows the stream to trigger a batch job, and the batch job can send events to trigger other flows, abstraction.

For a stream, some real-time analysis is supported by an abstraction called "Taps", like getting metrics and counting values. Conceptually, taps allows you to intervene into the stream, perform real-time analysis, and selectively generate data for external systems, such as GemFire, Redis, or other memory data grids.

Once you have the data in the Big Data Warehouse, you need some kind of workflow tool to schedule the processing. Scheduling is necessary because a script or map-reduce job that you write usually runs for a long time and takes the form of an event chain with multiple steps. Ideally, you need to be able to restart from a specific step when the event fails, rather than completely starting over.

Finally, you need to export the steps to put the data in a more suitable system for presentation, and possibly further analysis. For example, from HDFs to RDBMS (relational database management system), where you can use more traditional business intelligence tools.

Spring XD wants to provide a unified, distributed, and extensible service to meet these use cases. It did not start from scratch, but instead took advantage of a large number of existing spring technologies. For example, it uses spring batch to support workflow scheduling and export use cases, uses spring integration to support streaming, and also uses a variety of enterprise application integration patterns. Other key spring products include: Using spring data to handle nosql/hadoop work, using reactor to provide a simplified API for writing asynchronous programs, especially when using LMAX disruptor .

Installing Spring XD

In the next section, we'll look at each use case in detail. You might want to experiment with these examples yourself. Getting started is simple.

To get started, you need to make sure that the system has at least Java JDK 6 or a newer version installed. I recommend using Java JDK 7.

For OSX users, if you do not have homebrew, please install and then run:

Brew Tap Pivotal/tapbrew Install Springxd

This will be installed to the/usr/local/cellar/springxd/1.0.0.m7/libexec (dependent on the Spring XD Library).

Note: If you later want to install the updated version, then use brew upgrade springxd.

Users of Red Hat or CentOS can use Yum to install.

Windows users can download the latest. zip file, unzip it, install it to a folder, and then set the XD_HOME environment variable to the installation folder.

You can start spring XD on a single node by typing the following command:

Xd-singlenode

Type the following command to open another terminal window and start the shell program:

Xd-shell

You will see a situation like this:

To check if it works, let's create a fast stream:

Stream Create--definition "time | Log "--name Ticktock--deploy

When you start the Spring xd console, you'll see a display like this:

You can delete a stream from the shell using the stream destroy command.

Stream destroy--name ticktock data absorption stream

In spring xd, the basic flow defines the absorption of event-driven data, from the source to the destination, through any number of processors.

Spring XD Shell supports a DSL defined for a stream with pipeline and filter syntax-source | Processor | Sink

For example, a command like this stream create--name filetest--definition "file | Log--deploy logs the contents of the file.
In addition to being able to process files, Spring XD supports many other sources, including:

HTTP

Command http Post/streams/mystream "http | File--deploy "-means" consume my stream from HTTP and go to file ". This will default to Port 9000. You can override the default port settings using the--port option. This is the only parameter for HTTP.

For example (from the xd shell):

xd:> Stream Create--name infoqhttptest9010--definition "http--port=9010 | File "--deploy

You can submit some data to this new port to test:

Xd:> http POST--target http://localhost:9010--data "Hello World"

You will see the following text in the console window:

> POST (text/plain; charset=utf-8) http://localhost:9010 Hello World > OK

Open another terminal window and type:

$ Cd/tmp/xd/output $ tail-f infoqhttptest9010.out

You'll see "Hello World" in the output.

To send the binary data, you need to set the Content-type header description to application/octet-string:

$ Curl--data-binary @foo. Zip-h ' content-type:application-octet-string ' http://localhost:9000

Type stream destroy infoqhttptest9010 to complete the cleanup work.

Mail

Mail is the source module used to receive email. Depending on the protocol used, it can work as a pool or receive email when it is available.

For example:

xd:> Stream Create--name infoqmailstream--definition "Mail--host=imap.gmail.com [email protected]--password= Secret--delete=false | File "--deploy

Note: The delete option here is important because the e-mail is deleted by default for spring XD once it is consumed. Spring XD also has the Markasread option, but the default value is False. This is explained in detail in the Spring integration documentation , but the main problem is that the POP3 protocol only knows what is being read in a single session. As a result of the POP3 mail adapter running, when a message becomes available in each pool, it is sent successfully, and no mail message is sent multiple times. However, when you restart the adapter and start a new session, all mail messages that have been acquired in the previous session may be retrieved again.

If you see this error message in the console log:

WARN task-scheduler-1 org.springframework.integration.mail.imapidlechanneladapter:230-error occurred in idle task javax.mail.AuthenticationFailedException:failed to connect, no password specified?

Try replacing the @ symbol with the URL encoding at your URL:% 40:

Stream Create--name infoqmailstream--definition "Mail--host=imap.gmail.com--username=charles%40c4media.com-- Password=secret--delete=false | File "--deploy

Open another terminal window and type:

$ Cd/tmp/xd/output $ tail-f infoqmailstream.out

Send yourself an e-mail message to see what it displays in the log file.

Twitter search

Spring XD can use the Twitter Search API (Twittersearch) or use data from Twitter's streaming API .

For example:

xd:> Stream Create--name twittersearchinfoq--definition "Twittersearch--outputtype=application/json--fixedDelay =1000--consumerkey=afes2uqo6jaufljdjfhqa--consumersecret=0top8crpmd1mxgebbgzawvjsaodmcbeabhwhxlnsg--query= ' Infoq ' | File "--deploy

It uses the JSON output format of Twittersearch and uses the token "INFOQ" to query Twitter every 1000 milliseconds. To run the above, you need a consumer key (the app consumer key issued by Twitter) and its associated key.


The result is piped to a file in a synchronous way,/tmp/xd/output/[streamname].out by default.

Open another terminal window and type:

$ Cd/tmp/xd/output $ tail-f twittersearchjava.out

Wait a moment, and you'll find that it's beyond the limits of the Twitter Ape search, and you'll see a message in the console window where you start xd on a single node:

11:27:01,468 WARN task-scheduler-1 Client. Resttemplate:581-get Request for "Https://api.twitter.com/1.1/search/tweets.json?q=infoq&count=20&since_ id=478845525597237248 "resulted in 429 (Client Error (429)); Invoking error handler11:27:01,471 error task-scheduler-1 handler. Logginghandler:145-org.springframework.social.ratelimitexceededexception:the rate limit has been exceeded.

Type stream destroy Twittersearchinfoq to complete the cleanup work.

Other input streams

GemFire: Configuring a cache and replica area in the XD container process, which is also present in the channel adapter with the spring integration GemFire, which are supported by Cachelistener, The latter outputs the output message triggered by an external input event in the region. It also supports continuous queries, which allow client applications to create GemFire queries using object Query Language (OQL) and register a CQ listener, which subscribes to queries and is notified whenever the result set of a query changes.

Reactor IP: It acts as a server, allowing remote organizations to connect to XD and submit data via native TCP or UDP sockets. The difference between the REACTOR-IP source and the standard TCP source is that it is based on the reactor project and can be configured to use the LMAX disruptor ringbuffer Library, which allows for very high absorption rates, approximately 1M per second.

Syslog: There are three types of syslog sources: Reactor-syslog, SYSLOG-UDP, and syslog-tcp. Using TCP, the Reactor-syslog adapter builds the features available in the reactor project and provides better throughput than the SYSLOG-TCP adapter.

TCP: It acts as a server, allowing the remote organization to connect to XD and submit data through the native TCP socket.

MQTT: Connect to an MQTT server and receive telemetry messages.

Taps

anywhere in the stream, you can insert the word tap--from the "wire tap" mode in the book " Application integration Mode (application integration Patterns)", Gregor Hohpe and others.

Conceptually, you will insert a simple receive list in the channel that will publish each incoming message to the main and secondary channels. Stream does not know the presence of any tap in its pipeline. Deleting a stream does not automatically delete tap--they need to be deleted separately. However, if the stream that was added to the tap is recreated, the existing tap will continue to work.

Tap can be inserted anywhere in the stream (or multiple locations).

Processor

Data in a stream can be processed in several ways:

Filter: It can be used to determine whether messages should be sent to the output channel. In the simplest case, the filter is just a spel Boolean expression that returns True or false. For example:

xd:> Stream Create--name filtertest--definition "http | Filter--expression=payload== ' Good ' | Log "--deploy

Logs for all content with the "good" keyword are logged. However, filters can also be quite complex. Spring XD supports Jsonpath and custom groovy scripts.

Transform: The content or structure used to transform a message. It supports simple spel, and for more complex transformations, groovy scripts can be used.

Splitter: Similar to the concept of a splitter in spring integration, the splitter here uses the Spel expression, which computes the value of an array or collection, splitting a single message into separate messages. You can use the JSON oath expression, but you can't use a custom groovy script.

Aggregator (aggregator): Instead of a splitter, it combines multiple messages into one.

Finally, a script that can be used to invoke a particular groovy script as a processing step.

Slots (Sinks)

The simplest slots are logs and files. Other slots that can be supported include Hadoop (HDFS), JDBC, TCP, Mail, RabbitMQ, gemfire servers, splunk servers, and MQQT. There is also a dynamic routing option that allows the spring XD message to be routed to the named channel based on the values of the Spel expression or groovy script. Let me be a little odd that there is a lack of JMS slots for general purposes here, although we can be like [Url=http://www.infoq.com/cn/articles/(https://github.com/spring-projects/ SPRING-XD/WIKI/EXTENDING-XD] Build a custom slot module as described here [/url].

Real-time Analytics

Spring XD supports real-time computing for a variety of machine learning scoring algorithms, as well as real-time data analysis using various types of counters and meters. The analysis functionality is implemented through modules that can be added to the stream. In that case, real-time analysis is done with the same module as the data intake.

Although the primary role of a stream can be to perform real-time analysis, it is more common to add a tap to initialize another stream, where parsing-for example, a counter for a field value-is applied to the same data that is absorbed by the primary stream.

The Spring xd comes with some simple analysis tools that are implemented as abstract APIs for in-memory databases and Redis, as follows:

    • Simple counter
    • Field value Counter: Calculates the number of occurrences of a specific field.
    • Aggregation counters: Common in tools such as MONGO and Redis, allowing you to Shard data based on time-such as minutes, hours, months, years, and so on.
    • Metering Device (Gauge): Latest value
    • Metering: Latest value, run average, max, min

For predictive analysis, Spring XD contains an extensible class library, based on which other implementations can be built. For example, the PMML module available on GitHub, which integrates with the Jpmml-evaluator Library, provides support for a wider range of model types and can be The Rattle,knime , and RapidMiner exported modules are interoperable.

The product also contains some abstractions that can be used in the stream processing application for the event analysis model. In writing this article, only the Predictive Module Markup Language (predictive Model Markup LANGUAGE,PMML) is supported, but pivotal tells Infoq:

We are working on an in-house project to provide a wide range of analytical solutions with the goal of "fraud detection" and "cyber security". We are also working with the OSS libraries--like "stream-lib" and "Graphlab"--to do some design.

Pivotal also stated that they expect to see development in this area over time and provide additional support for predictive modelling.

batch jobs, workflow scheduling, and export

In addition to streaming, Spring XD includes the ability to start and monitor batch jobs based on spring batch, and spring batch is used to support workflow scheduling and export use cases.

The concept of a workflow is converted into a batch job, which can be thought of as a graph of the individual steps, each of which is a processing step:

Depending on the configuration, the steps can be executed sequentially or in parallel. They can replicate or process data from files, databases, MapReduce, Pig, hive, or cascading jobs, and persist with checkpoints that allow reboots. Like a stream, a job supports a single node, or it can be distributed with a data partition.

Spring XD itself has a small number of predefined jobs that can be used to export data to or import data from the Hadoop file system HDFs. These jobs cover ftp to HDFs, HDFs to JDBC, HDFs to MongoDB, and JDBC to HDFs. There is also a job for exporting files to JDBC. You can find it in the/libexec/xd/modules/job folder.

Spring XD provides a fairly basic, browser-based graphical interface that currently allows you to perform batch jobs related to tasks. For the start of Spring XD, the Administrator interface is available here :

(Click the image to enlarge)

As you can see in the screenshot above, the Administrator interface currently includes four tabs:

    • Modules: Lists the available batch jobs and more details (like Job module options and the module's XML configuration file).
    • Definition: Lists the xd batch job definitions and provides actions to deploy or unload those jobs.
    • Deployment: Lists all deployed jobs and provides an option to start a deployed job. Once the job has been deployed, it can be started through the administrator interface.
    • Execution: Lists the execution status of the batch job and provides an option to restart if the batch job can be restarted and in a stopped or failed state.
Conclusion

Spring XD is currently in development. The first milestone version was released in June 2013, and the GA version is expected to be released in July this year (2014). It is based on the Apache Second Edition license. source code and examples are available on GitHub. You can also find the online sonar code metrics .

The product may be new, but as we've seen, it's built on a mature basis--spring Batch, Spring integration and sping Data, and reactor project, LMAX disruptor, and Apache hadoop-- and provides a lightweight runtime environment that can be configured and integrated through a DSL, with little or no code required. Spring XD provides a convenient way for developers to start building big data applications and provide a "one-stop-service" for building and deploying such applications.

For those who want to explore this product, there are plenty of resources available, including major wikis, and videos that cover real-time analytics.

About the author

Charles Humble has been the editor-in-chief of the Infoq.com editorial team since March 2014, leading our content creation efforts, including news, articles, books, videos and interviews. Before joining Infoq full-time, Charles led our Java section as CTO of PRPI Consultants, a CV research firm that was acquired by PwC in July 2012. He worked for nearly 20 years as a developer, architect, and development manager in a software enterprise. In his spare time, he writes some music and is a member of the Technical group Twofish around London.

Spring xd Introduction: The runtime environment for big data applications

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.