Brief introduction Spring XD (eXtreme data, limit) is pivotal's big data product. It combines spring Boot and Grailsto form the execution part of the spring IO platform. Although Spring XD leverages a large number of existing spring projects, it is a runtime environment rather than a class library or framework that contains a bin directory with servers that you can start and interact with from the command line. The runtime can run on a development machine, on a client's own server, on AWS EC2, or on a cloud foundry. The key components in Spring XD are the management and container servers (admin and Container Servers). Using a DSL, you can submit a description of the required processing task to the Management Server via HTTP. The Management server then maps the processed tasks to the processing modules (each module is an execution unit, implemented as a spring application context). The product has two modes of operation:-single and Multi-node. The first is that the individual processes are responsible for all the processing and management of the work. This is useful for getting started, as well as for rapid application development and testing. All the instances in this article are designed to work in a single node mode. The second type is a distributed model. The Distributed Integration Runtime (distributed integration Runtime,dirt) distributes the processed tasks across multiple nodes. In addition to owning VMS or physical servers as these nodes, Spring XD also allows you to run on a Hadoop YARN cluster.
The XD Management Server divides the processed tasks into separate module definitions and assigns each module to a container instance that uses Apache ZooKeeper . Each container listens to the module definition assigned to it, and then deploys the module, creating a spring application context to run it. It is important to note that when I write this article, Spring XD does not bring its own zookeeper. The compatible version is 3.4.6, which you can download from here . Modules share data by passing messages using the configured message middleware. The transport layer is pluggable and supports two other pivotal projects-Redis and Rabbit MQ-as well as ready-to-use memory databases. CaseSo that you can have a general understanding of spring XD.
The Spring XD Team believes that there are four main use cases for creating big data solutions: Data absorption, real-time analysis, workflow scheduling, and export. Data ingestion provides the ability to receive data from a variety of input sources and transfer it to big data repositories like HDFs (Hadoop file system), Splunk, or MPP databases. As with files, the data source might include sensors from mobile devices, support for MQ Remote Sensing Transfer Protocol (MQTT), and events such as Twitter and other social interactions. The absorption process runs through the processing of event-driven data, as well as batch processing of other types of data (MR, PIG, Hive, cascading, SQL, and so on). The two worlds of the stream and the job are very different, but spring XD attempts to blur the boundary between them using the channel Abstraction , which allows the stream to trigger a batch job, and the batch job can send events to trigger other flows, abstraction. For a stream, some real-time analysis is supported by an abstraction called "Taps", like getting metrics and counting values. Conceptually, taps allows you to intervene into the stream, perform real-time analysis, and selectively generate data for external systems, such as GemFire, Redis, or other memory data grids. Once you have the data in the Big Data Warehouse, you need some kind of workflow tool to schedule the processing. Scheduling is necessary because a script or map-reduce job that you write usually runs for a long time and takes the form of an event chain with multiple steps. Ideally, you need to be able to restart from a specific step when the event fails, rather than completely starting over. Finally, you need to export the steps to put the data in a more suitable system for presentation, and possibly further analysis. For example, from HDFs to RDBMS (relational database management system), where you can use more traditional business intelligence tools. Spring XD wants to provide a unified, distributed, and extensible service to meet these use cases. It did not start from scratch, but instead took advantage of a large number of existing spring technologies. For example, it uses spring batch to support workflow scheduling and export use cases, uses spring integration to support streaming, and also uses a variety of enterprise application integration patterns. Other key spring products include: Using spring data to handle nosql/hadoop work, using reactor to provide a simplified API for writing asynchronous programs, especially when using LMAX disruptor . Installing Spring XDIn the next section, we'll look at each use case in detail. You might want to experiment with these examples yourself. Getting started is simple. To get started, you need to make sure that the system has at least Java JDK 6 or a newer version installed. I recommend using Java JDK 7. For OSX users, if you do not have homebrew, please install and then run: Brew Tap Pivotal/tapbrew Install SpringxdThis will be installed to the/usr/local/cellar/springxd/1.0.0.m7/libexec (dependent on the Spring XD Library). Note: If you later want to install the updated version, then use brew upgrade springxd. Users of Red Hat or CentOS can use Yum to install. Windows users can download the latest. zip file, unzip it, install it to a folder, and then set the XD_HOME environment variable to the installation folder. You can start spring XD on a single node by typing the following command: Xd-singlenodeType the following command to open another terminal window and start the shell program: Xd-shellYou will see a situation like this: To check if it works, let's create a fast stream: Stream Create--definition "time | Log "--name Ticktock--deployWhen you start the Spring xd console, you'll see a display like this: You can delete a stream from the shell using the stream destroy command. Stream destroy--name ticktock data absorption streamIn spring xd, the basic flow defines the absorption of event-driven data, from the source to the destination, through any number of processors. Spring XD Shell supports a DSL defined for a stream with pipeline and filter syntax-source | Processor | Sink For example, a command like this stream create--name filetest--definition "file | Log--deploy logs the contents of the file. In addition to being able to process files, Spring XD supports many other sources, including: HTTPCommand http Post/streams/mystream "http | File--deploy "-means" consume my stream from HTTP and go to file ". This will default to Port 9000. You can override the default port settings using the--port option. This is the only parameter for HTTP. For example (from the xd shell): xd:> Stream Create--name infoqhttptest9010--definition "http--port=9010 | File "--deployYou can submit some data to this new port to test: Xd:> http POST--target http://localhost:9010--data "Hello World"You will see the following text in the console window: > POST (text/plain; charset=utf-8) http://localhost:9010 Hello World > OKOpen another terminal window and type: $ Cd/tmp/xd/output $ tail-f infoqhttptest9010.outYou'll see "Hello World" in the output. To send the binary data, you need to set the Content-type header description to application/octet-string: $ Curl--data-binary @foo. Zip-h ' content-type:application-octet-string ' http://localhost:9000Type stream destroy infoqhttptest9010 to complete the cleanup work. MailMail is the source module used to receive email. Depending on the protocol used, it can work as a pool or receive email when it is available. For example: xd:> Stream Create--name infoqmailstream--definition "Mail--host=imap.gmail.com [email protected]--password= Secret--delete=false | File "--deployNote: The delete option here is important because the e-mail is deleted by default for spring XD once it is consumed. Spring XD also has the Markasread option, but the default value is False. This is explained in detail in the Spring integration documentation , but the main problem is that the POP3 protocol only knows what is being read in a single session. As a result of the POP3 mail adapter running, when a message becomes available in each pool, it is sent successfully, and no mail message is sent multiple times. However, when you restart the adapter and start a new session, all mail messages that have been acquired in the previous session may be retrieved again. If you see this error message in the console log: WARN task-scheduler-1 org.springframework.integration.mail.imapidlechanneladapter:230-error occurred in idle task javax.mail.AuthenticationFailedException:failed to connect, no password specified?Try replacing the @ symbol with the URL encoding at your URL:% 40: Stream Create--name infoqmailstream--definition "Mail--host=imap.gmail.com--username=charles%40c4media.com-- Password=secret--delete=false | File "--deployOpen another terminal window and type: $ Cd/tmp/xd/output $ tail-f infoqmailstream.outSend yourself an e-mail message to see what it displays in the log file. Twitter searchSpring XD can use the Twitter Search API (Twittersearch) or use data from Twitter's streaming API . For example: xd:> Stream Create--name twittersearchinfoq--definition "Twittersearch--outputtype=application/json--fixedDelay =1000--consumerkey=afes2uqo6jaufljdjfhqa--consumersecret=0top8crpmd1mxgebbgzawvjsaodmcbeabhwhxlnsg--query= ' Infoq ' | File "--deployIt uses the JSON output format of Twittersearch and uses the token "INFOQ" to query Twitter every 1000 milliseconds. To run the above, you need a consumer key (the app consumer key issued by Twitter) and its associated key.
The result is piped to a file in a synchronous way,/tmp/xd/output/[streamname].out by default. Open another terminal window and type: $ Cd/tmp/xd/output $ tail-f twittersearchjava.outWait a moment, and you'll find that it's beyond the limits of the Twitter Ape search, and you'll see a message in the console window where you start xd on a single node: 11:27:01,468 WARN task-scheduler-1 Client. Resttemplate:581-get Request for "Https://api.twitter.com/1.1/search/tweets.json?q=infoq&count=20&since_ id=478845525597237248 "resulted in 429 (Client Error (429)); Invoking error handler11:27:01,471 error task-scheduler-1 handler. Logginghandler:145-org.springframework.social.ratelimitexceededexception:the rate limit has been exceeded.Type stream destroy Twittersearchinfoq to complete the cleanup work. Other input streamsGemFire: Configuring a cache and replica area in the XD container process, which is also present in the channel adapter with the spring integration GemFire, which are supported by Cachelistener, The latter outputs the output message triggered by an external input event in the region. It also supports continuous queries, which allow client applications to create GemFire queries using object Query Language (OQL) and register a CQ listener, which subscribes to queries and is notified whenever the result set of a query changes. Reactor IP: It acts as a server, allowing remote organizations to connect to XD and submit data via native TCP or UDP sockets. The difference between the REACTOR-IP source and the standard TCP source is that it is based on the reactor project and can be configured to use the LMAX disruptor ringbuffer Library, which allows for very high absorption rates, approximately 1M per second. Syslog: There are three types of syslog sources: Reactor-syslog, SYSLOG-UDP, and syslog-tcp. Using TCP, the Reactor-syslog adapter builds the features available in the reactor project and provides better throughput than the SYSLOG-TCP adapter. TCP: It acts as a server, allowing the remote organization to connect to XD and submit data through the native TCP socket. MQTT: Connect to an MQTT server and receive telemetry messages. Tapsanywhere in the stream, you can insert the word tap--from the "wire tap" mode in the book " Application integration Mode (application integration Patterns)", Gregor Hohpe and others. Conceptually, you will insert a simple receive list in the channel that will publish each incoming message to the main and secondary channels. Stream does not know the presence of any tap in its pipeline. Deleting a stream does not automatically delete tap--they need to be deleted separately. However, if the stream that was added to the tap is recreated, the existing tap will continue to work. Tap can be inserted anywhere in the stream (or multiple locations). ProcessorData in a stream can be processed in several ways: Filter: It can be used to determine whether messages should be sent to the output channel. In the simplest case, the filter is just a spel Boolean expression that returns True or false. For example: xd:> Stream Create--name filtertest--definition "http | Filter--expression=payload== ' Good ' | Log "--deployLogs for all content with the "good" keyword are logged. However, filters can also be quite complex. Spring XD supports Jsonpath and custom groovy scripts. Transform: The content or structure used to transform a message. It supports simple spel, and for more complex transformations, groovy scripts can be used. Splitter: Similar to the concept of a splitter in spring integration, the splitter here uses the Spel expression, which computes the value of an array or collection, splitting a single message into separate messages. You can use the JSON oath expression, but you can't use a custom groovy script. Aggregator (aggregator): Instead of a splitter, it combines multiple messages into one. Finally, a script that can be used to invoke a particular groovy script as a processing step. Slots (Sinks)The simplest slots are logs and files. Other slots that can be supported include Hadoop (HDFS), JDBC, TCP, Mail, RabbitMQ, gemfire servers, splunk servers, and MQQT. There is also a dynamic routing option that allows the spring XD message to be routed to the named channel based on the values of the Spel expression or groovy script. Let me be a little odd that there is a lack of JMS slots for general purposes here, although we can be like [Url=http://www.infoq.com/cn/articles/(https://github.com/spring-projects/ SPRING-XD/WIKI/EXTENDING-XD] Build a custom slot module as described here [/url]. Real-time AnalyticsSpring XD supports real-time computing for a variety of machine learning scoring algorithms, as well as real-time data analysis using various types of counters and meters. The analysis functionality is implemented through modules that can be added to the stream. In that case, real-time analysis is done with the same module as the data intake. Although the primary role of a stream can be to perform real-time analysis, it is more common to add a tap to initialize another stream, where parsing-for example, a counter for a field value-is applied to the same data that is absorbed by the primary stream. The Spring xd comes with some simple analysis tools that are implemented as abstract APIs for in-memory databases and Redis, as follows:
- Simple counter
- Field value Counter: Calculates the number of occurrences of a specific field.
- Aggregation counters: Common in tools such as MONGO and Redis, allowing you to Shard data based on time-such as minutes, hours, months, years, and so on.
- Metering Device (Gauge): Latest value
- Metering: Latest value, run average, max, min
For predictive analysis, Spring XD contains an extensible class library, based on which other implementations can be built. For example, the PMML module available on GitHub, which integrates with the Jpmml-evaluator Library, provides support for a wider range of model types and can be The Rattle,knime , and RapidMiner exported modules are interoperable. The product also contains some abstractions that can be used in the stream processing application for the event analysis model. In writing this article, only the Predictive Module Markup Language (predictive Model Markup LANGUAGE,PMML) is supported, but pivotal tells Infoq:
We are working on an in-house project to provide a wide range of analytical solutions with the goal of "fraud detection" and "cyber security". We are also working with the OSS libraries--like "stream-lib" and "Graphlab"--to do some design.
Pivotal also stated that they expect to see development in this area over time and provide additional support for predictive modelling. batch jobs, workflow scheduling, and exportIn addition to streaming, Spring XD includes the ability to start and monitor batch jobs based on spring batch, and spring batch is used to support workflow scheduling and export use cases. The concept of a workflow is converted into a batch job, which can be thought of as a graph of the individual steps, each of which is a processing step: Depending on the configuration, the steps can be executed sequentially or in parallel. They can replicate or process data from files, databases, MapReduce, Pig, hive, or cascading jobs, and persist with checkpoints that allow reboots. Like a stream, a job supports a single node, or it can be distributed with a data partition. Spring XD itself has a small number of predefined jobs that can be used to export data to or import data from the Hadoop file system HDFs. These jobs cover ftp to HDFs, HDFs to JDBC, HDFs to MongoDB, and JDBC to HDFs. There is also a job for exporting files to JDBC. You can find it in the/libexec/xd/modules/job folder. Spring XD provides a fairly basic, browser-based graphical interface that currently allows you to perform batch jobs related to tasks. For the start of Spring XD, the Administrator interface is available here : (Click the image to enlarge) As you can see in the screenshot above, the Administrator interface currently includes four tabs:
- Modules: Lists the available batch jobs and more details (like Job module options and the module's XML configuration file).
- Definition: Lists the xd batch job definitions and provides actions to deploy or unload those jobs.
- Deployment: Lists all deployed jobs and provides an option to start a deployed job. Once the job has been deployed, it can be started through the administrator interface.
- Execution: Lists the execution status of the batch job and provides an option to restart if the batch job can be restarted and in a stopped or failed state.
ConclusionSpring XD is currently in development. The first milestone version was released in June 2013, and the GA version is expected to be released in July this year (2014). It is based on the Apache Second Edition license. source code and examples are available on GitHub. You can also find the online sonar code metrics . The product may be new, but as we've seen, it's built on a mature basis--spring Batch, Spring integration and sping Data, and reactor project, LMAX disruptor, and Apache hadoop-- and provides a lightweight runtime environment that can be configured and integrated through a DSL, with little or no code required. Spring XD provides a convenient way for developers to start building big data applications and provide a "one-stop-service" for building and deploying such applications. For those who want to explore this product, there are plenty of resources available, including major wikis, and videos that cover real-time analytics. About the authorCharles Humble has been the editor-in-chief of the Infoq.com editorial team since March 2014, leading our content creation efforts, including news, articles, books, videos and interviews. Before joining Infoq full-time, Charles led our Java section as CTO of PRPI Consultants, a CV research firm that was acquired by PwC in July 2012. He worked for nearly 20 years as a developer, architect, and development manager in a software enterprise. In his spare time, he writes some music and is a member of the Technical group Twofish around London. |