Use Microsoft streaminsight to control large data streams

Source: Internet
Author: User
Tags stock prices
Address: http://msdn.microsoft.com/zh-cn/magazine/hh205648.aspx Download Sample Code

After the production line is reduced, user media streams may skip these processes, or your product may become a "Required product. The real trick is to identify these situations or make predictions based on past trends.

To successfully predict these situations, you need to use near real-time methods. When extracting, transforming, and loading relevant data to traditional BI solutions such as SQL Server Analysis Services (SSAs), the situation has long changed. Similarly, some systems rely on the request-response mode to request updated data from transactional data storage (such as SQL Server reporting services or SSRS and reports, A system like this always runs stale data near the end of the request-polling interval. The polling interval is usually fixed, so even if an interesting activity suddenly occurs, the consumption system will not know until the next interval is entered. On the contrary, the consumption system should continuously receive notifications when satisfying the interesting conditions.

The interval is crucial when detecting emerging trends-in the past five minutes, a particular project was purchased 100 times, obviously, this is more indicative of emerging trends than continuous purchases over the past five months. Traditional systems such as SSAs and SSRs require developers to track data in rows in a single dimension of A Multidimensional Dataset or timestamp column in transactional storage. Theoretically, tools used to identify emerging situations may have a built-in concept of time and provide a wide range of Apis required to use the tool.

Finally, the accurate indication of the future comes from the analysis of the past. In fact, this is all the functions of traditional bi-to summarize and analyze a large amount of historical data to identify trends. Unfortunately, compared with more transactional systems, different tools and query languages are required when using these systems. Successful identification of emerging situations requires seamless Association of past data with current data. This close integration is possible only when the two data types use the same tool and query language.

For specific situations such as production line monitoring, you can use highly targeted custom tools to execute these functions, but these tools are generally expensive and not widely used.

In order to prevent production line output from falling or to ensure proper product pricing, the key is to have sufficient response capability to identify and adjust the product based on changes. To easily and quickly identify these situations, history query and Real-Time query should use the same developer-friendly toolset and query language, the system should process a large amount of data (about hundreds of events per second) in near real-time mode, and the engine should be flexible enough to handle situations that span multiple problem domains.

Fortunately, such a tool exists. It is called Microsoft streaminsight.

Streaminsight Architecture Overview

Streaminsight is a complex event processing engine that can process hundreds of events per second with extremely low latency. It can be hosted by any process (such as a Windows service) or directly embedded into any application. Streaminsight has a simple adapter model for input and output data, and queries real-time data and historical data like any other data from any Microsoft.. NET Framework uses the same LINQ syntax. It is licensed as part of SQL Server 2008 R2.

The advanced architecture of streaminsight is very simple: events are collected from various sources through the input adapter. These events are analyzed and converted through queries, and the query results are distributed to other systems and persons through the output adapter.Figure 1This simple structure is displayed.

Figure 1Microsoft streaminsight advanced architecture

Just like a service-oriented architecture that focuses on messages, while a database system focuses on rows, complex event processing systems such as streaminsight are organized by event. The event is a simple data segment and time related to the data-similar to the sensor readings or stock prices for a specific time in a day. The data carried by the event is called its load.

Streaminsight supports three types of events. Point events are real-time and continuous events. An interval event is an event related to a specific period of time. Edge events are similar to interval events, but when edge events arrive, their duration is unknown. The system sets the start time and the event actually has an infinite duration. The End Time is not set until another edge event arrives. For example, a speedometer reading may be a point event because it is constantly changing, but the supermarket's milk price may be an edge event because of its long association time. When the retail price of milk changes (for example, due to a change in distributor pricing), the duration of the new price is unknown, so edge events are more appropriate than interval events. Later, when a distributor updates its pricing, the new edge event will overwrite the duration of previous pricing changes, and another edge event will set a new price to continue.

The input adapter and output adapter in streaminsight are abstract examples of the adapter design mode. The streaminsight engine runs on its own event representation, but the actual sources of these events may be quite different, status messages that range from proprietary interfaces to hardware sensors to state messages generated by enterprise applications. The input adapter converts the source event to an event stream that the engine can understand.

The results from streaminsight queries indicate specific business knowledge and are highly specialized. It is important to route these results to the most appropriate location. The output adapter can be used to convert the internal representation of an event to text printed to the console, and send messages to another system for processing through windows Communication Foundation (WCF, even the points in the table in the Windows Presentation Foundation application. Sample adapters for using text files, WCF, and SQL can be obtained from streaminsight.codeplex.com.

Streaminsight queries by example

At first glance, the streaminsight query seems similar to the query row from the database, but there is a major difference between the two. When querying a database, the system constructs and executes the query and returns the result. If the basic data is changed, the output is not affected by running queries. The database query result indicates the snapshot at a specific time point, which can be used in request-response mode.

The streaminsight query is an existing query. As new input events arrive, the query continuously responds and creates new output events as needed.

The query example in this article comes from the sample solution available for download. These examples are relatively simple, but with the introduction of new query language functions, the functions become more powerful. All queries use the same load class. The following is a simple class definition, which has the region attribute and value attribute:


  
  
  1.           public class EventPayload {
  2.   public string Region { get; set; }
  3.   public double Value { get; set; }
  4.  
  5.   public override string ToString() {
  6.     return string.Format("{0}\t{1:F4}", Region, Value);
  7.   }
  8. }
  9.        

The query in the example application uses an input adapter and an output Adapter. The input adapter can generate data randomly. The output adapter only needs to write events to the console. For clarity, the adapters in the sample application are simplified.

To run each query, uncomment the rows in the program. CS file in the sample solution, which assigns the query to a local variable called "template.

The following is a basic query that uses the value attribute to filter events:


  
  
  1.           var filtered =
  2.   from i in inputStream
  3.   where i.Value > 0.5
  4.   select i;
  5.        

Any developer with experience in using LINQ should be familiar with this query. Because streaminsight uses LINQ as its query language, this query is similar to a LINQ to SQL query, accessing the database or filtering ilist in memory. When an event arrives from the input adapter, its load is checked. If the value of the value attribute is greater than 0.5, the event is passed to the output adapter and printed to the console.

When the application is running, you can see that the events continuously reach the output. This is actually a push model. When the event arrives, streaminsight calculates the new output event from the input, which is different from the database PULL model. In the PULL model, the application must periodically Round-Robin the data source, to check whether the new data has arrived. This works perfectly with the iobservable support available in Microsoft. NET Framework 4, which will be described in subsequent chapters.

It is a good idea to use the PUSH model instead of polling to process continuous data, but the real function of streaminsight is reflected in the query time-related attributes. When events arrive by entering the adapter, they get a timestamp. The timestamp may come from the data source itself (assuming that the event represents historical data and has a display column for storing the time), or you can set it to the time when the event arrives. In fact, time is the first class in the streaminsight query language.

Queries are usually similar to standard database queries. standard database queries are pasted with time delimiters at the end, such as "every five seconds" or "every three seconds ". For example, the following is a simple query, which queries the average value of the value attribute every five seconds:


  
  
  1.           var aggregated =
  2.   from i in inputStream
  3.     .TumblingWindow(TimeSpan.FromSeconds(5),
  4.     HoppingWindowOutputPolicy.ClipToWindowEnd)
  5.   select new { Avg = i.Avg(p => p.Value)};
  6.        
Data window

Because the concept of time is the basic concept of a complex event processing system, it is very important to use the time component of the query logic in the system in a simple way. Streaminsight uses the window concept to indicate grouping by time. The previous query uses the flip window. When the application is running, the query generates a single output event (window size) every five seconds ). The output event indicates the average value of the first five seconds. Like LINQ to SQL or LINQ to object, aggregation methods (such as sum and average) can aggregate events grouped by time into a single value, alternatively, you can use select to project the output to different formats.

Flipped window is only a special case of another window type: Hop window. Hop windows also have a size, but they also have a hop size not equal to the window size. This indicates that hop windows can overlap with each other.

For example, if the window size is five seconds and the hop size is three seconds, an output (the hop size) is generated every three seconds, and the average value (the window size) of the first five seconds is provided ). It jumps forward for three seconds at a time and lasts for five seconds.Figure 2Displays the event streams grouped into the flipped window and hop window.

Figure 2Flip and hop windows

Note that the flipped window does not overlap. However, for hop windows, if the hop size is smaller than the window size, it can overlap. If the window overlaps, the event may end in multiple windows, such as the third event that exists in both window 1 and window 2. Edge events (with duration) may also overlap at the edge of the window and end in multiple windows, such as the penultimate event in the flip window.

Another common window type is Count window. The Count window contains a specific number of events, rather than events at a certain time point or time period. To query the average of the last three arriving events, you may need to use the Count window. The current limitation of the counting window is that built-in aggregation methods such as sum and average are not supported. You must create a user-defined aggregation. This simple process is described below.

The last window type is the snapshot window. In an edge event environment, the snapshot window is the easiest to understand. The start or end of each event indicates the completion of the current window and the start of the new window.Figure 3Displays how to group edge events into a snapshot window. Note the method of triggering the window boundary for each event boundary. E1 and W1. When E2 starts, W1 is completed, while W2 is started. The next edge ends with E1, so that W2 is completed, and W3 is started. The result is three windows: W1 with E1, W2 with E1 and E2, and W3 with E3. When the events are grouped as windows, they are stretched to make the event start and end time the same as the window.

Figure 3Snapshot window

More complex queries

You can perform multiple queries by providing available windows and basic query methods (such as location, grouping basis, and sorting basis. The following is a query that groups input events by region and uses the hop window to output the total load values of each region in the last minute:


  
  
  1.           var payloadByRegion =
  2.   from i in inputStream
  3.   group i by i.Region into byRegion
  4.   from c in byRegion.HoppingWindow(
  5.     TimeSpan.FromMinutes(1),
  6.     TimeSpan.FromSeconds(2),
  7.     HoppingWindowOutputPolicy.ClipToWindowEnd)
  8.   select new {
  9.     Region = byRegion.Key,
  10.     Sum = c.Sum(p => p.Value) };
  11.        

These windows use a two-second hop size, so the engine sends an output event every two seconds.

Because the query operator is defined in the iqueryable interface, you can write a query. The following code uses the previous query to find the sum by region and calculate the region with the highest sum. The Snapshot window allows event streams to be classified by sum. Therefore, you can use the take method to obtain the region with the highest sum:


  
  
  1.           var highestRegion =
  2.   // Uses groupBy query
  3.   (from i in payloadByRegion.SnapshotWindow(
  4.     SnapshotWindowOutputPolicy.Clip)
  5.     from sumByRegion in i
  6.     orderby sumByRegion.Sum descending
  7.     select sumByRegion).Take(1);
  8.        

It is generally a query of the stream from a fast moving event (such as reading in a sensor) to a slow moving or static reference data (such as a fixed position of a sensor. Query uses join to achieve this goal.

The streaminsight join syntax is the same as that of any other LINQ joins, but note that when the event duration overlaps, they are joined together. If sensor 1 reports a value at time t1, but the reference data about sensor 1 is only valid for time t2 to T3, the connection will not match. The connection condition of duration is not explicitly written into the query definition; this is the basic attribute of the streaminsight engine. When static data is used, the input adapter typically processes data as an edge event with an infinite duration. This will successfully complete all the connections to the fast moving event stream.

Joining multiple event streams is a very powerful concept. Assembly lines, oil production facilities, or high-capacity websites generally do not fail due to isolated incidents. A device component used to trigger a temperature alarm usually does not paralyze the production line; Production Line paralysis may be caused by multiple reasons, such as the temperature being too high during a certain duration, and a tool is used too much, the operator is switching.

If there is no connection, isolation events will not have so much commercial value. By using connections and streaminsight queries for historical data, you can associate isolated streams with very specific monitoring conditions for real-time monitoring. The existing query can find the situations that may cause faults and automatically generate output events that can be routed to the system. The system knows how to take overheated device components offline, instead of waiting for this part to stop the entire production line.

In the retail situation, events related to sales volume by project for a certain period of time can be entered into the pricing system and customer order history to ensure that each project has the best pricing, or decide the project to be recommended to the user before closing the bill. Because queries are easy to create, modify, and write, you can start with a simple situation and optimize it over time to increase business value.

User-Defined Aggregation

Streaminsight comes with the most common Aggregate functions, including count, sum, and average. When these functions are insufficient (or you need to aggregate them in the count window mentioned above), streaminsight supports user-defined aggregate functions.

To create a user-defined aggregation, the process consists of two steps: writing the actual aggregation method, and then publishing the method to LINQ through the extension method.

In the first step, if aggregation is time-independent, it is inherited from cepaggregate <tinput, toutput>. If aggregation is time-related, it is inherited from ceptimesensitiveaggregate <tinput, toutput>. These abstract classes have independent implementation methods called generateoutput.Figure 4Displays the implementation of everyothersum aggregation, which adds up every other event.

Figure 4Everyothersum Aggregation


  
  
  1.           public class EveryOtherSum :
  2.   CepAggregate<double, double> {
  3.  
  4.   public override double GenerateOutput(
  5.     IEnumerable<double> payloads) {
  6.  
  7.     var sum = default(double);
  8.     var include = true;
  9.     foreach (var d in payloads) {
  10.       if (include) sum += d;
  11.       include = !include;
  12.     }
  13.     return sum;
  14.   }
  15. }
  16.        

In the second step, you must create an extension method on cepwindow <tpayload> to use your aggregate in the query. Cepuserdefinedaggresponattribute applies to extension methods to notify streaminsight where to find the aggregate implementation (in this case, the class is created in the first step ). In a downloadable sample application, the code for the two steps in this process can be found in the everyothersum. CS file.

More adapter Information

Query indicates the business logic for operations on the data provided by the adapter. The sample application uses a simple input adapter and an output Adapter. The input adapter generates random data and the output adapter writes data to the console. They all follow a similar pattern, and the adapters provided on the codeplex website also follow this pattern.

Streaminsight creates an adapter in factory mode. After a configuration class is specified, the factory can create an instance of the corresponding adapter. In the example application, the configuration classes of the input and output adapters are very simple. The output adapter is configured with a single field that saves the format string and can be used when writing the output. The input adapter configuration has a field that fills in the sleep time between the generated random events, and also has another field called ctifrequency.

The CTI in ctifrequency indicates the current time increment. Streaminsight uses CTI events to help ensure that events are passed in the correct order. By default, streaminsight supports events that do not arrive in sequence. When an event is passed through a query, the engine Automatically sorts the event. However, this sorting has certain restrictions.

Assume that the event can arrive in any order. So how can we determine that the earliest event has arrived and push it through queries? This is not possible because the next event may be earlier than the time when you received the earliest event. Streaminsight uses CTI events to notify the engine that events earlier than the events received will not arrive. The CTI event actually prompts the engine to process events that have arrived, and then ignores or adjusts any events with timestamps earlier than the current time.

The sample input adapter generates a sorting event stream. Therefore, it automatically inserts a CTI event after each generated event to keep the process running. If you have compiled an input adapter and your program does not produce output, make sure that your adapter is inserted with CTI, because if there is no CTI, the engine will keep waiting.

Streaminsight comes with a variety of basic adapter classes: Special, generic, point, interval, and edge. The special adapter always generates events with common load types-In the example, it is the randompayload class. Generic adapters are suitable for events that can generate multiple types of events, or that do not know the row layout and content in advance, such as CSV files.

The sample input adapter has common load types and can generate point events. Therefore, it inherits from typedpointinputadapter <randompayload>. The basic class has two abstract methods that must be implemented: Start and resume. In the example, the start method triggers the timer within the specified interval. The elapsed event of the timer runs the produceevent method, which completes the main work of the adapter. The subject of this method follows the general mode.

First, the adapter checks whether the engine has stopped since the last run and is still running. Then, call a method in the basic class to create a vertex event instance. Its load has been set and the events are arranged in the stream. In the example, the setrandomeventpayload method can replace any real adapter logic-for example, reading a file, talking to a sensor, or querying a database.

The input Adapter Factory is also very simple. It implements the interface itypedinputadapterfactory <randompayloadconfig> because it is the factory of the feature adapter. The unique feature of this factory is that it also implements the itypeddeclareadvancetimeproperties <randompayloadconfig> interface. This interface allows the factory to process the CTI insert operation described above.

The output adapter of the sample application follows the same pattern as the input adapter. Including configuration class, factory, and output adapter itself. The adapter class is very similar to the input adapter. The main difference is that the adapter removes events from the queue rather than queuing them. Because CTI events are similar to other events, they also reach the output adapter and are easily ignored.

Number of observations

Although the adapter model is very simple, you can use the following simpler method to input and output events. If the application uses the embedded deployment model of streaminsight, you can use ienumerable and iobservable as the engine input and output. Given an ienumerable or iobservable, you can create an input stream by calling one of the provided extension methods (such as tostream, topointstream, tointervalstream, or toedgestream. This creates an event stream that looks very similar to the event stream created by the input adapter.

Similarly, given a query, the extension methods (such as toobservable/enumerable, topointobservable/enumerable, tointervalobservable/enumerable or disable) route the query output to iobservable or ienumerable respectively. These modes are especially suitable for replaying historical data stored in the database.

Use Entity Framework or LINQ to SQL to create a database query. Use the tostream Extension Method to convert database results to event streams and define streaminsight queries for the event stream. Finally, use toenumerable to route the streaminsight results to a location that facilitates foreach and printing.

Deploy models and other tools

To support observable and enumerable, you must embed streaminsight in your application. However, streaminsight does not support independent models. During installation, you are asked if you want to create a Windows service to host the default instance. The service can then host streaminsight, allowing multiple applications to connect to the same instance and share the adapter and query.

Communication through a shared server rather than an Embedded Server uses a different static method on the server class. Instead of calling create with the instance name, connect is called with the endpointaddress pointing to the shared instance. This deployment policy is more suitable for enterprises, where multiple applications may need to use a shared query or adapter.

In both cases, it is sometimes necessary to figure out why the output generated by streaminsight is not the expected output. This product comes with a tool named event flow Debugger for this purpose. This document does not describe how to use the tool, but in short, it allows you to connect to the instance and track Input and Output events through queries.

Flexible and responsive tools

Flexible Deployment Options, familiar programming models, and easily-created adapters make streaminsight a good choice in various situations. From querying and associating centralized instances of thousands of sensor inputs within one second to embedded instances that monitor current and historical events in a single application, streaminsight uses a developer-friendly framework (such as LINQ) to implement a highly customized solution.

The easy-to-create adapter and built-in support for converting event streams to ienumerable and iobservable allow it to quickly locate and run the solution, this increases the creation and improvement of queries that encapsulate specific business knowledge. In the process of improvement, these queries provide more and more values, so that applications and organizations can identify and respond to interesting situations, which is a good opportunity to deal.

Rob pierry He is the chief consultant of captura (capturaonline.com). captura is a consulting company that provides innovative user experiences supported by scalable technologies. You can contact him through rpierry + msdn@gmail.com.

We sincerely thank the following technical experts for reviewing this article:Ramkumar Krishnan, Douglas laudenschlagerAndRoman schindlauer

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.