Stream Data Mining (III)

Source: Internet
Author: User

This article mainly introduces the research content of streaming data.

Data Stream processing preparation knowledge.

1. Research on Data Stream Models

The data stream model is a logical abstraction of data streams. A reasonable data stream model can improve the processing efficiency of data streams and is the basis for designing efficient processing algorithms. Data Stream analysis models mainly include:

Sliding window model, landmark model, and snapshot window model ).

Snapshot window:

This window sets the start time (Ts) and End Time Stamp (Te) in advance and keeps them fixed. Only data streams between Ts and Te are considered.

Mark window:

This window includes all data streams from a start timestamp (Ts) to the current timestamp (Tc). The initial timestamp remains unchanged.

Sliding Window:

The start and end timestamps of the window can be changed, new data enters the sliding window, and old data is deleted continuously.

In the preceding three windows, the interface label and sliding window have the ability to process new data, which is closer to the actual application, especially the sliding window.

2. Currently, the research of data stream mainly focuses on the data stream management system and data stream analysis..

Data Stream management focuses on the query language, query model, operation scheduling, resource management, load control, and other issues closely related to the management system for different application backgrounds from the system perspective. And data flow management

Compared with the research questions, data stream analysis focuses more on theoretical research and more on analysis-related processing algorithms.

We have briefly introduced the research on the Data Stream Management System in the previous article.

Many universities and research institutions at home and abroad rely on the design ideas of database management systems. Based on the specific industry background, they have successively proposed a variety of data stream models and developed some representative data stream management systems. Typical
Include:

STREAM (Stanford Stream Data Manager): It is a general data stream prototype system launched by Stanford University. Based on relational data, a Continuous Query Language (CQL) supporting data streams is designed ). The STREAM system is designed to effectively provide continuous and approximate query results when resources are insufficient. The design focuses on the Management and approximate query processing of continuous and time-varying data streams. The main research contents include query language, operation scheduling, resource management, and load control. The system can adapt to massive, fast, and variable data stream environments and has excellent continuous query capabilities.

TelegraphCQ: The prototype system was developed by UC Berkeley and built on the open source database PostgreSQL. It adopts the design concept of the workflow system and is based on the active Query Processing Engine in query processing. It shares multiple query operation operators through the tuples routing and grouping filtering technology. It focuses on adaptive processing and Dynamic Operation Scheduling Based on pipelines.

Aurora & boregion: AuroraThe system is a real-time data stream system jointly developed by the University of brown, the University of brantis and the University of Massachusetts Institute of Technology. The system is mainly applicable to three types of applications: real-time Monitoring applications, data archiving applications, and applications that include historical and current data processing. The system focuses on real-time processing, such as QoS Management, memory-aware operation scheduling, semantic-based load control, and archive storage management.

Gigasloud: It is a high-performance data stream management system developed by AT&T lab. It is mainly used for monitoring the distribution of high-speed network data streams [14]. The system uses a two-layer query structure to select the most appropriate processing policy based on the flow rate and available resources.

Data Flow Analysis includes:

Data Stream frequent item Set Mining

Data stream clustering

Data Stream Classification

Data Stream outlier detection

Data Stream Skyline computing

Data Stream subsequence matching

Data Stream Index Structure

Generate data stream summary Structure

Data Sampling and Compression

Data Stream granularity Representation

Data Stream similarity measurement

Data Stream Trend Prediction

.

At present, the research in this area has been fruitful at home and abroad. In particular, data stream management, data stream aggregation analysis, and data stream mining have been extensively and deeply studied.

3. Challenges of data stream Mining

In view of the characteristics of the data stream model, the current work has the following challenges:

Low time-space complexity. Infinite speed is the basic feature of Data Streams. Therefore, the contradiction between infinite and fast stream data and limited resources (such as computing, storage, and network bandwidth) has become the basic contradiction in data stream research. Theoretically, the data stream scale is infinite. To ensure that the algorithm can adapt to the infinite and fast data stream processing, the data stream algorithm requires a very low time-space complexity.

Incremental near real-time. The single-pass scanning of data streams requires that the algorithm must have the incremental update function. Because historical data is usually not stored, the algorithms designed to scan persistent data stored in the database multiple times are no longer applicable to Data Flow applications. For different data flow analysis and processing problems, the corresponding incrementally updatable data structures and algorithms will be designed. The fast speed of data streams will inevitably require algorithms to be able to process each stream of data in near real time, generally, algorithms require linear or even sublinear processing speeds. sublinearity can be obtained through sampling and other techniques.

Adaptive Approximation. Data Stream uncertainty and time variability require adaptive algorithms. The data stream algorithm should be able to detect the dynamic changes of the data stream in a timely manner, such as changes in load, flow rate and data distribution, and adapt the corresponding parameters of the algorithm according to the changes, this improves algorithm stability and reliability. For example, scheduling optimization, Load balancing, and Load-down (Load Shedding) can be used for processing under overload conditions; generally, data stream applications only need the approximate results that meet the precision requirements. The algorithm can apply the design idea and method of the approximate algorithm.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.