Mining the event connection detected by distributed system with a distributed processing method
Click to download the demo document
Abstract: There is a growing demand for monitoring, analyzing and controlling large-scale distributed systems. The events under monitoring are often related, which is helpful to resource allocation, job scheduling and fault prediction. In order to discover the connection in detected events, many of the existing methods are to put the detected events into the database and data mining. But we don't think these methods are suitable for large-scale distributed systems, because the amount of data that monitors events grows so fast that it is difficult to find the connection between events with the power of a single computer. In this paper, we propose a distributed method for effectively detecting events, filtering irrelevant events, and then discovering the connection between them. We propose to use the MapReduce-based algorithm, Mapreduce-apriori, Data Mining Event Association rules, which utilize the computing resources of multiple dedicated nodes in the system. The experimental results show that our distributed event correlation mining algorithm has a near-ideal speed improvement compared with the centralized mining method.
1.introduction
Large-scale distributed systems: Cluster systems (CLUSTERSYSTEM), Cloud computing Systems (Cloud computing system)
The key to building an effective and reliable distributed environment is to monitor and control nodes, services, and applications. Monitoring such a large system requires constant collection of performance attribute values (such as CPU usage, memory consumption) at a fixed frequency. As the scale of the system expands, the indirect costs of monitoring will become a limiting factor.
Events detected during monitoring can help administrators quickly locate and resolve bottlenecks, failures, and other issues. For example, Fisichella's paper, which detects health-care time in social networks to predict infectious diseases, presents a three-stage process to detect unsupervised public health events. At a given time, we can assume that the time of failure can be considered a special case when a property value is observed to exceed a given threshold. However, when the complexity of the system continues to grow, failure becomes the norm rather than the exception of "44". Traditional methods such as checkpoints are often proven to be counter effective "16". Therefore, the fault management research has shifted to the fault prediction and the related active management technology. "25,30"
We all agree that events are not independent but interrelated. Previous research on fault analysis has shown important fault distribution patterns. In particular, events have a strong connection over long spans of time. Previous research has shown the importance of event linkage to resource allocation, job scheduling, and fault prediction. "43,16,17"
The event contact pattern describes the co-occurrence between different times, which can be found by means of data mining methods. However, classical data mining algorithms, such as Apriori, have exponential search space when mining frequent patterns. Because data mining is both computationally intensive and data intensive, it is a serious challenge to improve data mining performance when data level increases. In addition, the Apriori-based approach does not reveal a short connection between detected events. Some of the previous work "5,,36" is dedicated to mining the short-term connection between events in the sensor network. However, in these methods, all of the data is stored in a centralized database. Aggregating events into a centralized database is not effective when the scale of the system expands.
Centralized data mining methods are not practical in some cases. First, in areas where monitoring will last for many years, the amount of events being monitored can be too large to be integrated. Second, in some key areas, event contact mining is difficult to accomplish with the ability to use only one computer in an acceptable time. In contrast, distributed data mining is especially useful for applications that handle large amounts of data, such as transactional data, scientific simulations, telecommunications data, and so on, which cannot be analyzed by a traditional model within an acceptable time period. We also believe that events do not run through the entire monitoring process, so the transient nature of the events must be addressed.
In this article, we focus on discovering transient event connections, in order to speed up the process of mining events, we recommend that events be aggregated into a series of databases rather than a centralized dataset so that event correlation mining can be done in parallel. In addition, to reduce aggregation overhead, events should be filtered locally (in each node). Finally, a MapReduce-based event correlation mining is implemented on a series of databases to discover event associations in parallel. Event associations are considered event association rules, and Event Association rules highlight transient associations based on the same activity interval. Event Association rules facilitate the prediction of system failures and improve the quality of service (Qos,quality Ofservice) of the system. For example, we expect to receive an event from a node at some point in time, but this does not actually happen. At this point, we can infer that the node may have failed.
The main summary of this article is as follows:
1. We have promoted the event correlation mining method by taking into account the transient nature of the detected events.
2. We propose a way to effectively filter unrelated events locally to reduce the number of events that need to be aggregated.
3. We propose a mapreduce-based algorithm for parallel mining of association rules for events from a series of contributing nodes.
The remainder of this document is organized as follows:
Section2 describes the models and concepts.
Section3 An example is given to illustrate our approach.
Section4 the specific algorithm is introduced.
section5 the experiment was given.
Section6 reviewed the relevant work.
section7 This paper summarizes the thesis and discusses the future development .
2.Models and Concepts
In this part, we first introduce the monitoring and time detection framework, and then give the model of mining monitoring contact rules.
2.1Monitoring Framework
The structure of distributed system can be regarded as a heterogeneous monitoring tree, nodes are divided into Hyper nodes (Sn,super node), management node (an,admin node), working node (wn,working node), the nodes are the root nodes of the whole monitoring tree, and the monitoring tree controls the whole system. The management node is the intermediate node, which manages a series of work nodes separately. Each work node is a leaf node. The task of detecting events is distributed and the task involves monitoring units at each work node. A unit (agent) is an application-level monitoring program that runs independently of other applications in the system and communicates with the outside world via message-passing "4". For each work node, he has a local store that is used to store detected events during the detection process. All nodes share a globaldistributefilesystem (GDFS), Gdfs is built on the local storage of the node. Each node can download files from Gdfs to its own local storage, and can upload files to gdfs.
A monitoring request includes long-running activities that are used to observe, analyze, and control large-scale distributed systems and the applications they serve. Each monitoring unit periodically collects the values of the exact attributes and compares them to the given thresholds. Typically, we define a monitoring request as follows:
Definition2.1:
、、、
The monitoring request is initiated by the hyper-node and then propagated to all work nodes through the management node. Finally, it activates monitoring to initiate monitoring. The goal of monitoring is to check whether a range of system performance attributes (CPU occupancy, memory occupancy, network bandwidth, and any custom attributes at the application level) exceed the appropriate thresholds. If a threshold is exceeded at a certain point in time, then an event is monitored. We are only interested in whether an event is detected and not on the corresponding attribute value. We give the event definition as follows:
、、、
2.2 Eventcorrelation Mining Framework
Based on the definition of association rules in a traditional database, we define the item, transaction, and transaction database in our domain.
Each item has its own life cycle, the life cycle clearly indicates his duration, and we define the item's life time.
3. Motivational examples
In this section, we use a proof example to illustrate how we can exploit frequent patterns in our scenario.
、、、、
Mining results: ((n1, E1)? (N2, E3), Min, 100%).
This means: If the event E1 is detected on the node N1, the probability of detecting E3 on N2 within 10 minutes is 100%.
Given a transaction database and minimum confidence, minimum support, the problem of data mining is to produce the associated rules (confidence, support is greater than or equal to the given minimum value), which can be divided into two parts:
1. Find frequent patterns (frequent itemsets);
2. Summarize the association rules;
The first phase is based on an iterative approach to discovering frequent itemsets, which occupy the dominant position of data mining overhead. Therefore, the key to successful mining is how to effectively reduce the first part of the overhead.
In the next paragraph we propose a distributed approach to effectively reduce overhead and speed up the mining process.
4. The algorithms
In this section, we first introduce the local event detection algorithm, and then we present the Mapreduce-apriori algorithm, which is a mapreduce-based association rule mining algorithm that effectively filters out certain events on detected events (before they are transmitted), They then detect their associations in parallel.
4.1 Local Event Detection
At each point in time, the Monitoring unit on each working node checks whether the attributes exceed the threshold value, and if yes, stores r= (T, (N,e)), which indicates that event E is detected by node n at point T. Visible ALGORITHM1.
4.2mapreduce-apriori
MapReduce is a programming model that is related to the implementation of large data sets. The user specifies a mapping method to process a key-value pair to generate a set of key-value pairs, combining a set of intermediate values (with the same intermediate key). When component failure becomes a normal rather than a special case, MapReduce provides a programming convenience and fault tolerance.
MapReduce consists of four stages: local event filtering, merging events into transactions, global frequent itemsets mining, and Global Association rule summaries.
The following is a description of the 4 phases:
1. Local Event filtering: Filters those tuples (that is, 1-itemsets, (N,e)) whose support counts are less than the minimum support, or the tuples that include those tuples.
2. The merge event is a transaction: a MapReduce method to merge all events that occurred at the same time into a transaction.
3. Global frequent itemsets mining: A multiple iteration of the MapReduce method to discover frequent itemsets.
4. Summary of Global Association rules: A MapReduce method to summarize event association rules.
The Mapreduce-apriori starts from several input parameters, including minimum support, minimum confidence level.
At this point, each node holds the detected event in his private store (local storage), and nothing in gdfs can be shared.
4.2.1Local fitering
At this stage, our goal is to reduce the communication overhead when uploading records to GDFS. The idea behind this is to filter out all tuples with a support count less than the minimum support level. Because if a tuple is not frequent locally, it must not be frequent at the global level. Once the monitoring phase is complete, a scan on each work node starts, scanning each record locally stored to calculate the life cycle (lifetime) and support count for each tuple. If a time of support count is not less than the minimum support, all records are saved, once the filtering phase is over, all the records left behind are uploaded to Gdfs. Records are stored in different nodes.
The following Algorithm2 describes the details:
For example, we consider an event in table2, given a minimum support level of 3, and all time that support is greater than or equal to 3 will be saved. Therefore, after the local event filtering phase is over, only 4 records are left, greatly reducing the amount of records that need to be passed. There is no record of the left node without transmitting data.
4.2.2. Merging records into transaction
Events in Gdfs need to be grouped by point-in-time before their association is discovered. This process can be executed in parallel. One of the mapreduce processes here is to merge records into transactions. Each mapper enters a record as a key-value pair (Key=recordid,value=record), record= (T, (N,e)), and then mapper the record into a key-value pair (key=t (T is a point in time), Value= (N, e) After all the mapper are finished, collect his related items for each unique key t,mapreduce and then pass the key value pair (Key=t,value=items) to reducer, where items are all item at the same point in time. Reducer receives all key-value pairs at the same point in time, sets the item to a transaction T at the same point in time, and assigns a unique TID, output key-value pair (key=tid,value=t).
4.2.3. Globalfrequent itemset Mining
Because transactions are distributed on GDFS, the MapReduce model is just right for our scenario to parallel global frequent itemsets mining, which describes several iterations of the mapreduce process. After the K-th mapreduce process, the frequent itemsets are summed up, which loops until the frequent itemsets no longer grow.
Frequent item set mining consists of two steps:
1. Summary of alternate K-itemsets from frequent (k-1) itemsets Fk-1;
2. Summarize frequent K-itemsets (elimination of support count in CK Mr. Xiao Yuhua minimum support);
The first step is to be a candidate set, requiring two small steps: First, during the connection process
for two K-itemsets {I1, I2, ..., Ik-1, Ik},{i1,i2′, ..., ik-1′, ik′} , able to connect to produce a k+1 -Itemsets, when and only if ii=ii′, which I ? {1, 2, ..., k-1} and Ik 1 ik′ , the connection result is (k+1 Item set ) : {I1, I2, ..., Ik-1, ik, ik}
such as {A, C, d} with {A, C, E} connection produces {A, C, d,e}, while {A, C, d} cannot be connected with {A, B, d}.
Second, in the trimming step, remove all the K-itemsets that exist in the k-1 itemsets that are not part of the frequent itemsets.
Lemma 4.1:
A frequent K-itemsets with a frequent J-itemsets lifetime= (IK and JK) =[t1,t2], if (T2-T1)/detection interval < minimum support, then IK and ij are not a frequent itemsets.
Proof: Slightly
Algorithm is an algorithm for enlarging candidate sets
In the second step of summarizing frequent K-itemsets, multiple mapreduce processes, each of which summarizes frequent K-itemsets from the candidate K-Key set (through the support count) each mapper downloads all candidate K-itemsets, and then it receives a key-value pair (key=tid,value=t), Then for each of the candidate K-Itemsets, the map method determines whether t contains this set of items, and if so, outputs a key-value pair (key=, value=1). Reducer summarizes the key-value pairs for all the same key (that is, the candidate), re-evaluates the candidate's life cycle, and determines whether the total support count is not less than the minimum support level. If so, he outputs a key-value pair (key= the candidate, Value=sup (I,lifetime (i)), ALGORITHM5 gives the specific algorithm:
4.2.4. Globalassociation Rule Generation
To summarize the association rules, you need to find all the subsets of the frequent itemsets. For example a frequent itemsets D, we must find for each appropriate subset of D1, have (D1? D? D1), [T1, T2], conf), if Conf is not less than the minimum confidence level, then this is an association rule. Ale and Rossi[3] presents a problem, considering a frequent itemsets I, to find out all of his subsets require an exponential computational process.
To avoid this, the ale and Rossi assume that the itemsets are evenly distributed throughout his life cycle, so the probability of any subset appearing is the same, which we think is ineffective in our case because time is not evenly distributed throughout his life cycle.
To solve this problem, we present an enhanced life-time format for itemsets, which records all the time points that an item occurs, rather than the minimum and maximum point in time. On the basis of this format, we can calculate the SUP directly (D1,lifetime (D)) without scanning the database.
In order to parallelize this phase into a mapreduce process, each mapper read into a frequent itemsets F and his support count sup (F, L (f)), output key-value pair (KEY=F,VALUE=F1), F1 is a non-empty set of F, each reducer calculation (F1 ? F?
F1), L (F)), the confidence degree, ALGORITHM6 the specific algorithm, the association rules in accordance with the confidence degree of order, you can directly find the most confidence of the K rule.
To summarize, the benefits of the algorithm we propose are as follows:
1. It is not necessary to centralize all data into the same database. Instead, the time is collected into different nodes.
2. Excavation time is extremely reduced, especially in some critical areas.
3. Time is filtered locally, so the amount of time required to upload is greatly reduced.
4.MapReduce is best suited for dealing with large amounts of data generated by today's distributed systems.
5. We have designed an extended candidate set algorithm to effectively reduce the resulting candidate.
5. Experimental results
We implemented the proposed algorithm on Apache Hadoop and Apache Hadoop is an open source implementation of the Googlemapreduce framework. Apache-hadoop is a Java framework that supports distributed applications in data sets, and Hadoop enables applications to handle thousands of nodes and petabytes of data. In addition, the Hadoop distributed File System (HDFS) is a suitable implementation for GDFS.
We first build a distributed system and then simulate the monitoring environment to generate synthetic time. We then use the Mapreduce-ariori algorithm to discover event connections on synthetic events. To demonstrate the effectiveness of mapreduce, we have also implemented the classic Apriori algorithm and then applied it to a centralized database. We compared two methods of mining the execution time of association rules.
5.1. Synthetic eventgeneration
The simulator generates events by entering several parameters that can be seen in the Table6:
When these parameters are received, the emulator generates the value of the attribute set a on each node and then determines whether the value exceeds his relative threshold value. Each dataset is independent, the attribute values conform to the standard distribution (0,,1), and the composition event program runs on the PC.
Our simulations include the following scenarios:
? Given N =, A =10,ξmin = 0.7,ξmax = 1.0, with the
Increaseof T, collect detected events that is distributed in different nodes within T.
? Given T = 000,a = 10,ξmin = 0.7,ξmax = 1.0, with
The increase of N,collect detected events that is distributed
In different Nodeswithin T.
? Given T = 000,n = 50,ξmin = 0.7,ξmax = 1.0, with
The increase of A,collect detected events that is distributed
In different Nodeswithin T.
Once the event has ended, the simulator immediately starts the local filtering time. As can be seen through table8, we are very effective in filtering a lot of infrequent events.
5.2. Relativeperformance
In this section, we implement two algorithms to mine the association. The Apriori runs on a single PC, and the Mapreduce-apriori runs on a two-node Hadoop cluster (two machines). We calculated the execution time of the two algorithms, including frequent itemsets mining events and association rule generation time.
2, 4, 6, 7 can see the performance of the two algorithms are not the same, the dash line represents the performance of Apriori 、、、 in our experiment, the ideal performance of MapReduce has reduced the general Apriori time, gained twice times the speed of ascension.
5.2.1. Varying Monitoring time slots (T different)
6 datasets:n50a10t5k,n50a10t10k,n50a10t20k, n50a10t40k, n50a10t70k, andn50a10t100k
Fig.2 can see the change in the execution time of the two algorithms when T is changed, and the increase of T will lead to the increase of the transaction. Because the TID is associated with a specific point in time, we use the TID instead of the monitoring time point directly behind. When the number of transactions is relatively small, the difference between the two methods is not obvious. As the transaction increased, Mapreduce-apriori began to show his superiority. And
Mapreduve-apriori's performance is very close to the ideal Mapreduce-apriori.
Next, we studied the frequent itemsets mining and found that for each dataset, only 1-itemsets and 2-itemsets were excavated. Because the minimum confidence level is 0.07, there is no association with 3 events that can have such a high degree of support.
Fig.3 describes the number of frequent itemsets that are excavated, black is a 1-item set, and Red is a 2-item set. From the chart we can see that the 2-item set was not excavated at T=20, while 6,000 2-itemsets were excavated at t=100. This may explain why the execution time of n50a10t20k in Fig.2 has grown so fast.
5.2.2. Varying the number of nodes in a synthetic system
4 datasets,n50a10t10k,n100a10t10k,n150a10t10k, and n200a10t10k
When the number of nodes N of distributed systems changes, the growth of the nodes leads to the increase of the number of different tuples (item). N50A10T10K has been tested, and fig.4 shows the different performance of the two algorithms when the nodes are growing. We can see that MapReduce is far superior to Apriori and close to ideal. The number of frequent itemsets mining can be seen in fig.5, where the 2-itemsets of n100a10t10k are larger than 1-itemsets, but n150a10t10k's 2-itemsets are small. Unfortunately, the n200a10t10k did not dig into the 2-item set,
5.2.3. Varyingattributes in the system
4 datasets,n50a10t10k, n50a15t10k,n50a20t10k and n50a25t10k,
When property values grow, the number of different tuples (item) increases. Fig.6 compares the execution time of the two algorithms, and the execution time of the two algorithms increases with the type of attribute. Mapreduce-apriori is superior to apriori, reducing the execution time by half.
5.2.4. Varying min_sup_rate in a specific dataset
We start with a minimum support level of 0.06, increase each time by 0.01, calculate the amount of time, and the local filtering algorithm filters the tuple (item). Next, we apply the two algorithms to the filtered time and record the execution time.
Table9 describes the filter results, and when the minimum support level increases, the filtering effect increases.
Fig.7 compares the execution time of the two algorithms in different minimum support degrees. The minimum support level rises, the execution time drops, in addition, Mapreduce-apriori completely due to apriori, and gradually approaching the ideal situation.
6. Related work
7.Conclutionsand Future work
In This paper, we propose an extended model for Mining Event Association rules in distributed systems, and consider the transient nature of detection events. In addition, we propose a Mapreduce -based algorithm to effectively filter unrelated events and discover their short-term connections. Our experimental results show that our algorithm is better than Apriori based centralized mining method, and the speed is greatly improved.
Our work will be expanded in two directions, first, in order to improve the productivity of the mining methods, we should consider extending our framework so that the frequent itemsets mining phase and the association Rule Mining section can use pipelining. Second, we plan to deploy our framework into an actual distributed environment to further verify the effectiveness of the proposed algorithm.
Software gifted Summer Camp a decentralized approach for mining event correlations in Distributed system monitoring translations (original)