Data Mining article translation--mining emerging Patterns by streaming Feature Selection

Source: Internet
Author: User
Tags svm uci machine learning repository

Learning data mining, can be used tools-machine learning, SPSS (IBM), Matlab,hadoop, suggest spare time to read articles, expand the field of vision, the following is my translation of an article for everyone to learn. In addition, I am interested in the field of machine learning, big data, target tracking, interested can learn from each other, I q q mailbox 657831414.,word format translation and understanding can send e-mail
The original title is mining emerging Patterns by streaming Feature Selection

Mining revealing patterns through selection of flow features
Euque, Ding, Dan A. Simovici, Wu Xindong
Euque, Wu Xindong, Computer Department, Hefei University of Technology
Ding, computer department, Boston campus, Massachusetts University of America
Dan A. Simovici, Vermont University of America (Burlington)

Summary
It is a challenging problem to establish an accurate revealing pattern classification with high-dimensional datasets. If the entire feature space is not available before machine learning is started, then the problem becomes more difficult. In this paper, a new technique is proposed, which is about using flow feature selection to excavate the revealing model. We modeled the high-dimensional space with flow features, that is, the feature arrives and is processed once. When the feature flow comes in one after another, we estimate on-line the characteristics of each spoke to be reached whether the relationship between the use of feature correlation and EP resolution (Predictive Force of EP) is useful for mining prediction-revealing Patterns (EP). We use this relationship to guide the online EP mining process. This new method can be used to mine EP from a high-dimensional dataset, even when its entire collection is not available before learning. This experiment validates the effectiveness of the proposed approach in predicting accuracy, number of patterns, and uptime compared to other established methods in a wide range of data sets.
Classification and Topic Description
1.5.2 "Calculation Method": Design and evaluation, feature evaluation and selection of designing method classifier.
Overview: Algorithms, experiments.
Keywords: Revealing models, feature dependencies, flow characteristics.
1. Introduction
An exposure pattern (EP) [7,10] is a model that supports a significant class of subsequent changes in the number of degrees. A highly accurate classifier is established by means of the EPS force in the agglomeration.
When feature dimensions reach thousands, mining EP is still a daunting issue because there are plenty of candidate EPS so it is difficult to store, retrieve and sort, prune, and effectively classify. Because of the presence of a large number of exposed data sets, including hundreds of feature streams, image processing, gene expression data, text data, and so on. This type of model search space is quite large, even if these feature spaces are not likely to be smoothed. Therefore, the mining of EPs from such a space must face two challenging research questions:
(1) How to efficiently mine a small subset of predictable EPS from a high-dimensional data set.
(2) How to predict EPs in this feature space when the exhaustive search is time-consuming or not feasible in a large feature space.
In this article, we propose a new approach to discuss these two challenging issues. A new contribution to our approach is to model the high-dimensional spaces with flow features, and then incorporate the flow feature selection into the EP mining process to help find predictable, small collections of EPS efficiently and efficiently in a large feature space and achieve good performance.
The concept of flow characteristics is proposed to deal with feature selection in the changing feature space over time. Unlike data streams, feature dimensions with flow characteristics are modeled as feature streams, which appear one after the other, depending on the arrival of the feature stream when it is processed. Recent studies have indicated that the selection of feature streams is effective and efficient not only in a large feature space but also in an unknown full-featured space [19]. However, if we consider flow feature selection and EP mining as a whole, it is a complex research problem to gather all the features and samples to excavate EP.
(1) Online data processing. Because feature dimensions flow one after the other, this requires online conversion, mapping, and dispatching of the arriving feature stream. First, convert a real data set into a required coded dataset, and all work items are not feasible until the mining begins. Second, the mapping between the number of items and the true feature stream needs to be built and updated as the feature stream is pressed one after another. Thirdly, when a feature is valid, we should divide each type of data rather than dividing it in advance with the feature stream.
(2) Dynamic EP mining. According to the flow characteristics, a solution to mining EPS is to use flow feature selection to dynamically control the EP mining process. The problem is how to incorporate the flow characteristics into the EP excavation process to obtain an accurate EP classifier.
In this paper, we propose the EPSF (mining the revealing mode by Stream feature selection). More specifically, EPSF assumes that features are streamed in and immediately processed online. In the Dynamic EP mode mining phase, EPSF presents a two-stage framework for dynamic processing of EP pattern mining. As the features flow into each other, EPSF alternately performs the following two processes, thus EPSF provides a natural way to embed the flow feature selection into the EP pattern mining process to solve the pattern mining problem in the dynamic high-dimensional feature space.
The remainder of this article is organized as follows: Part 2nd reviews the relevant work. The 3rd part gives the relevant basics, and section 4 shows our approach. Part 5th covers our results, and part 6th provides our conclusions and future work.
2. Related work
Dong and Li[9] introduced EPS to show a significant contrast between different types of data. In addition, a jumping revealing model (abbreviated JEP) is a special kind of EPS, It supports the addition of 0 of a class of data to a non-0 of another class. As with other patterns, the combination of connected elements, EPS can easily be understood and used directly in a wide range of applications, such as error detection and discovery of information in gene expression data [8,18].
One of the biggest challenges in mining EPS is the high computational cost because the candidate model is exponential. An interesting measure is to reduce the number of EP found without sacrificing resolution. Dong and Li were inspired by the Max-miner algorithm for the first time to propose a boundary-based approach. In their method, the boundary value is used to represent the candidates and subsets of the EPs, and the boundary differentiation operation is used to discover the EPS. The Consepminer algorithm follows a level-by-layer, candidate set generation test method to excavate eps[23]. Bailey et al. [3] A fast algorithm for mining jeps is recommended, which is faster than a boundary-based algorithm. Later Bailey et al. [4] A new algorithm is presented to efficiently excavate EPS according to the calculation of the minimum super-graph transmission. Inspired by the FP tree, a CP tree miner display based on the CP tree data structure can improve EP mining performance [6]. Although the performance of EP Mining has improved, [17] proves that none of the previous technologies can handle more than 60 of the dimensions. They put forward a method of ZBDD Ep-minor: Mining eps under high dimensional data with 0 compression two fork decision diagram.
Many studies on EPS are focused on classification. Dong et al.[10] presents the first EP classifier, which is called the CAEP (by clustering exposure pattern classification). Based on Caep,li et al. a Jep classifier is proposed, which is distinctly different from the CAEP classifier. Because JEPs is more different than other types of EPs between different classes, the Jep classifier is uniquely jeps. All of these classifiers are based on the boundary value method when mining EPs. At the same time, Li et al.[13] also exhibits an inert EP classifier, known as Deeps, based on distance EP Discovery, to prove the accuracy and efficiency of CAEP and Jep classifiers. In addition, Fan and RAMAMOHANARAO[7] presented a robust EP classifier, the Sjep classifier, using a powerful jump-show pattern, a special Jep that supports 0 in a class but the other class gets 0 but satisfies the minimum threshold. The SJEP classifier embeds the CP tree into the EP classifier and uses fewer jeps instead of the Jep classifier.
Limited by EP Mining Technology, the existing EP classifier is still unable to process more than 60-D data sets. Although the ZBDD EP miner can handle a high-dimensional data set, as in previous methods, it is still subject to a sudden increase in the number of EP, even with a fairly high level of support thresholds. However, it is still a very challenging research problem to excavate a small amount of predictable EPS from a large number of candidate EPS. In a recent study, Yu et al. used Bayesian networks to help build the EP classifier, thus solving the concept of a random associative classification. This study shows that embedding arbitrary structures into EP mining can efficiently extract a small number of highly predictive EP models in high-dimensional data and achieve high-accuracy EP classifiers. Compared with [21], we propose a method of mining EP model with flow feature selection control. This approach can handle unknown or large amounts of complete feature space, but exhaustive lookups are time-consuming and not necessarily feasible.
3. Pre-knowledge
Suppose we have a DataSet D defined as a collection of n features and class attribute C. Fi,i=1,... for each feature N, we assume that it is in a discrete field where the range of discrete values is expressed as DOM (Fi). A collection that represents all itemsets. Suppose the itemsets x is a subset and its support in D is expressed as the number of instances in D that contain x, | D| is the number of instances in D. To make bi represent a finite set of K different class labels, the data set is sadly divided into d1,d2,..., DK,DJ consists of instances of the class label CJ, j=1,2,..., K. The growth rate from DS to DM (s,m=1,..., k,s m) x is defined as follows.
Definition 1: (GR: Growth rate) [9], if so;
Definition 2: (ep: Revealing model) Given a threshold, an EP from DS to DM is defined as an item set X, and
An EP from DS to DM becomes the EP mode of DM. If, then the Itemsets x is called a jep.ep mining from DS to DM the target is given a predetermined growth threshold and a minimum support threshold value, from each category CI mining the EP set ei from D-dci to DCI.
Definition 3: (growth rate improvement) [23] Given an EP e,e, Growth rate improvement tateimp (e) is defined as the minimum difference between its growth rates and the growth rates of all its subsets.

Rateimp (e) Positive thresholds ensure that the EPS set is concise and representative, and that EPS is not included in another class, which helps to improve the predictive power of the model. Therefore, the growth rate improvement can remove the useless or redundant EP mode.
In tables 1 and 2, a very clear example is given, table I and table two ballon datasets from the UCI Machine Learning repository[5], assuming that the minimum support threshold is 0.2, the growth threshold >1, the candidate EPS is T (expansion Class) and F ( Non-expansive class) two classes, altogether 20 examples, 4 characteristics: color,size,act and age.

By definition 2, in table two, {Act=stretch} and {Age=adult} are all EPs in the T class. In table I, as seen by definition 3, {Act=dip,age=child} is an EP of Class F, because {Act=dip} and {Age=child} are included in them. When applying EPS to a classification, each CI class's EP collection is used to determine which class A test instance T belongs to. Specifically, according to each category of EPs and scoring function, we calculate k scores for T. The following definition gives the calculation formula for the scoring function [10].
Definition 4: (Aggregate score). Give an example T and a category of EPS collections ei,t in the CI class are defined as follows:

The problem with scoring in definition 4 is that the number of different classes of EP patterns may be unbalanced. If Lei ci contains more EPS than the class CJ, then a test instance obtains a higher score in example CI than it did at CJ, even if the test instance actually belongs to the CJ class. Thus the definition of a 4 computed score cannot be directly used to classify a test instance. To this end, Dong et people have proposed a class Ci score was redefined as Normscore (T,CI), is score (T,CI) and Basescore (Ci) wallpaper, Normscore (T,CI) defined as follows:

The test instance T is given the category with the highest normscore. If the normscore is equal, the test instance T assigns the category with the maximum training sample data.
4. Revealing pattern mining of feature flow
4.1 Feature correlation and EP discriminant force
It is not feasible to detect the data contained in a search space in a large, high-dimensional dataset. The question is whether all • Eigenvalues can be trimmed before the EP is mined. From Table I and table 2, we can see that the final EP pattern collection does not contain the feature size and color because their corresponding EP mode has no effect on constructing an exact classifier. We propose to embed the feature stream so that the EP can be mined at high dimensions to produce a highly accurate EP classifier, otherwise it is impossible to search for the possibility space under high dimensional space. In this section, we theoretically analyze the relationship between feature correlation and EP discriminant Force, and then estimate the correlation of features with the variation of EP discriminant Force and characteristic flow over time.
One of the input features may exist in one of three different classes, in terms of relevance and its class allocation, strong correlation, weak correlation, and irrelevant. Weak correlation features can be further divided into redundant features, best feature subsets and non-redundant features, but non-redundant features have strong correlation characteristics [22]. In the following definition, F is a full feature set, FI is the I input feature, C is the class distribution, P (c| s) is a probability distribution of different classes, where a subset of features
Definition 5 (strong correlation) A characteristic fi has a strong correlation to C.
Definition 6 (weak correlation), FI has weak correlation to C

Definition 7 (irrelevant), FI is not relevant to C there is

Proposition 1: Yes, then FI is not relevant to C.
Proof: Assume that there are two categories in DataSet D C={CP,CN},DP the training data representing the category CP, DN represents the training data of category Cn,?-SUPDP (FI=F) represents the support degree of the itemsets {fi=f} in the class DN, then the growth rate gr (fi=f) from the class DN to the DP The calculation process is as follows:

On the other hand, if there is, there is

According to the definition 7, for arbitrary and F and C, there is, thus the fi is not related to C. The point is, if gr (fi=f) = 1, it means that f is not related to C, on the other hand, Fi is not related to C, then

Proposition 1 proof.
Definition 8: (Markov blanket) [11] The order is a subset of features, if the Fi condition is independent of F-mi-{fi}, then MI is a Markov blanket for F.
Definition 9 (Redundancy feature) [22] When a feature is superfluous, it needs to be removed from F, and only if it is relevant and there is a Markov blanket present in F.
From the definition of 3, for an EP E, if I can find, so that a given subset of E ', E may be redundant EP, because E can be replaced by its subset E '. Therefore, avoiding these redundant EP in advance will improve the search efficiency. In the following, the relationship between the redundancy node and the redundant EP in the EP Mining is explained.

Proposition 2:

If FI is redundant for C, the condition of C is on the subset S.
The growth rate of GR from category DN to DP is calculated as follows:

So we have proved proposition 2.
Proposition 2 shows that if the fi is a redundancy of C, give a subspace S, if and to
Contains the same predictable information as a subset.
Based on propositions 1 and 2, we are able to extract EPS without regard to irrelevant and redundant features. Therefore, we embed the flow feature selection in the EP mining process to avoid non-EPS and redundant EP.
Because extraneous features can easily be found, the two challenging questions we face are: (1) How to identify the characteristics of redundant feature streams, and (2) How to extract EPS online from the current feature pool. We deal with redundant features for the following two propositions. From definition 8 See, because the Markov blanket C runs through all the other information features, the C (MB (c) short) that we set up a Markov blanket is an empty setup that starts with a gradual build of CMB (c) over time with the characteristic flow in one after the other. Obviously, by definition, we can get proposition 3 to determine whether a new arrival feature is redundant.
When the feature is one-to-one inflow, a current Markov blanket is defined in time t, assuming that the newly arrived Markov blanket fi is in t+1 time to c weakly correlated. If, then, the fi can be removed.
With Proposition 3, we get the proposition 4来 determines which of the characteristics is redundant when Fi joins.
Proposition 4: At time t, we assume that there is no
, then y can be removed at the next.
Revealing pattern mining under the selection of 4.2 feature stream
With the theoretical analysis in 4.1, we present a EPSF algorithm in Figure 1:

EPSF built two pools online: A feature pool and an EP pool. A feature pool is a feature that stores the characteristics of an EP pattern that is predictive of mining and dynamically changes as features flow in and out, and the EP pool is used to hold candidate EP models mined from the current feature pool, and is updated in real time as the feature pool changes dynamically. When a new feature flows in, if it is a strongly correlated or non-redundant feature, the EPSF adds it to the current feature pool, converts it online into an item set, and in this set of itemsets, the online Mining EP mode, and adds these EP patterns to the EP pool. With the inflow of features, the current EP pool is updated online as the feature pool changes. In order to respond quickly to this change, we only excavate the single-set (1-itemset) EP mode online when the feature flows in and out, and then dig all the EP patterns from the current EP pool when all the features arrive.
Specifically, EPSF contains the following major phases:
Online mining single Set (1-itemset) EP mode (steps 2 to 18). When a new feature arrives, EPSF first identifies whether it is an unrelated feature or, if so, discards it. Otherwise, we evaluate it as redundant through Proposition 2, or discard if it is. If not, add it to the current feature pool CMB (C). EPSF then converts the feature X into a set of itemsets, preserving the mapping relationship between x and map_form. This mapping relationship guarantees that the set of items that are converted from the same feature, or that its superset does not appear in an EP mode. Based on the item set mapping relationship, EPSF divides feature x according to the number of categories, mining EP mode for each category, and stores the Mining EP mode in the candidate EP pool of appellation CEP.
Online update of CEP and Map_form. Due to the addition of the new feature X to the current feature pool CMC (B), EPSF updates the feature pool CMB (c) online by removing redundant features from CMB (c). If there are features removed from CMB (c), CEP and Map_form will be updated online.
The EPSF algorithm is related to datasets in high dimensions. , it does not need to store all the data and is only detected in memory: When a new feature is added, the new feature is redundant, and then the CMB (C) is updated. As a new distribution, EPSF can dig datasets under high-dimensional data before they know the full feature set. When features flow into each other, the processing of each feature depends on its arrival. The feature redundancy check (Step 6) and CMB (C) update (Steps 20-22) are performed under current CMB, without having to be in the entire feature space.
5. Experimental results
5.1 Experiment Settings
The future comprehensive evaluation of the EPSF algorithm, table 3 of 36 datasets from the UCI Machine Learning Database (the top 24), high-dimensional biomedical data sets (Hiva,ovarian-cancer,lymphoma breast-cancer), NIPS2003 Feature Selection Challenge data set (Madelon,arcene, Dorothea, and Dexter) and 4 commonly used gene regulatory datasets (last four datasets).
Our comparative studies include three types of comparisons, with 10 cross-validation on all datasets.
Status comparison of EP classifiers: caep[10] and ce-ep[21]:
Three well-known associated classifiers: cba[15],cmar[14] and crpr[20] compare with EPSF.
Compare the predictive accuracy of EPSF and irrelevant classifiers, including decision Trees J48,svm,adaboost implement Weka with default parameters.
We use the method proposed by Aliferis and other people to discrete continuous feature sets. In the experiment, we set the minimum threshold to: CBA and CMAR0.8,EPSF, CAPE, CE-EP growth rate of 20. To test the effect of the minimum support threshold, we set 7 minimum support degrees for EPSF, Ce-ep, CAEP, CBA, Cmar respectively: 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4. The Cpar parameters are the same as those reported in [20]. CBA, Cmar and Cpar are implemented on Lucs KDD software. EPSF, CAEP, and CE-EP are implemented on C + +. These experiments are in the Windows7dell workspace, configured as Intel Xeon 2.93GHZ processors and 12GRAM.

5.2 Comparison of forecast accuracy
Table 4-6 reports detailed EPSF classifiers and other eight classifiers that predict the accuracy of the results, including two EP, three associations, and three non-correlated classifiers in the 36 datum dataset. We compared the results with 7 minimum support to select the best predictive accuracy. In table 6, CBA, along with and cpar only result in 24 low-dimensional datasets because they cannot handle high-feature spaces. The best result is a large number of candidate models that are strengthened for each dataset and used by the symbol "/" to represent a classifier that has been dropped.
Further study the results of the classification, we do experiments at 95% of the significance level, in Table 7 summarizes the EPSF algorithm and other algorithms of the win/flat/negative table. (Note: If a classifier cannot run on a dataset and can work on EPSF, EPSF wins).
In table 7, the development of EPSF is usually better than CAEP on all 36 datasets and Cba,cmar,cpar under 24 low-dimensional datasets. EPSF is also superior to CE-EP,CE-EP, which is the most advanced EP classifier in the processing of high-feature dimensional data. At the same time, compared to the famous non-associative classifier, EPSF is significantly better than J48 and AdaBoost and SVM are also very competitive, as shown in table 7. We have proved in our experiment that the selection of stream feature into EP mining can avoid generating non-eps and redundant EP. This not only enables EPSF to process high-dimensional datasets such as processing the last 12 datasets, but also in table 3, it produces very promising predictive accuracy.

5.3 Model number comparison
Figures 2 through 4 compare the number of models mined under the EPSF and CBA Cmar CAEP Ce-ep, since the
five classifiers are focused on generating models under the support framework. We report the average number of models mined by 7 different minimum thresholds. Since German on Wdbc, KR-VS-KP, ionosphere, horse-colic and CAEP cannot be run with all support thresholds, the number of models made on the available support thresholds is averaged.
Figure 2 draws only 21 low-dimensional datasets, because CAEP cannot run on infant,promoters and SPECTF datasets. In Figure 3, the X-axis represents 24 datasets for the first 24 datasets in table three. In Figure 4, the X-axis represents all 36 datasets corresponding to table three. It is clear that EPSF has a much smaller selection of patterns than Cba,cmar in low-dimensional datasets.
in Figure 4, EPSF also has fewer modes than the CE-EP selection on 16 datasets. These results show that both EPSF and CE-EP can select a small number of strong predictive EPS under a high-dimensional data set. In Figure 4, 25 to 36 correspond to the last 12 high-dimensional datasets in table three. We can see that, even under very high feature dimensions, the number of modes selected by CE-EP and EPSF is not significantly different from the 24-D low-dimensional datasets.
5.4 Run time comparison
The run time for EPSF,CAEP and CE-EP includes all acquisition times, including import datasets, 10 cross-validation tests. The respective run times of EPSF and CAEP,CE-EP are shown in table 8. As you can see in table 8, EPSF is faster than CAEP on all datasets. Under the first 24 low-dimensional
datasets, EPSF is faster than CE-EP. But under the last 12 high-dimensional spatial datasets, EPSF is no faster than CE-EP. This is because, under each folded cross-validation, EPSF considers all feature mining EP, and CE-EP can discover the direct cause of the class and the direct influence of the class distribution before the EP excavation, and then directly perform the cross-validation in the reduced feature space instead of all the features. Therefore, in Figure 5-6, we only draw the EP mining run time (not including the time of training and test classifier) on each of the folds with a maximum support threshold of 0.2.
in Figure 5, as in Figure two, the x-axis represents the same 21 datasets. In Figure VI, on the x-axis, 1 to 3 represent INFANT,PROMOTERS,SPECFF datasets, and 4 to 15 represent the last 12 high-dimensional datasets in table three. In Figure V, we can see that EPSF is still faster than CAEP and CE-EP in 21 low-dimensional datasets.

Figure 2, the number of mining EP: EPSF vs CAEP. The 21 datasets on the x-axis are: 1.australian, 2, respectively. Breast-w, 3.CRX, 4.cleve, 5.diabetes,6.german,7. House-votes, 8.hepatitis, 9.horse-colic,10. Hypothyroid,11.heart, 12.ionosphere, 13.KR-VS-KP, 14.labor, 15. Liver, 16.mushroom,17. Pima, 18.spect, 19.tictactoe, 20. Vote, 21. WDBC).

In Figure 6, EPSF is slower than CE-EP in only 3 datasets: Hiva,dexter and Breast-cancer. In summary, EPSF has 33 faster than CE-EP in 36 data sets.

5.5 Prediction accuracy analysis under different growth thresholds
to further explore the performance of the EPSF,CAEP,CE-EP, we analyzed the predictive accuracy of EPSF,CAEP,CE-EP at 7 different growth thresholds, 7 to 9, and GR for the growth threshold, The minimum support threshold is stable at 0.1. Because CAEP is not able to operate at 7 growth thresholds on infant, ionosphere, promoters and SPECTF, Figure 7 draws the predictive accuracy of the remaining 20 low-dimensional datasets under 7 growth thresholds. In Figures 8 and 9, the X-axis represents the corresponding 36 datasets in table three. In figures 7 through 9, we see that CAEP,CE-EP and EPSF are not sensitive to low growth thresholds, especially CE-EP and EPSF.
5.6 No smooth mining of EPS under the entire feature space
compared with CE-EP, EPSF not only handles large-scale feature spaces, but also handles high-dimensional datasets without knowing the full feature space. Sometimes, if a feature space is too large, and poor search is time-consuming and not feasible. EPSF provides a way to solve this problem, according to the characteristics of the arrival by processing features in turn, and the user classification of the standard processing. CE-EP cannot solve this situation because it needs to identify the causes and effects based on the distribution of the classes. Due to page limitations we only estimated the performance of EPSF under the four genetic data in Figure 10 for each data set, we randomly picked 10 samples as test cases (5 positive 5 negative) and the rest for training. SVM and AdaBoost are used to test and train all the collected baselines. Without knowing all the feature spaces, EPSF and EPS estimate the current EPS on the test sample when the characteristics of the training sample flow in sequence.

Effects of different growth thresholds on FIG.7.CAEP: 20 Datasets on the x-axis represent wth rate thresholds on CAEP (the 20
Datasets on the X-axis Are:1.australian, 2. Breast-w, 3.CRX, 4.cleve,5.diabetes, 6.german, 7. House-votes, 8.hepatitis, 9 . Horse-colic, 10.hypothyroid, 11.heart, 12.KR-VS-KP, 13.labor, 14. Liver, 15.mushroom, 16.pima, 17.spect, 18.tictactoe, 19. Vote, 20. WDBC).

On a cloned dataset, when the percentage of features reached is 20% to 50%, the EPSF prediction accuracy is the same as for SVM. When all features arrive, the EPSF has a predictive accuracy of 100%, which is better than SVM. As for the remaining data sets, EPSF will not be worse than AdaBoost, and the predictive accuracy of SVM can be achieved without exhaustive search for the entire feature space. This proves that EPSF provides an effective and efficient method to solve EP mining problems when the whole feature space is smooth and expensive and impossible.

5.7 Summary of experimental results
Based on a comparative study of 5.2 to 5.6, we have the following findings:
On all data sets, the EPSF produces a smaller number of models. EPSF is more accurate than other class four classifiers (Caep,cba,cmar, and Cpar) and two state-of-the-art classifiers (J48 and AdaBoost) and can compete with SVM. Even, as a federated classifier, Caep,cba,cmar, and Cpar cannot handle high-dimensional datasets. For runtime, EPSF is faster than CAEP in all data sets.
EPSF vs. Ce-ep. EPSF and CE-EP can handle high-dimensional datasets and have high predictive accuracy. When all feature sets are known, on three evaluation indicators: accuracy, number of patterns and uptime, EPSF are better than CE-EP, even though they are close. This empirical study verifies the relationship between feature correlation and EP discrimination. In addition, the flow feature selection can not only deal with the higher feature dimension, but also have the advantage of processing the unknown full feature space before acquiring the whole space, compared with CE-EP,EPSF. And EPSF may be able to avoid exhaustive lookups throughout the feature space.
Conclusions and future work
In this paper, we discuss the relationship between feature correlation and EP discrimination. By using this relationship, we embed the flow feature selection into the process of directing a dynamic EP mining. This new approach not only handles higher feature dimensions, but also has an advantage in dealing with unknown full-featured spaces before acquiring the entire space. The experimental results verify the efficiency and effectiveness of our method. We plan to apply our new approach to real satellite imagery to generate infinite texture-based (texture) features.
Thanks
This work has China National 863 Project (2012AA011005), China National Natural Science Foundation (61070131, 61175051 and 61005007), the National Science Foundation of the United States (CCF-0905337), and NASA Research Awards (NNX09AK86G).
Reference documents
[1] C. F. Aliferis, I. Tsamardinos, A. Statnikov & L.E. Brown. (2003) Causal explorer:a causal Probabilistic Network Learning Toolkit for Biomedical discovery. Metmbs ' 03.
[2] Roberto J. Bayardo. (1998) Efficiently mining long patterns fromdatabases. Sigmod ' 98, 85-93.
[3] J. Bailey, T. Manoukian & K. Ramamohanarao. (2002) Fast algorithms for mining emerging patterns. Pkdd ' 02, 39-50.
[4] J. Bailey, T. Manoukian & K. Ramamohanarao. (2003) A Fast algorithm for computing hypergraph transversals and it application in mining emerging patterns. ICDM ' 03, 485-488.
[5] C.L Blake & C.J. Merz. (1998) UCI Repository of machine learning Databases.
[6] H. Fan & K. Ramamohanarao. (2002) An efficient Single-scan algorithm for mining essential jumping emerging patterns for classification. Pakdd ' 02, 456-462.
[7] H. Fan & K. Ramamohanarao. (2006) Fast discovery and the generalization of strong jumping emerging patterns for building compact and accurate Classif Iers. IEEE transactions on Knowledge and Data Engineering, 18 (6), 721-737.
[8] G. Fang, G. Pandey, W. Wang, M. Gupta, M. Steinbach, & v. Kumar. Mining low-support discriminative patterns from dense and high-dimensional data. IEEE transactions on Knowledge and Data Engineering, 24 (2), 279-294.
[9] G. Dong & J. Li (1999) Efficient mining of emerging patterns:discovering trends and differences. KDD ' 99, 43-52.
G. Dong, X. Zhang, L. Wong, & J. Li (1999) Caep:classification by aggregating emerging Patterns. DS ' 99, 30-42.
R. Kohavi & G. H. John. (1997) Wrappers for feature subset selection. Artificial Intelligence, 97, 273-324.
J. Li, G. Dong, & K. Ramamohanarao. (+) Making use of the most expressive jumping emerging patterns for classification. Pakdd ' 00, 220-232.
J. Li, G. Dong & K. Ramamohanarao (2000). instance-based classification by emerging patterns. Pkdd ' 00, 191-200.
W. Li, J. Han, & J. Pei. (2001) Cmar:accurate and efficient classification based on Multiple-class Association rule. ICDM ' 01,369–376.
[C] B. Liu, W. Hsu, & Y. Ma. (1998) Integrating Classification and association rule mining. KDD ' 98, 80-86.
[+] D. Lo, H. Cheng, J. Han, S. Khoo, & C. Sun. Classification of software behaviors for failure detection:a Discriminative pattern Mining approach. KDD ' 09, 557-566.
E. Loekito & J. Bailey. (2006) Fast mining of high dimensional expressive contrast patterns using zero suppressed binary decision diagrams. KDD ' 06, 307–316.
[S.mao] & G.dong. (2005) Discovery of highly differentiative gene groups from microarray gene expression data using the gene club approach. J. Bioinformatics and Computational biology,3 (6): 1263-1280.
X. Wu, K.yu, H. Wang & W. Ding. () Online Streaming
Feature Selection. ICML ' 10, 1159-1166.
X. Yin & J. Han. (2003) cpar:classification based on Predictive Association rule. SDM ' 03, 369-376.
K. Yu, X. Wu, W. Ding, H. Wang & H. Yao. () causal associative classification. ICDM ' 11, 914-923.
[L.yu] & H. Liu. (2004) Efficient feature selection via analysis of relevance and redundancy. J. of Machine Learning, 5, 1205-1224.
X. Zhang, G. Dong & K. Ramamohanarao. (+) Exploring onstraints to efficiently mine emerging patterns from large highdimensional datasets. KDD ' 00, 310-314.
J. Zhou, D. Foster, R.A Stine & L.H. (2006) Streamwise feature Selection. J. of Machine Learning, 7, 1861-1885. "

Data Mining article translation--mining emerging Patterns by streaming Feature Selection

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.