"reprint" Data mining-examples of medical applications based on Relieff and K-means algorithms (turn from: http://www.cnblogs.com/asxinyu/archive/2013/08/29/3289682.html)
The method of data mining makes people have the ability to finally realize the real value of data, that is, the information and knowledge contained in the data. Data Mining (datamiriing) refers to extracting the knowledge that people are interested in from large database or data warehouse, which is an implicit, unknown and potentially useful information, and data mining is one of the most frontier research directions in the field of database and information decision-making in the world at present. So share a little research that was done a long time ago. is also a simple example of data mining processing. 1. Data Mining and Clustering Analysis Overview
Data mining generally consists of the following steps:
(l) Analysis of the problem: the source data database must be evaluated to confirm compliance with data mining standards. To determine the expected results, the optimal algorithm for this work is also selected.
(2) Extracting, cleaning and verifying data: the extracted data is placed in a database that is structurally compatible with the data model. Clean inconsistent, incompatible data in a uniform format. Once the data has been extracted and cleaned, browse through the created model to ensure that all data is present and complete.
(3) Creating and debugging a model: applying an algorithm to a model results in a structure. It is important to explore the data in the resulting structure and confirm that it is an accurate representation of the "facts" in the source data. While it may not be possible to do this for every detail, you may find important features by looking at the resulting model.
(4) Querying data for a data mining model: Once the model is established, the data can be used for decision support.
(5) Maintain the data mining model: When the data model is established, the characteristics of the initial data, such as validity, may change. Some changes in information can have a significant impact on precision, as its changes affect the nature of the original model as a basis. Therefore, the maintenance of data mining model is very important link.
Cluster analysis is the core technology used in data mining, and has become a very active research topic in this field. The cluster analysis is based on the simple thought of "flock together", which is clustered or classified according to the characteristics of things. As an important research direction of data mining, cluster analysis is getting more and more attention. The input of a cluster is a set of data that is not labeled in categories, and it can be known beforehand that the data is clustered into several clusters of claws. By analyzing these data, according to certain clustering criteria, the records set can be divided reasonably, so that the similar records are divided into the same cluster, and the dissimilar data are divided into different clusters. 2. Feature Selection and clustering analysis algorithm
Relief is a series of algorithms, which includes the first proposed relief and later extended Relieff and Rrelieff, of which the Rrelieff algorithm is proposed for the regression problem of the continuous value of the target attribute, Only the relief and Relieff algorithms for classification problems are described below. 2.1 Relief Algorithm
The relief algorithm was first proposed by Kira, initially confined to the classification of two types of data. The relief algorithm is a feature weight algorithm (Feature weighting algorithms) that assigns different weights based on the correlation of each feature and category, and the feature with weights less than a certain threshold is removed. The correlation of features and categories in the relief algorithm is based on the distinguishing ability of the feature to the near-distance samples. The algorithm randomly selects a sample r from the training set D, and then looks for the nearest neighbor Sample H from the same sample as R, called near hit, and finds the nearest neighbor sample m, called Nearmiss, from a sample of different classes of R, and then updates the weights for each feature according to the following rules: if R and near The distance of hit on a feature is less than the distance on the R and near miss, which means that the feature is useful for distinguishing the nearest neighbors of the same and different classes, increasing the weight of the feature, and conversely, if R and near hits have a distance greater than R and near miss on a feature, Indicates that the feature has a negative effect on distinguishing the nearest neighbors of homogeneous and different classes, then the weight of the feature is reduced. The above process repeats m times, and finally the average weight of each feature is obtained. The higher the weight of the feature, the stronger the classification ability of the feature, and the weaker the ability to classify the feature. The running time of the relief algorithm increases linearly with the sample number of samples m and the number of original features n, thus the operating efficiency is very high. The specific algorithm is as follows:
2.2 Relieff Algorithm
Because the relief algorithm is relatively simple, but the operation efficiency is high, and the result is more satisfactory, so it is widely used, but its limitation is that it can only deal with two categories of data, so 1994 Kononeill extended it, got Relieff algorithm, can deal with multi-class problems. The algorithm is used to deal with the regression problem that the target attribute is continuous value. When dealing with multiple classes of problems, the Relieff algorithm randomly takes a sample r from the training sample set and then finds the K nearest neighbor (near Hits) from the same sample set of R, and finds the K nearest neighbor sample (near Misses) from the sample set of each R's different classes. Then update the weights for each feature, as shown in the following formula:
Relief Series algorithm runs high efficiency, there is no limit to the data type, belongs to a feature weight algorithm, the algorithm will give all and category correlation high characteristics of higher weights, so the limitation of the algorithm is not effective to remove redundant features. 2.3 K-means Clustering Algorithm
Because the clustering algorithm is to give the data to the natural similarity of the method, the required clustering is the internal data of each cluster as much as possible and the clustering between the largest possible difference. So it's very important to define a scale to measure similarity. In general, there are two ways to define similarity. The first method is to define the distance between the data and describe the difference in the data. The second method is to directly define the similarity between the data. Here are a few common ways to define distances:
1.Euclidean distance, this is a traditional concept of distance, suitable for 2, 3-dimensional space.
2.Minkowski distance, is the extension of Euclidean distance, can be understood as the distance of n-dimensional space.
There are many kinds of clustering algorithms, which can be chosen according to the data types involved, the purpose of clustering and the application requirements to choose the appropriate clustering algorithm. The K-means clustering algorithm is described below:
K-means algorithm is a common clustering algorithm based on partition. The K-means algorithm takes k as the parameter, divides n objects into K clusters, makes the clusters have higher similarity, and the similarity between clusters is low. The process of K-means is: first randomly selects K objects as the centroid of the initial K-clusters, then assigns the remainder to the nearest cluster according to its distance from the centroid of each cluster, and finally recalculates the centroid of each cluster. Repeat this process until the target function is minimized. The centroid of a cluster is calculated from the following formula:
In a specific implementation, in order to prevent the condition in step 2 is not set up in an infinite loop, often define a maximum number of iterations. K-means attempts to find the K division that minimizes the value of the squared error function. When the data distribution is more uniform, and the difference between cluster and cluster is obvious, its effect is better. In the face of large-scale data sets, the algorithm is relatively extensible and has high efficiency. where n is the number of objects in the DataSet, K is the number of clusters expected, and T is the number of iterations. Usually, the algorithm terminates the local optimal solution. But with, for example, data that involves a non-numeric attribute. Secondly, this algorithm requires the number of clusters to be generated in advance K, it is obvious that the user has raised a high demand, and because the initial clustering center of the algorithm is randomly selected, and the different initial centers have a great influence on the clustering results. In addition, the K-means algorithm is not suitable for discovering clusters with non-convex shapes, or clusters of very large sizes, and is sensitive to noise and outlier data. 3. An example of a medical data analysis 3.1 Data Description
The experimental data from the well-known UCI machine Learning Database, which has a large number of artificial intelligence data mining data, the site is: http://archive.ics.uci.edu/ml/. The database is constantly updated and also accepts donations of data. The types of databases involve life, engineering, and science, and the number of records is from less to more, up to hundreds of thousands of. By the end of 2010, there were 199 datasets in the database, each with different types and time-related data. Can be selected according to the actual situation.
The type of data used in this article is: Breast cancer Wisconsin (Original) data set, the Chinese name is: Wisconsin State breast cancer dataset. The data came from a clinical case report from the University of Wisconsin Hospital in the United States, with 11 attributes per data. The downloaded data file format is ". Data", which is convenient for the program to invoke by converting it to MATLAB's default dataset save using Excel and MATLAB tools.
The following table is the 11 attribute names and descriptions for the dataset:
After converting the above data and the data description, there are 9 indicators that can be used for feature extraction, and the sample number and classification are only used to determine the classification. The idea of data processing in this paper is to use Relieff feature extraction algorithm to calculate the weights of each attribute, eliminate the least relevant attribute, and then use K-means Clustering algorithm to analyze the remaining attributes. 3.2 Data preprocessing and procedures
In this paper, after the conversion of data, the first preprocessing, because the data range of this article is 1-10, so do not need normalization, but there are some incomplete data samples, will affect the actual program run, through the program processing, this part of the data deleted. These incomplete data are not registered or lost due to some actual reasons, with "?" In the form of a representative.
This paper uses MATLAB software for programming calculation. According to the process of Relieff algorithm mentioned in chapter three, we first write the Relieff function program, which is used to calculate the characteristic attribute, then write the main program, call the function in the main program to calculate, and analyze the result, draw the useful conclusion.
The program is unified at the end of the post. 3.3 Breast Cancer data set feature extraction
In this paper, the Relieff algorithm in section 3.1 is used to calculate the weights of each feature, and the features with weights less than a certain threshold will be removed, and the 2-3 kinds of weights will be eliminated for the actual situation. Because the algorithm in the process of running, will choose Random sample R, the different random number will lead to a certain difference in the result weight, so this article takes the average method, the main program runs 20 times, and then the results are summarized to find the average of each weight. As shown below, the attribute number is listed as the result of the behavior each time:
Here is the feature extraction algorithm calculation of the feature weight trend graph, calculated 20 times the results of the same trend:
Whether the above results run the calculation results of the main program, it does not look intuitive, the following will be plotted in order, you can visually display the weight distribution of each attribute, as shown in the following figure:
In order from small to large, we know that the weights of each attribute are as follows:
Property 9< Property 5< Property 7< Property 4< Property 2< Property 3< Property 8< Property 1< Property 6
We select a weight threshold of 0.02, then attribute 9, attribute 4, and attribute 5 are excluded.
From the above characteristics of the weight can be seen, attribute 6 bare core size is the most important factor, indicating that the symptoms of breast cancer patients first showed the size of the bare nucleus, will directly lead to the size of the bare nucleus changes, followed by attributes 1 and attribute 8, the latter several properties of weight close, but from the multiple calculation law, It is also possible to illustrate the different importance of these, and the following is an analysis of several important attributes. Here are the weight changes for the bare core size (attribute 6) in 20 Tests:
As you can see from the above figure, the weight of this property is mostly around 0.22-0.26, which is the most weighted attribute. Here's a look at the weight distribution of attribute 1:
The feature weights of block thickness attribute change around 0.19-25, which is also a very high weight, which indicates that the characteristic attribute is very important in the detection index of breast cancer patients. Further analysis shows that in the cluster analysis of attribute 6, and attribute 1 alone, the success rate can reach 91.8%. This article is described in detail in the Kmeans algorithm in the next section. analysis of 3.4 Breast cancer data aggregation class
In the previous section, the analysis of data set by Relieff algorithm can get the important degree of the attribute weight, which can be used to analyze the actual cases, and can avoid the error diagnosis and improve the speed and accuracy of diagnosis. The data is analyzed by the K-menas clustering algorithm. This section will be divided into several steps to compare and determine the results of the clustering algorithm and the results associated with the Relieff algorithm. 1.k-means algorithm analyzes data sets separately
The data set is analyzed separately using the Kmeans algorithm. MATLAB has included a number of conventional data mining algorithms, such as the K-means algorithm used in this article. The function is named Kmeans, which enables clustering of datasets. First of all, the breast cancer data set of all the attribute column (except identity information and classification column) directly to classify, because the dataset results only 2 types, so first of the 2 classes of testing, the results are as follows: The total 683 data into 2 categories, the overall correct rate is 94.44%, The correct rate for the first class is 93.56%, and the second class has a correct rate of 96.31%. Here is a map of the attribute values that are plotted after categorization for different attributes: