The implementation of this clustering algorithm is the third assignment in the data mining course. The first two assignments were made using other people's software and seldom implemented by themselves, the first is to use sqlserver2008's business intelligence tool to create a data warehouse, data processing, and warehouse model, dimension tables, and fact tables, however, during the examination, we should also establish common data warehouse models. The second time we ran some provided data using WEKA's classification and association rules algorithms, in fact, the parameter principles of those algorithms are unknown;
However, this clustering job is actually implementing a clustering algorithm for the provided data. First, we will describe the provided data to be clustered, mainly including two datasets:
For Dataset 1, we can clearly see the intent of clustering, but Dataset 2 does not quite understand the intent of Data Clustering. For Dataset 1, we can see that we can use the Euclidean distance to use the K-means algorithm for clustering, however, when the first data is written, the effect is not very good, and K-means does not seem to be able to cluster the second data set, and it feels the rhythm of rewriting, however, the process of writing a program to the first dataset is described here;
Data settings: compatibility was taken into account when the program was designed. For datasets 1 and 2, a parent class is designed, and datasets 1 and 2 are integrated into the parent class, in this way, the polymorphism can be used to store the pointer of the parent class in the data, so you do not need to consider storing the specific data set 1 or 2, however, considering that the second dataset cannot use k-means, the algorithm needs to be modified, but this structure can be retained;
The hierarchy of inheritance is as follows:
The parent class is an abstract class and defines several abstract methods inherited and overwritten by several subclasses, as follows:
The main function of the method is to calculate the distance, calculate the convergence function of the criterion, re-calculate the mean, and draw a graph function. The attribute values are marked by the color value and cluster type;
The two sub-classes implement the above methods and have their own attributes, mainly as follows:
In this program design process, the program is not based on the dialog, but the single-document structure provided by the MFC, including the view, Doc, and frame classes, here, read the file and rewrite the onopendocument function in the doc class;
Bool cdataminingexample1doc: onopendocument (lpctstr lpszpathname) {If (! Cdocument: onopendocument (lpszpathname) return false; // open the file and read the data into the data structure. cmainframe * pmainframe = (cmainframe *) (afxgetapp ()-> m_pmainwnd ); cdataminingexample1view * pmainview = (cdataminingexample1view *) (pmainframe-> getactiveview (); // clear pmainview-> emptycluster (); pmainview-> emptykmeans (); // read data to vector // 2D // open the input stream ifstream infile (lpszpathname); double NCX, ncy; int NCR, NCG, NCB; // judge which file is string STR; getline (INFI Le, STR); int length = Str. Size (); infile. seekg (0); While (! Infile. EOF () {If (length> 11) {infile> NCX; infile> ncy; infile> NCR; infile> NCG; infile> NCB; if (NCR = 255 & NCG = 255 & NCB = 255) {continue;} else {dataptcolor * ndataptcolor = new dataptcolor (NCX, ncy, NCR, NCG, NCB); pmainview-> m_ccluster.m_datapoints.push_back (ndataptcolor) ;}} else {infile >>> NCX; infile >> ncy; // The point required to read the data structure. // you should determine the object to be constructed based on the file structure. Here, only the first data is required. datapt2d * ndatapt2d = new datapt2d (NCX, ncy ); pmainview-> m_ccluster.m_datapoints.push_back (ndatapt2d) ;}} return true ;}
Next, we want to refresh the color changes of the Dynamic Display points on the screen during the cluster drawing process. At first, we choose to run the algorithm in the thread on the interface, and then open the thread invalidate to update the interface, however, it was found that the interface was stuck every time the algorithm was run, and later it was found that the interface was constantly refreshed, then open up new threads to constantly update the content to be displayed on the interface, that is, the background data, so here we should open up a thread running algorithm, and then the interface is only responsible for drawing according to the data;
However, the invalidate plot will flash, so we need to use double buffering, but this technology does not know how to use it, so keep it here;
In fact, the K-means algorithm is relatively simple and simple. At that time, this algorithm was selected, but it seems that the features of the second dataset are not suitable for this algorithm, this algorithm is suitable for discovering the structure of circular and spherical clusters. The K-means algorithm is as follows:
Void cluster: kmeans (vector <abstractdatapt *> initkmeans) {// K-means iteration algorithm // each vertex has a category vector <abstractdatapt *> tmpkmeans (initkmeans ); int I, j, k = tmpkmeans. size (), length = m_datapoints.size (); // keep the shortest; // The value of the criterion function, double newfunction; int num = 0; // The specific iteration process while (true) {num ++; for (I = 0; I <length; I ++) {shortest = max; For (j = 0; j <K; j ++) {// the distance between each vertex and K mean is calculated from double tpdistance = m_datapoints [I]-> calculated (tmpkmeans [J]); // update the color of a vertex near a vertex, and change the vertex category to J + 1 If (tpdistance <shortest) {shortest = tpdistance; // update the color m_datapoints [I]-> m_colorpt = tmpkmeans [J]-> m_colorpt; // update the category m_datapoints [I]-> categroy = J + 1 ;}}} // calculate the criterion function newfunction = m_datapoints [0]-> calculatee (m_datapoints, tmpkmeans); If (FABS (lastfunction-newfunction)> 0.00001) {lastfunction = newfunction ;} else {break;} // recalculate the mean ktmpkmeans. clear (); tmpkmeans = m_datapoints [0]-> calculatemeans (m_datapoints, k); sleep (2000); initkmeans = tmpkmeans; // check to jump out of the loop after convergence }}
Another experience is that the application of Object-oriented Thinking to programs is to design a vertex class, which should also be the same as this class, so the object should have its own method ~ Some others are the view classes that can be obtained by using the code below anywhere in the program:
CMainFrame* pMainFrame=(CMainFrame*)(AfxGetApp()->m_pMainWnd);CDataMiningExample1View* pMainView=(CDataMiningExample1View*)(pMainFrame->GetActiveView());
The View class is a class responsible for display. In fact, both the doc class and the View class can obtain the pointer of each other, so this knowledge must be accumulated;
The following screenshot shows the clustering result. The effect is not very good:
The clustering of the K-means algorithm based on the coordinate dataset of the second vertex is as follows, and the results are unsatisfactory. Therefore, the K-means algorithm is not very effective for clustering analysis of this dataset, if more effective methods are implemented, update and compare them:
Correction: During the preprocessing of the second question, some data problems are ignored here, so the clustering effect of the second dataset is not very good. Here, the correction is not just to retrieve the white points, here, the filter condition is enhanced to remove a slightly lighter vertex. Here, the settings are as follows:
if (nCR>235&&nCG>235&&nCB>235){continue;}
Another advantage of this processing is that some vertices are filtered, the number of vertices is reduced, and the algorithm computing speed is increased, so that results can be obtained quickly, the subsequent DBSCAN time is relatively slow and does not know how to create an index. Therefore, the calculation of the neighboring area still needs to traverse all vertices;
In this way, you can get a clear image, as shown below. The image is enlarged here, so some of the images are not displayed: