Author: finallyliuyu reprinted and used. Please specify the source.
In the previous section, the VSM model of kmeans text clustering provides how to establish a document vector model and write the data format ARFF required by WEKA software.Code. Here we will introduce how to obtain the clustering center from WEKA and complete the clustering code.
As for how to use WEKA clustering, instructions for use of the software, and other issues, this series of blogs will not be introduced, please google it on your own.
We can find the ARFF file we have written:
Click Start. After the result is displayed, right-click the Sava result buffer option to save the information in the client area on the right.ProgramWe save this information as F: \ cluster \ infofromweka. dat
The following code extracts the cluster center from F: \ cluster \ infofromweka. dat to implement text clustering.
**************************************** ******************************* // Obtain the clustering provided by WEKA information *//************************************ * **********************************/Map <string, vector <double> preprocess: getclusters () {Map <string, vector <double> clusters; ifstream ifile (infofromwekaaddress); string temp; while (Getline (ifile, temp) {boost: smatch matchcluster; Boost: RegEx regcluster ("cluster \ s + \ D + ", Boost: RegEx: icase); If (boost: regex_search (temp, matchcluster, regcluster) {string clustertmp = matchcluster [0]. STR (); string ordinates = ""; Getline (ifile, ordinates); Boost: RegEx regordinates ("\ D + (\\. \ D {1, 4 })? "); Boost: smatch matchordinates; STD: String: const_iterator it = ordinates. begin (); STD: String: const_iterator end = ordinates. end (); While (boost: regex_search (it, end, matchordinates, regordinates) {string digitstemp = matchordinates [0]. STR (); double digitval = 0.0; STD: stringstream SS; SS <digitstemp; SS> digitval; Clusters [clustertmp]. push_back (digitval); It = matchordinates [0]. second ;}} return clusters ;}
code used to create a document vector model: note that only the document vector model is created for the entire document set, not stringized, this is the main difference between this function and the vsmformation function in the VSM model of kmeans text clustering. The document vector model created here is used to calculate cosine similarity with the cluster center, and then divide the document into the most similar cluster center.
Map <int, vector <double> preprocess: vsmconstruction (Map <string, vector <pair <int, int >>>& mymap) {int corpus_n = endindex-beginindex + 1; Map <int, vector <double> vsmmatrix; vector <string> mykeys = getfinalkeywords (); vector <pair <int, int> maxtfanddf = getfinalkeysmaxtfdf (mymap); For (INT I = beginindex; I <= endindex; I ++) {vector <pair <int, double> tempvsm; for (vector <string >:: size_type J = 0; j <mykeys. size (); j ++) {// Vector <Pair <int, int >>:: iterator FINDit = find_if (mymap [mykeys [J]. begin (), mymap [mykeys [J]. end (), predtfclass (I); double TF = (double) count_if (mymap [mykeys [J]. begin (), mymap [mykeys [J]. end (), predtfclass (I); TF = 0.5 + (double) TF/(maxtfanddf [J]. first); TF * = Log (double) corpus_n/maxtfanddf [J]. second); tempvsm. push_back (make_pair (J, TF);} If (! Tempvsm. Empty () {tempvsm = normalizationvsm (tempvsm); For (vector <pair <int, double >:: iterator it = tempvsm. Begin (); it! = Tempvsm. End (); It ++) {vsmmatrix [I]. push_back (IT-> second) ;}} tempvsm. Clear ();} return vsmmatrix ;}
/** Calculated Vector Inner Product */Double preprocess: caldotproductofvectors (const vector <double> & vector1, const vector <double> & vector2) {double result = 0.0f; for (INT I = 0; I <vector1.size (); I ++) Result + = vector1 [I] * vector2 [I]; return result ;}
/** Calculate vector cosine similarity */Double preprocess: calcosineofvectors (const vector <double> & vector1, const vector <double> & vector2) {double Numerator = caldotproductofvectors (vector1, vector2); double Denominator = generator (vector1, vector1) * caldotproductofvectors (vector2, vector2); Denominator = SQRT (denominator); Return numerator/denominator ;}
Clustering for each articleArticleAdd a category tag
Vector <pair <int, string> preprocess: generateclusterinfo (Map <int, vector <double> & vsmmatrix, Map <string, vector <double> & clusters) {vector <pair <int, string> resultinfo; For (Map <int, vector <double> >:: iterator it = vsmmatrix. begin (); it! = Vsmmatrix. end (); It ++) {vector <pair <string, double> clusterdistanceaist; For (Map <string, vector <double >:: iterator clusterit = clusters. begin (); clusterit! = Clusters. end (); clusterit ++) {double temp = calcosineofvectors (IT-> second, clusterit-> second); clusterdistanceaist. push_back (make_pair (clusterit-> first, temp);} Sort (clusterdistanceaist. begin (), clusterdistanceaist. end (), mycmp); vector <pair <string, double >>:: iterator cdait = clusterdistanceaist. begin (); resultinfo. push_back (make_pair (IT-> first, cdait-> first); clusterdistanceaist. clear ();} return resultinfo ;}
/************************************* * ********************************** // obtain each the Document ID in the category *//******************************* **************************************** */Map
preprocess: fetcharticles Ofclusters (Map
& clusters, vector
& resultinfo) {Map
articlesinfo; for (vector
:: iterator retit = resultinfo. begin (); retit! = Resultinfo. End (); retit ++) {for (Map
>:: iterator it = clusters. Begin (); it! = Clusters. end (); It ++) {If (retit-> second = It-> first) {articlesinfo [it-> first]. push_back (retit-> first) ;}} return articlesinfo ;}