Kmeans text clustering: obtains the clustering center of WEKA computing to complete text clustering.

Source: Internet
Author: User

Author: finallyliuyu reprinted and used. Please specify the source.

 

In the previous section, the VSM model of kmeans text clustering provides how to establish a document vector model and write the data format ARFF required by WEKA software.Code. Here we will introduce how to obtain the clustering center from WEKA and complete the clustering code.

As for how to use WEKA clustering, instructions for use of the software, and other issues, this series of blogs will not be introduced, please google it on your own.

We can find the ARFF file we have written:

Click Start. After the result is displayed, right-click the Sava result buffer option to save the information in the client area on the right.ProgramWe save this information as F: \ cluster \ infofromweka. dat

The following code extracts the cluster center from F: \ cluster \ infofromweka. dat to implement text clustering.

**************************************** ******************************* // Obtain the clustering provided by WEKA information *//************************************ * **********************************/Map <string, vector <double> preprocess: getclusters () {Map <string, vector <double> clusters; ifstream ifile (infofromwekaaddress); string temp; while (Getline (ifile, temp) {boost: smatch matchcluster; Boost: RegEx regcluster ("cluster \ s + \ D + ", Boost: RegEx: icase); If (boost: regex_search (temp, matchcluster, regcluster) {string clustertmp = matchcluster [0]. STR (); string ordinates = ""; Getline (ifile, ordinates); Boost: RegEx regordinates ("\ D + (\\. \ D {1, 4 })? "); Boost: smatch matchordinates; STD: String: const_iterator it = ordinates. begin (); STD: String: const_iterator end = ordinates. end (); While (boost: regex_search (it, end, matchordinates, regordinates) {string digitstemp = matchordinates [0]. STR (); double digitval = 0.0; STD: stringstream SS; SS <digitstemp; SS> digitval; Clusters [clustertmp]. push_back (digitval); It = matchordinates [0]. second ;}} return clusters ;}

code used to create a document vector model: note that only the document vector model is created for the entire document set, not stringized, this is the main difference between this function and the vsmformation function in the VSM model of kmeans text clustering. The document vector model created here is used to calculate cosine similarity with the cluster center, and then divide the document into the most similar cluster center.

Map <int, vector <double> preprocess: vsmconstruction (Map <string, vector <pair <int, int >>>& mymap) {int corpus_n = endindex-beginindex + 1; Map <int, vector <double> vsmmatrix; vector <string> mykeys = getfinalkeywords (); vector <pair <int, int> maxtfanddf = getfinalkeysmaxtfdf (mymap); For (INT I = beginindex; I <= endindex; I ++) {vector <pair <int, double> tempvsm; for (vector <string >:: size_type J = 0; j <mykeys. size (); j ++) {// Vector <Pair <int, int >>:: iterator FINDit = find_if (mymap [mykeys [J]. begin (), mymap [mykeys [J]. end (), predtfclass (I); double TF = (double) count_if (mymap [mykeys [J]. begin (), mymap [mykeys [J]. end (), predtfclass (I); TF = 0.5 + (double) TF/(maxtfanddf [J]. first); TF * = Log (double) corpus_n/maxtfanddf [J]. second); tempvsm. push_back (make_pair (J, TF);} If (! Tempvsm. Empty () {tempvsm = normalizationvsm (tempvsm); For (vector <pair <int, double >:: iterator it = tempvsm. Begin (); it! = Tempvsm. End (); It ++) {vsmmatrix [I]. push_back (IT-> second) ;}} tempvsm. Clear ();} return vsmmatrix ;}
/** Calculated Vector Inner Product */Double preprocess: caldotproductofvectors (const vector <double> & vector1, const vector <double> & vector2) {double result = 0.0f; for (INT I = 0; I <vector1.size (); I ++) Result + = vector1 [I] * vector2 [I]; return result ;}

 

 

 

 

/** Calculate vector cosine similarity */Double preprocess: calcosineofvectors (const vector <double> & vector1, const vector <double> & vector2) {double Numerator = caldotproductofvectors (vector1, vector2); double Denominator = generator (vector1, vector1) * caldotproductofvectors (vector2, vector2); Denominator = SQRT (denominator); Return numerator/denominator ;}
 
Clustering for each articleArticleAdd a category tag
Vector <pair <int, string> preprocess: generateclusterinfo (Map <int, vector <double> & vsmmatrix, Map <string, vector <double> & clusters) {vector <pair <int, string> resultinfo; For (Map <int, vector <double> >:: iterator it = vsmmatrix. begin (); it! = Vsmmatrix. end (); It ++) {vector <pair <string, double> clusterdistanceaist; For (Map <string, vector <double >:: iterator clusterit = clusters. begin (); clusterit! = Clusters. end (); clusterit ++) {double temp = calcosineofvectors (IT-> second, clusterit-> second); clusterdistanceaist. push_back (make_pair (clusterit-> first, temp);} Sort (clusterdistanceaist. begin (), clusterdistanceaist. end (), mycmp); vector <pair <string, double >>:: iterator cdait = clusterdistanceaist. begin (); resultinfo. push_back (make_pair (IT-> first, cdait-> first); clusterdistanceaist. clear ();} return resultinfo ;}
 
 /************************************* * ********************************** // obtain each the Document ID in the category *//******************************* **************************************** */Map 
  
    preprocess: fetcharticles Ofclusters (Map 
   
     & clusters, vector 
    
      & resultinfo) {Map 
     
       articlesinfo; for (vector 
      
       :: iterator retit = resultinfo. begin (); retit! = Resultinfo. End (); retit ++) {for (Map 
       
         >:: iterator it = clusters. Begin (); it! = Clusters. end (); It ++) {If (retit-> second = It-> first) {articlesinfo [it-> first]. push_back (retit-> first) ;}} return articlesinfo ;}
       
      
     
    
   
  

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.