Full-text search, data mining, and recommendation engine series 6-KMean-based text automatic Algorithms

Source: Internet
Author: User

Automatic clustering of a series of articles can be used as the basis of the content-based recommendation engine. To achieve automatic text clustering, first perform word segmentation based on the articles described in Series 5, then, the term vector representation of the article is calculated, that is, the TF * IDF corresponding to each different word in the article is obtained. The specific calculation method is shown in Figure 5. Currently, KMean is the most widely used text automatic clustering algorithm. This article introduces the application of KMean algorithm. Of course, the KMean algorithm can be implemented by calling the two open-source machine learning algorithm libraries Mahout Or WEKA. However, in such algorithms, complicated input files need to be prepared, and the preprocessing process is complicated, another point is that we may need to adjust the KMean Algorithm in practical applications. This helps us understand the text clustering algorithm by writing the KMean algorithm and adding it again.

First, we define a glossary class to represent the glossary of each article. It also includes document numbers and Category numbers. The Code is as follows:

Public class SepaTermVector {
Public SepaTermVector (){
TermVector = new Vector <TermInfo> ();
}
Public Vector <TermInfo> getTermVector (){
Return termVector;
}
Public void setTermVector (Vector <TermInfo> termVector ){
This. termVector = termVector;
}
Public int getDocId (){
Return docId;
}
Public void setDocId (int docId ){
This.doc Id = docId;
}
Public int getClusterId (){
Return clusterId;
}
Public void setClusterId (int clusterId ){
This. clusterId = clusterId;
}
 
/**
* When using the term vector of an article, we do not want to destroy the term vector of the article in the automatic clustering process.
* Create a glossary for the automatic clustering algorithm
*/
@ Override
Public SepaTermVector clone (){
SepaTermVector obj = new SepaTermVector ();
Obj. setDocId (docId );
Obj. setClusterId (clusterId );
Vector <TermInfo> vt = new Vector <TermInfo> ();
For (TermInfo item: termVector ){
Vt. add (item );
}
Obj. setTermVector (vt );
Return obj;
}
Private Vector <TermInfo> termVector = null;
Private int docId =-1; // Document ID
Private int clusterId =-1; // cluster ID
}

Then we define the class of text clustering, store the cluster number in the class, and the center of the cluster (itself is a union of all the article terms vectors in the cluster) and the term vectors contained in the clustering (each term vector represents an article ). The Code is as follows:

Public class TextClusterInfo {
Public TextClusterInfo (int clusterId ){
This. clusterId = clusterId;
Items = new Vector <SepaTermVector> (); // sacrifices some performance when considering thread security
}
 
Public void addItem (SepaTermVector item ){
Items. add (item );
}
 
Public void clearItems (){
Items. clear ();
}
 
/**
* Calculate the center of this type.
*/
Public void computeCenter (){
If (items. size () <= 0 ){
Return;
}
For (SepaTermVector item: items ){
If (null = center ){
Center = item;
} Else {
Center = DocTermVector. calCenterTermVector (item, center );
}
}
}
 
Public int getClusterId (){
Return clusterId;
}
Public void setClusterId (int clusterId ){
This. clusterId = clusterId;
}
Public SepaTermVector getCenter (){
Return center;
}
Public void setCenter (SepaTermVector center ){
This. center = center;
}
Public List <SepaTermVector> getItems (){
Return items;
}
Public void setItems (List <SepaTermVector> items ){
This. items = items;
}
Private int clusterId = 0;
Private SepaTermVector center = null;
Private List <SepaTermVector> items = null;
}

The next step is the tool class of the KMean automatic clustering algorithm. Note that in the standard KMean automatic clustering algorithm, you only need to specify the initial number of clusters, and then the algorithm automatically selects the cluster center randomly, then, iterate to find the automatic clustering result. In order to reduce the computing intensity of the algorithm, we provide not only the number of clusters, but also the central terminology vector of each cluster, that is, in a large number of texts, find a representative article in each clustering and pass it as a parameter to the automatic clustering algorithm. In our experiment data, the convergence effect can be quickly achieved and the accuracy is very high.

KMean is divided into the following steps:

  1. Initialize a cluster based on the given cluster center
  2. Clears the list of terms that belong to the cluster in each cluster.
  3. For the term vector of each article, find the closest cluster and add the term vector to the cluster. If the clustering obtained in the previous cycle is different from the current one, it indicates that the cluster algorithm needs to be run.
  4. Calculate the new center of clustering after adding a new term Vector
  5. Determine whether to run the automatic clustering algorithm. If necessary, return to 2.

The similarity between terms vectors and clusters is measured in the form of dot product, that is, the sum of the same word weights represented by the terms vectors and the cluster center. A larger value indicates a more similar word weight. The Code is as follows:

Public static double getDotProdTvs (SepaTermVector stv1, SepaTermVector stv2 ){
Double dotProd = 0.0;
Hashtable <String, Double> dict = new Hashtable <String, Double> ();
For (TermInfo info: stv2.getTermVector ()){
Dict. put (info. getTermStr (), info. getWeight ());
}
For (TermInfo item: stv1.getTermVector ()){
If (dict. get (item. getTermStr ())! = Null ){
DotProd + = item. getWeight () * dict. get (item. getTermStr (). doubleValue ();
}
}
Return dotProd;
}

The following code of the KMean Algorithm Implementation class:

Public class TextKMeanCluster {
/**
* Generally, we need to divide the text to be classified into general categories, that is, determine numClusters. In some cases, you can specify a category
. This avoids clustering quality problems when algorithms do not converge.
* @ Param docTermVectors refers to the terminology vector for clustering.
* @ Param numClusters cluster quantity
*/
Public TextKMeanCluster (List <SepaTermVector> docTermVectors, int numClusters ){
This.doc TermVectors = docTermVectors;
This. numClusters = numClusters;
}
 
/**
* Clustering articles
* @ Param initcenters indicates the center of the cluster.
* @ Return cluster result
*/
Public list <textclusterinfo> cluster (list <sepatermvector> initcenters ){
If (doctermvectors. Size () <= 0 ){
Return NULL;
}
Initclusters (initcenters );
Boolean hadreassign = true;
Int runtimes = 0;
While (runtimes <= max_kmean_runtimes) & (hadreassign )){
System. Out. println ("runtimes =" + runtimes + "! ");
Clearclusteritems ();
Hadreassign = reassignclusters ();
Computeclustercenters ();
Runtimes ++;
}
Return clusters;
}
 
/**
* This algorithm uses the given clustering center method, but the standard KMean algorithm randomly selects the clustering center, and the random selection convergence is slow.
*/
Public void initClusters (List <SepaTermVector> initCenters ){
Clusters = new Vector <TextClusterInfo> ();
TextClusterInfo cluster = null;
Int I = 0;
For (SepaTermVector stv: initCenters ){
Cluster = new TextClusterInfo (I ++ );
Cluster. setCenter (stv );
Clusters. add (cluster );
}
}
 
/**
* New clustering of terms vectors of all articles is obtained. If the clustering is different from the one obtained last time, it indicates that you need to continue running.
* @ Return in the real age, tables need to continue to run the automatic clustering algorithm.
*/
Public boolean reassignClusters (){
Int numChanges = 0;
TextClusterInfo newCluster = null;
For (SepaTermVector termVector: docTermVectors ){
NewCluster = getClosetCluster (termVector );
If (termVector. getClusterId () <0) | termVector. getClusterId ()! = NewCluster. getClusterId ()){
NumChanges ++;
TermVector. setClusterId (newCluster. getClusterId ());
}
NewCluster. addItem (termVector );
// System. out. println ("reassignCluster: cid =" + newCluster. getClusterId () + ": size =" +

// NewCluster. getItems (). size ());
}
Return (numchanges> 0 );
}
 
/**
* Find the new center of the cluster after the new language vector is added.
*/
Public void computeclustercenters (){
For (textclusterinfo cluster: clusters ){
Cluster. computecenter ();
}
}
 
/**
* Clears the list of terms vectors for the cluster.
*/
Public void clearclusteritems (){
For (textclusterinfo cluster: clusters ){
Cluster. clearitems ();
}
}
 
/**
* The algorithm used to randomly extract the cluster center from the standard kmean algorithm. This method is not used in this class for the moment.
* @ Param usedindex
* @ Return
*/
Private SepaTermVector getTermVectorAtRandom (Hashtable <Integer, Integer> usedIndex ){
Boolean found = false;
Int index =-1;
While (! Found ){
Index = (int) Math. floor (Math. random () * docTermVectors. size ());
While (usedIndex. get (index )! = Null ){
Index = (int) Math. floor (Math. random () * docTermVectors. size ());
}
UsedIndex. put (index, index );
Return docTermVectors. get (index). clone (); // copy the file again without damaging the original copy.
}
Return null;
}
 
/**
* Perform dot product on the term vector and the term vector represented by all clustering centers. The cluster with the maximum value is the cluster of this document.
* @ Param termVector Glossary
* @ Return refers to the clustering closest to the term vector.
*/
Private TextClusterInfo getClosetCluster (SepaTermVector termVector ){
TextClusterInfo closetCluster = null;
Double dotProd =-1.0;
Double maxDotProd =-2.0;
Double dist =-1.0;
Double smallestDist = Double. MAX_VALUE;
For (TextClusterInfo cluster: clusters ){
// Dist = DocTermVector. calTermVectorDist (cluster. getCenter (), termVector );
DotProd = DocTermVector. getDotProdTvs (cluster. getCenter (), termVector );
// System. out. println ("getClosetCluster: dotProd =" + dotProd + "[" + maxDotProd + "] docId ="
// + TermVector. getDocId () + "! ");
// If (dist <smallestDist ){
If (dotProd> maxDotProd ){
// SmallestDist = dist;
MaxDotProd = dotProd;
ClosetCluster = cluster;
}
}
Return closetCluster;
}
 
Public final static int MAX_KMEAN_RUNTIMES = 1000;
 
Private List <SepaTermVector> docTermVectors = null; // Glossary of all articles
Private List <SepaTermVector> centers = null;
Private List <TextClusterInfo> clusters = null; // all clusters
Private int numClusters = 0;
}

The calling method is as follows:

DocTermVector. init ();
// Technical
Int doc1Id = FteEngine. genTermVector (-1, "Java programming technology details ","","","","");
Int doc2Id = FteEngine. genTermVector (-1, "C ++ programming guide ","","","","");
Int doc4Id = FteEngine. genTermVector (-1, "Python Programming Tutorial ","","","","");
// Homosexuality
Int doc3Id = FteEngine. genTermVector (-1, "a gay website becomes an e-commerce website ","","","","");
Int doc5Id = FteEngine. genTermVector (-1, "gay website Daquan ","","","","");
Int doc6Id = FteEngine. genTermVector (-1, "male features", "gay ","","","");
// Angel Investment
Int doc7Id = FteEngine. genTermVector (-1, "angel investing in social networks ","","","","");
Int doc8Id = FteEngine. genTermVector (-1, "angel investment development overview ","","","","");
Int doc9id = fteengine. gentermvector (-1, "famous angel investors and angel investors ","","","","");
// Environmental Protection
Int doc10id = fteengine. gentermvector (-1, "Environmental Protection Technical Analysis ","","","","");
Int doc11id = fteengine. gentermvector (-1, "environmental protection and carbon Tariff Analysis ","","","","");
Int doc12id = fteengine. gentermvector (-1, "Environmental Protection and China's economic development trends ","","","","");

Fteengine. gentermvector (-1, "vbprogramming Guide ","","","","");
Fteengine. gentermvector (-1, "officially launched Angel Street in angel investment community ","","","","");
Fteengine. gentermvector (-1, "annual programming language selection activity ","","","","");

List <sepatermvector> centers = new vector <sepatermvector> ();
Centers. Add (doctermvector. getdoctermvector (0 ));
Centers. Add (doctermvector. getdoctermvector (3 ));
Centers. add (DocTermVector. getDocTermVector (6 ));
Centers. add (DocTermVector. getDocTermVector (9 ));
TextKMeanCluster tkmc = new TextKMeanCluster (DocTermVector. getDocTermVectors (), 4 );
List <TextClusterInfo> rst = tkmc. cluster (centers );
String lineStr = null;
For (TextClusterInfo info: rst ){
LineStr = "" + info. getClusterId () + "(" + info. getItems (). size () + "):";
For (SepaTermVector tvItem: info. getItems ()){
LineStr + = "" + tvItem. getDocId ();
}
LineStr + = "^_^ ";
System. out. println (lineStr );
}

The running result is:

0 (5): 0 1 2 12 14 ^_^
1 (3): 3 4 5 ^_^
2 (4): 6 7 8 13 ^_^
3 (3): 9 10 11 ^_^

The preceding results show that clustering is basically correct.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.