Text Clustering Tutorials

Source: Internet
Author: User
Tags glob pprint idf

I have done machine learning direction, because the internship need to do text clustering, classification work, although roughly similar, but still novice, process and results are only for the great God advice. This blog contains the author two weeks of focus on the commissioning and thousands of lines of testing to obtain more than 300 lines of code essence, if necessary reprint, please indicate the source.


What is text clustering?

Text clustering is a document from the original natural language text information into mathematical information, in the form of high-dimensional space to show up, by calculating those points distance comparison recently gathered those points into a cluster, the center of the cluster is called Cluster heart. A good cluster to ensure that the distance of the points within the cluster is as close as possible, but the points between clusters and clusters should be as far as possible.


What is the difficulty of text clustering?

Clustering is a non-supervised learning, that is, to gather into several categories, how to gather, we do not know, only a little bit to try out. But sometimes the machine thinks that these two piles of points can be considered to be two clusters, but people understand may be a cluster, text clustering is difficult here, the machine and people's understanding is not the same. Generally can see this Bo people have learned the basic clustering algorithm, take K-means as an example, cluster heart selection is a very random process, resulting in the same K value in the case of clustering results each time is not the same, and bad to take an average, so the quality of clustering is difficult to evaluate out.


How to evaluate the quality of clustering?

I was on a blog: http://blog.csdn.net/chixujohnny/article/details/51852633 in the end of all the S_DBW evaluation indicators, I have not yet tried, interested students can try, in short, sooner or later to use.


Text Clustering Process

Exposition Right method also has textrate, also useless, Jieba comes with, the naked eye looks good, can try.

A few parts of the above diagram, generally generated document vector matrix format is, each row represents a document, each column is a dimension represents the weight of the word, does not appear that the word is 0, thousands of file dimensions in more than 10 W (see the size of the document), such a large dimension of the brain want to think of, The matrix will be and sparse, that is, in a high-dimensional space, thousands of points almost all together, although there is a distance between each other, but the distance is very small, it is obvious that the cluster effect is very poor, measured, and the probability of tossing a coin. So the matrix is dense a little to think of the PCA, PCA is the abbreviation of Principal component analysis, the general meaning is to take this high-dimensional vector of the most important direction of the Chinese side after some mathematical transformation will be useful part of the reservation, the useless part discarded, this method is also suitable for the classification algorithm to find the most hanging features, details I in the Machine learning Combat "This book (a small brother back sack of the book), not detailed but generally understand. Why not use SVD to reduce the dimension, SVD suitable for dense matrix, than matrix or recommendation system, take 80% useful information, suitable for image compression algorithm (know not deep, please hit the face).

Contour coefficient This concept I really see this bupt classmate's blog: buptguo.com/2016/05/31/learn-ml-from-scikit-learn-silhouette-analysis/directly into the line to see it, He speaks better than me.


Clustering this piece I read some information, Baidu's those use is K-means do of text clustering, I want to ask you to do is not the school homework to deal with? thousand-dimensional vectors are made with K-means? Is it funny? I measured a bit, the effect is very poor, the accuracy of throwing a sieve. Later began to look at the literature, there is a hierarchical clustering algorithm called Birch, the algorithm can be better to solve the problem of K-means each cluster result deviation is too large, compared to Dbscan has the number of clusters can be set (of course, the threshold can also be set), the most important is the Sklearn have ready-made library calls, The speed is fast. Try a bit better than Kmeans, but not particularly satisfied, check the next or not too suitable for high-dimensional space, ready to check the information.


There are a few more points that you might want to try:

The threshold value of clustering algorithm can be set up, and what clustering algorithms are tried, especially high-dimensional clustering algorithm.

Use Textrate to assign a weight, see how it works


Finally, send the code:

I'm not going to explain what the function is. The following code comments are very detailed

# coding:utf-8# 2.0 Use Jieba for Word segmentation, completely discard inefficient nlpir, assign weights with Textrank algorithm (actual Textrank effect is better) # 2.1 Gensim tfidf# 2.2 Sklearn do TFIDF and kmeans# 2.3 to change Kmeans to Birch, using traditional tfidfimport loggingimport timeimport osimport jiebaimport globimport Randomimport copyimport chardetimport gensimfrom gensim import corpora,similarities, modelsfrom pprint Import Pprintimport jieba.analysefrom sklearn Import feature_extractionfrom sklearn.feature_extraction.text Import Tfidftransformerfrom sklearn.feature_extraction.text Import Countvectorizerimport osfrom sklearn.decomposition Import pca# logging.basicconfig (format= '% (asctime) s:% (levelname) s:% (message) s ', level=logging.info) start =              Time.clock () print ' #----------------------------------------# ' print ' # # ' print ' # Load Corpus # ' print ' # # ' print ' #-----------------------------    -----------#\n ' def preprocessdoc (root): Alldirpath = [] # holds the folder path around the Corpus DataSet folder, String,[1:] for the desiredFilenumlist = [] def processdirectory (args, dirname, filenames, filenum=0): Alldirpath.append (dirname) fo R filename in Filenames:filenum + = 1 filenumlist.append (FileNum) os.path.walk (Root, Processdirectory , None) Totalfilenum = SUM (filenumlist) print ' Total files: ' + str (totalfilenum) return alldirpathprint ' #------------                ----------------------------# ' print ' # # ' print ' # Synthetic corpus document # ' print ' # ' print ' #----------------------------------------#\n ' # One line per document, the first word is this Categories of documents Def savedoc (Alldirpath, DocPath, stopwords): print ' Start crafting corpus: ' category = 1 # document category F = open (DocPath, ' W ') # Put all the text in this document for Dirparh in Alldirpath[1:]: For FilePath in Glob.glob (Dirparh + '/*.txt '): data = Open (FilePath, ' R '). Read () texts = deletestopwords (data, stopwords) line = ' # to indent the words in a row, the first position is the document category, Separate for wor with spacesD in Texts:if word.encode (' utf-8 ') = = ' \ n ' or Word.encode (' utf-8 ') = = ' nbsp ' or Word.encode (' utf-8 ') = = ' \ r\n ': Continue line + = Word.encode (' utf-8 ') line + = "F.writ E (line + ' \ n ') # put this in the file category + = 1 # Sweep a folder, category +1 return 0 # Generate the document without returning the value print ' #-----------------------------                                        -----------# ' print ' # # ' print ' # participle + go to stop word # ' print ' # # ' print ' #----------------------------------------#\n ' def deletestopwords (data, StopWord s): WordList = [] # First break the word cutwords = jieba.cut (data) for item in Cutwords:if item.encode (' utf-8 ') not In Stopwords: # participle encoding to be consistent with the Stop Word encoding wordlist.append (item) return Wordlistprint ' #---------------------------------                             -------# ' print ' # # ' print ' # tf-idf # ' print ' #           # ' print ' #----------------------------------------#\n ' def TFIDF (docPath): print ' Start TFIDF: ' Corpus = [] #  Document Corpus # reads corpus, one line corpus is a document lines = open (DocPath, ' R '). ReadLines () for lines in Lines:corpus.append (Line.strip ()) # Strip () before and after the space is gone, but the middle space is also reserved # to convert words in the text into the word frequency matrix, matrix element A[i][j] for J words in Class I text word frequency Vectorizer = Countvectorizer () # This class will count each word TFIDF Weight transformer = Tfidftransformer () # The first fit_transform is calculated TF-IDF The second fit_transform is to convert text to Word frequency matrix TFIDF = Transfor Mer.fit_transform (Vectorizer.fit_transform (Corpus)) # gets all the words in the word bag model Word = Vectorizer.get_feature_names () # will Tf-id F-Matrix extraction, element W[i][j] denotes J-word tf-idf weight in Class I text weight = Tfidf.toarray () print weight # # output All words # result = Open (DocPath,    ' W ') # for J in range (Len (word)): # Result.write (Word[j].encode (' utf-8 ') + ') # result.write (' \r\n\r\n ') # # # # of output Ownership weight # for I in range (len (weight)): # for J in range (len Word): # result.write (str (weigh    T[I][J]) # Result.write (' \r\n\r\n ') # # Result.close () return weightprint ' #----------------------------------------# ' Print                                        ' # # ' print ' # PCA # ' print ' # # ' print ' #----------------------------------------#\n ' def PCA (weight, dimension): From Sklearn.dec    Omposition Import PCA print ' original dimension: ', Len (weight[0]) print ' Start dimensionality reduction: ' PCA = PCA (n_components=dimension) # Initialize PCA X = Pca.fit_transform (weight) # Returns the data after dimension print ' descending dimension: ', Len (x[0]) print X return xprint ' #--------------------- -------------------# ' print ' # # ' print ' # k-means # ' P Rint ' # # ' print ' #----------------------------------------#\n ' def kmeans (x, K): # X =weight from sklearn.cluster import Kmeans print ' Start clustering: ' Clusterer = Kmeans (n_clusters=k, init= ' k-means++ ') # Set Poly Class Model # X = CLUSTERER.FIt (weight) # According to the text vector fit # print X # print Clf.cluster_centers_ # Each sample belongs to the cluster y = clusterer.fit_predict (X) # weigh T matrix throw in fit, output label Print y # i = 1 # while I <= Len (y): # i + = 1 # is used to evaluate whether the number of clusters is appropriate, the distance is about small, the better the cluster, select the cluster of critical points                                        Number of # Print Clf.inertia_ return yprint ' #----------------------------------------# ' print ' # # ' print ' # BIRCH # ' print ' # # ' Print ' #----------------------------------------#\n ' Def birch (X, k): # to cluster lattice, number of clusters from sklearn.cluster import Birch Print ' Start clustering: ' Clusterer = Birch (n_clusters=k) y = clusterer.fit_predict (X) print ' Output clustering result: ' Print y return yprint                  ' #----------------------------------------# ' print ' # # ' print ' # contour factor # ' print ' # # ' print ' #----------------------------------------#\n ' d    EF Silhouette (X, y):From sklearn.metrics import silhouette_samples, silhouette_score print ' Calculate contour factor: ' Silhouette_avg = Silhouette_score (X , y) # average contour factor sample_silhouette_values = Silhouette_samples (X, y) # contour factor for each point Pprint (SILHOUETTE_AVG) return Silhou                                        Ette_avg, Sample_silhouette_valuesprint ' #----------------------------------------# ' print ' # # ' print ' # paint # ' print ' # # ' print ' #---- ------------------------------------#\n ' def Draw (Silhouette_avg, Sample_silhouette_values, Y, k): Import  Matplotlib.pyplot as PLT import matplotlib.cm as cm import NumPy as NP # Create a subplot with 1-row 2-column Fig,    Ax1 = plt.subplots (1) fig.set_size_inches (18, 7) # The first subplot set of Contour points # range is [-1, 1] ax1.set_xlim ([-0.2, 0.5]) # after (k + 1) * 10 is to be able to show these points more clearly ax1.set_ylim ([0, Len (X) + (k + 1) *]) Y_lower = Ten for I in range (k): # respectively Traverse these clusters Ith_cluster_silhouette_Values = Sample_silhouette_values[y = = i] ith_cluster_silhouette_values.sort () Size_cluster_i = Ith_cluster_        Silhouette_values.shape[0] Y_upper = y_lower + Size_cluster_i color = cm.spectral (float (i)/k) # Make a color Ax1.fill_betweenx (Np.arange (Y_lower, Y_upper), 0, Ith_cluster_silhouett E_values, Facecolor=color, Edgecolor=color, ALP ha=0.7) # This coefficient does not know what to do # in the contour points here Add the cluster class number Ax1.text ( -0.05, Y_lower + 0.5 * Size_cluster_i, str (i)) # calculation Y_lower y-axis position of the next point y_lower = Y_upper + 10 # Make a vertical comment in the diagram contour coefficients dashed ax1.axvline (x=silhouette_avg, color= ' Red ', line Style= "--") plt.show () if __name__ = = "__main__": root = '/users/john/desktop/test ' stopwords = open ('/users/john/ Documents/nlpstudy/stopwords-utf8 ', ' R '). Read () DocPath = '/users/john/desktop/test/doc.txt ' k = 3 Alldirpath = P Reprocessdoc (Root) SaVedoc (Alldirpath, DocPath, stopwords) weight = TFIDF (docPath) X = PCA (weight, dimension=800) # Reduce the original weight data by dimension # y = km EANs (x, k) # y= class label after clustering y = Birch (x, k) silhouette_avg, sample_silhouette_values = Silhouette (x, y) # contour factor Draw (si Lhouette_avg, Sample_silhouette_values, y, k) end = Time.clock () print ' Run time: ' + str (end-start)


Text Clustering Tutorials

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.