[Turn]python for Chinese text clustering (word-cutting and Kmeans clustering)

Last Update:2018-04-18 Source: Internet

Author: User

Tags idf cosn

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Brief introduction

View Baidu Search 中文文本聚类 I am disappointed to find that there is no complete online on the python implementation of the Chinese text clustering (and even search keywords python 中文文本聚类 are so), the Internet is mostly about the text clustering Kmeans 原理 , Java实现 R语言实现 ,, There's even one C++的实现 .

I wrote some of the articles, I did not very good classification, I would like to be able to cluster the method of some similar articles to cluster, and then I look at each cluster of the approximate topic is what, give each cluster a label, so it is completed classification.

Chinese text clustering mainly has a few steps, the following will be described in detail:

Cut words
Remove discontinued words
Construction of space VSM for word bag (vector space model)
TF-IDF Building Word Weights
Using the K-means algorithm

First, cut the word

Here the Chinese word use is 结巴切词 , GitHub project homepage, author Weibo

Detailed installation methods are available on the GitHub Project homepage, 结巴切词 as well as sample descriptions, which are not detailed here and can normally be installed in the following ways.

# pip install jieba

# easy_install jieba

You can also refer to the article:
1.Python Chinese sub-phrase pieces Jieba
2.python stuttering participle (jieba) learning

Ii. Removal of discontinued words

结巴分词Although there is the ability to remove the disabled words, but it seems to be used only for jieba.analyse the formation, not to jieba.cut use, so here we still have to build the inactive Word file, and remove the stop word.
The common Chinese stop words are:
1. Chinese Stop Glossary (more comprehensive, there are 1208 discontinued words)
2. The most complete Chinese Stop Thesaurus (1893)

The implementation code is as follows (code compares water):

defRead_from_file (file_name): With open (file_name,"R") as Fp:words=Fp.read ()returnwordsdefstop_words (stop_word_file): Words=Read_from_file (stop_word_file) result=jieba.cut (words) new_words= []     forRinchResult:new_words.append (R)returnSet (new_words)defdel_stop_words (words,stop_words_set):#words is a document that has been cut but has not removed the discontinued word. #returns the document after the stop word is removedresult =jieba.cut (words) new_words= []     forRinchResult:ifR not inchStop_words_set:new_words.append (R)returnNew_words

Third, the construction of the word bag space VSM (vector space model)

Next is the construction of the word bag space, our steps are as follows
1. Read all the documents into the program, and then cut each document into words.
2. Remove the inactive words from each document.
3. Statistic the Word collection for all documents (Sk-learn has related functions, but I know it can be used in Chinese).
4. For each document, a vector is built and the value of the vector is the 词语 number of occurrences in this document.
For example, suppose there are two texts, 1. 我爱上海，我爱中国 2. 中国伟大,上海漂亮
Then there are a few words:,,,,,, 词语 我 爱 上海 中国 伟大 漂亮 , (commas may also be cut).
Assuming that the stop word is 我， , then after removing the stop word, the remaining words are
爱,,,, 上海 中国 伟大 漂亮
Then we build vectors for document 1 and document 2, then the vectors are as follows:

text	Love	Shanghai	China	Great	beautiful
Document 1	2	1	1	0	0
Document 2	0	1	1	1	1

The code is as follows:

defGet_all_vector (file_path,stop_words_set): Names= [Os.path.join (file_path,f) forFinchOs.listdir (File_path)] posts= [Open (name). Read () forNameinchNames] Docs=[] Word_set=set () forPostinchPosts:doc=del_stop_words (Post,stop_words_set) docs.append (DOC) Word_set|=Set (DOC)#print Len (doc), Len (word_set)Word_set=list (word_set) DOCS_VSM= []    #For word in word_set[:30]:        #print Word.encode ("Utf-8"),     forDocinchDocs:temp_vector= []         forWordinchword_set:temp_vector.append (Doc.count (word)* 1.0)        #Print Temp_vector[-30:-1]docs_vsm.append (temp_vector) Docs_matrix= Np.array (DOCS_VSM)

In Python, it may be possible to include it in the [[2,1,1,0,0],[0,1,1,1,]] numpy array or matrix to facilitate the calculation of TF-IDF below.

Iv. convert the number of occurrences of words into weights (TF-IDF)

In other words, our VSM is already stored in the form of vectors, why do we need TF-IDF form? I think this is to convert the number of occurrences of a word to a weighted value.
The introduction of TF-IDF can refer to the article on the Internet:
1. Basic Text Clustering method
2. TF-IDF Baidu Encyclopedia
3. TF-IDF Wikipedia English (requires FQ)

It is important to note that the calculation of the TF (term frequency), about the calculation of the IDF (inverse document frequency), I think the formula is basically the same:
The inverse file frequency (inverse document FREQUENCY,IDF) is a measure of the universal importance of a word. IDF of a particular term may be divided by the number of total documents by the number of documents containing the word, and the obtained quotient logarithm is obtained:

This formula is edited to recommend an amazing website: detexify
which
: Total number of files in corpus
: The number of files that contain words (that is, the number of files) if the term is not in the corpus, it causes the denominator to be zero, so it is generally used as the denominator.

However, Baidu Encyclopedia and most of the online introduction of TF is actually problematic, TF-IDF Baidu Encyclopedia of the Word frequency (term FREQUENCY,TF) refers to a given term in the file appears in the frequency, then it is obvious that this calculation formula is:

However, this calculation often leads to a small tf, in fact, TF-IDF is not only one way of computing, but a variety of, this time reflects the power of Wikipedia, the specific introduction to TF-IDF or refer to Wikipedia.

If you are unfamiliar with numpy, you can refer to NumPy official documentation

Column_sum = [Float (len (Np.nonzero (docs_matrix[:,i]) [0]) forIinchRange (docs_matrix.shape[1])]column_sum=Np.array (column_sum) column_sum= Docs_matrix.shape[0]/COLUMN_SUMIDF=Np.log (column_sum) IDF=Np.diag (IDF)#think carefully, the definition of IDF is not dependent on a document, so we calculate it in advance. #Note that the calculation is a matrix operation, not an operation of a single variable.  forDoc_vinchDocs_matrix:ifDoc_v.sum () = =0:doc_v= DOC_V/1Else: Doc_v= Doc_v/(Doc_v.sum ()) TFIDF=Np.dot (DOCS_MATRIX,IDF)returnNames,tfidf

Now we have the properties of the matrix as follows,

A column is a collection of words in total for all documents.
Each row represents a document.
Each line is a vector, and each value of the vector is the weight of the word.

V. Clustering with the K-means algorithm

By this time, we can use the Kmeans algorithm for clustering, for the Kmeans algorithm, it is not text, it is just a matrix, so we use a common Kmeans algorithm can be.
An introduction to Kmeans can be found in the following articles:
1. Introduction and implementation of basic Kmeans algorithm
2. K-means Baidu Encyclopedia
3. Discussion on Kmeans clustering
The difference is, in most of the text clustering, people usually use the cosine distance (a good introduction to the article) rather than Euclidean distance to calculate, is because of the sparse matrix, I do not quite understand.

The following code comes from the code in the tenth chapter of machine learning Combat:

defGen_sim (A, b): Num=Float (Np.dot (a,b.t)) Denum= Np.linalg.norm (A) *np.linalg.norm (B)ifDenum = =0:denum= 1Cosn= num/Denum Sim= 0.5 + 0.5 *CosnreturnSimdefRandcent (DataSet, k): N= Shape (DataSet) [1] Centroids= Mat (Zeros ((k,n)))#Create centroid Mat     forJinchRange (N):#Create random cluster centers, within bounds of each dimensionMinj =min (dataset[:,j]) Rangej= Float (max (dataset[:,j))-Minj) Centroids[:,j]= Mat (Minj + Rangej * Random.rand (k,1))    returncentroidsdefKmeans (DataSet, K, Distmeas=gen_sim, createcent=randcent): M=shape (DataSet) [0] clusterassment= Mat (Zeros ((m,2)))#Create mat to assign data points                                      #to a centroid, also holds SE of each pointCentroids =Createcent (DataSet, k) clusterchanged=True Counter=0 whileCounter <= 50: Counter+ = 1clusterchanged=False forIinchRange (m):#For each data point assign it to the closest centroidMindist =inf; Minindex=-1 forJinchRange (k): Distji=Distmeas (centroids[j,:],dataset[i,:])ifDistji <mindist:mindist=Distji; Minindex=Jifclusterassment[i,0]! =minindex:clusterchanged=True clusterassment[i,:]= Minindex,mindist**2#Print Centroids         forcentinchRange (k):#Recalculate CentroidsPtsinclust = Dataset[nonzero (clusterassment[:,0]. A==cent) [0]]#get All the "this cluster"Centroids[cent,:] = mean (Ptsinclust, axis=0)#assign centroid to mean    returnCentroids, Clusterassment

Vi. Summary

Basically up to here, a usable Chinese text clustering tool has been completed, the GitHub project address.
What about the effect?

I myself have some unclassified articles belonging to 人生感悟 the (shy face) category of a total of 182, in the cut and remove the stop word after a total of 13,202 words, I set k=10, ah, the effect is not too good, of course, there may be a reason:

The document itself is already highly classified, and clustering based on word frequency does not reveal subtle differences between these articles.
Algorithms need to be optimized, and there may be some places where you can set modifications.

In short, after learning several days of machine learning, the first practice trip is finished.

This article was reproduced from: http://blog.csdn.net/likeyiyy/article/details/48982909

[Turn]python for Chinese text clustering (word-cutting and Kmeans clustering)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More