Brief introduction
View Baidu Search 中文文本聚类
I am disappointed to find that there is no complete online on the python implementation of the Chinese text clustering (and even search keywords python 中文文本聚类
are so), the Internet is mostly about the text clustering Kmeans 原理
, Java实现
R语言实现
,, There's even one C++的实现
.
I wrote some of the articles, I did not very good classification, I would like to be able to cluster the method of some similar articles to cluster, and then I look at each cluster of the approximate topic is what, give each cluster a label, so it is completed classification.
Chinese text clustering mainly has a few steps, the following will be described in detail:
- Cut words
- Remove discontinued words
- Construction of space VSM for word bag (vector space model)
- TF-IDF Building Word Weights
- Using the K-means algorithm
First, cut the word
Here the Chinese word use is 结巴切词
, GitHub project homepage, author Weibo
Detailed installation methods are available on the GitHub Project homepage, 结巴切词
as well as sample descriptions, which are not detailed here and can normally be installed in the following ways.
# pip install jieba
Or
# easy_install jieba
You can also refer to the article:
1.Python Chinese sub-phrase pieces Jieba
2.python stuttering participle (jieba) learning
Ii. Removal of discontinued words
结巴分词
Although there is the ability to remove the disabled words, but it seems to be used only for jieba.analyse
the formation, not to jieba.cut
use, so here we still have to build the inactive Word file, and remove the stop word.
The common Chinese stop words are:
1. Chinese Stop Glossary (more comprehensive, there are 1208 discontinued words)
2. The most complete Chinese Stop Thesaurus (1893)
The implementation code is as follows (code compares water):
defRead_from_file (file_name): With open (file_name,"R") as Fp:words=Fp.read ()returnwordsdefstop_words (stop_word_file): Words=Read_from_file (stop_word_file) result=jieba.cut (words) new_words= [] forRinchResult:new_words.append (R)returnSet (new_words)defdel_stop_words (words,stop_words_set):#words is a document that has been cut but has not removed the discontinued word. #returns the document after the stop word is removedresult =jieba.cut (words) new_words= [] forRinchResult:ifR not inchStop_words_set:new_words.append (R)returnNew_words
Third, the construction of the word bag space VSM (vector space model)
Next is the construction of the word bag space, our steps are as follows
1. Read all the documents into the program, and then cut each document into words.
2. Remove the inactive words from each document.
3. Statistic the Word collection for all documents (Sk-learn has related functions, but I know it can be used in Chinese).
4. For each document, a vector is built and the value of the vector is the 词语
number of occurrences in this document.
For example, suppose there are two texts, 1. 我爱上海,我爱中国
2. 中国伟大,上海漂亮
Then there are a few words:,,,,,, 词语
我
爱
上海
中国
伟大
漂亮
,
(commas may also be cut).
Assuming that the stop word is 我 ,
, then after removing the stop word, the remaining words are
爱
,,,, 上海
中国
伟大
漂亮
Then we build vectors for document 1 and document 2, then the vectors are as follows:
text |
Love |
Shanghai |
China |
Great |
beautiful |
Document 1 |
2 |
1 |
1 |
0 |
0 |
Document 2 |
0 |
1 |
1 |
1 |
1 |
The code is as follows:
defGet_all_vector (file_path,stop_words_set): Names= [Os.path.join (file_path,f) forFinchOs.listdir (File_path)] posts= [Open (name). Read () forNameinchNames] Docs=[] Word_set=set () forPostinchPosts:doc=del_stop_words (Post,stop_words_set) docs.append (DOC) Word_set|=Set (DOC)#print Len (doc), Len (word_set)Word_set=list (word_set) DOCS_VSM= [] #For word in word_set[:30]: #print Word.encode ("Utf-8"), forDocinchDocs:temp_vector= [] forWordinchword_set:temp_vector.append (Doc.count (word)* 1.0) #Print Temp_vector[-30:-1]docs_vsm.append (temp_vector) Docs_matrix= Np.array (DOCS_VSM)
- In Python, it may be possible to include it in the
[[2,1,1,0,0],[0,1,1,1,]]
numpy array or matrix to facilitate the calculation of TF-IDF below.
Iv. convert the number of occurrences of words into weights (TF-IDF)
In other words, our VSM is already stored in the form of vectors, why do we need TF-IDF form? I think this is to convert the number of occurrences of a word to a weighted value.
The introduction of TF-IDF can refer to the article on the Internet:
1. Basic Text Clustering method
2. TF-IDF Baidu Encyclopedia
3. TF-IDF Wikipedia English (requires FQ)
It is important to note that the calculation of the TF (term frequency), about the calculation of the IDF (inverse document frequency), I think the formula is basically the same:
The inverse file frequency (inverse document FREQUENCY,IDF) is a measure of the universal importance of a word. IDF of a particular term may be divided by the number of total documents by the number of documents containing the word, and the obtained quotient logarithm is obtained:
This formula is edited to recommend an amazing website: detexify
which
: Total number of files in corpus
: The number of files that contain words (that is, the number of files) if the term is not in the corpus, it causes the denominator to be zero, so it is generally used as the denominator.
However, Baidu Encyclopedia and most of the online introduction of TF is actually problematic, TF-IDF Baidu Encyclopedia of the Word frequency (term FREQUENCY,TF) refers to a given term in the file appears in the frequency, then it is obvious that this calculation formula is:
However, this calculation often leads to a small tf, in fact, TF-IDF is not only one way of computing, but a variety of, this time reflects the power of Wikipedia, the specific introduction to TF-IDF or refer to Wikipedia.
If you are unfamiliar with numpy, you can refer to NumPy official documentation
Column_sum = [Float (len (Np.nonzero (docs_matrix[:,i]) [0]) forIinchRange (docs_matrix.shape[1])]column_sum=Np.array (column_sum) column_sum= Docs_matrix.shape[0]/COLUMN_SUMIDF=Np.log (column_sum) IDF=Np.diag (IDF)#think carefully, the definition of IDF is not dependent on a document, so we calculate it in advance. #Note that the calculation is a matrix operation, not an operation of a single variable. forDoc_vinchDocs_matrix:ifDoc_v.sum () = =0:doc_v= DOC_V/1Else: Doc_v= Doc_v/(Doc_v.sum ()) TFIDF=Np.dot (DOCS_MATRIX,IDF)returnNames,tfidf
Now we have the properties of the matrix as follows,
- A column is a collection of words in total for all documents.
- Each row represents a document.
- Each line is a vector, and each value of the vector is the weight of the word.
V. Clustering with the K-means algorithm
By this time, we can use the Kmeans algorithm for clustering, for the Kmeans algorithm, it is not text, it is just a matrix, so we use a common Kmeans algorithm can be.
An introduction to Kmeans can be found in the following articles:
1. Introduction and implementation of basic Kmeans algorithm
2. K-means Baidu Encyclopedia
3. Discussion on Kmeans clustering
The difference is, in most of the text clustering, people usually use the cosine distance (a good introduction to the article) rather than Euclidean distance to calculate, is because of the sparse matrix, I do not quite understand.
The following code comes from the code in the tenth chapter of machine learning Combat:
defGen_sim (A, b): Num=Float (Np.dot (a,b.t)) Denum= Np.linalg.norm (A) *np.linalg.norm (B)ifDenum = =0:denum= 1Cosn= num/Denum Sim= 0.5 + 0.5 *CosnreturnSimdefRandcent (DataSet, k): N= Shape (DataSet) [1] Centroids= Mat (Zeros ((k,n)))#Create centroid Mat forJinchRange (N):#Create random cluster centers, within bounds of each dimensionMinj =min (dataset[:,j]) Rangej= Float (max (dataset[:,j))-Minj) Centroids[:,j]= Mat (Minj + Rangej * Random.rand (k,1)) returncentroidsdefKmeans (DataSet, K, Distmeas=gen_sim, createcent=randcent): M=shape (DataSet) [0] clusterassment= Mat (Zeros ((m,2)))#Create mat to assign data points #to a centroid, also holds SE of each pointCentroids =Createcent (DataSet, k) clusterchanged=True Counter=0 whileCounter <= 50: Counter+ = 1clusterchanged=False forIinchRange (m):#For each data point assign it to the closest centroidMindist =inf; Minindex=-1 forJinchRange (k): Distji=Distmeas (centroids[j,:],dataset[i,:])ifDistji <mindist:mindist=Distji; Minindex=Jifclusterassment[i,0]! =minindex:clusterchanged=True clusterassment[i,:]= Minindex,mindist**2#Print Centroids forcentinchRange (k):#Recalculate CentroidsPtsinclust = Dataset[nonzero (clusterassment[:,0]. A==cent) [0]]#get All the "this cluster"Centroids[cent,:] = mean (Ptsinclust, axis=0)#assign centroid to mean returnCentroids, Clusterassment
Vi. Summary
Basically up to here, a usable Chinese text clustering tool has been completed, the GitHub project address.
What about the effect?
I myself have some unclassified articles belonging to 人生感悟
the (shy face) category of a total of 182, in the cut and remove the stop word after a total of 13,202 words, I set k=10, ah, the effect is not too good, of course, there may be a reason:
- The document itself is already highly classified, and clustering based on word frequency does not reveal subtle differences between these articles.
- Algorithms need to be optimized, and there may be some places where you can set modifications.
In short, after learning several days of machine learning, the first practice trip is finished.
This article was reproduced from: http://blog.csdn.net/likeyiyy/article/details/48982909
[Turn]python for Chinese text clustering (word-cutting and Kmeans clustering)