K-means algorithm and text clustering practices

Source: Internet
Author: User

K-means is a common clustering algorithm. Compared with other clustering algorithms, K-means has a low time complexity and a good clustering effect. Here we will briefly introduce the K-means algorithm, is the result of a handwritten dataset clustering.

Basic Ideas

The K-means algorithm needs to specify the number of clusters K in advance. The algorithm starts to randomly select K record points as the center point and then traverse each record of the entire dataset, place each record in the cluster with the closest center point, replace the previous center point with the average center point of the record of each cluster, and then iterate until convergence, the algorithm is described as follows:

The Convergence mentioned above can be seen in two aspects: one is that the cluster of each record does not change, and the other is that the optimization goal does not change much. The time complexity of the algorithm is O (K * n * t), k is the number of centers, the size of n data sets, and T is the number of iterations.

Optimization objectives

The loss function of K-means is square error:

$ Rss_k = \ sum _ {x \ In \ Omega _ k} | X-U (\ Omega _ k) | ^ 2 $

$ RSS = \ sum _ {k = 1} ^ {k} rss_k $

$ \ Omega _ k $ indicates the k-th cluster, and $ U (\ Omega _ k) $ indicates the center of the k-th cluster, $ rss_k $ is the loss function of the k-th cluster. $ RSS $ indicates the overall loss function. The goal of optimization is to select an appropriate record ownership scheme to minimize the overall loss function.

Central Point Selection

The K-Meams algorithm can ensure convergence, but it cannot guarantee convergence to the global advantage. When the initial center is not well selected, it can only achieve the local advantage, and the overall clustering effect will be poor. You can use the following method: K-means center point

1. Select the points that are as far away as possible as the center point;

2. output k clusters by preliminary clustering with layers, and input the center point of the cluster as the center point of K-means.

3. Select K-means for multiple random center training and select the clustering result with the best effect.

K value selection

The error function of K-means has a major defect, that is, as the number of clusters increases, the error function approaches 0. the most extreme case is that each record is a separate cluster, at this time, the error of data records is 0, but the clustering result is not what we want. We can introduce structural risks to punish the complexity of the model:

$ K = min_k [RSS _ {min} (k) + \ Lambda K] $

$ \ Lambda $ is a parameter used to balance the training error and the number of clusters. However, the question now becomes how to select $ \ Lambda $. For more information, see [Reference 1, when the dataset satisfies the Gaussian distribution, $ \ Lambda = 2 m $, where M is the vector dimension.

Another method is to try different K values in an ascending order, and draw the corresponding error value, and find a better K value by seeking the inflection point, for details, see the example of text clustering below.

K-means text clustering

I crawled some articles in 36kr, a total of 1456 articles. After word splitting, I used sklearn for K-means clustering. Data Records after word segmentation are as follows:

Using TF-IDF to select feature words is the curve of the number of center points from 3 to 80 corresponding error values:

A significant inflection point is displayed at k = 10. Therefore, K = 10 is selected as the number of centers. below is the number of datasets in 10 clusters.

{0: 152, 1: 239, 2: 142, 3: 61, 4: 119, 5: 44, 6: 71, 7: 394, 8: 141, 9: 93}
Cluster tag generation

After clustering, we need some labels to describe the cluster. After clustering, each class uses a class label, at this time, you can use TFIDF, mutual information, Chi-square, and other methods to select feature words as tags. For the feature extraction of card and mutual information, see the text feature selection in my previous article. The following is the TFIDF tag result of 10 classes.

Cluster 0: merchant goods logistics brand payment shopping guide website shopping platform orders
Cluster 1: investment financing USD company capital market in China last year
Cluster 2: smart phone hardware devices, TV sports data functions, and healthy use
Cluster 3: Data Platform market student app mobile information company doctor Education
Cluster 4: enterprise recruitment talent platform company it mobile Website Security Information
Cluster 5: Social friend pet activity friend sharing game
Cluster 6: accounting, financial loan, bank, financial, P2P investment, Internet fund companies
Cluster 7: task collaboration, enterprise sales communication, project management tool Member
Cluster 8: Travel and Tourism Hotel Booking Information City investment open app demand
Cluster 9: video content, game music pictures, photos, advertisements, reading and sharing functions

Implementation Code
#! -- Encoding = utf-8from _ ure _ import print_functionfrom sklearn. feature_extraction.text import tfidfvectorizerfrom sklearn. feature_extraction.text import hashingvectorizerimport matplotlib. pyplot as pltfrom sklearn. cluster import kmeans, minibatchkmeansdef loaddataset (): ''' import text dataset ''' F = open('36krout.txt ', 'R') dataset = [] lastpage = none for line in F. readlines (): If '<title>' in line and '</title>' in line: If lastpage: Dataset. append (lastpage) lastpage = line else: lastpage + = line if lastpage: Dataset. append (lastpage) F. close () return datasetdef transform (dataset, n_features = 1000): vectorizer = tfidfvectorizer (max_df = 0.5, max_features = n_features, min_df = 2, use_idf = true) x = vectorizer. fit_transform (Dataset) return X, vectorizerdef train (x, vectorizer, true_k = 10, minibatch = false, showlable = false): # Use sample data or raw data to train K-means, if minibatch: Km = minibatchkmeans (n_clusters = true_k, init = 'K-means ++ ', n_init = 1, init_size = 1000, batch_size = 1000, verbose = false) else: km = kmeans (n_clusters = true_k, init = 'K-means ++ ', max_iter = 300, n_init = 1, verbose = false) km. FIT (x) If showlable: Print ("Top terms per cluster:") order_centroids = km. cluster_centers _. argsort () [:,:-1] terms = vectorizer. get_feature_names () print (vectorizer. get_stop_words () for I in range (true_k): Print ("Cluster % d:" % I, end = '') for IND in order_centroids [I,: 10]: print ('% s' % terms [ind], end = '') print () Result = List (km. predict (x) print ('Cluster distribution: ') print (dict ([(I, result. count (I) for I in result]) Return-km. score (x) def test (): ''' test selection optimal parameter ''' dataset = loaddataset () print ("% d documents" % Len (Dataset) X, vectorizer = transform (dataset, n_features = 500) true_ks = [] scores = [] For I in xrange (, 1): score = train (x, vectorizer, true_k = I) /Len (Dataset) print (I, score) true_ks.append (I) scores. append (score) PLT. figure (figsize = (8, 4) PLT. plot (true_ks, scores, label = "error", color = "red", linewidth = 1) PLT. xlabel ("n_features") PLT. ylabel ("error") PLT. legend () PLT. show () def out (): ''' output clustering result with optimal parameters ''' dataset = loaddataset () x, vectorizer = transform (dataset, n_features = 500) score = train (x, vectorizer, true_k = 10, showlable = true)/Len (Dataset) print (score) # Test () Out ()

After this article, you are welcome to leave a message.

References

[1]. Wang Bin. Introduction to Information Retrieval

 

Reprinted please indicate the source: http://www.cnblogs.com/fengfenggirl/

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.