Data Analysis of football game forums-simple and crude K-means clustering and mean-means clustering
After trying to tag in <Data Analysis of football game forums-simple and crude Bayes>, I always felt that the results were unacceptable. I think that the selected algorithm is wrong, cause:
- The classification of Forum posts is not as simple as PC/PS/XBOX
- Even the author's own labels have the possibility of hanging the goat's head.
Since it is impossible to classify posts, try the clustering algorithm to see if any of the following information is found:
# All texts with good words have been saved into one file without prior classification of f = codecs.open('forum_all.txt ', 'R', 'utf-8') words_full = f. readlines () f. close () true_k = 5 # vectorizer = TfidfVectorizer (max_df = 0.5, max_features = 1000, min_df = 2) transformer = TfidfTransformer () td = vectorizer. fit_transform (words_full) tfidf = transformer. fit_transform (td) word = np. array (vectorizer. get_feature_names () km = KMeans (n_clusters = true_ke, init = 'K-means ++ ', max_iter = 200, n_init = 1) km. fit (td) print (u "Silhouette Coefficient (contour Coefficient): % 0.3f" % metrics. silhouette_score (td, km. labels _, sample_size = 5000) order_centroids = km. cluster_centers _. argsort () [:,:-1] terms = vectorizer. get_feature_names () for I in range (true_ke): # output the first 10 feature words for ind in order_centroids [I,: 10] for each category: print '% s' % terms [ind], print''
View Code
Running result
Silhouette Coefficient (contour Coefficient): 0.137 Cluster 0: 1634 posts graphics card recognition independent Installation Method tutorial final cracked version reloaded Cluster 1: 4388 posts 2014 evolution soccer recommended pro Forum debut dlc3 download cracked version Cluster 2: 1677 posts Summary resources dlc6 22 10 Update pes2014 share thank you for supporting Cluster 3: 7872 posts wecn officially released pes2016 patch v2 Simplified Chinese v1 patch Cluster 4: 11287 posts pes2014 troubleshooting patch update Player 10 stadium sharing pes2016 Thank you
From the results of this classification, the Forum sections I crawled mainly discuss:
- Computer Configuration for playing games, that is, PC games
- "You know" there are still a lot of game users, and there is still a long way to go to versioning.
- More than half of the discussions (classification 3 and 4) are various patches of the game. As the old drivers know, FIFA is invincible on copyright issues.
- Unexpectedly, 2014/2016 is a popular version, and 2015 has no sense of presence.
- I am a bit skeptical that my crawler has spent a lot of time posting on the PC section-_-B
PS, Category 5 is actually just a set value. after testing the classification from 3 to 12, we found that the contour coefficient entered a stable state at 5, but the improvement was not significant.
Finally, a scatter chart of classification made by Dimensionality Reduction of feature vectors is attached.