Data Analysis of football game forums-simple and crude K-means clustering and mean-means clustering

Source: Internet
Author: User

Data Analysis of football game forums-simple and crude K-means clustering and mean-means clustering

After trying to tag in <Data Analysis of football game forums-simple and crude Bayes>, I always felt that the results were unacceptable. I think that the selected algorithm is wrong, cause:

  • The classification of Forum posts is not as simple as PC/PS/XBOX
  • Even the author's own labels have the possibility of hanging the goat's head.

Since it is impossible to classify posts, try the clustering algorithm to see if any of the following information is found:

# All texts with good words have been saved into one file without prior classification of f = codecs.open('forum_all.txt ', 'R', 'utf-8') words_full = f. readlines () f. close () true_k = 5 # vectorizer = TfidfVectorizer (max_df = 0.5, max_features = 1000, min_df = 2) transformer = TfidfTransformer () td = vectorizer. fit_transform (words_full) tfidf = transformer. fit_transform (td) word = np. array (vectorizer. get_feature_names () km = KMeans (n_clusters = true_ke, init = 'K-means ++ ', max_iter = 200, n_init = 1) km. fit (td) print (u "Silhouette Coefficient (contour Coefficient): % 0.3f" % metrics. silhouette_score (td, km. labels _, sample_size = 5000) order_centroids = km. cluster_centers _. argsort () [:,:-1] terms = vectorizer. get_feature_names () for I in range (true_ke): # output the first 10 feature words for ind in order_centroids [I,: 10] for each category: print '% s' % terms [ind], print''
View Code

Running result

Silhouette Coefficient (contour Coefficient): 0.137 Cluster 0: 1634 posts graphics card recognition independent Installation Method tutorial final cracked version reloaded Cluster 1: 4388 posts 2014 evolution soccer recommended pro Forum debut dlc3 download cracked version Cluster 2: 1677 posts Summary resources dlc6 22 10 Update pes2014 share thank you for supporting Cluster 3: 7872 posts wecn officially released pes2016 patch v2 Simplified Chinese v1 patch Cluster 4: 11287 posts pes2014 troubleshooting patch update Player 10 stadium sharing pes2016 Thank you

From the results of this classification, the Forum sections I crawled mainly discuss:

  • Computer Configuration for playing games, that is, PC games
  • "You know" there are still a lot of game users, and there is still a long way to go to versioning.
  • More than half of the discussions (classification 3 and 4) are various patches of the game. As the old drivers know, FIFA is invincible on copyright issues.
  • Unexpectedly, 2014/2016 is a popular version, and 2015 has no sense of presence.
  • I am a bit skeptical that my crawler has spent a lot of time posting on the PC section-_-B

PS, Category 5 is actually just a set value. after testing the classification from 3 to 12, we found that the contour coefficient entered a stable state at 5, but the improvement was not significant.

Finally, a scatter chart of classification made by Dimensionality Reduction of feature vectors is attached.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.