Football Game Forum Data analysis--simple and rough K-mean clustering

Source: Internet
Author: User

In the << football game Forum data analysis-simple rough Bayesian >> tried to paste the label, has always felt that the results are unacceptable, slowly recall, in fact, the choice of algorithm is wrong, the reason is

    • Forum posts classification is not pc/ps/xbox so simple
    • Even the author's own label, there is the possibility of vinegar

Since there is no easy way to categorize the posts, then try the clustering algorithm to see if there is any discovery:

    #all the text of a good word has been stored in a file without prior categorizationf = Codecs.open ('Forum_all.txt','R','Utf-8') Words_full=f.readlines () f.close () True_k= 5#Pre- programmed into 5 categoriesVectorizer= Tfidfvectorizer (max_df=0.5, max_features=1000, MIN_DF=2) Transformer=Tfidftransformer () TD=vectorizer.fit_transform (words_full) TFIDF=Transformer.fit_transform (TD) Word=Np.array (Vectorizer.get_feature_names ()) Miles= Kmeans (N_clusters=true_ke, init='k-means++', max_iter=200, N_init=1) Km.fit (TD)Print(U"Silhouette coefficient (contour factor):%0.3f"% Metrics.silhouette_score (TD, Km.labels_, sample_size=5000)) Order_centroids= Km.cluster_centers_.argsort () [:,::-1] Terms=vectorizer.get_feature_names () forIinchRange (True_ke):#output 10 feature words per category header         forIndinchOrder_centroids[i,: 10]:            Print '%s'%Terms[ind],Print "'
View Code

Run results

Silhouette coefficient (contour factor): 0.137Cluster 0:  1634 posts Graphics  identify  How to install standalone installation  method  Tutorial  Last  cracked version   1:  4388  Evolution  Soccer  recommended  Pro  Forum  starter  dlc3  download   2:  1677 posts Summary  resource  dlc6  22    update  pes2014  share  Thank you   3:  7872 posts WECN  released  formally  pes2016  Patch  v2  Simplified Chinese  v1  4:  11287 posts pes2014  Troubleshooting  Patch  Update  players   at the  Stadium  share  pes2016  

Judging from the results of this classification, the Forum section I crawled is mainly about:

    • Can/Play Computer configuration, that is, PC version of the game
    • "You know," the game users are still many, the authenticity of the still heavy and long way
    • The majority of the discussions (categories 3 and 4) are the game's various patches, the old drivers are aware that the copyright issue, at this point, Fifa invincible
    • Quite unexpectedly, 2014/2016 is a popular version, 2015 no sense of existence
    • I'm a little skeptical that my crawler spends a lot of time on the PC section of the post-_-b

PS, number 5 In fact, it's just a random set of values. The final selection of 5, only after testing from 3 to 12 of the classification, found that the contour factor at 5 began to enter a stable state, a little increase.

Finally, a classified scatter plot with a feature vector after descending dimension is attached.

Football Game Forum Data analysis--simple and rough K-mean clustering

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.