In the << football game Forum data analysis-simple rough Bayesian >> tried to paste the label, has always felt that the results are unacceptable, slowly recall, in fact, the choice of algorithm is wrong, the reason is
- Forum posts classification is not pc/ps/xbox so simple
- Even the author's own label, there is the possibility of vinegar
Since there is no easy way to categorize the posts, then try the clustering algorithm to see if there is any discovery:
#all the text of a good word has been stored in a file without prior categorizationf = Codecs.open ('Forum_all.txt','R','Utf-8') Words_full=f.readlines () f.close () True_k= 5#Pre- programmed into 5 categoriesVectorizer= Tfidfvectorizer (max_df=0.5, max_features=1000, MIN_DF=2) Transformer=Tfidftransformer () TD=vectorizer.fit_transform (words_full) TFIDF=Transformer.fit_transform (TD) Word=Np.array (Vectorizer.get_feature_names ()) Miles= Kmeans (N_clusters=true_ke, init='k-means++', max_iter=200, N_init=1) Km.fit (TD)Print(U"Silhouette coefficient (contour factor):%0.3f"% Metrics.silhouette_score (TD, Km.labels_, sample_size=5000)) Order_centroids= Km.cluster_centers_.argsort () [:,::-1] Terms=vectorizer.get_feature_names () forIinchRange (True_ke):#output 10 feature words per category header forIndinchOrder_centroids[i,: 10]: Print '%s'%Terms[ind],Print "'
View Code
Run results
Silhouette coefficient (contour factor): 0.137Cluster 0: 1634 posts Graphics identify How to install standalone installation method Tutorial Last cracked version 1: 4388 Evolution Soccer recommended Pro Forum starter dlc3 download 2: 1677 posts Summary resource dlc6 22 update pes2014 share Thank you 3: 7872 posts WECN released formally pes2016 Patch v2 Simplified Chinese v1 4: 11287 posts pes2014 Troubleshooting Patch Update players at the Stadium share pes2016
Judging from the results of this classification, the Forum section I crawled is mainly about:
- Can/Play Computer configuration, that is, PC version of the game
- "You know," the game users are still many, the authenticity of the still heavy and long way
- The majority of the discussions (categories 3 and 4) are the game's various patches, the old drivers are aware that the copyright issue, at this point, Fifa invincible
- Quite unexpectedly, 2014/2016 is a popular version, 2015 no sense of existence
- I'm a little skeptical that my crawler spends a lot of time on the PC section of the post-_-b
PS, number 5 In fact, it's just a random set of values. The final selection of 5, only after testing from 3 to 12 of the classification, found that the contour factor at 5 began to enter a stable state, a little increase.
Finally, a classified scatter plot with a feature vector after descending dimension is attached.
Football Game Forum Data analysis--simple and rough K-mean clustering