Some days ago into the PS4 of a famous game 2017, seek small demon brush ml in. I have to say that in the beginning, the interpretation of Brother Kun is quite a feeling. One months later ... Turn the volume down, the commentary is too poor.
In the process of seeking small demon, the whim of a famous forum to see the data there is no special place, so scrapy walk ...
By the server ban several times, ba la down 2w multi-main paste, more than 30 W replies to the SQLite database
[Data Cleansing]
Use XPath to clean HTML code, sift out sections, post content, author, time, etc.
Delete the crawler ba la down the other sections of the post
This first step cleaning is easy to say, and it takes a lot of time. The rest of the results are as follows
Sqladmin not support Chinese and I can't-_-!.
Analysis
To tell the truth just get these data, I was a face blindfolded forced, at the beginning of Scrapy, completely did not consider the analysis of what, data also grasp not much. Forget it, see what can be done, don't bother to run reptiles, save electricity
First see, some of the post author to the subject, such as [ps4][xbox360], statistics on the host's posts it:
Select category, Count (0) from articles Group by category
Results
What are the other ghosts? deleted the:-(
See a large number of empty category, I went to the forum to see a bit, the original most of the authors are too lazy to choose the theme, like "pes2017 really good A * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Then try to classify the posts that are empty in category.
Document classification, Word vector can't run away, Jieba participle walk
conn = Sqlite3.connect ('expdata2.db') Conn.text_factory=strrows= Conn.execute ('SELECT * from articles'). Fetchall () conn.close () rmvlist= ['x','y','Uj']#remove some useless wordsF1=codecs.open ('Forum_all.txt','W','Utf-8') J= 1 forRowinchrows:wlist="'words= row[2] + row[5] Keys=jieba.posseg.cut (words) forKinchKeys:ifK.flag not inchrmvlist:wlist+=' '+K.word f1.write (wlist) f1.write ('\ n') J+ = 1ifj%1000 = =0:Print 'Write%DK Records ...'% (j/1000) F1.close ()Print 'completed.'
View Code
According to category, I chose 200 PS, 180 stick Xbox series, human flesh selected 200 post of discussion pc after Jieba participle, saved as a TXT file as a training set, the results are as follows:
Training test results have 9+% accuracy rate, a bit high, regardless of, first put all the data to a classification to see
defclassnb_txt (): C, X, Y=Loaddataset ()Print 'Building trainning matrix ....'TRAINMAT1D=[] trainmat2d= [] forPostindocinchx:ary=Array (Setofwords2vec (C, Postindoc)) trainmat2d.append (ary)Print 'Building trainning Matrix completed'CLF=gaussiannb (). Fit (TRAINMAT2D, Y) F= Codecs.open ('Forum_all.txt','R') #FW = Codecs.open (' Forum_all_result.txt ', ' W ')Lines =f.readlines () totalnum=Len (lines) J= 1Thisdoc= [] forLineinchLines:ar= Line.split (' ') Thisdoc.append (Setofwords2vec (C, AR)) Arydoc=Np.array (thisdoc) R=clf.predict (Arydoc) f.close ()Print 'completed.'L=list (R)#ary = [[x] for x in L] #Print ary Print 'XB:%d'% (L.count (1)) Print 'PS:%d'% (L.count (2)) Print 'PC:%d'% (L.count (3))
View Code
Final results
Building trainning Matrix .... Building trainning Matrix Completedcompleted.xb:7223ps:1943pc:17692process finished with exit code 0
The result is quite unexpectedly, ps/xbox/pc three main host of the theme paste proportion actually close to 1:4:9.
If it is reasonable, there are two reasons for this:
- In the previous generation of host wars, Xbox360 was the winner. And the key is that there's cracked
- Although the PC version is not as good as the main engine version, but the PC version is cheap Ah, many users ah. And the key is that there's cracked ⊙▂⊙
Unreasonable is also possible:
- Preparation training data is not accurate, and does not filter keywords, shielding stopwords
- Quite a large part of the players in the Forum only reply to the theme, and I only consider the main post, not counted Huitie
- The data is too fragmented
In summary, this statistic can only be said to be specific to a section of the statistics.
Football Game Forum Data analysis--simple rough Bayesian