Football Game Forum Data analysis--simple rough Bayesian

Source: Internet
Author: User

Some days ago into the PS4 of a famous game 2017, seek small demon brush ml in.  I have to say that in the beginning, the interpretation of Brother Kun is quite a feeling. One months later ... Turn the volume down, the commentary is too poor.

In the process of seeking small demon, the whim of a famous forum to see the data there is no special place, so scrapy walk ...

By the server ban several times, ba la down 2w multi-main paste, more than 30 W replies to the SQLite database

[Data Cleansing]

Use XPath to clean HTML code, sift out sections, post content, author, time, etc.

Delete the crawler ba la down the other sections of the post

This first step cleaning is easy to say, and it takes a lot of time. The rest of the results are as follows

Sqladmin not support Chinese and I can't-_-!.

Analysis

To tell the truth just get these data, I was a face blindfolded forced, at the beginning of Scrapy, completely did not consider the analysis of what, data also grasp not much. Forget it, see what can be done, don't bother to run reptiles, save electricity

First see, some of the post author to the subject, such as [ps4][xbox360], statistics on the host's posts it:

Select category, Count (0) from articles Group by category

Results

What are the other ghosts? deleted the:-(

See a large number of empty category, I went to the forum to see a bit, the original most of the authors are too lazy to choose the theme, like "pes2017 really good A * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Then try to classify the posts that are empty in category.

Document classification, Word vector can't run away, Jieba participle walk

conn = Sqlite3.connect ('expdata2.db') Conn.text_factory=strrows= Conn.execute ('SELECT * from articles'). Fetchall () conn.close () rmvlist= ['x','y','Uj']#remove some useless wordsF1=codecs.open ('Forum_all.txt','W','Utf-8') J= 1 forRowinchrows:wlist="'words= row[2] + row[5] Keys=jieba.posseg.cut (words) forKinchKeys:ifK.flag not inchrmvlist:wlist+=' '+K.word f1.write (wlist) f1.write ('\ n') J+ = 1ifj%1000 = =0:Print 'Write%DK Records ...'% (j/1000) F1.close ()Print 'completed.'
View Code

According to category, I chose 200 PS, 180 stick Xbox series, human flesh selected 200 post of discussion pc after Jieba participle, saved as a TXT file as a training set, the results are as follows:

Training test results have 9+% accuracy rate, a bit high, regardless of, first put all the data to a classification to see

defclassnb_txt (): C, X, Y=Loaddataset ()Print 'Building trainning matrix ....'TRAINMAT1D=[] trainmat2d= []     forPostindocinchx:ary=Array (Setofwords2vec (C, Postindoc)) trainmat2d.append (ary)Print 'Building trainning Matrix completed'CLF=gaussiannb (). Fit (TRAINMAT2D, Y) F= Codecs.open ('Forum_all.txt','R')    #FW = Codecs.open (' Forum_all_result.txt ', ' W ')Lines =f.readlines () totalnum=Len (lines) J= 1Thisdoc= []     forLineinchLines:ar= Line.split (' ') Thisdoc.append (Setofwords2vec (C, AR)) Arydoc=Np.array (thisdoc) R=clf.predict (Arydoc) f.close ()Print 'completed.'L=list (R)#ary = [[x] for x in L]    #Print ary    Print 'XB:%d'% (L.count (1))    Print 'PS:%d'% (L.count (2))    Print 'PC:%d'% (L.count (3))
View Code

Final results

Building trainning Matrix .... Building trainning Matrix Completedcompleted.xb:7223ps:1943pc:17692process finished with exit code 0

The result is quite unexpectedly, ps/xbox/pc three main host of the theme paste proportion actually close to 1:4:9.

If it is reasonable, there are two reasons for this:

    • In the previous generation of host wars, Xbox360 was the winner. And the key is that there's cracked
    • Although the PC version is not as good as the main engine version, but the PC version is cheap Ah, many users ah. And the key is that there's cracked ⊙▂⊙

Unreasonable is also possible:

    • Preparation training data is not accurate, and does not filter keywords, shielding stopwords
    • Quite a large part of the players in the Forum only reply to the theme, and I only consider the main post, not counted Huitie
    • The data is too fragmented

In summary, this statistic can only be said to be specific to a section of the statistics.

Football Game Forum Data analysis--simple rough Bayesian

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.