Using Bayesian classifier for Text Mining --- Note

Source: Internet
Author: User
Tags hadoop fs

1. Call the inspector to perform word segmentation.

[Email protected]: ~ /Data $ hadoop jar mrtokenize. Jar tokenize. tokenizedriver/home/GRID/data/lesson8/home/GRID/output/sportwords

14/08/31 21:59:33 info input. fileinputformat: total input paths to process: 10205

.....

14/08/31 22:05:25 info mapred. jobclient: map output records = 10205

Processing in total: 10205 files, which takes 6 minutes



Result After word segmentation:

Badminton Uber Cup China Thailand advance finals Wang yihan smart Back To The Ball month, Chinese team player Wang yihan back to the ball in the game she finally beat the Thai team player hand delivery noo day Wuhan Sports Center center Center china's Chinese team beat the Thai team to the finals in the final semi-finals of the feather badminton tournament held in the sports gymnasium. Xinhua News Agency, reporter Meng Yongmin, takes a picture of sports to learn more about sports

Badminton Uber Cup China Thailand advance finals Wang yihan game serving Month day Chinese team player Wang yihan game serving her final victory Thai team player hand delivery noo day Wuhan Sports Center body china's Chinese team beat the Thai team to the finals in the final semi-finals of the feather badminton tournament held in the sports gymnasium. Xinhua News Agency, reporter Meng Yongmin, takes a picture of sports to learn more about sports


2. Load word segmentation set

Grunt> processed = load '/home/GRID/output/sportword' as (Category: chararray, DOC: chararray );


3. Generate a test set at random 20%

Grunt> test = sample processed 0.2;


4. Generate a training set

Grunt> jnt = join processed by (category, DOC) left outer, test by (category, DOC );

Grunt> filt_test = filter jnt by test: category is null;

Grunt> train = foreach filt_test generate processed: category as category, processed: Doc as Doc;


5. Export the test set and training set

Grunt> store test into '/home/GRID/data/lesson8.2/test ';


6. Statistical Test Set Classification

Grunt> test_ct = foreach (group test by category) generate group, count (test. Category );

Grunt> dump test_ct;


Result:

(F1, 1, 196)

(Golf, 1, 206)

(Swim, 170)

(Tennis, 187)

(Football, 209)

(Pingpong, 220)

(Badminton, December 212)

(Billiards, 236)

(Basketball, 201)

(Volleyball, 204)



7. Statistical Training set

Grunt> train_ct = foreach (group train by category) generate group, count (train. Category );

Grunt> dump train_ct;


8. Use Naive Bayes classifier to train Models


[Email protected]: ~ /Data $ mahout trainclassifier \

>-I/home/GRID/data/lesson8.2/train \

>-O/home/GRID/output/model-bayes8.2 \

>-Type Bayes \

>-Ng 1 \

>-Source HDFS


Result:

14/08/31 23:31:15 info mapred. jobclient: map output records = 228211

14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-doccount

14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-termdoccount

14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-featurecount

14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-wordfreq

14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-TFIDF/trainer-vocabcount

14/08/31 23:31:15 info driver. mahoutdriver: program took 1265118 MS (Minutes: 21.0853)



9. test the model


[Email protected]: ~ /Data $ mahout testclassifier \

>-D/home/GRID/data/lesson8.2/test \

>-M/home/GRID/output/model-bayes8.2 \

>-Type Bayes \

>-Ng 1 \

>-Source HDFS \

>-Method mapreduce


14/08/31 23:45:52 info Bayes. bayesclassifierdriver: ========================================================== ====================

Confusion Matrix

-------------------------------------------------------

A B c d e f g h I j <-- classified

190 0 1 1 0 3 0 0 0 0 | 195 A = Basketball

0 249 0 0 0 1 0 0 0 | 250 B = billiards

0 0 198 0 0 1 0 0 0 0 | 199 c = Badminton

0 0 0 224 0 0 0 0 0 0 | 224 d = football

0 0 0 0 190 0 0 0 0 0 | 190 E = volleyball

0 0 0 0 181 0 0 0 0 | 181 F = swim

0 1 0 0 0 204 0 0 0 | 205g = Pingpong

1 0 0 0 0 0 193 0 0 | 194 H = golf

0 0 0 0 0 0 0 196 0 | 196 I = F1

0 0 0 0 0 1 0 206 | 207 J = tennis



10. Word Segmentation of user browsing records, same as 1

[Email protected]: ~ /Data $ hadoop jar mrtokenize. Jar tokenize. tokenizedriver/home/GRID/data/les8-usersport/home/GRID/lesson8/output/userwords



11. Use the model generated by sport to classify user browsing content

[Email protected]: ~ /Data $ hadoop jar mrclassify. Jar classifier. classifierdriver \

>/Home/GRID/lesson8/output/userwords \

>/Home/GRID/lesson8/output/classify \

>/Home/GRID/output/model-bayes8.2 \

> Bayes



Result:

[Email protected]: ~ /Data $ hadoop FS-CAT/home/GRID/lesson8/output/classify/part-r-00000 | head-20

Warning: $ hadoop_home is deprecated.


10511838 | badminton | 7

10511838 | basketball | 5

10511838 | billiards | 8

10511838 | F1 | 7

10511838 | football | 11

10511838 | golf | 5

10511838 | pingpong | 5

10511838 | tennis | 2

10511838 | volleyball | 12

10564290 | badminton | 2

10564290 | basketball | 12

10564290 | billiards | 11

10564290 | F1 | 12

10564290 | football | 16

10564290 | golf | 1

10564290 | pingpong | 18

10564290 | swim | 6

10564290 | tennis | 3

10564290 | volleyball | 7

107879 | basketball | 7



This article from the "Wandering footsteps" blog, please be sure to keep this source http://now51jq.blog.51cto.com/3474143/1547434

Using Bayesian classifier for Text Mining --- Note

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.