Using Bayesian classifier for Text Mining --- Note

Last Update:2014-09-01 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Call the inspector to perform word segmentation.

[Email protected]: ~ /Data $ hadoop jar mrtokenize. Jar tokenize. tokenizedriver/home/GRID/data/lesson8/home/GRID/output/sportwords

14/08/31 21:59:33 info input. fileinputformat: total input paths to process: 10205

.....

14/08/31 22:05:25 info mapred. jobclient: map output records = 10205

Processing in total: 10205 files, which takes 6 minutes

Result After word segmentation:

Badminton Uber Cup China Thailand advance finals Wang yihan smart Back To The Ball month, Chinese team player Wang yihan back to the ball in the game she finally beat the Thai team player hand delivery noo day Wuhan Sports Center center Center china's Chinese team beat the Thai team to the finals in the final semi-finals of the feather badminton tournament held in the sports gymnasium. Xinhua News Agency, reporter Meng Yongmin, takes a picture of sports to learn more about sports

Badminton Uber Cup China Thailand advance finals Wang yihan game serving Month day Chinese team player Wang yihan game serving her final victory Thai team player hand delivery noo day Wuhan Sports Center body china's Chinese team beat the Thai team to the finals in the final semi-finals of the feather badminton tournament held in the sports gymnasium. Xinhua News Agency, reporter Meng Yongmin, takes a picture of sports to learn more about sports

2. Load word segmentation set

Grunt> processed = load '/home/GRID/output/sportword' as (Category: chararray, DOC: chararray );

3. Generate a test set at random 20%

Grunt> test = sample processed 0.2;

4. Generate a training set

Grunt> jnt = join processed by (category, DOC) left outer, test by (category, DOC );

Grunt> filt_test = filter jnt by test: category is null;

Grunt> train = foreach filt_test generate processed: category as category, processed: Doc as Doc;

5. Export the test set and training set

Grunt> store test into '/home/GRID/data/lesson8.2/test ';

6. Statistical Test Set Classification

Grunt> test_ct = foreach (group test by category) generate group, count (test. Category );

Grunt> dump test_ct;

Result:

(F1, 1, 196)

(Golf, 1, 206)

(Swim, 170)

(Tennis, 187)

(Football, 209)

(Pingpong, 220)

(Badminton, December 212)

(Billiards, 236)

(Basketball, 201)

(Volleyball, 204)

7. Statistical Training set

Grunt> train_ct = foreach (group train by category) generate group, count (train. Category );

Grunt> dump train_ct;

8. Use Naive Bayes classifier to train Models

[Email protected]: ~ /Data $ mahout trainclassifier \

>-I/home/GRID/data/lesson8.2/train \

>-O/home/GRID/output/model-bayes8.2 \

>-Type Bayes \

>-Ng 1 \

>-Source HDFS

Result:

14/08/31 23:31:15 info mapred. jobclient: map output records = 228211

14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-doccount

14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-termdoccount

14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-featurecount

14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-wordfreq

14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-TFIDF/trainer-vocabcount

14/08/31 23:31:15 info driver. mahoutdriver: program took 1265118 MS (Minutes: 21.0853)

9. test the model

[Email protected]: ~ /Data $ mahout testclassifier \

>-D/home/GRID/data/lesson8.2/test \

>-M/home/GRID/output/model-bayes8.2 \

>-Type Bayes \

>-Ng 1 \

>-Source HDFS \

>-Method mapreduce

14/08/31 23:45:52 info Bayes. bayesclassifierdriver: ========================================================== ====================

Confusion Matrix

-------------------------------------------------------

A B c d e f g h I j <-- classified

190 0 1 1 0 3 0 0 0 0 | 195 A = Basketball

0 249 0 0 0 1 0 0 0 | 250 B = billiards

0 0 198 0 0 1 0 0 0 0 | 199 c = Badminton

0 0 0 224 0 0 0 0 0 0 | 224 d = football

0 0 0 0 190 0 0 0 0 0 | 190 E = volleyball

0 0 0 0 181 0 0 0 0 | 181 F = swim

0 1 0 0 0 204 0 0 0 | 205g = Pingpong

1 0 0 0 0 0 193 0 0 | 194 H = golf

0 0 0 0 0 0 0 196 0 | 196 I = F1

0 0 0 0 0 1 0 206 | 207 J = tennis

10. Word Segmentation of user browsing records, same as 1

[Email protected]: ~ /Data $ hadoop jar mrtokenize. Jar tokenize. tokenizedriver/home/GRID/data/les8-usersport/home/GRID/lesson8/output/userwords

11. Use the model generated by sport to classify user browsing content

[Email protected]: ~ /Data $ hadoop jar mrclassify. Jar classifier. classifierdriver \

>/Home/GRID/lesson8/output/userwords \

>/Home/GRID/lesson8/output/classify \

>/Home/GRID/output/model-bayes8.2 \

> Bayes

Result:

[Email protected]: ~ /Data $ hadoop FS-CAT/home/GRID/lesson8/output/classify/part-r-00000 | head-20

Warning: $ hadoop_home is deprecated.

10511838 | badminton | 7

10511838 | basketball | 5

10511838 | billiards | 8

10511838 | F1 | 7

10511838 | football | 11

10511838 | golf | 5

10511838 | pingpong | 5

10511838 | tennis | 2

10511838 | volleyball | 12

10564290 | badminton | 2

10564290 | basketball | 12

10564290 | billiards | 11

10564290 | F1 | 12

10564290 | football | 16

10564290 | golf | 1

10564290 | pingpong | 18

10564290 | swim | 6

10564290 | tennis | 3

10564290 | volleyball | 7

107879 | basketball | 7

This article from the "Wandering footsteps" blog, please be sure to keep this source http://now51jq.blog.51cto.com/3474143/1547434

Using Bayesian classifier for Text Mining --- Note

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More