1. Call the inspector to perform word segmentation.
[Email protected]: ~ /Data $ hadoop jar mrtokenize. Jar tokenize. tokenizedriver/home/GRID/data/lesson8/home/GRID/output/sportwords
14/08/31 21:59:33 info input. fileinputformat: total input paths to process: 10205
.....
14/08/31 22:05:25 info mapred. jobclient: map output records = 10205
Processing in total: 10205 files, which takes 6 minutes
Result After word segmentation:
Badminton Uber Cup China Thailand advance finals Wang yihan smart Back To The Ball month, Chinese team player Wang yihan back to the ball in the game she finally beat the Thai team player hand delivery noo day Wuhan Sports Center center Center china's Chinese team beat the Thai team to the finals in the final semi-finals of the feather badminton tournament held in the sports gymnasium. Xinhua News Agency, reporter Meng Yongmin, takes a picture of sports to learn more about sports
Badminton Uber Cup China Thailand advance finals Wang yihan game serving Month day Chinese team player Wang yihan game serving her final victory Thai team player hand delivery noo day Wuhan Sports Center body china's Chinese team beat the Thai team to the finals in the final semi-finals of the feather badminton tournament held in the sports gymnasium. Xinhua News Agency, reporter Meng Yongmin, takes a picture of sports to learn more about sports
2. Load word segmentation set
Grunt> processed = load '/home/GRID/output/sportword' as (Category: chararray, DOC: chararray );
3. Generate a test set at random 20%
Grunt> test = sample processed 0.2;
4. Generate a training set
Grunt> jnt = join processed by (category, DOC) left outer, test by (category, DOC );
Grunt> filt_test = filter jnt by test: category is null;
Grunt> train = foreach filt_test generate processed: category as category, processed: Doc as Doc;
5. Export the test set and training set
Grunt> store test into '/home/GRID/data/lesson8.2/test ';
6. Statistical Test Set Classification
Grunt> test_ct = foreach (group test by category) generate group, count (test. Category );
Grunt> dump test_ct;
Result:
(F1, 1, 196)
(Golf, 1, 206)
(Swim, 170)
(Tennis, 187)
(Football, 209)
(Pingpong, 220)
(Badminton, December 212)
(Billiards, 236)
(Basketball, 201)
(Volleyball, 204)
7. Statistical Training set
Grunt> train_ct = foreach (group train by category) generate group, count (train. Category );
Grunt> dump train_ct;
8. Use Naive Bayes classifier to train Models
[Email protected]: ~ /Data $ mahout trainclassifier \
>-I/home/GRID/data/lesson8.2/train \
>-O/home/GRID/output/model-bayes8.2 \
>-Type Bayes \
>-Ng 1 \
>-Source HDFS
Result:
14/08/31 23:31:15 info mapred. jobclient: map output records = 228211
14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-doccount
14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-termdoccount
14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-featurecount
14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-wordfreq
14/08/31 23:31:15 info common. hadooputil: deleting/home/GRID/output/model-bayes8.2/trainer-TFIDF/trainer-vocabcount
14/08/31 23:31:15 info driver. mahoutdriver: program took 1265118 MS (Minutes: 21.0853)
9. test the model
[Email protected]: ~ /Data $ mahout testclassifier \
>-D/home/GRID/data/lesson8.2/test \
>-M/home/GRID/output/model-bayes8.2 \
>-Type Bayes \
>-Ng 1 \
>-Source HDFS \
>-Method mapreduce
14/08/31 23:45:52 info Bayes. bayesclassifierdriver: ========================================================== ====================
Confusion Matrix
-------------------------------------------------------
A B c d e f g h I j <-- classified
190 0 1 1 0 3 0 0 0 0 | 195 A = Basketball
0 249 0 0 0 1 0 0 0 | 250 B = billiards
0 0 198 0 0 1 0 0 0 0 | 199 c = Badminton
0 0 0 224 0 0 0 0 0 0 | 224 d = football
0 0 0 0 190 0 0 0 0 0 | 190 E = volleyball
0 0 0 0 181 0 0 0 0 | 181 F = swim
0 1 0 0 0 204 0 0 0 | 205g = Pingpong
1 0 0 0 0 0 193 0 0 | 194 H = golf
0 0 0 0 0 0 0 196 0 | 196 I = F1
0 0 0 0 0 1 0 206 | 207 J = tennis
10. Word Segmentation of user browsing records, same as 1
[Email protected]: ~ /Data $ hadoop jar mrtokenize. Jar tokenize. tokenizedriver/home/GRID/data/les8-usersport/home/GRID/lesson8/output/userwords
11. Use the model generated by sport to classify user browsing content
[Email protected]: ~ /Data $ hadoop jar mrclassify. Jar classifier. classifierdriver \
>/Home/GRID/lesson8/output/userwords \
>/Home/GRID/lesson8/output/classify \
>/Home/GRID/output/model-bayes8.2 \
> Bayes
Result:
[Email protected]: ~ /Data $ hadoop FS-CAT/home/GRID/lesson8/output/classify/part-r-00000 | head-20
Warning: $ hadoop_home is deprecated.
10511838 | badminton | 7
10511838 | basketball | 5
10511838 | billiards | 8
10511838 | F1 | 7
10511838 | football | 11
10511838 | golf | 5
10511838 | pingpong | 5
10511838 | tennis | 2
10511838 | volleyball | 12
10564290 | badminton | 2
10564290 | basketball | 12
10564290 | billiards | 11
10564290 | F1 | 12
10564290 | football | 16
10564290 | golf | 1
10564290 | pingpong | 18
10564290 | swim | 6
10564290 | tennis | 3
10564290 | volleyball | 7
107879 | basketball | 7
This article from the "Wandering footsteps" blog, please be sure to keep this source http://now51jq.blog.51cto.com/3474143/1547434
Using Bayesian classifier for Text Mining --- Note