IntroductionNext, I wrote the previous article "[scavenger] mahout0.9 patch TO MAKE IT support hadoop2.2.0" Release.
You are welcome to reprint it. Please indicate the source:
Http://blog.csdn.net/u010967382/article/details/39088285
Step 1: Upload All 20news files to HDFS[Email protected]: ~ /Mahout-distribution-0.7 $ hadoop FS-ls/workspace/mahout/week4/data/20 newsfound 2 itemsdrwxr-XR-X-yarn supergroup 0 2014-09-04/workspace/mahout /week4/data/20 news/20news-bydate-testdrwxr-xr-x-yarn supergroup 0 2014-09-04/workspace/mahout/week4/data/20 news/20news-bydate-train Step 2: create a Sequence File for data [email protected]: ~ /Mahout-distribution-0.7/bin $
./Mahout seqdirectory-I/workspace/mahout/week4/data/20 news-O/workspace/mahout/week4/data/20news_seq
[Email protected]: ~ /Mahout-distribution-0.7/bin $ hadoop FS-ls/workspace/mahout/week4/data/20news_seqfound 1 items-RW-r -- 1 yarn supergroup 37064977 /workspace/mahout/week4/data/20news_seq/chunk-0 Step 3: convert a sequence file into a vector [email protected]: ~ /Mahout-distribution-0.7/bin $./mahout
Seq2sparse-I/workspace/mahout/week4/data/20news_seq/-O/workspace/mahout/week4/data/20news_vectors
-Lnorm-Nv-wt TFIDF
[Email protected]: ~ /Mahout-distribution-0.7/bin $ hadoop FS-ls/workspace/mahout/week4/data/20news_vectorsfound 7 itemsdrwxr-XR-X-yarn supergroup 0 2014-09-04/workspace/ mahout/week4/data/20news_vectors/DF-count-RW-r -- 1 yarn supergroup 1937084/workspace/mahout/week4/data/20news_vectors/dictionary. file-0-rw-r -- r -- 1 yarn supergroup 1890053/workspace/mahout/week4/data/20news_vectors/frequency. file-0drwxr-xr-x-yarn supergroup 0 2014-09-04/workspace/mahout/week4/data/20news_vectors/TF-train-XR-X-yarn supergroup 0 2014-09-04/workspace/mahout/week4/data/ examples/TFIDF-rule-XR-X-yarn supergroup 0/workspace/mahout/week4/data/20news_vectors/tokenized-documentsdrwxr-XR-X-yarn supergroup 0/ workspace/mahout/week4/data/20news_vectors/wordcount Step 4: divides vector sets into training sets and test data.
Parameters:
- -Tr training set
- -Te Test Set
- -The RP parameter sets the percentage of the test dataset to the total dataset. The following code sets this parameter to 20%!
[Email protected]: ~ /Mahout-distribution-0.7/bin $./mahout
Split-I/Workspace/mahout/week4/data/20news_vectors/TFIDF-Vectors
-Tr/Workspace/mahout/week4/data/train-Vectors
-Te/Workspace/mahout/week4/data/test-vectors-RP 20-ow-seq-XM sequential
Step 5: Train the Model[Email protected]: ~ /Mahout-distribution-0.9/bin $./mahout
Trainnb
-I/Workspace/mahout/week4/data/train-Vectors
-El-o/Workspace/mahout/week4/nbmodel
-Li/Workspace/mahout/week4/labindex
-Ow-C
View the generated index:[Email protected]: ~ $ Hadoop FS-text/workspace/mahout/week4/labindex20news-bydate-test 020news-bydate-train 1
View the trained model:[Email protected]: ~ $ Hadoop FS-ls/workspace/mahout/week4/nbmodelfound 1 items-RW-r -- 1 yarn supergroup 2437874 2014-09-05/workspace/mahout/week4/nbmodel/naivebayesmodel. Bin
Step 6: Test[Email protected]: ~ /Mahout-distribution-0.9/bin $./mahout
Testnb-I/Workspace/mahout/week4/data/test-Vectors
-M/Workspace/mahout/week4/nbmodel-L/workspace/mahout/week4/labindex
-Ow-o/Workspace/mahout/week4/20news-test-result
-C
Note: The input path followed by-I during the test is the test set split in step 4.
Test results:14/09/05 23:18:09 info test. testnaivebayesdriver: complementary results: ========================================================== ================ summary ------------------------------------------------------------------- correctly classified instances: 2887 74.9675% incorrectly classified instances: 964 25.0325% total classified instances: 3851
========================================================== ================= Confusion matrix into a B <-- classified as1131 413 | 1544 A = 20news-bydate-test551 1756 | 2307 B = 20news-bydate-train
========================================================== ================ Statistics --------------------------------------------------------------------- Kappa 0.486 accuracy 74.9675% reliability 49.7892% reliability (standard deviation) 0.4314
14/09/05 23:18:09 info driver. mahoutdriver: program took 17504 MS (Minutes: 0.29173333333333334)
Mondorf tests mahout0.9 on hadoop2.2.0 after 1329-3.patch using Bayesian text classification.