First, Introduction
For an introduction to Mahout, please see here: http://mahout.apache.org/
For information on Naive Bayes, please poke here:
Mahout implements the Naive Bayes classification algorithm, where I use it to classify Chinese news texts.
The official has a component class example, using the total size of newsgroups data (http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) is about 85MB.
For Chinese text, compared to English text, only one step more steps to cut words, using Sogou Lab corpus, the total size of about 300M. Please poke here: http://www.sogou.com/labs/resources.html?v=1
Second, detailed steps
1. Write the word-cut applet, the toolkit is IK, separated by a space, all the news into one text, one line for a news ~
2. Upload data to HDFs, size of data, pro-Test hours ~ ~ ~
[email protected]: ~/workspace$hadoop dfs-cp/share/data/mahout_examples_data_set/20news-all.
3. Create a sequence file from 20newsgroups data (sequence files)
[Email protected]:~/workspace$mahout seqdirectory-i 20news-all-o 20news-seq
4. Converting a sequence file to a vector
[Email protected]:~/workspace$mahout seq2sparse-i/20news-seq-o./20NEWS-VECTORS-LNORM-NV-WT TFIDF
5. Divide the vector dataset into training data and test data, with random 40-60 split
[Email protected]:~/workspace$mahout split-i/20news-vectors/tfidf-vectors--trainingoutput./20news-train-vectors --testoutput./20news-test-vectors--randomselectionpct--overwrite--SEQUENCEFILES-XM Sequential
6. Training naive Bayesian model
[Email protected]:~/workspace$mahout trainnb-i/20news-train-vectors-el-o/model-li./labelindex-ow-c
7. Test naive Bayesian model
[Email protected]:~/workspace$mahout testnb-i/20news-train-vectors-m/model-l./labelindex-ow-o 20news-testing–c
8. Test Model Classification Effect
[Email protected]:~/workspace$mahout testnb-i/20news-test-vectors-m/model-l/labelindex-ow-o./20news-testing-c
Reference: http://openresearch.baidu.com/activitybulletin/448.jhtml;jsessionid=28BD4187550DCA6F8AD6FEA4DCCA2480
Mahout Naive Bayes Chinese News Classification example