Mahout Naive Bayes Chinese News Classification example

Source: Internet
Author: User

First, Introduction

For an introduction to Mahout, please see here: http://mahout.apache.org/

For information on Naive Bayes, please poke here:

Mahout implements the Naive Bayes classification algorithm, where I use it to classify Chinese news texts.

The official has a component class example, using the total size of newsgroups data (http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) is about 85MB.

For Chinese text, compared to English text, only one step more steps to cut words, using Sogou Lab corpus, the total size of about 300M. Please poke here: http://www.sogou.com/labs/resources.html?v=1

Second, detailed steps

1. Write the word-cut applet, the toolkit is IK, separated by a space, all the news into one text, one line for a news ~

2. Upload data to HDFs, size of data, pro-Test hours ~ ~ ~

[email protected]: ~/workspace$hadoop dfs-cp/share/data/mahout_examples_data_set/20news-all.

3. Create a sequence file from 20newsgroups data (sequence files)

[Email protected]:~/workspace$mahout seqdirectory-i 20news-all-o 20news-seq

4. Converting a sequence file to a vector

[Email protected]:~/workspace$mahout seq2sparse-i/20news-seq-o./20NEWS-VECTORS-LNORM-NV-WT TFIDF

5. Divide the vector dataset into training data and test data, with random 40-60 split

[Email protected]:~/workspace$mahout split-i/20news-vectors/tfidf-vectors--trainingoutput./20news-train-vectors --testoutput./20news-test-vectors--randomselectionpct--overwrite--SEQUENCEFILES-XM Sequential

6. Training naive Bayesian model

[Email protected]:~/workspace$mahout trainnb-i/20news-train-vectors-el-o/model-li./labelindex-ow-c

7. Test naive Bayesian model

[Email protected]:~/workspace$mahout testnb-i/20news-train-vectors-m/model-l./labelindex-ow-o 20news-testing–c

8. Test Model Classification Effect

[Email protected]:~/workspace$mahout testnb-i/20news-test-vectors-m/model-l/labelindex-ow-o./20news-testing-c

Reference: http://openresearch.baidu.com/activitybulletin/448.jhtml;jsessionid=28BD4187550DCA6F8AD6FEA4DCCA2480

Mahout Naive Bayes Chinese News Classification example

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.