Experiment Introduction
This course studied the Bayes classification algorithm of Mahout.
First, the experimental environment explained
1. Environment Login
Automatic login without password, System user name Shiyanlou
2. Introduction to the Environment
This experiment environment uses the Ubuntu Linux environment with the desktop, the experiment will use the program on the desktop:
- Xfceterminal:linux command line terminal, Open will enter the Bash environment, you can use the Linux command;
- Firefox: Browser, can be used in the need for the front-end interface of the course, only need to open the environment to write the HTML/JS page;
- GVim: Very useful editor, the simplest usage can refer to the course Vim editor.
- Eclipse:eclipse is a well-known cross-platform, free integrated development environment (IDE). It is used primarily for Java language development, but it is also being developed as a development tool for languages such as C + + and Python through plugins.
3. Use of the environment
Use the GVim editor to enter the code required for the experiment, and then use the xfceterminal command-line environment to compile and run, view the running results, run and share your experiment results, the lab building provides the backstage, cannot cheat, can prove that you have completed the experiment effectively.
The Experiment records page can be viewed in the "My Course", which contains each experiment and notes, as well as the effective learning time of each experiment (refers to the time of the experiment desktop operation, if there is no action, the system will be recorded as Daze time). These are the proof of authenticity of your studies.
Second, the classification algorithm
To make everyone understand what a clustering algorithm is, here we also cite an example.
Suppose you put some fruit in front of a few years old and tell him that the red round is an apple, and the Orange round is an orange. Then take the fruit away, and then re-fetch a red round apple and ask him if he is an apple. The child answers, this is a simple classification algorithm process.
In this process, the main involves two stages, the first is to establish the model stage, that is, to tell the child which characteristics of the Apple process, the second is the use of the model stage, that is, ask the child new fruit is not an apple, the child answer is, this process.
Three, Bayes classification algorithm
Bayes (Bayesian) classification algorithm is a kind of classification algorithm in Mahout. Bayes classification algorithm is a statistic-based classification algorithm, which is used to predict the probability of a sample belonging to a certain classification. Bayes classification algorithm is a classification algorithm based on Bayes theorem.
There are many variants of Bayes classification algorithm. This experiment mainly introduces naive Bayes classification algorithm. For simplicity, it is assumed that the individual attributes are independent of each other, and that the idea is based on the following: for the given classification, the probability of the occurrence of each category under the condition of this item is the largest, which category is considered to belong to the classification.
Bayes classification algorithm has the advantages of simple method, high accuracy and fast speed when it is applied to big data. In fact, the Bayes classification algorithm has its disadvantage, that is, Bayes theorem assumes that the effect of a property value on a given class is independent of the value of other properties, and this hypothesis is almost not tenable in the actual situation. Therefore, the classification accuracy rate may be reduced. Naive Bayes classification algorithm is a kind of supervised learning algorithm, using Naive Bayes classification algorithm to classify text, there are two main models: polynomial model (multinomial models) and Bernoulli models (Bernoulli model). The naive Bayes classification algorithm in Mahout uses a polynomial model, and if you are interested in in-depth research, you can go here to see the specific papers (English version).
Here's an example to illustrate the Bayes classification algorithm. The text training data given a set of classification numbers, as shown in:
Given a new sample of the document: "China, China, China, Tokyo, Japan", the sample is classified. The text attribute vector can be represented as d= (China, China, China, Tokyo, Japan), category set y={Yes, no}. “是”
There are 8 words under the category, and “否”
there are 3 words in the category. The total number of words in the training sample is 8+3=11. So P (yes) =8/11,p (NO) =3/11.
The probability of a class condition is calculated as follows:
- P (China | yes) = (5+1)/(8+6) =6/14=3/7
- P (Japan | yes) =p (Tokyo | Yes) = (0+1)/(8+6) =1/14
- P (China | no) = (+)/(3+6) =2/9
- P (Japan | no) =p (Tokyo | no) = (+)/(3+6) =2/9
which
- 8 in the denominator, representing
“是”
the total number of words in the training sample under the category;
- 6 training samples are "China, Beijing, Shanghai, Macau, Tokyo, Japan," a total of 6 words;
- 3 indicates
“否”
a total of 3 words under the category.
With the result of the above class conditional probability calculation, we can begin to calculate the posteriori probability:
- P (is |d) = (3/7) 3x1/14x1/14x1/14x8/11=108/184877=0.00058417
- P (no |d) = (2/9) 3x2/9x2/9x3/11=32/216513=0.00014780
Finally, we can conclude that this document belongs to the category 中国
, which is the main idea of Bayes classification algorithm implemented in Mahout.
Four, Bayes classification algorithm application example
This experiment, still through a set of specific examples to show you.
(1) /usr/local/hadoop-1.2.1
under Create a new test directory, download the test data 20news-bydate.tar.gz and unzip (this test data contains multiple newsgroup documents, which are divided into several newsgroups):
$ sudo mkdir bayes$ cd bayes$ sudo wget http://labfile.oss.aliyuncs.com/20news-bydate.tar.gz$ sudo tar zxvf 20news-bydate.tar.gz
After extracting it, it has divided the training set (Train) and test set (tests) for us. We only need to convert the data format.
(2) Use the Mahout seqdirectory command to convert all the sample files in the directory to the <Text, Text>
formatted sequencefile. Mahout automatically executes the Hadoop map-reduce for processing, and the output is placed in the 20NEWS-SEQ directory:
$ source /etc/profile$ mahout seqdirectory -i 20news-bydate-train -o 20news-bydate-train-seq$ mahout seqdirectory -i 20news-bydate-test -o 20news-bydate-test-seq
Diagram: Executing map-reduce job
(3) using the mahout seq2sparse command to convert the generated sequencefile file into <Text, VectorWritable>
a formatted vector file, the same will be done by the Hadoop map-reduce for processing, the output is placed in the 20news-vectors directory (life Make sure the input is correct before executing it:
$ mahout seq2sparse -i 20news-bydate-train-seq -o 20news-bydate-train-vectors -lnorm -nv -wt tfidf$ mahout seq2sparse -i 20news-bydate-test-seq -o 20news-bydate-test-vectors -lnorm -nv -wt tfidf
Figure: Partial output after vectorization is completed
(4) Once the data format conversion is complete, we can begin to train the data set.
$ mahout trainnb -i 20news-bydate-train-vectors/tfidf-vectors -el -o model -li labelindex -ow
Illustration: Partial output after training completed
(5) After the training is complete, start testing:
20news-bydate-test-vectors/tfidf-vectors -m model -l labelindex -ow -o test-relust -c
Illustration: The above two graphs are the test result information
(6) Here to explain the above mixed matrix output information meaning:
The above a
to the t
difference is represented by 20 categories, you can see these 20 categories under the train or Test folder. The data in the column represents the number of bytes allocated in each category, and classified represents the total number that should be allocated. For example:
It says that Alt.atheis originally belonged to a, 475 documents were zoned for Class A, this is the correct data, and the others in turn represent the number of strokes in the B~t class. We can see that the accuracy rate is 475/480=0.9895833 , it can be seen that the accuracy rate is very high.
Similarly, from Summary and Statistics statistics, we can also see that the accuracy and reliability of the results are very high.
Homework
You might want to run a Bayes classification algorithm instance in Hadoop and Mahout in pseudo-distribution mode.
Mahout Classification algorithm