Learning notes of machine learning practice: Classification Method Based on Naive Bayes,

Last Update:2015-09-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Probability is the basis of many machine learning algorithms. A small part of probability knowledge is used in the decision tree generation process, that is, to count the number of times a feature obtains a specific value in a dataset, divide by the total number of instances in the dataset to obtain the probability that the feature obtains this value.

Directory:

1. classification method based on Bayesian theory
II. Application scenarios of Naive Bayes
Iii. Text Classification Based on Python And Naive Bayes
1. Prepare data
2. Training Algorithms
3. Test the algorithm
Iv. Summary

Enter the text below:

1. classification method based on Bayesian theory

Assume that a dataset consists of two types of data:

Assume that the parameters of the two probability distributions are known andp1(x,y)Indicates the current data point(x,y)Probability of Category 1; Usep2(x,y)Indicates the current data point(x,y)The probability of being in Class 2.

The core idea of Bayesian decision-making theory is to select the class corresponding to the high probability and the decision with the highest probability. It is sometimes summarized as the principle of "dominant majority.

Specific to the instance, for a data point(x,y)You can use the following rules to determine its category:

Ifp1(x,y)>p2(x,y), That's the point(x,y)It is determined as category 1.
Ifp1(x,y)<p2(x,y), That's the point(x,y)Is determined as category 2.

Of course, in actual situations, all problems cannot be solved simply by relying on the above determination, because the above criteria are not all the content of Bayesian decision-making theory.p1(x,y)Andp2(x,y)To simplify the description. More, we usep(ci|x,y)To determine the point of the given coordinate(x,y), The data point comes from the categoryciThe probability. Specifically, the Bayesian criterion can be used to calculate the unknown probability value through three known probability values:

The above Determination Rules can be converted:

Ifp(c1|x,y)>p(c2|x,y), That's the point(x,y)It is determined as category 1.
Ifp(c1|x,y)<p(c2|x,y), That's the point(x,y)Is determined as category 2.

II. Application scenarios of Naive Bayes

An important application of machine learning is automatic document classification, while Naive Bayes is a common algorithm for document classification. The basic step is to traverse and record the words that appear in the document, and use the appearance or absence of each word as a feature. In this way, there are as many features as the number of words in the document. If there are a large number of features, it is better to use a histogram. The general process of Naive Bayes is as follows:

At this point, you may also have questions about why Bayes adds "Simplicity". In fact, this is a assumption based on Naive Bayes, that is:Features are mutually independent (statistically ).For example, the possibility of a word appearing is irrelevant to that of other words. Of course, this is not necessarily true in actual situations, but countless experiments have shown that this assumption is necessary, in addition, the actual effect of Naive Bayes is actually very good.

Another assumption of Naive Bayes is that each feature is equally important. Of course, this assumption is also problematic (otherwise it will not be called an assumption ......), But it does have useful assumptions.

Iii. Text Classification Based on Python And Naive Bayes

To extract features from a text, you must first split the text and convert it into a word vector. If a word exists, it is expressed as 1. If it does not exist, it is expressed as 0. In this way, A large string is converted into a simple 0, 1 Series vector. In this case, you only consider whether a word appears. Of course, you can also use the number of occurrences of the recorded word as a vector, or record the frequency of occurrence of different words.

1. Prepare data: build word vectors from the text. Here, consider all words that appear in all documents and convert each document into vectors on the vocabulary. The following code implements the function:

FunctionloadDataSet()Some lab samples have been created.postingListAnd corresponding labelslistClassSome samples are labeled as insulting words;
FunctioncreateNonRepeatedList()Count and save a listvocList, The list contains all the words in the document (not repeated), here the PythonsetFunction;
FunctiondetectInput(vocList, inputStream)Word List usedvocList,inputStreamIt is the word string to be detected and the output document vector. Each element of the vector is1Or0Indicates whether words in the vocabulary appear in the input document.

#-*-Coding: UTF-8-*-"Created on Tue Sep 08 16:12:55 2015 @ author: Administrator" from numpy import * # create an experiment sample, some processing may be required for the actual sample, such as removing the punctuation def loadDataSet (): postingList = [['my', 'Dog', 'has', 'flea ', 'problems', 'help', 'please'], ['maybe', 'not', 'Take ', 'him', 'to', 'Dog ', 'Park ', 'stupid'], ['my', 'dalmation ', 'is', 'so', 'cute ',' I ', 'love ', 'him'], ['stop', 'posting', 'stupid ', 'Worthless', 'garb Age '], ['Mr', 'lick', 'ate', 'my', 'steak', 'who', 'to', 'stop ', 'him'], ['quit', 'bucket', 'Worthless ', 'Dog', 'food', 'stupid'] listClass = [0, 1, 0, 1, 0, 1] #1 indicates there is an insulting text, 0 indicates there is no return postingList, listClass # Save all the words in the document to a list, use set () function to remove duplicate words def createNonRepeatedList (data): vocList = set ([]) for doc in data: vocList = vocList | set (doc) # return list (vocList) def detectInput (vocList, I NputStream): returnVec = [0] * len (vocList) # create a full 0 list of the same length as vocabList for word in inputStream: if word in vocList: # returnVec [vocList. index (word)] = 1 #? Else: print "The word: % s is not in the vocabulary! "% Word return returnVec

2. Training Algorithm: calculate the probability from the word vector, and convert(x,y)Change to VectorwThe length is the length of the word vector, as shown in the following formula:

Computing is classified into categoriesciProbabilityp(ci). Divide the number of documents in category I by the total number of documents. Andlabel_i/sum(label).

A Class c is known, and w is calculated in the classciProbability inp(w|ci). Because Naive Bayes assumes that all features are independent of each other, there are:

The pseudocode is as follows:

Calculate the number of documents in each category for each training document: for each category: If the entry appears in the document-> Add the Count value of this entry to add the Count value of all entries for each category: divide the number of entries by the total number of entries to obtain the conditional probability.

The code of the Bayesian Classifier Training function is as follows:

def trainNaiveBayes(trainMatrix, classLabel):    numTrainDocs = len(trainMatrix)    numWords = len(trainMatrix[0])    pBase = sum(classLabel) / float(numTrainDocs)    # The following Settings aim at avoiding the probability of 0    p0Num = ones(numWords)    p1Num = ones(numWords)    p0Denom = 2.0    p1Denom = 2.0    for i in range(numTrainDocs):        if classLabel[i] == 1:            p1Num += trainMatrix[i]            p1Denom += sum(trainMatrix[i])                    else:            p0Num += trainMatrix[i]            p0Denom += sum(trainMatrix[i])           p0 = log(p0Num / p0Denom)    p1 = log(p1Num / p1Denom)    return p0, p1, pBase

3. Test Algorithm: test the classifier Effect

Inp(w0|ci)*p(w1|ci)*...p(w0|ci)If a value is 0, the product of the last word is 0. Therefore, all words are initialized at least once. The denominator is initialized to 2, so the actual effect is not changed. At the same time,p(w0|ci)*p(w1|ci)*...*p(w0|ci)Obtain the logarithm:ln(p(w0|ci))+ln(p(w1|ci))+...+ln(p(w0|ci))Because of the logarithm functionln(x)It is a monotonic increasing function, so the logarithm of multiplication in the calculation process has no effect on the monotonicity of the probability curve (a property commonly used in advanced mathematics ). After modifying the code, perform the test:

def trainNaiveBayes(trainMatrix, classLabel):    numTrainDocs = len(trainMatrix)    numWords = len(trainMatrix[0])    pBase = sum(classLabel) / float(numTrainDocs)    # The following Settings aim at avoiding the probability of 0    p0Num = ones(numWords)    p1Num = ones(numWords)    p0Denom = 2.0    p1Denom = 2.0    for i in range(numTrainDocs):        if classLabel[i] == 1:            p1Num += trainMatrix[i]            p1Denom += sum(trainMatrix[i])                    else:            p0Num += trainMatrix[i]            p0Denom += sum(trainMatrix[i])           p0 = log(p0Num / p0Denom)    p1 = log(p1Num / p1Denom)    return p0, p1, pBasetrainMat = []for doc in loadData:    trainMat.append(detectInput(vocList, doc))p0,p1,pBase = trainNaiveBayes(trainMat, dataLabel)#print "trainMat : "#print trainMat# test the algorithmdef naiveBayesClassify(vec2Classify, p0, p1, pBase):    p0res = sum(vec2Classify * p0) + log(1 - pBase)    p1res = sum(vec2Classify * p1) + log(pBase)    if p1res > p0res:        return 1    else:        return 0def testNaiveBayes():    loadData, classLabel = loadDataSet()    vocList = createNonRepeatedList(loadData)    trainMat = []    for doc in loadData:         trainMat.append(detectInput(vocList, doc))    p0, p1, pBase = trainNaiveBayes(array(trainMat), array(classLabel))    testInput = ['love', 'my', 'dalmation']    thisDoc = array(detectInput(vocList, testInput))    print testInput, 'the classified as: ', naiveBayesClassify(thisDoc, p0, p1, pBase)    testInput = ['stupid', 'garbage']    thisDoc = array(detectInput(vocList, testInput))    print testInput, 'the classified as: ', naiveBayesClassify(thisDoc, p0, p1, pBase)testNaiveBayes()

Finally, two groups of word strings are detected. The first segment is determined to be non-insulting, and the second segment is determined to be insulting. The classification is correct.

Iv. Summary

The above experiments have basically implemented the naive Bayes classifier and correctly implemented the text classification. We need to further study and apply Naive Bayes to practical applications such as spam filtering and regional preferences of personal advertisements.

Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Learning notes of machine learning practice: Classification Method Based on Naive Bayes,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Learning notes of machine learning practice: Classification Method Based on Naive Bayes,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support