Learning notes of machine learning practice: Classification Method Based on Naive Bayes,
Probability is the basis of many machine learning algorithms. A small part of probability knowledge is used in the decision tree generation process, that is, to count the number of times a feature obtains a specific value in a dataset, divide by the total number of instances in the dataset to obtain the probability that the feature obtains this value.
Directory:
1. classification method based on Bayesian theory
II. Application scenarios of Naive Bayes
Iii. Text Classification Based on Python And Naive Bayes
1. Prepare data
2. Training Algorithms
3. Test the algorithm
Iv. Summary
Enter the text below:
1. classification method based on Bayesian theory
Assume that a dataset consists of two types of data:
Assume that the parameters of the two probability distributions are known andp1(x,y)
Indicates the current data point(x,y)
Probability of Category 1; Usep2(x,y)
Indicates the current data point(x,y)
The probability of being in Class 2.
The core idea of Bayesian decision-making theory is to select the class corresponding to the high probability and the decision with the highest probability. It is sometimes summarized as the principle of "dominant majority.
Specific to the instance, for a data point(x,y)
You can use the following rules to determine its category:
Ifp1(x,y)>p2(x,y)
, That's the point(x,y)
It is determined as category 1.
Ifp1(x,y)<p2(x,y)
, That's the point(x,y)
Is determined as category 2.
Of course, in actual situations, all problems cannot be solved simply by relying on the above determination, because the above criteria are not all the content of Bayesian decision-making theory.p1(x,y)
Andp2(x,y)
To simplify the description. More, we usep(ci|x,y)
To determine the point of the given coordinate(x,y)
, The data point comes from the categoryci
The probability. Specifically, the Bayesian criterion can be used to calculate the unknown probability value through three known probability values:
The above Determination Rules can be converted:
Ifp(c1|x,y)>p(c2|x,y)
, That's the point(x,y)
It is determined as category 1.
Ifp(c1|x,y)<p(c2|x,y)
, That's the point(x,y)
Is determined as category 2.
II. Application scenarios of Naive Bayes
An important application of machine learning is automatic document classification, while Naive Bayes is a common algorithm for document classification. The basic step is to traverse and record the words that appear in the document, and use the appearance or absence of each word as a feature. In this way, there are as many features as the number of words in the document. If there are a large number of features, it is better to use a histogram. The general process of Naive Bayes is as follows:
At this point, you may also have questions about why Bayes adds "Simplicity". In fact, this is a assumption based on Naive Bayes, that is:Features are mutually independent (statistically ).For example, the possibility of a word appearing is irrelevant to that of other words. Of course, this is not necessarily true in actual situations, but countless experiments have shown that this assumption is necessary, in addition, the actual effect of Naive Bayes is actually very good.
Another assumption of Naive Bayes is that each feature is equally important. Of course, this assumption is also problematic (otherwise it will not be called an assumption ......), But it does have useful assumptions.
Iii. Text Classification Based on Python And Naive Bayes
To extract features from a text, you must first split the text and convert it into a word vector. If a word exists, it is expressed as 1. If it does not exist, it is expressed as 0. In this way, A large string is converted into a simple 0, 1 Series vector. In this case, you only consider whether a word appears. Of course, you can also use the number of occurrences of the recorded word as a vector, or record the frequency of occurrence of different words.
1. Prepare data: build word vectors from the text. Here, consider all words that appear in all documents and convert each document into vectors on the vocabulary. The following code implements the function:
FunctionloadDataSet()
Some lab samples have been created.postingList
And corresponding labelslistClass
Some samples are labeled as insulting words;
FunctioncreateNonRepeatedList()
Count and save a listvocList
, The list contains all the words in the document (not repeated), here the Pythonset
Function;
FunctiondetectInput(vocList, inputStream)
Word List usedvocList
,inputStream
It is the word string to be detected and the output document vector. Each element of the vector is1
Or0
Indicates whether words in the vocabulary appear in the input document.
#-*-Coding: UTF-8-*-"Created on Tue Sep 08 16:12:55 2015 @ author: Administrator" from numpy import * # create an experiment sample, some processing may be required for the actual sample, such as removing the punctuation def loadDataSet (): postingList = [['my', 'Dog', 'has', 'flea ', 'problems', 'help', 'please'], ['maybe', 'not', 'Take ', 'him', 'to', 'Dog ', 'Park ', 'stupid'], ['my', 'dalmation ', 'is', 'so', 'cute ',' I ', 'love ', 'him'], ['stop', 'posting', 'stupid ', 'Worthless', 'garb Age '], ['Mr', 'lick', 'ate', 'my', 'steak', 'who', 'to', 'stop ', 'him'], ['quit', 'bucket', 'Worthless ', 'Dog', 'food', 'stupid'] listClass = [0, 1, 0, 1, 0, 1] #1 indicates there is an insulting text, 0 indicates there is no return postingList, listClass # Save all the words in the document to a list, use set () function to remove duplicate words def createNonRepeatedList (data): vocList = set ([]) for doc in data: vocList = vocList | set (doc) # return list (vocList) def detectInput (vocList, I NputStream): returnVec = [0] * len (vocList) # create a full 0 list of the same length as vocabList for word in inputStream: if word in vocList: # returnVec [vocList. index (word)] = 1 #? Else: print "The word: % s is not in the vocabulary! "% Word return returnVec
2. Training Algorithm: calculate the probability from the word vector, and convert(x,y)
Change to Vectorw
The length is the length of the word vector, as shown in the following formula:
Computing is classified into categoriesci
Probabilityp(ci)
. Divide the number of documents in category I by the total number of documents. Andlabel_i/sum(label)
.
A Class c is known, and w is calculated in the classci
Probability inp(w|ci)
. Because Naive Bayes assumes that all features are independent of each other, there are:
p(w|ci) = p(w0,w1,...,wn|ci) = p(w0|ci)*p(w1|ci)*...*p(w0|ci)
Calculate the probability of each word wj in the known Class I, and then multiply.
The pseudocode is as follows:
Calculate the number of documents in each category for each training document: for each category: If the entry appears in the document-> Add the Count value of this entry to add the Count value of all entries for each category: divide the number of entries by the total number of entries to obtain the conditional probability.
The code of the Bayesian Classifier Training function is as follows:
def trainNaiveBayes(trainMatrix, classLabel): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pBase = sum(classLabel) / float(numTrainDocs) # The following Settings aim at avoiding the probability of 0 p0Num = ones(numWords) p1Num = ones(numWords) p0Denom = 2.0 p1Denom = 2.0 for i in range(numTrainDocs): if classLabel[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p0 = log(p0Num / p0Denom) p1 = log(p1Num / p1Denom) return p0, p1, pBase
3. Test Algorithm: test the classifier Effect
Inp(w0|ci)*p(w1|ci)*...p(w0|ci)
If a value is 0, the product of the last word is 0. Therefore, all words are initialized at least once. The denominator is initialized to 2, so the actual effect is not changed. At the same time,p(w0|ci)*p(w1|ci)*...*p(w0|ci)
Obtain the logarithm:ln(p(w0|ci))+ln(p(w1|ci))+...+ln(p(w0|ci))
Because of the logarithm functionln(x)
It is a monotonic increasing function, so the logarithm of multiplication in the calculation process has no effect on the monotonicity of the probability curve (a property commonly used in advanced mathematics ). After modifying the code, perform the test:
def trainNaiveBayes(trainMatrix, classLabel): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pBase = sum(classLabel) / float(numTrainDocs) # The following Settings aim at avoiding the probability of 0 p0Num = ones(numWords) p1Num = ones(numWords) p0Denom = 2.0 p1Denom = 2.0 for i in range(numTrainDocs): if classLabel[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p0 = log(p0Num / p0Denom) p1 = log(p1Num / p1Denom) return p0, p1, pBasetrainMat = []for doc in loadData: trainMat.append(detectInput(vocList, doc))p0,p1,pBase = trainNaiveBayes(trainMat, dataLabel)#print "trainMat : "#print trainMat# test the algorithmdef naiveBayesClassify(vec2Classify, p0, p1, pBase): p0res = sum(vec2Classify * p0) + log(1 - pBase) p1res = sum(vec2Classify * p1) + log(pBase) if p1res > p0res: return 1 else: return 0def testNaiveBayes(): loadData, classLabel = loadDataSet() vocList = createNonRepeatedList(loadData) trainMat = [] for doc in loadData: trainMat.append(detectInput(vocList, doc)) p0, p1, pBase = trainNaiveBayes(array(trainMat), array(classLabel)) testInput = ['love', 'my', 'dalmation'] thisDoc = array(detectInput(vocList, testInput)) print testInput, 'the classified as: ', naiveBayesClassify(thisDoc, p0, p1, pBase) testInput = ['stupid', 'garbage'] thisDoc = array(detectInput(vocList, testInput)) print testInput, 'the classified as: ', naiveBayesClassify(thisDoc, p0, p1, pBase)testNaiveBayes()
Finally, two groups of word strings are detected. The first segment is determined to be non-insulting, and the second segment is determined to be insulting. The classification is correct.
Iv. Summary
The above experiments have basically implemented the naive Bayes classifier and correctly implemented the text classification. We need to further study and apply Naive Bayes to practical applications such as spam filtering and regional preferences of personal advertisements.
Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.