Machine Learning 4, machine learning

Source: Internet
Author: User

Machine Learning 4, machine learning

Probability-based classification method: Naive Bayes

Bayesian decision theory

Naive Bayes is a part of Bayesian decision-making theory. Therefore, before explaining Naive Bayes, let's take a quick look at Bayesian decision-making theory knowledge.

The core idea of Bayesian decision-making theory: select a decision with the highest probability. For example, if we select the job direction after graduation, the probability of selecting the C ++ direction is 0.3, the probability of choosing Java is 0.2, and the probability of selecting Machine Learning is 0.5. Then we will classify the job orientation of such a graduate as machine learning.

Conditional Probability

What is conditional probability? Event A's probability of occurrence under another event B's known condition is recorded as P (A | B) and is read as "the probability of occurrence of A Under Condition B ".

Example 1: There are two dice. If one dice throws six points, then when another dice is thrown, the probability that the sum of the two dice points is greater than or equal to 10 is obtained.

Example 2: There are three boxes numbered 1, 2, 3. Box 1 has one red ball and four white balls, box 2 has two red balls and 3 white balls, and box 3 has three red balls. Take one box from the three boxes, and draw a ball from them to obtain the probability of a red ball.

Another effective method for calculating conditional probability is Bayesian criterion. Bayesian principles tell us how to exchange conditions and results in the probability of a condition. If P (x | c) is known and P (c | x) is required, we can use this calculation method: P (c | x) * P (x) = P (x | c) * P (c ).

So far, we have basically understood Bayesian decision-making theory and conditional probability. Now we have learned how to write code and use Python to implement Naive Bayes classifier.

Naive Bayes Classifier

Take the message board as an example. In order not to affect the development of the community, we need to block insulting comments, so we need to build a quick filter. If a message uses negative or insulting words, this message is identified as inappropriate content. Filtering such content is a basic requirement for many websites. Here, we classify comments as insults and non-insults, expressed by 1 and 0 respectively.

 1 # coding : utf-8 2 from numpy import * 3  4 def loadDataSet(): 5     postingList = [['my', 'dog', 'has', 'flea', \ 6         'problems', 'help', 'please'], 7         ['maybe', 'not', 'take', 'him', \ 8         'to', 'dog', 'park', 'stupid'], 9         ['my' ,'dalmation', 'is', 'so', 'cute', \10         'I', 'love', 'him'],11         ['stop', 'posting', 'stupid', 'worthless', 'garbage'],12         ['mr', 'licks', 'ate', 'my', 'steak', 'how' \13         'to', 'stop', 'him'],14         ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']15         ]16     classVec = [0,1,0,1,0,1]17     return postingList, classVec18 19 def createVocabList(dataSet):20     vocabSet = set([])21     for doc in dataSet:22         vocabSet = vocabSet | set(doc)23     return list(vocabSet)24     25 def setOfWord2Vec(vocabList, inputSet):26     returnVec = [0]*len(vocabList)27     for word in inputSet:28         if word in vocabList:29             returnVec[vocabList.index(word)] = 130         else:31             print "the word: %s is not in my Vocabulary!" % word32     return returnVec 33     

The first function loadDataSet () creates some experiment samples. The first variable returned by this function is the set of documents after word segmentation. These documents come from the message board for dogs. The second variable is a set of category tags. There are two types: insulting and non-insulting.

The next function, createVocabList (), creates a list of non-repeated words that appear in all documents and uses the set data type. The operator "|" is used to calculate the union of two sets.

The input parameter of the third function is the vocabulary list and a document. The output is the document vector. Each element of the vector is 1 or 0, indicates whether words in the vocabulary appear in the input document.

Next, let's take a look at the effect of executing this function block.

1 # coding : utf-82 import bayes3 from numpy import *4 5 list2Posts, listClasses = bayes.loadDataSet()6 myVocabList = bayes.createVocabList(list2Posts)7 print myVocabList, len(myVocabList)8 print bayes.setOfWord2Vec(myVocabList, list2Posts[0])9 print bayes.setOfWord2Vec(myVocabList, list2Posts[3])

Calculate Probability

 1 def trainNBO(trainMatrix, trainCategory): 2     numTrainDocs = len(trainMatrix) 3     numWords = len(trainMatrix[0]) 4     pAbusive = sum(trainCategory)/float(numTrainDocs) 5     p0Num = zeros(numWords); p1Num = zeros(numWords) 6     p0Denom = p1Denom = 0.0 7     for i in range(numTrainDocs): 8         if trainCategory[i] == 1: 9             p1Num += trainMatrix[i]10             p1Denom += sum(trainMatrix[i])11         else :12             p0Num += trainMatrix[i]13             p0Denom += sum(trainMatrix[i])14     p1Vect = p1Num/p1Denom15     p0Vect = p0Num/p0Denom16     return p0Vect,p1Vect,pAbusive

 

Everything is ready. Next, there is an important Naive Bayes classification function.

 

 1 def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): 2     p1 = sum(vec2Classify * p1Vec) + log(pClass1) 3     p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) 4     if p1 > p0: 5         return 1 6     else : 7         return 0 8          9 def testingNB():10     listOPosts,listClasses = loadDataSet()11     myVocabList = createVocabList(listOPosts)12     trainMat = []13     for postinDoc in listOPosts:14         trainMat.append(setOfWord2Vec(myVocabList,postinDoc))15     p0V,p1V,pAb = trainNBO(array(trainMat),array(listClasses))16     testEntry = ['love', 'my', 'dalmation']17     thisDoc = array(setOfWord2Vec(myVocabList, testEntry))18     print testEntry, 'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)19     testEntry = ['stupid', 'garbage']20     thisDoc = array(setOfWord2Vec(myVocabList, testEntry))21     print testEntry, 'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

 

Try our calculation result:

As expected, words in the document are correctly classified.

Summary:

In general, the naive Bayes algorithm is more complex than the decision tree and KNN algorithm, and the amount of code is also relatively more. I think back to the young man in my freshman year. I didn't study probability theory well. I didn't pay much attention to it at the time. As a result, I am now a bird. Ah, it's strange that we were still young! There are many complex probability formulas in the probability part of the condition, but they are still very vague. Next, we will spend some time on probability. As long as there is no problem with the probability, I believe it is not a problem to read the above Code. The classifyNB function is used to calculate the final probability.

Come on, BaiYiShaoNian!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.