Python Implementation of Naive Bayes algorithm and python of Bayesian Algorithm

Source: Internet
Author: User

Python Implementation of Naive Bayes algorithm and python of Bayesian Algorithm
Advantages and disadvantages of Naive Bayes Algorithms

  • Advantage: it is still valid when the data volume is small and can handle multi-category issues
  • Disadvantage: sensitive to input data preparation methods
  • Applicable data type: nominal data
Algorithm idea:

Naive Bayes
For example, if we want to determine whether an email is a spam email, we know the word distribution in the email, and we also need to know the number of words in the spam email, the Bayesian theorem can be used to obtain the result.
One assumption in Naive Bayes classifier is that each feature is equally important.

Function

loadDataSet()

Create a dataset. The dataset is a sentence composed of words that have been split. It indicates the user comment of a forum, and tag 1 indicates that this is a curse.

createVocabList(dataSet)

Find the total number of words in these sentences to determine the size of our word Vectors

setOfWords2Vec(vocabList, inputSet)

Convert a sentence into a Vector Based on the word in the sentence. Here, the bernuoli model is used to determine whether the word exists.

bagOfWords2VecMN(vocabList, inputSet)

This is another model for converting sentences into vectors. It is a polynomial model that considers the number of occurrences of a word.

trainNB0(trainMatrix,trainCatergory)

Calculate P (I) and P (w [I] | C [1]) and P (w [I] | C [0]). Here are two tips, one is that the initial denominator is not all initialized to 0 to prevent one of them from being 0, resulting in a total of 0, and the other is to use the logarithm later to prevent the result from precision issues being 0.

classifyNB(vec2Classify, p0Vec, p1Vec, pClass1)

Calculate which of the two sets has a high probability based on Bayesian formula.
  1. 1 # coding = UTF-8 2 from numpy import * 3 def loadDataSet (): 4 postingList = [['my', 'Dog', 'has', 'flea ', 'problems', 'help', 'please'], 5 ['maybe', 'not', 'Take ', 'him', 'to', 'Dog ', 'Park ', 'stupid'], 6 ['my', 'dalmation ', 'is', 'so', 'cute ',' I ', 'love ', 'him'], 7 ['stop', 'posting', 'stupid ', 'Worthless', 'garbage'], 8 ['Mr ', 'licks ', 'ate', 'My ', 'steak', 'who', 'to', 'stop', 'him'], 9 ['quit', 'bucket ', 'Worthless ', 'Dog', 'food', 'stupid'] 10 classVec = [,] #1 is abusive, 0 not11 return postingList, classVec12 13 # create a list with all words 14 def createVocabList (dataSet): 15 vocabSet = set ([]) 16 for document in dataSet: 17 vocabSet = vocabSet | set (document) 18 return list (vocabSet) 19 20 def setOfWords2Vec (vocabList, inputSet): 21 retVocabList = [0] * len (vocabList) 22 for word in inputSet: 23 if word in vocabList: 24 retVocabList [vocabList. index (word)] = 125 else: 26 print 'word', word, 'not in dict '27 return retVocabList28 29 # Another model 30 def bagOfWords2VecMN (vocabList, inputSet ): 31 returnVec = [0] * len (vocabList) 32 for word in inputSet: 33 if word in vocabList: 34 returnVec [vocabList. index (word)] + = 135 return returnVec36 37 def trainNB0 (trainMatrix, trainCatergory): 38 numTrainDoc = len (trainMatrix) 39 numWords = len (trainMatrix [0]) 40 pAbusive = sum (trainCatergory)/float (numTrainDoc) 41 # prevent one of multiple probability scores from being 042 p0Num = ones (numWords) 43 p1Num = ones (numWords) 44 p0Denom = 2.045 p1Denom = 2.046 for I in range (numTrainDoc ): 47 if trainCatergory [I] = 1:48 p1Num + = trainMatrix [I] 49 p1Denom + = sum (trainMatrix [I]) 50 else: 51 p0Num + = trainMatrix [I] 52 p0Denom + = sum (trainMatrix [I]) 53 p1Vect = log (p1Num/p1Denom) # accuracy considerations, otherwise, it is very likely that the limit to return to 54 p0Vect = log (p0Num/p0Denom) 55 return p0Vect, p1Vect, pAbusive56 57 def classifyNB (vec2Classify, p0Vec, p1Vec, pClass1 ): 58 p1 = sum (vec2Classify * p1Vec) + log (pClass1) # element-wise mult59 p0 = sum (vec2Classify * p0Vec) + log (1.0-pClass1) 60 if p1> p0: 61 return 162 else: 63 return 064 65 def testingNB (): 66 listOPosts, listClasses = loadDataSet () 67 myVocabList = createVocabList (listOPosts) 68 trainMat = [] 69 for postinDoc in listOPosts: 70 trainMat. append (setOfWords2Vec (myVocabList, postinDoc) 71 p0V, p1V, pAb = trainNB0 (array (trainMat), array (listClasses) 72 testEntry = ['love', 'My ', 'dalmation '] 73 thisDoc = array (setOfWords2Vec (myVocabList, testEntry) 74 print testEntry, 'classified as:', classifyNB (thisDoc, p0V, p1V, pAb) 75 testEntry = ['topid', 'garbage'] 76 thisDoc = array (setOfWords2Vec (myVocabList, testEntry) 77 print testEntry, 'classified as: ', classifyNB (thisDoc, p0V, p1V, pAb) 78 79 80 def main (): 81 testingNB () 82 83 if _ name _ = '_ main _': 84 main ()

     

From Weizhi note (Wiz)



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.