Python Implementation Method of Naive Bayes algorithm, python of Bayesian Algorithm
This article describes the python Implementation Method of Naive Bayes algorithm. Share it with you for your reference. The specific implementation method is as follows:
Advantages and disadvantages of Naive Bayes Algorithms
Advantage: it is still valid when the data volume is small and can handle multi-category issues
Disadvantage: sensitive to input data preparation methods
Applicable data type: nominal data
Algorithm idea:
For example, if we want to determine whether an email is a spam email, we know the word distribution in the email, and we also need to know the number of words in the spam email, the Bayesian theorem can be used to obtain the result.
One assumption in Naive Bayes classifier is that each feature is equally important.
Function
LoadDataSet ()
Create a dataset. The dataset is a sentence composed of words that have been split. It indicates the user comment of a forum, and tag 1 indicates that this is a curse.
CreateVocabList (dataSet)
Find the total number of words in these sentences to determine the size of our word Vectors
SetOfWords2Vec (vocabList, inputSet)
Convert a sentence into a Vector Based on the word in the sentence. Here, the bernuoli model is used to determine whether the word exists.
BagOfWords2VecMN (vocabList, inputSet)
This is another model for converting sentences into vectors. It is a polynomial model that considers the number of occurrences of a word.
TrainNB0 (trainMatrix, trainCatergory)
Calculate P (I) and P (w [I] | C [1]) and P (w [I] | C [0]). Here are two tips, one is that the initial denominator is not all initialized to 0 to prevent one of them from being 0, resulting in a total of 0, and the other is to use the logarithm later to prevent the result from precision issues being 0.
ClassifyNB (vec2Classify, p0Vec, p1Vec, pClass1)
Calculate which of the two sets has a high probability based on Bayesian formula.
Copy codeThe Code is as follows:
# Coding = UTF-8
From numpy import *
Def loadDataSet ():
PostingList = [['my', 'Dog', 'has ', 'flea', 'problems', 'help', 'please'],
['Maybe', 'not ', 'Take', 'him', 'to', 'Dog', 'Park ', 'stupid'],
['My', 'dalmation ', 'is', 'so', 'cute ',' I ', 'love', 'him'],
['Stop', 'posting', 'stupid ', 'Worthless', 'garbage'],
['Mr ', 'licks', 'ate', 'My ', 'steak', 'who', 'to', 'stop', 'him'],
['Quit', 'bucket', 'Worthless ', 'Dog', 'food', 'stupid']
ClassVec = [0, 1, 0, 1] #1 is abusive, 0 not
Return postingList, classVec
# Create a list with all words
Def createVocabList (dataSet ):
VocabSet = set ([])
For document in dataSet:
VocabSet = vocabSet | set (document)
Return list (vocabSet)
Def setOfWords2Vec (vocabList, inputSet ):
RetVocabList = [0] * len (vocabList)
For word in inputSet:
If word in vocabList:
RetVocabList [vocabList. index (word)] = 1
Else:
Print 'word', word, 'not in dict'
Return retVocabList
# Another Model
Def bagOfWords2VecMN (vocabList, inputSet ):
ReturnVec = [0] * len (vocabList)
For word in inputSet:
If word in vocabList:
ReturnVec [vocabList. index (word)] + = 1
Return returnVec
Def trainNB0 (trainMatrix, trainCatergory ):
NumTrainDoc = len (trainMatrix)
NumWords = len (trainMatrix [0])
PAbusive = sum (trainCatergory)/float (numTrainDoc)
# Prevent one of the scores with multiple probabilities from being 0
P0Num = ones (numWords)
P1Num = ones (numWords)
P0Denom = 2.0
P1Denom = 2.0
For I in range (numTrainDoc ):
If trainCatergory [I] = 1:
P1Num + = trainMatrix [I]
P1Denom + = sum (trainMatrix [I])
Else:
P0Num + = trainMatrix [I]
P0Denom + = sum (trainMatrix [I])
P1Vect = log (p1Num/p1Denom) # It is in consideration of precision. Otherwise, it is likely that the limit is zero.
P0Vect = log (p0Num/p0Denom)
Return p0Vect, p1Vect, pAbusive
Def classifyNB (vec2Classify, p0Vec, p1Vec, pClass1 ):
P1 = sum (vec2Classify * p1Vec) + log (pClass1) # element-wise mult
P0 = sum (vec2Classify * p0Vec) + log (1.0-pClass1)
If p1> p0:
Return 1
Else:
Return 0
Def testingNB ():
ListOPosts, listClasses = loadDataSet ()
MyVocabList = createVocabList (listOPosts)
TrainMat = []
For postinDoc in listOPosts:
TrainMat. append (setOfWords2Vec (myVocabList, postinDoc ))
P0V, p1V, pAb = trainNB0 (array (trainMat), array (listClasses ))
TestEntry = ['love', 'my', 'dalmation ']
ThisDoc = array (setOfWords2Vec (myVocabList, testEntry ))
Print testEntry, 'classified as: ', classifyNB (thisDoc, p0V, p1V, pAb)
TestEntry = ['stupid ', 'garbage']
ThisDoc = array (setOfWords2Vec (myVocabList, testEntry ))
Print testEntry, 'classified as: ', classifyNB (thisDoc, p0V, p1V, pAb)
Def main ():
TestingNB ()
If _ name _ = '_ main __':
Main ()
I hope this article will help you with Python programming.