Advantages and disadvantages of algorithms
Pros: Still effective with less data, can handle multiple categories of problems
Cons: Sensitive to the way the input data is prepared
Applicable data type: Nominal type data
Algorithm idea:
Naive Bayesian
For example, we want to determine whether an e-mail message is spam, then we know the distribution of the word in this message, then we also need to know: spam in the presence of some words, you can use the Bayesian theorem obtained.
One hypothesis in naive Bayesian classifier is that each feature is equally important
Bayesian classification is a generic term for a class of classification algorithms, which are based on Bayesian theorem, so collectively referred to as Bayesian classification.
Function
Loaddataset ()
Create a dataset where the dataset is a sentence of broken words that represents a user comment for a forum, and label 1 says it's a curse.
Createvocablist (DataSet)
Find out how many words are in total in these sentences to determine the size of our word vectors
Setofwords2vec (Vocablist, Inputset)
To convert a sentence into a vector based on the word, the Bernoulli model is used to consider whether the word exists
BAGOFWORDS2VECMN (Vocablist, Inputset)
This is another model of turning a sentence into a vector, a polynomial model that takes into account the number of occurrences of a word.
TrainNB0 (Trainmatrix,traincatergory)
Calculate P (i) and P (w[i]| C[1]) and P (w[i]| C[0]), here are two tricks, one is to start the numerator denominator not all initialized to 0 is to prevent one of the probability of 0 leads to the whole 0, and the other is the back multiply with logarithmic prevent because the accuracy problem result for 0
CLASSIFYNB (Vec2classify, P0vec, P1vec, PClass1)
Calculates the probability that the vector belongs to two sets according to the Bayesian formula.
#coding =utf-8from numpy Import *def loaddataset (): postinglist=[[' my ', ' dog ', ' have ', ' flea ', ' problems ', ' help ', ' pleas E '], [' Maybe ', ' not ', ' take ', ' him ', ' to ', ' dog ', ' Park ', ' stupid '], [' my ', ' dalmation ', ' Is ', ' so ', ' cute ', ' I ', ' love ', ' him '], [' Stop ', ' posting ', ' stupid ', ' worthless ', ' garbage '], [' Mr ', ' licks ', ' ate ', ' my ', ' steak ', ' How ', ' to ', ' stop ', ' him '], [' Quit ', ' buying ', ' worthless ', ' d OG ', ' food ', ' stupid ']] Classvec = [0,1,0,1,0,1] #1 was abusive, 0 not return postinglist,classvec# create a list with all words de F createvocablist (dataSet): Vocabset = set ([]) for document in Dataset:vocabset = Vocabset | Set (document) return list (Vocabset) def setofwords2vec (Vocablist, inputset): Retvocablist = [0] * Len (vocablist) For word in Inputset:if word in vocablist:retvocablist[vocablist.index (word)] = 1 Else: print ' word ', word, ' not in Dict '' Return retvocablist# another model def bagofwords2vecmn (Vocablist, inputset): Returnvec = [0]*len (vocablist) for Word In Inputset:if Word in vocablist:returnvec[vocablist.index (word)] + = 1 return returnvecdef TRAINNB 0 (trainmatrix,traincatergory): Numtraindoc = Len (trainmatrix) numwords = Len (trainmatrix[0]) pabusive = SUM (train catergory)/float (numtraindoc) #防止多个概率的成绩当中的一个为0 p0num = Ones (numwords) p1num = Ones (numwords) P0denom = 2.0 P1denom = 2.0 for I in Range (Numtraindoc): if traincatergory[i] = = 1:p1num +=trainmatrix[i] P1denom + = SUM (Trainmatrix[i]) Else:p0num +=trainmatrix[i] P0denom + = SUM (Trainmatrix[i] ) P1vect = log (p1num/p1denom) #处于精度的考虑, otherwise it is possible to limit to zero P0vect = log (p0num/p0denom) return p0vect,p1vect,pabusive def CLASSIFYNB (Vec2classify, P0vec, P1vec, pClass1): P1 = SUM (vec2classify * P1vec) + log (pClass1) #element-wise mult P0 = SUM (vec2classify * P0vec) + log (1.0-PCLASS1) if p1 > P0:return 1 else:return 0 def TESTINGNB (): Listop osts,listclasses = Loaddataset () myvocablist = Createvocablist (listoposts) trainmat=[] for Postindoc in ListOPost S:trainmat.append (Setofwords2vec (Myvocablist, postindoc)) P0v,p1v,pab = trainNB0 (Array (trainmat), Array (Listclas SES)) testentry = [' Love ', ' my ', ' dalmation '] Thisdoc = Array (Setofwords2vec (Myvocablist, testentry)) Print Teste Ntry, ' classified as: ', CLASSIFYNB (thisdoc,p0v,p1v,pab) testentry = [' stupid ', ' garbage '] Thisdoc = Array (setOfWords2 Vec (Myvocablist, testentry)) print testentry, ' classified as: ', Classifynb (THISDOC,P0V,P1V,PAB) def main (): TE STINGNB () if __name__ = = ' __main__ ': Main ()