Machine learning Python implements Bayesian algorithm

Source: Internet
Author: User
Tags natural logarithm

['my ',' dog ',' have ',' Flea ',' problems ',' help ', ' Please '], 0

[' Maybe ',' not ',' Take ',' him ',' to ',' dog ', ' Park ',' stupid ', 1

[' my ','dalmation',' is ',' so ',' cute ',' I ', ' Love ',' him ', 0

[' Stop ',' posting ',' stupid ',' worthless ',' garbage '], 1

['Mr',' licks ',' ate ',' my ',' Steak ',' how ', ' to ',' Stop ',' him ',0

[' Quit ', ' buying ',' worthless ',' dog ',' food ',' Stupid '] [1 ]


The above is six sentences, the mark is 0 sentence of the normal sentence, the mark is 1 sentence expression for the foul language. We can find out which words are foul language by analyzing the probabilities of each word in each sentence, either in foul sentences or in normal sentences.

Note: The classifier is modified mainly from the following two points

<1> Bayesian probabilities need to calculate the product of multiple probabilities to get the probability that a document belongs to a category, which is to calculate P (w0|1) p (w1|1) p (w2|1). If one of the probability values is 0, then the last product is also 0

<2> The second problem is the next overflow, which is caused by multiplying too many small numbers. Because most of the factors are very small, the program overflows or does not get the correct answer. The solution is to take the natural logarithm of the product, which avoids the error caused by an overflow or floating-point rounding.

<3> the appearance of each word as a feature is called the word set model; in a word bag model , each word can appear multiple times.

Bayesian theory is to calculate the pre-test probability by the posterior probability which can be obtained.There are two prerequisites for the naive Bayesian theory used here.features are independent of each otherequally important between features
#-*-coding:cp936-*-from numpy Import * #过滤网站的恶意留言 # Create an experimental sample Def loaddataset (): Postinglist = [[' My ', ' dog ', ' ha                     S ', ' flea ', ' problems ', ' help ', ' "Please '], [' Maybe ', ' no ', ' take ', ' him ', ' to ', ' dog ', ' Park ', ' stupid '], [' My ', ' dalmation ', ' is ', ' so ', ' cute ', ' I ', ' love ', ' him '], [' Stop ', ' posting ', ' stupid ', ' worthl Ess ', ' garbage '], [' Mr ', ' licks ', ' ate ', ' my ', ' steak ', ' How ', ' to ', ' stop ', ' him '], [' Quit ', ' buying ', ' worthless ', ' dog ', ' food ', ' stupid '] Classvec = [0,1,0,1,0,1] return postinglist, Classvec # Create a containing          List of non-repeating words that appear in all Documents def createvocablist (DataSet): Vocabset = set ([]) #创建一个空集 for document in DataSet: Vocabset = Vocabset | Set (document) #创建两个集合的并集 return list (vocabset) #将文档词条转换成词向量 def setofwords2vec (Vocablist, inputset): retur              Nvec = [0]*len (vocablist) #创建一个其中所含元素都为0的向量 for word in Inputset:if word in vocablist:#returnVec [Vocablist.index (word)] = 1 #index函数在字符串里找到字符第一次出现的位置 Word set model Returnvec[vocablist.index (word)] + = 1 #文档的词袋模型 Each word can appear multiple times Else:print "The word:%s is not in my vocabulary!"% word return returnvec# Park Sobeyes Classifier Training function calculates the probability from the word vector def trainNB0 (Trainmatrix, traincategory): Numtraindocs = Len (trainmatrix) #文本数量 numwords = Len (trainmatrix[0]) #文本中词汇数量也是词库的大小 pabusive = SUM (traincategory)/float (Numtraindocs) #P (WI) # p0num = zeros (num Words); P1num = Zeros (numwords) #p0Denom = 0.0; P1denom = 0.0 p0num = ones (numwords); P1num = Ones (numwords) #避免一个概率值为0, the last product is also 0 p0denom = 2.0;  P1denom = 2.0 for I in Range (Numtraindocs): if traincategory[i] = = 1:p1num + = Trainmatrix[i] #按行进行相加, each word calculates its own total p1denom + = SUM (Trainmatrix[i]) #所有词汇一共出现的次数 else:p0num + = trainmatrix[  I] p0denom + = SUM (trainmatrix[i]) #p1Vect = p1num/p1denom #p0Vect = p0num/p0denom #计算获得p (WI|C1), P (wi|c0) #print (p1num) #print (p1denom) p1vect = log (p1num/p1denom) p0vect = log (p0num/p 0Denom) #避免下溢出或者浮点数舍入导致的错误 Overflow is the return p0vect, P1vect, pabusive# naive Bayesian classifier def CLASSIFYNB (Vec2class), which is multiplied by too many very small numbers Ify, P0vec, P1vec, PClass1): #vec2Classify表示需要分类的向量 p1 = sum (Vec2classify*p1vec) + log (pClass1) P0 = SUM (vec2clas Sify*p0vec) + log (1.0-PCLASS1) if p1 > P0:return 1 else:return 0 def Test (): LISTPOSTS,LISTC lasses = Loaddataset () myvocablist = Createvocablist (listposts) Set1 = Setofwords2vec (Myvocablist,listposts[0]) # Print (Set1) Set2 = Setofwords2vec (myvocablist,listposts[1]) #print (set2) Trainmat = [] for Postindoc in LISTPO Sts:trainMat.append (Setofwords2vec (Myvocablist,postindoc)) #print (Trainmat) #print (listclasses) p0v,p1v,p Ab = trainNB0 (trainmat,listclasses) print (pAb) print (p1v) testentry = [' Love ', ' my ', ' dalmation '] thisdoc = AR Ray (Setofwords2vec (MYVOCAblist, testentry)) print testentry, ' classified as: ', CLASSIFYNB (Thisdoc, p0v, p1v, pAb) testentry = [' Stupid ' , ' garbage '] Thisdoc = Array (Setofwords2vec (Myvocablist, testentry)) print testentry, ' classified as: ', classify NB (Thisdoc, p0v, p1v, PAb) #垃圾邮件过滤的例子: def textparse (bigstring): #正则表达式进行文本分割 import Re listoftokens = RE.SPL It (R ' \w* ', bigstring) return [Tok.lower () for Tok in Listoftokens if Len (tok) > 2] def spamtest (): docList = []; Classlist = [];  fulltext = [] for I in range (1,26): #导入并解析文本文件 wordList = textparse (open (' E:/python          Project/bayes/email/spam/%d.txt '% i). Read ()) Doclist.append (wordList) fulltext.extend (wordList)          Classlist.append (1) wordList = textparse (open (' E:/python project/bayes/email/ham/%d.txt '% i). Read ()) Doclist.append (wordList) fulltext.extend (wordList) classlist.append (0) vocablist = createvocablist (dOcList) Trainingset = range () Testset = [] for i in range: #随机构建10个测试集 RA ndindex = Int (Random.uniform (0,len (Trainingset))) Testset.append (Trainingset[randindex]) del (trainingset      [Randindex]) Trainmat = [];           Trainclasses = [] for docindex in TrainingSet:trainMat.append (Setofwords2vec (Vocablist, Doclist[docindex]))      Trainclasses.append (Classlist[docindex]) p0v, p1v, pspam = trainNB0 (Array (trainmat), Array (trainclasses)) Errorcount = 0 for Docindex in testset: #对测试集进行分类 wordvector = Setofwords2vec (Vocablist, do Clist[docindex]) if CLASSIFYNB (Array (wordvector), p0v, P1V, pspam)! = Classlist[docindex]: Errorcoun   T + = 1 print ' The error rate is: ', float (errorcount)/len (Testset)



Machine learning Python implements Bayesian algorithm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.