Machine learning Python implements Bayesian algorithm

Last Update:2015-03-06 Source: Internet

Author: User

Tags natural logarithm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

['my ',' dog ',' have ',' Flea ',' problems ',' help ', ' Please '], 0

[' Maybe ',' not ',' Take ',' him ',' to ',' dog ', ' Park ',' stupid ', 1

[' my ','dalmation',' is ',' so ',' cute ',' I ', ' Love ',' him ', 0

[' Stop ',' posting ',' stupid ',' worthless ',' garbage '], 1

['Mr',' licks ',' ate ',' my ',' Steak ',' how ', ' to ',' Stop ',' him ',0

[' Quit ', ' buying ',' worthless ',' dog ',' food ',' Stupid '] [1 ]

The above is six sentences, the mark is 0 sentence of the normal sentence, the mark is 1 sentence expression for the foul language. We can find out which words are foul language by analyzing the probabilities of each word in each sentence, either in foul sentences or in normal sentences.

Note: The classifier is modified mainly from the following two points

<1> Bayesian probabilities need to calculate the product of multiple probabilities to get the probability that a document belongs to a category, which is to calculate P (w0|1) p (w1|1) p (w2|1). If one of the probability values is 0, then the last product is also 0

<2> The second problem is the next overflow, which is caused by multiplying too many small numbers. Because most of the factors are very small, the program overflows or does not get the correct answer. The solution is to take the natural logarithm of the product, which avoids the error caused by an overflow or floating-point rounding.

<3> the appearance of each word as a feature is called the word set model; in a word bag model , each word can appear multiple times.

Bayesian theory is to calculate the pre-test probability by the posterior probability which can be obtained.There are two prerequisites for the naive Bayesian theory used here.features are independent of each otherequally important between features

#-*-coding:cp936-*-from numpy Import * #过滤网站的恶意留言 # Create an experimental sample Def loaddataset (): Postinglist = [[' My ', ' dog ', ' ha                     S ', ' flea ', ' problems ', ' help ', ' "Please '], [' Maybe ', ' no ', ' take ', ' him ', ' to ', ' dog ', ' Park ', ' stupid '], [' My ', ' dalmation ', ' is ', ' so ', ' cute ', ' I ', ' love ', ' him '], [' Stop ', ' posting ', ' stupid ', ' worthl Ess ', ' garbage '], [' Mr ', ' licks ', ' ate ', ' my ', ' steak ', ' How ', ' to ', ' stop ', ' him '], [' Quit ', ' buying ', ' worthless ', ' dog ', ' food ', ' stupid '] Classvec = [0,1,0,1,0,1] return postinglist, Classvec # Create a containing          List of non-repeating words that appear in all Documents def createvocablist (DataSet): Vocabset = set ([]) #创建一个空集 for document in DataSet: Vocabset = Vocabset | Set (document) #创建两个集合的并集 return list (vocabset) #将文档词条转换成词向量 def setofwords2vec (Vocablist, inputset): retur              Nvec = [0]*len (vocablist) #创建一个其中所含元素都为0的向量 for word in Inputset:if word in vocablist:#returnVec [Vocablist.index (word)] = 1 #index函数在字符串里找到字符第一次出现的位置 Word set model Returnvec[vocablist.index (word)] + = 1 #文档的词袋模型 Each word can appear multiple times Else:print "The word:%s is not in my vocabulary!"% word return returnvec# Park Sobeyes Classifier Training function calculates the probability from the word vector def trainNB0 (Trainmatrix, traincategory): Numtraindocs = Len (trainmatrix) #文本数量 numwords = Len (trainmatrix[0]) #文本中词汇数量也是词库的大小 pabusive = SUM (traincategory)/float (Numtraindocs) #P (WI) # p0num = zeros (num Words); P1num = Zeros (numwords) #p0Denom = 0.0; P1denom = 0.0 p0num = ones (numwords); P1num = Ones (numwords) #避免一个概率值为0, the last product is also 0 p0denom = 2.0;  P1denom = 2.0 for I in Range (Numtraindocs): if traincategory[i] = = 1:p1num + = Trainmatrix[i] #按行进行相加, each word calculates its own total p1denom + = SUM (Trainmatrix[i]) #所有词汇一共出现的次数 else:p0num + = trainmatrix[  I] p0denom + = SUM (trainmatrix[i]) #p1Vect = p1num/p1denom #p0Vect = p0num/p0denom #计算获得p (WI|C1), P (wi|c0) #print (p1num) #print (p1denom) p1vect = log (p1num/p1denom) p0vect = log (p0num/p 0Denom) #避免下溢出或者浮点数舍入导致的错误 Overflow is the return p0vect, P1vect, pabusive# naive Bayesian classifier def CLASSIFYNB (Vec2class), which is multiplied by too many very small numbers Ify, P0vec, P1vec, PClass1): #vec2Classify表示需要分类的向量 p1 = sum (Vec2classify*p1vec) + log (pClass1) P0 = SUM (vec2clas Sify*p0vec) + log (1.0-PCLASS1) if p1 > P0:return 1 else:return 0 def Test (): LISTPOSTS,LISTC lasses = Loaddataset () myvocablist = Createvocablist (listposts) Set1 = Setofwords2vec (Myvocablist,listposts[0]) # Print (Set1) Set2 = Setofwords2vec (myvocablist,listposts[1]) #print (set2) Trainmat = [] for Postindoc in LISTPO Sts:trainMat.append (Setofwords2vec (Myvocablist,postindoc)) #print (Trainmat) #print (listclasses) p0v,p1v,p Ab = trainNB0 (trainmat,listclasses) print (pAb) print (p1v) testentry = [' Love ', ' my ', ' dalmation '] thisdoc = AR Ray (Setofwords2vec (MYVOCAblist, testentry)) print testentry, ' classified as: ', CLASSIFYNB (Thisdoc, p0v, p1v, pAb) testentry = [' Stupid ' , ' garbage '] Thisdoc = Array (Setofwords2vec (Myvocablist, testentry)) print testentry, ' classified as: ', classify NB (Thisdoc, p0v, p1v, PAb) #垃圾邮件过滤的例子: def textparse (bigstring): #正则表达式进行文本分割 import Re listoftokens = RE.SPL It (R ' \w* ', bigstring) return [Tok.lower () for Tok in Listoftokens if Len (tok) > 2] def spamtest (): docList = []; Classlist = [];  fulltext = [] for I in range (1,26): #导入并解析文本文件 wordList = textparse (open (' E:/python          Project/bayes/email/spam/%d.txt '% i). Read ()) Doclist.append (wordList) fulltext.extend (wordList)          Classlist.append (1) wordList = textparse (open (' E:/python project/bayes/email/ham/%d.txt '% i). Read ()) Doclist.append (wordList) fulltext.extend (wordList) classlist.append (0) vocablist = createvocablist (dOcList) Trainingset = range () Testset = [] for i in range: #随机构建10个测试集 RA ndindex = Int (Random.uniform (0,len (Trainingset))) Testset.append (Trainingset[randindex]) del (trainingset      [Randindex]) Trainmat = [];           Trainclasses = [] for docindex in TrainingSet:trainMat.append (Setofwords2vec (Vocablist, Doclist[docindex]))      Trainclasses.append (Classlist[docindex]) p0v, p1v, pspam = trainNB0 (Array (trainmat), Array (trainclasses)) Errorcount = 0 for Docindex in testset: #对测试集进行分类 wordvector = Setofwords2vec (Vocablist, do Clist[docindex]) if CLASSIFYNB (Array (wordvector), p0v, P1V, pspam)! = Classlist[docindex]: Errorcoun   T + = 1 print ' The error rate is: ', float (errorcount)/len (Testset)

Machine learning Python implements Bayesian algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More