A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Naive Bayesian model, plainly is in the case of independence, the calculation of a particular characteristic of which category of the probability is large, then this feature attributed to this category.
Formula: P (C|WI) = P (wi|c) *p (c)/P (WI)
if: P (c1|wi) >p (C2|WI), then the WI belongs to the class C1
P (C1|WI) <p (C2|WI), then the WI belongs to the class C2
# Coding=utf-8 from numpy import * def loaddataset (): postinglist=[[' i ', ' dog ', ' has ', ' flea ', ' problems ', ' help ', ' Please ', [' Maybe ', ' Don't ', ' take ', ' him ', ' to ', ' dog ', ' Park ', ' stupid '], [' I ', ' Dalmat Ion ', ' is ', ' so ', ' cute ', ' I ', ' love ', ' him ', [' Stop ', ' posting ', ' stupid ', ' worthless ', ' garbage '], [' Mr ', ' licks ', ' ate ', ' my ', ' steak ', ' I ', ' to ', ' stop ', ' him '], [' Quit ', ' buying ', ' wort Hless ', ' dog ', ' food ', ' stupid ']] Classvec = [0,1,0,1,0,1] #1 is abusive, 0 does return postinglist, Classvec #原 Start training document, each document belongs to category ' Create Word set DataSet: Original document return value: Word set [' cute ', ' love ', ' help ', ' garbage ', ' Quit ', ' I ', ' problems ', ' is ', ' Park ', ' Stop ', ' flea ', ' dalmation ', ' licks ', ' food ', ' not ', ' him ', ' buying ', ' posting ', ' has ', ' worthless ', ' ate ', ' to ', ' Maybe ', ' please ', ' dog ', ' who ', ' stupid ', ' so ', ' take ', ' Mr. ', ' steak ', ' my ' ' Def createvocablist (dataSet): Vocabs ET = set () #set最后的结果就是In this form, if each element added is a list type, then multiple lists will be merged into a list; If the element being added is a normal string, then each character in the string will be merged for the document in DataSet: # Print Set cument) Vocabset = Vocabset | Set (document) #求并集 # print Vocabset #print list (vocabset) return list (Vocabset) #将set转化为list "Create for each document Word set 01 tables, that is, each document corresponds to a word set 01 vector, if a word set in a word in this document, then this position in the document vector is 1 vocablist: Word set Inputset: input document Returnvec: input document corresponding to the Word set 01 table as: [' My ', ' dog ', ' has ', ' flea ', ' problems ', ' help ', ' please ']-> [0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0 , 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1] ' Def setofwords2vec (Vocablist, inputset): Returnvec =  * Len (vocablist) For word in Inputset:if word in vocablist:returnvec[vocablist.index (word)] = 1 Else: #如果是训练 Document, no word in a document does not exist in the word set; If you are testing a document, this situation may appear print "The word%s doesn ' t exist in vocablist"% word return R Eturnvec ' ' Training function Trainmatrix: document matrices [[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]] Traincategory: Whether it is an insulting document vector (insulting matrix, which is labeled 1, is an insulting document.) Sum, a vector summation, will add up to 1. The number of insulting documents) P0vect: The vocabulary of each document is not insulting weight (p (wi|c), that is, the probability of a word appearing in the context of this document is an insulting document) P1vect: The insulting weight of each document word Pabusive: The probability of insulting documents [ -3.04452244-3.04452244-3.04452244-2.35137526-2.35137526-3.04452244-3.04452244-3.04452244- 2.35137526-2.35137526-3.04452244-3.04452244-3.04452244-2.35137526-2.35137526-2.35137526-2.35137526-2.35137526- 3.04452244-1.94591015-3.04452244-2.35137526-2.35137526-3.04452244-1.94591015-3.04452244-1.65822808-3.04452244-2.35137526-3.04452244-3.04452244-3.04452244] [ -3.04452244-3.04452244-3.04452244-2.35137 526-2.35137526-3.04452244-3.04452244-3.04452244-2.35137526-2.35137526-3.04452244-3.04452244-3.04452244-2.35137 526-2.35137526-2.35137526-2.35137526-2.35137526-3.04452244-1.94591015-3.04452244-2.35137526-2.35137526-3.04452 244-1.94591015-3.04452244-1.65822808-3.04452244-2.35137526-3.04452244-3.04452244-3.04452244] 0.5 ' def TrainNB 0 (Trainmatrix, traincategory): Numtraindocs = Len (trainmatrix) #文档总数 numwords = Len (trainmatrix[0)) #词集中词汇的数目 PA busive = SUM (traincategory)/float (numtraindocs) p0num = Ones (numwords) #向量, length is the length of the entire word set, representing the number of occurrences of each word in an insulting vocabulary document. Originally for zeros, the use of Laplace smoothing, changed to ones p1num = ones (numwords) p0denom = 2.0# Training Word concentration, the total number of all words. Originally 0.0, using Laplace smoothing, to 2 P1denom = 2.0 for I in Range (Numtraindocs): #这里的范围是文档数目. is not a word set length. If traincategory[i] = = 1:p1num = Trainmatrix[i] P1denom + = SUM (trAinmatrix[i]) Else:p0num + = Trainmatrix[i] P0denom + = SUM (trainmatrix[i)) P1vect = Lo G (p1num/p1denom) #p1Vect = P1num/p1denom, to avoid underflow, take logarithm p0vect = log (p0num/p0denom) return p0vect, P1vect, Pabusiv E ' ' classification function vec2classify: Document to be categorized P0vec: Threat weights for non-insulting document word sets (vector of P (wi|c0) P1vec: Threat weights for insulting document word sets PCLASS1: the probability of insulting documents appearing "DEF" CLASSIFYNB (Vec2classify, P0vect, P1vect, pClass1): P1 = SUM (vec2classify * p1vect) + log (PCLASS1) #相乘指两个向量对应相乘, the result is P (W1 |C) +p (w2|c) +...+p (wn|c) (the logarithm is taken here), to classify a vector if a feature is 0, then this item just does not exist, if 1, times probability is also the probability itself p0 = SUM (vec2classify * p0vect) + log (1.0-pc LASS1) Print sum (vec2classify * p1vect), sum (vec2classify * p0vect) print ' 2: ', log (PCLASS1), log (1.0-PCLASS1) Print P1, p0 if p1 > P0:return 1 else:return 0 "' Parsing text bigstring: A message, that is, an original document returns a list of Word segmentation after a document ' Def testparse (bigstring): import Re listoftokens = Re.split (R ' \w* ', bigstring) #大写W是非单词字符, which is a delimiter that is not a word character retu RN [Tok.lower () for Tok in ListofTokens if Len (tok) > 2] "Use the established classification model to train and test the ad classifier ' Def spamtest (): DocList =  #文档列表, a list containing multiple lists Classlist =  #类别列表 # fulltext =  #词集列表 for I in range (1): Wordlist = testparse (open (' h:/email/spam/%d.txt '% i) . Read ()) Doclist.append (wordlist) #fullText. Extend (wordlist) classlist.append (1) wordlist = Testparse (Open (' h:/email/ham/%d.txt '% i). Read ()) Doclist.append (wordlist) # fulltext.extend (wordlist) Classlist.append (0) vocablist = createvocablist (doclist) Trainingset = range #训练文档编号 testset =  #测试文 File number for I in range: randindex = Int (random.uniform (0, Len (trainingset)) Testset.append (Trainingse T[randindex]) del (Trainingset[randindex]) Trainmatrix =  trainclasses =  for docindex in trainings Et:returnvec = Setofwords2vec (Vocablist, Doclist[docindex]) trainmatrix.append (Returnvec) trainCl Asses.append (classlist[dOcindex]) P0vect, p1vect, pspam = trainNB0 (Array (Trainmatrix), Array (trainclasses)) # Training #错误率检测, measured using a randomly selected test document Try Errorcount = 0 for docindex in Testset:wordvec = Setofwords2vec (Vocablist, Doclist[docindex]) I F classifynb (Array (Wordvec), P0vect, P1vect, Pspam)!= Classlist[docindex]: errorcount = 1 Print Doclist[docindex] print ' The error rate is: ', float (errorcount)/len (Testset) ' "to train and test the insulting language classifier using the established classification model." Def abusivetest (): postinglist, Classvec = Loaddataset () #导入原始数据 vocablist = createvocablist (postinglist) #创建词集向量 Trainmatrix =  #创建文档词集01矩阵 for document in Postinglist:returnvec = Setofwords2vec (vocablist, document) Trainmatrix.append (Returnvec) P0vect, p1vect, pabusive = trainNB0 (Trainmatrix, Classvec) #训练 #测试 testentr y = [' love ', ' i ', ' haha '] thisdoc = Setofwords2vec (vocablist, testentry) print thisdoc print Testentry, ' is: ' , CLASSIFYNB (Thisdoc, p0Vect, P1vect, pabusive) testentry = [' stupid ', ' garbage '] Thisdoc = Setofwords2vec (vocablist, Testentry) prin T thisdoc print Testentry, ' is: ', CLASSIFYNB (Thisdoc, P0vect, P1vect, pabusive) def main (): Abusivetest () #s Pamtest () if __name__ = = ' __main__ ': Main ()
The following items come from:
In the first two chapters, the KNN classification algorithm and the decision tree classification algorithm are all the results of the prediction of the definite classification result of the example. However, sometimes the classifier produces the wrong result; the naive Bayesian classification algorithm in this chapter is to give an optimal guessing result, and to give a guessing probability estimate value.
1 Preparation knowledge: Conditional probability formula
Believe that the students have learned probability theory will not be unfamiliar to the probability theory, if the moment feel unfamiliar, you can access the relevant data, here is mainly want to post the formula of conditional probability:
P (a| b) =p (A,B)/p (b) =p (b| A) *p (a)/P (B)
2 How to use conditional probability to classify
Suppose there are two classes of categories to be categorized here, Class C1 and Class C2, then we need to calculate the size of the probability P (c1|x,y) and P (c2|x,y) and compare them:
if: P (c1|x,y) >p (c2|x,y), then (X,y) belongs to the class C1
P (c1|x,y) <p (c2|x,y), then (X,y) belongs to the class C2
We know that the conditional probability of P (x,y|c) means the probability of taking a point (x,y) under a given class C1 condition, and what is the meaning of P (c1|x,y). Obviously, we can also describe the probability meaning according to the conditional probability method, that is, under the condition of the given point (X,y), the point belongs to the probability value of the class C1. So how do you calculate this probability? Obviously, we can use the Bayes criterion to compute the transformation:
P (ci|x,y) =p (x,y|ci) *p (CI)/P (x,y)
Using the above formula, we can calculate the probability that the classification calculates its category under the given instance point, then compare the probability value, then select the class that has the maximum probability as the result of the prediction classification of the point (X,y).
The above we know through the Bayesian criteria to calculate the probability of each category, then specifically, is to calculate the Bayesian Formula Three probability, as long as the three probability values are obtained, it is obvious that we can predict the classification results through Bayesian algorithm. So, when we get here, we know the core of the tree Bayes algorithm.
3 simple meaning of naive Bayes
"Plain" meaning: This chapter of the algorithm is called naive Bayesian algorithm, obviously in addition to the Bayesian preparation, the word simple is equally important. This is what we want to say about the concept of conditional independence hypothesis. The hypothesis of conditional independence refers to the mutual independence hypothesis between characteristics, that is, independence refers to the statistical sense of independence, that is, the possibility of a feature or word appearing is not related to its adjacent to other words. For example, suppose that the word bacon appears after unhealthy is the same as the probability behind the delisious. Of course, we know that it is not true, but this is the meaning of the word simplicity. At the same time, another implication of naive Bayes is that these characteristics are equally important. Although these assumptions have a certain problem, but the actual effect of naive Bayesian is very good.
Second, naive Bayesian complete document classification
A very important application of naive Bayes is the classification of documents. In a document category, an entire document, such as an e-mail message, is an instance, and the word in the message can be defined as a feature. In this case, we have two ways to define document features. One is the word set model, the other is the word bag model. As the name suggests, the word set model is for every word that appears in a document, regardless of the number of occurrences, but only if it appears in the document and takes it as a feature; Suppose we've got a list of words that appear in all the documents, and if each word appears, You can turn the document into a vector that is as long as the vocabulary list. The word bag model, which is based on the word set model, also takes into account the number of times the word appears in the document, taking into account the information contained in several words in the document.
Well, after describing the description of the characteristics of the document classification, we can start coding and implement specific text categorization.
1 split text, preparing data
To get the feature from the text, obviously we need to split the text first, where the text refers to the entry from the text, and each entry is any combination of characters. The entry can be a word, and of course it can be an URL,IP address or any other string. Splits the text according to the entry, depending on whether the entry appears in the Glossary list, the document is composed of the entry vector, each value of the vector is 1 or 0, where 1 represents the occurrence, and 0 indicates that it did not appear.
Next, take a message from an online community. For each message to predict the classification, the categories of two, insulting and not insulting, after the completion of the prediction, according to the forecast results to consider shielding insulting remarks, so as not to affect community development.
Conversion function of Thesaurus to vector
#---------------------------Build a vector of entries from the text-------------------------#1 to get the feature from the text, you need to split the text, which is the entry from the text, and each word #条是字符的任意组合. The entry can be understood as a word, and of course it can be a non word term, such as a URL #IP地址或者其他任意字符串 # after splitting the text into the entry vector, each text fragment is represented as an entry vector, and a value of 1 indicates a #在文档中, and a value of 0 indicates that the entry does not appear #导入numpy from num PY Import * def loaddataset (): #词条切分后的文档集合, each row of the list represents a document postinglist=[[' my ', ' dog ', ' has ', ' flea ', \ ' Pro Blems ', ' help ', ' please ', [' Maybe ', ' Don't ', ' take ', ' him ', ' to ', ' dog ', ' Park ', ' stupid '], [' My ', ' dalmation ', ' are ', ' so ', ' cute ', ' I ', ' love ', ' him '], [' Stop ', ' posting ', ' Stupid ', ' worthless ', ' garbage '], [' My ', ' licks ', ' ate ', ' I ', ' steak ', ' How ', ' ' to ', ' Stop ', ' Him '], [' Quit ', ' buying ', ' worthless ', ' dog ', ' food ', ' stupid ']] #由人工标注的每篇文档的类标签 classvec=[0,1,0,1,0,1 ] Return Postinglist,classvec #统计所有文档中出现的词条列表 def createvocablist (dataSet): #新建一个存放词条的集合 vocabset=set () #遍历文档集合中的每一篇文档 for Document in DataSet: #将文档列表转为集合的形式 to ensure uniqueness #然后与vocabSet取并集 for each entry, add no #的新的词条 v to the Vocabset Ocabset=vocabset|set (document) #再将集合转化为列表 to facilitate the subsequent processing of return list (Vocabset) #根据词条列表中的词条是否在文档中出现 (1, no 0), converting the document to an entry vector def Setofwords2vec (vocabset,inputset): #新建一个长度为vocabSet的列表, and each dimension element is initialized to 0 returnvec=*len (vocabset) #遍历文档 Each entry for word in Inputset: #如果词条在词条列表中出现 if Word in Vocabset: #通过列表获取当前word的索引 (subscript) #将词条向量中的对应下标的项由0改为1 Returnvec[vocabset.index (Word)]=1 else:print (' The word:%s is not in my V ocabulary! '% ' word ') #返回inputet转化后的词条向量 return Returnvec
What you need to note is that the above function creatvocablist the list of words that appear in all documents, there are no duplicate words in the list, and each word is unique.
2 Calculating the probability value of naive Bayes by the word vector
Here, if we change the previous point (X,y) to the entry vector W (the values of each dimension are composed of 0 or 1 of the characteristics), here the term vector has the same dimension as the glossary length.
P (ci|w) =p (w|ci) *p (CI)/P (W)
We will use this formula to calculate the probability of a document entry vector belonging to each class, and then compare the probability size to predict the classification result.
Specifically, first of all, you can calculate the corresponding p (CI) by counting the number of documents in each category divided by the total number of documents; Then, based on the conditional independence assumption, the W is expanded into an independent feature, then the above formula can be written as P (w|ci) =p (W0|CI) *p (W1|CI) * ... p (WN|CI), which makes it easy to compute, greatly simplifies the calculation process.
The pseudo code for the function is:
Calculate the number of documents per category
Calculate the ratio of each category to the total number of documents
For each document:
For each category:
If the entry appears in the document-> increase the count value of the entry # counts the number of entries that appear in each category
Increase the Count of all entries # statistics the total number of entries that appear in the document for each category
For each category:
Get conditional probabilities by dividing the number of occurrences of each entry by the number of total entries that appear in the category
Returns the conditional probability of each entry in each category and the proportion of each category
The code is as follows:
#训练算法, the probability P (W0|CI) is computed from the word vector. and P (CI) # @trainMatrix: A document matrix consisting of the entry vectors of each document @trainCategory: Vector def trainNB0 (trainmatrix,traincategory) of the class label for each document: # Gets the number of documents in the document Matrix Numtraindocs=len (Trainmatrix) #获取词条向量的长度 Numwords=len (trainmatrix) #所有文档中属于类1所占的比例p (c=1) Pabusive=sum (traincategory)/float (numtraindocs) #创建一个长度为词条向量等长的列表 P0num=zeros (numwords);p 1num=zeros (numwords) p0denom=0.0;p1denom=0.0 #遍历每一篇文档的词条向量 for I in Range (Numtraindocs): #如果该词条向量对应的标签为1 if Traincat Egory[i]==1: #统计所有类别为1的词条向量中各个词条出现的次数 P1num+=trainmatrix[i] #统计类别为1的词条向量中出现的所有词条的总数 #即统计类1所有文档中出现单词的数目 p1denom+=sum (Trainmatrix[i]) Else: #统计所有类别为0的词条向量中各个词条出现的次数 P0num+=trainmatrix[i] #统计类别为0的词条向量中出现的所有词条的总数 #即统计类0所有文档中出现单词的数目 p0denom+=s Um (Trainmatrix[i]) #利用NumPy数组计算p (WI|C1) P1vect=p1num/p1denom #为避免下溢出问题, which is later changed to log () #利用NumPy数组计算p (WI|C0) p0 Vect=p0num/p0denom #为避免下溢出问题, followed by log () return p0vect,p1vect,pabusive
3 partial improvements for the algorithm
1 when calculating probabilities, it is necessary to compute the product of multiple probabilities to obtain the probability that a document belongs to a category, that is, to compute P (W0|CI) *p (W1|CI) *...P (WN|CI), and then when any of the values of one of them is 0, Then the final product is also 0. To reduce this effect, Laplace smoothing is used to add a (typically 1) to the numerator, and the denominator adds Ka (k for the total number of categories), where the number of occurrences of all words is initialized to 1 and the denominator is initialized to 2*1=2
#p0Num =ones (numwords);p 1num=ones (numwords) #p0Denom =2.0;p1denom=2.0
2) To solve the problem of overflow
As mentioned above, there are too many small numbers to multiply. When calculating probabilities, because most of the factors are very small, the result of the final multiplication is rounded to 0, resulting in the next overflow or not get accurate results, so we can take the natural logarithm of the results, that is, to solve the logarithmic likelihood probability. In this way, you can avoid errors caused by overflow or floating-point rounding. There will be no loss at the same time using natural logarithm processing.
#p0Vect =log (p0num/p0denom);p 1vect=log (p1num/p1denom)
Here is the code for the naive Bayesian classification function:
#朴素贝叶斯分类函数 # @vec2Classify: The entry vector # @p0Vec for the categories to be tested: Category 0 Frequency P (wi|c0) # @p0Vec for each entry in all documents: Category 1 frequency P (WI|C1) # for each entry in all documents @pClass1 : Documents with Category 1 for total number of documents% def CLASSIFYNB (vec2classify,p0vec,p1vec,pclass1): #根据朴素贝叶斯分类函数分别计算待分类文档属于类1和类0的概率 p1=sum (Vec2class Ify*p1vec) +log (PCLASS1) p0=sum (Vec2classify*p0vec) +log (1.0-PCLASS1) if P1>p0:return 1 else: return 0 #分类测试整体函数 def TESTINGNB (): #由数据集获取文档矩阵和类标签向量 listoposts,listclasses=loaddataset () #统计所有文档中 The entries that appear, are credited to the entry list myvocablist=createvocablist (listoposts) #创建新的列表 trainmat= for Postindoc in listoposts: #将每篇文档利用words2Vec函数转为词条向量, in the document Matrix Trainmat.append (Setofwords2vec (myvocablist,postindoc)) \ #将文档矩阵和类标签向量转为Nu The Mpy array form is convenient for the next probability calculation #调用训练函数, the corresponding probability value p0v,p1v,pab=trainnb0 (Array (trainmat), Array (listclasses)) #测试文档 testent Ry=[' love ', ' my ', ' dalmation '] #将测试文档转为词条向量 and into the form of a numpy array Thisdoc=array (Setofwords2vec (myvocablist,testentry)) # Using Bayesian classification function to classify and print the test document (Testentry, ' classified as: ', CLASSIFYNB (thisdoc,p0v,p1v,pab)) #第二个测试文档 testentry1=[' stupid ', ' garbage '] #同样转为词条向量 and into the form of a numpy array Thisdoc1=array (Setofwords2vec (myvocablist,testentry1)) print (TestEntry1, ' classified as: ', Classifyn B (THISDOC1,P0V,P1V,PAB))
Here's a little bit of a note on how to select document features, the use of the word set model, that is, for a document, the document is the appearance of an entry as a feature, that the feature can only be 0 does not appear or 1; Then, the number of entries in a document may also have important information, So we can use the word bag model, in the word bag vector each word can appear many times, so that when the document into a vector, whenever a word, it will increase the corresponding value in the word vector
The code that converts a document to a word bag vector is:
def bagofwords2vecmn (vocablist,inputset): #词袋向量 Returnvec=*len (vocablist) for word in Inputset: if Word in vocablist: #某词每出现一次, number of times plus 1 returnvec[vocablist.index (word)]+=1 return Returnvec
Start building with 50+ products and up to 12 months usage for Elastic Compute Service