Basic principle and algorithm implementation of Bayesian algorithm

Last Update:2017-05-05 Source: Internet

Author: User

Tags pears idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Bayesian formula derivation

Naive Bayesian classification is a very simple classification algorithm, which is called simplicity because of its simplicity: in terms of text categorization, it believes that the relationship between the 22 words in the word bag is independent of each other, that is, each dimension in the eigenvector of an object is independent of each other. For example, yellow is a common property of apples and pears, but apples and pears are independent of each other. This is the ideological foundation of naive Bayesian theory. Now let's extend it to multidimensional situations:

The formal definition of Naive Bayes classification is as follows:

1. Set X={A1,A2,..., am} to be a categorical item, and each A is a characteristic attribute of x.
2. There is a category collection C={y1,y2,..., yn}.
3. Calculate P (y1|x), P (y2|x),..., P (yn|x).
4. If P (yk|x) =max{p (y1|x), P (y2|x),..., P (yn|x)}, then X∈yk.
So the key now is how to calculate the probability of each condition in the 3rd step. We can do this:
(1) Find a known classification of the set of items to be categorized, that is, the training set.
(2) The conditional probability estimation of each characteristic attribute under various types is obtained by statistic. That

Because the denominator is constant for all categories, as long as the numerator is maximized. And because each characteristic attribute is conditionally independent, there are:

According to the above analysis, naive Bayesian classification process can be expressed as follows: The first stage: Training data Generation Training Sample set: TF-IDF

Phase II: Calculate P (yi) for each category

Phase three: Calculate the conditional probabilities of all divisions for each feature attribute Phase IV: Calculates P( x | y i ) P ( yi )

Fifth stage: With P( x | y i ) the largest item of P ( yi ) as the category of X

Two. Naive Bayesian algorithm implementation

Use a simple English corpus as a data set:

defLoaddataset (): Postinglist=[['my','Dog',' has','Flea','problems',' Help',' please'], ['maybe',' not',' Take','him',' to','Dog','Park','Stupid'], ['my','dalmation',' is',' So','Cute','I',' Love','him','my'], ['Stop','Posting','Stupid','Worthless','Garbage'], ['Mr','Licks','ate','my','Steak',' How',' to','Stop','him'], ['quit','Buying','Worthless','Dog',' Food','Stupid']] Classvec = [0,1,0,1,0,1]#1 is abusive, 0 not 　　 returnPostinglist,classvec

Postlist is the training set text, and Classvec is the category that corresponds to each text.

According to the steps in the previous section, the whole process of Bayesian algorithm is implemented gradually:

1. Write a Bayesian algorithm class and create the default constructor method:

classNbayes (object):def __init__(self): Self.vocabulary= []#DictionariesSelf.idf=0#the IDF weight vector of a dictionarySelf.tf=0#Weights matrix of training setSelf.tdm=0#P (X|yi)Self. Pcates = {}#P (yi)--is a category dictionarySelf.labels=[]#a list of external imports corresponding to each text categorySelf.doclength = 0#number of training set textSelf.vocablen = 0#Dictionary word lengthSelf.testset = 0#Test Set

2. Import and train data sets to generate the parameters and data structures that the algorithm must have:

defTrain_set (Self,trainset,classvec): Self.cate_prob (Classvec)#Calculate the probability of each cluster in the dataset: P (Yi)
　　 self.doclength = Len (trainset) Tempset =set () [Tempset.add (word) forDocinchTrainset forWordinchDOC]#th into a dictionaryself.vocabulary=list (tempset) Self.vocablen=Len (self.vocabulary) self.calc_wordfreq (trainset)#calculating the word frequency data setSELF.BUILD_TDM ()#per-dimension value of the cumulative vector space by Category: P (x|yi)

3.cate_prob function: Calculate the probability of each cluster in the dataset: P (Yi)

def Cate_prob (Self,classvec):      = Classvec    #  Get all categories for in      labeltemps:    　　 #   Duplicate categories in the statistics list: Self.labels.count (labeltemp) self. Pcates[labeltemp] = float (Self.labels.count (labeltemp))/float (len (self.labels)     )

4.calc_wordfreq function: Generate a normal word frequency vector

#th into a common word frequency vectordefCalc_wordfreq (self,trainset): SELF.IDF= Np.zeros ([1,self.vocablen])#number of 1* dictionariesSELF.TF = Np.zeros ([Self.doclength,self.vocablen])#Number of training set files * Dictionaries 　　  forIndxinchXrange (self.doclength):#iterate through all the text 　　　　  forWordinchTRAINSET[INDX]:#iterate through each word in the textSelf.tf[indx,self.vocabulary.index (word)] +=1#find the bit in the dictionary of the text wordPlace +1 forSignlewordinchset (Trainset[indx]): Self.idf[0,self.vocabulary.index (Signleword)]+=1

5.BUILD_TDM function: Calculates the per-dimension value of the vector space by Category: P (X|yi)

 #   def   build_tdm: self.tdm  = Np.zeros ([Len (self). pcates), Self.vocablen]) #   category line * Dictionary column sumlist = Np.zeros ([Len (self. Pcates), 1]) # count the total value of each category  for  indx in  Span style= "COLOR: #000000" > Xrange (self.doclength): Self.tdm[self.labels[indx]]  + = Self.tf[indx] #    #   Statistics the total value of each classification--is a scalar  sumlist[self.labels[indx]]= np.sum (self.tdm[  SELF.LABELS[INDX]] Self.tdm = self.tdm/sumlist #   Th into P (x|yi)

6.map2vocab function: Mapping a test set to the current dictionary

deffor in+=1

7.predict functions: Predicting classification results, classifying categories of output predictions

defPredict (Self,testset):ifNp.shape (Testset) [1]! = Self.vocablen:#if the test set length is not equal to the dictionary, exit the program 　　　　 Print "Input Error"
Exit (0) Predvalue= 0#Initialize class probabilitiesPredclass =""    #Initialize category name 　　  forTdm_vect,keyclassinchZip (self.tdm,self. Pcates):#p (x|yi) p (Yi)temp = Np.sum (testset*tdm_vect*self. Pcates[keyclass])#variable TDM, calculating the maximum classificationvalueifTemp > Predvalue:
Predvalue = Temp Predclass =KeyclassreturnPredclass

Three. Algorithm improvements

　　Use the TF-IDF strategy for common word frequency vectors to make it more capable of correcting multiple deviations.

4.CALC_TFIDF function: In tf-idf way th into vector space:

#th into TF-IDFdefCALC_TFIDF (self,trainset): SELF.IDF= Np.zeros ([1, Self.vocablen]) SELF.TF= Np.zeros ([Self.doclength,self.vocablen])
 forIndxinchxrange (self.doclength): forWordinchTrainset[indx]: Self.tf[indx,self.vocabulary.index (word)]+=1#eliminate deviations caused by different sentence lengthsSELF.TF[INDX] = self.tf[indx]/float (len (Trainset[indx]))
 forSignlewordinchset (Trainset[indx]): Self.idf[0,self.vocabulary.index (Signleword)]+=1SELF.IDF= Np.log (float (self.doclength)/SELF.IDF) SELF.TF= Np.multiply (SELF.TF,SELF.IDF)#dot multiplication of matrices and vectors tf x IDF

Four. Evaluate classification results

#-*-coding:utf-8-*-ImportSysImportOS fromNumPyImport*ImportNumpyas NP fromNbayes_libImport*dataset,listclasses= Loaddataset ()#import an external data set#DataSet: The word vector of a sentence,#ListClass is the category of the sentence [0,1,0,1,0,1]NB = Nbayes ()#instantiation ofNb.train_set (dataset,listclasses)#Training Data SetNb.map2vocab (Dataset[0])#randomly Select a test sentencePrintNb.predict (Nb.testset)#Output Classification Results

Classification results

Execute the naïve Bayesian class we created to get execution results

Basic principle and algorithm implementation of Bayesian algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More