Basic principle and algorithm implementation of Bayesian algorithm

Source: Internet
Author: User
Tags pears idf

I. Bayesian formula derivation

Naive Bayesian classification is a very simple classification algorithm, which is called simplicity because of its simplicity: in terms of text categorization, it believes that the relationship between the 22 words in the word bag is independent of each other, that is, each dimension in the eigenvector of an object is independent of each other. For example, yellow is a common property of apples and pears, but apples and pears are independent of each other. This is the ideological foundation of naive Bayesian theory. Now let's extend it to multidimensional situations:

The formal definition of Naive Bayes classification is as follows:

1. Set X={A1,A2,..., am} to be a categorical item, and each A is a characteristic attribute of x.
2. There is a category collection C={y1,y2,..., yn}.
3. Calculate P (y1|x), P (y2|x),..., P (yn|x).
4. If P (yk|x) =max{p (y1|x), P (y2|x),..., P (yn|x)}, then X∈yk.
So the key now is how to calculate the probability of each condition in the 3rd step. We can do this:
(1) Find a known classification of the set of items to be categorized, that is, the training set.
(2) The conditional probability estimation of each characteristic attribute under various types is obtained by statistic. That

P (a1|y1), P (a2|y1),..., P (am|y1);
P (a1|y2), P (a2|y2),..., P (am|y2);
P (Am|yn), P (Am|yn),..., P (Am|yn).
(3) If each characteristic attribute is conditionally independent (or if we assume that they are independent of each other), then the Bayesian theorem is deduced as follows:

Because the denominator is constant for all categories, as long as the numerator is maximized. And because each characteristic attribute is conditionally independent, there are:

According to the above analysis, naive Bayesian classification process can be expressed as follows: The first stage: Training data Generation Training Sample set: TF-IDF

Phase II: Calculate P (yi) for each category

Phase three: Calculate the conditional probabilities of all divisions for each feature attribute Phase IV: Calculates P( x | y i ) P ( yi )

Fifth stage: With P( x | y i ) the largest item of P ( yi ) as the category of X

Two. Naive Bayesian algorithm implementation

Use a simple English corpus as a data set:

  

defLoaddataset (): Postinglist=[['my','Dog',' has','Flea','problems',' Help',' please'], ['maybe',' not',' Take','him',' to','Dog','Park','Stupid'], ['my','dalmation',' is',' So','Cute','I',' Love','him','my'], ['Stop','Posting','Stupid','Worthless','Garbage'], ['Mr','Licks','ate','my','Steak',' How',' to','Stop','him'], ['quit','Buying','Worthless','Dog',' Food','Stupid']] Classvec = [0,1,0,1,0,1]#1 is abusive, 0 not    returnPostinglist,classvec

Postlist is the training set text, and Classvec is the category that corresponds to each text.

According to the steps in the previous section, the whole process of Bayesian algorithm is implemented gradually:

1. Write a Bayesian algorithm class and create the default constructor method:

classNbayes (object):def __init__(self): Self.vocabulary= []#DictionariesSelf.idf=0#the IDF weight vector of a dictionarySelf.tf=0#Weights matrix of training setSelf.tdm=0#P (X|yi)Self. Pcates = {}#P (yi)--is a category dictionarySelf.labels=[]#a list of external imports corresponding to each text categorySelf.doclength = 0#number of training set textSelf.vocablen = 0#Dictionary word lengthSelf.testset = 0#Test Set

2. Import and train data sets to generate the parameters and data structures that the algorithm must have:

defTrain_set (Self,trainset,classvec): Self.cate_prob (Classvec)#Calculate the probability of each cluster in the dataset: P (Yi)
   self.doclength = Len (trainset) Tempset =set () [Tempset.add (word) forDocinchTrainset forWordinchDOC]#th into a dictionaryself.vocabulary=list (tempset) Self.vocablen=Len (self.vocabulary) self.calc_wordfreq (trainset)#calculating the word frequency data setSELF.BUILD_TDM ()#per-dimension value of the cumulative vector space by Category: P (x|yi)

3.cate_prob function: Calculate the probability of each cluster in the dataset: P (Yi)

def Cate_prob (Self,classvec):      = Classvec    #  Get all categories for in      labeltemps:       #   Duplicate categories in the statistics list: Self.labels.count (labeltemp) self. Pcates[labeltemp] = float (Self.labels.count (labeltemp))/float (len (self.labels)     )

4.calc_wordfreq function: Generate a normal word frequency vector

#th into a common word frequency vectordefCalc_wordfreq (self,trainset): SELF.IDF= Np.zeros ([1,self.vocablen])#number of 1* dictionariesSELF.TF = Np.zeros ([Self.doclength,self.vocablen])#Number of training set files * Dictionaries     forIndxinchXrange (self.doclength):#iterate through all the text       forWordinchTRAINSET[INDX]:#iterate through each word in the textSelf.tf[indx,self.vocabulary.index (word)] +=1#find the bit in the dictionary of the text wordPlace +1 forSignlewordinchset (Trainset[indx]): Self.idf[0,self.vocabulary.index (Signleword)]+=1

5.BUILD_TDM function: Calculates the per-dimension value of the vector space by Category: P (X|yi)

 #   def   build_tdm: self.tdm  = Np.zeros ([Len (self). pcates), Self.vocablen]) #   category line * Dictionary column sumlist = Np.zeros ([Len (self. Pcates), 1]) # count the total value of each category  for  indx in  Span style= "COLOR: #000000" > Xrange (self.doclength): Self.tdm[self.labels[indx]]  + = Self.tf[indx] #    #   Statistics the total value of each classification--is a scalar  sumlist[self.labels[indx]]= np.sum (self.tdm[  SELF.LABELS[INDX]] Self.tdm = self.tdm/sumlist #   Th into P (x|yi)  


6.map2vocab function: Mapping a test set to the current dictionary

deffor in+=1

7.predict functions: Predicting classification results, classifying categories of output predictions

defPredict (Self,testset):ifNp.shape (Testset) [1]! = Self.vocablen:#if the test set length is not equal to the dictionary, exit the program      Print "Input Error"
Exit (0) Predvalue= 0#Initialize class probabilitiesPredclass ="" #Initialize category name    forTdm_vect,keyclassinchZip (self.tdm,self. Pcates):#p (x|yi) p (Yi)temp = Np.sum (testset*tdm_vect*self. Pcates[keyclass])#variable TDM, calculating the maximum classificationvalueifTemp > Predvalue:
Predvalue = Temp Predclass =KeyclassreturnPredclass

Three. Algorithm improvements

  Use the TF-IDF strategy for common word frequency vectors to make it more capable of correcting multiple deviations.

4.CALC_TFIDF function: In tf-idf way th into vector space:

#th into TF-IDFdefCALC_TFIDF (self,trainset): SELF.IDF= Np.zeros ([1, Self.vocablen]) SELF.TF= Np.zeros ([Self.doclength,self.vocablen])
forIndxinchxrange (self.doclength): forWordinchTrainset[indx]: Self.tf[indx,self.vocabulary.index (word)]+=1#eliminate deviations caused by different sentence lengthsSELF.TF[INDX] = self.tf[indx]/float (len (Trainset[indx]))
forSignlewordinchset (Trainset[indx]): Self.idf[0,self.vocabulary.index (Signleword)]+=1SELF.IDF= Np.log (float (self.doclength)/SELF.IDF) SELF.TF= Np.multiply (SELF.TF,SELF.IDF)#dot multiplication of matrices and vectors tf x IDF

Four. Evaluate classification results

#-*-coding:utf-8-*-ImportSysImportOS fromNumPyImport*ImportNumpyas NP fromNbayes_libImport*dataset,listclasses= Loaddataset ()#import an external data set#DataSet: The word vector of a sentence,#ListClass is the category of the sentence [0,1,0,1,0,1]NB = Nbayes ()#instantiation ofNb.train_set (dataset,listclasses)#Training Data SetNb.map2vocab (Dataset[0])#randomly Select a test sentencePrintNb.predict (Nb.testset)#Output Classification Results

Classification results

1

Execute the naïve Bayesian class we created to get execution results

Basic principle and algorithm implementation of Bayesian algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.