Machine learning based on naive Bayesian text classification algorithm __ algorithm

Source: Internet
Author: User
principle

In a classification (classification) problem, it is often necessary to divide a thing into a category. A thing has many attributes, its many attributes as a vector, that is, x= (x1,x2,x3,..., xn), using the vector of X to represent this thing. There are also many types of categories, using a set of Y=y1,y2,... ym expression. If x belongs to the Y1 category, you can give X a y1 tag, meaning that x belongs to the Y1 category. This is called the Classification (classification).

The collection of x is recorded as x, called the property set. The relationship between x and Y is uncertain, and you can only say to some extent that X has the possibility of being a class Y1, for example, if X has a 80% chance of being a class Y1, then you can think of x and y as random variables, P (y| X) is called the posterior probability of y (posterior probability), in contrast, P (y) is called the priori probability of y (prior probability) 1.

In the training phase, we study the probability P (y|) for each combination of X and Y based on the information collected from the training data. X). When sorting, an example of X is given, and all of the P (y|x) is found in a pile of posteriori probabilities, the largest of which is the category x belongs to. According to the Bayesian formula, the posterior probability is P (y| X) =p (x| y) P (Y) p (X)

When comparing the posteriori probabilities of different Y-values, the denominator p (X) is always constant, so it can be ignored . The priori probability P (Y) can be easily estimated by calculating the proportion of training samples that belong to each class. Here the direct computation p (y| X) is troublesome, and naive Bayesian classification presents an independent hypothesis: x1,x2,..., xn are independent of each other, which is one of the reasons called "simplicity". So P (x| Y) = P (x1| Y) P (x2| Y) ... P (xn| Y), and P (xi| Y) very well, please.

For the two-dollar category, compare P (y=1| X) and P (y=0| X), just compare the molecular p (x| Y) Section P (y). So we only need to calculate n conditional probability and priori probability. text categorization in text categorization, let's say we have a document d∈x,x is a document vector space, a fixed class collection C={c1,c2,..., CJ}, and a category called a label. Obviously, the document vector space is a high dimensional space. We have a bunch of tagged documents set <d,c> as a training sample, <D,C>∈XXC. For example:

<d,c>={beijing joins the World Trade Organization, PRC}

For this one-sentence document, we classify it in the US, which is labeled "the".

We expect to use some kind of training algorithm to train a function γ to map documents to a certain category:

Γ:x→c

This type of learning is called supervised learning, because there is a supervisor (we give a bunch of tagged documents beforehand) to oversee the whole learning process like a teacher.

Naive Bayesian classifier is a kind of supervised learning, there are two common models, polynomial models (multinomial model) and Bernoulli models (Bernoulli model).
Polynomial model

In the polynomial model, a document d= (T1,t2,..., tk) is set, and TK is the word that appears in the document, allowing repetition,

Prior probability P (c) = total number of words in class C/Total number of words in the whole training sample

Class conditional probability P (tk|c) = (the number of times the word TK appears in various documents and +1)/(the total number of words under Class C +| v|)

V is a training sample of the word list (that is, the collection of words, words appear many times, only one), | V| indicates how many words the training sample contains. Right here, m=|. v|, p=1/| v|.

P (tk|c) can be seen as the amount of evidence that the word TK provides in proving that D belongs to Class C, while P (c) can be considered as the overall proportion of category C (how likely).

Bernoulli model

P (c) = total number of documents under Class C/Total number of documents for the entire training sample

P (TK|C) = (number of files containing word TK under Class C +1)/( total number of documents under Class C +2) Here, m=2, P=1/2.

#这里一定要注意: Bernoulli is document-grained, so the denominator is the total number of documents, not the total number of words under Class C.

Here are a few misinformation links:

http://blog.csdn.net/kongying168/article/details/7026389

http://cn.soulmachine.me/blog/20100528/

There are so many other things that are not listed.


The numerator denominator in the two models above is added in order to prevent a conditional probability of 0, thus the whole P (x| Y) = P (x1| Y) P (x2| Y) ... P (xn| Y) product is 0.


Instance Demo classifies a message as a data collection from the University of California, Irvine, http://archive.ics.uci.edu/ml/datasets/spambase  Spambase data Format description is as follows:
English, I do not explain (mainly to explain the difficulty,-_-. Sorry
to determine the quality of the classification model, you can calculate the AUC value. This experiment uses the metrics in Sklearn package to calculate ROC curve and AUC value (ROC and AUC see HTTP://BLOG.CSDN.NET/CHJJUNKING/ARTICLE/DETAILS/5933105) And this requires the posterior probability of each sample. So the value of the molecule P (X) is also required to convert the whole equation: p (y| X) =p (x| y) P (Y) p (X)  = p (x1| Y=1) P (x2| Y=1) ... P (xn| Y=1) P (y=1)   (P (x1| Y=1) P (x2| Y=1) ... P (xn| Y=1) + p (x1| y=0) P (x2| y=0) ... P (xn| y=0)) numerator denominator up and down simultaneously divided by P (x1| y=0) P (x2| y=0) ... P (xn| Y=0 formula is not good to play Ah, a few days with handwriting, and then pass the picture bar,-_-. Sorry
OK, nonsense not much to say, on code: naivebayes.py

#!/usr/bin/python # Naivebayes Classification # lming_08 2014.07.06 import Math import numpy as NP from Sklearn import m Etrics class Naivebayes:def __init__ (Self, trainfile): Self.trainingfile = Trainfile Self.trainingda
        Ta = Self.read_data (trainfile) # Read training or testing data def read_data (self, file): data = [] FD = open (file) for line in Fd:arr = Line.rstrip (). Split (', ') # Turn a instance ' Las
            T column (y column) As integer arr[-1] = Int (arr[-1]) # Append the instance to Trainingdata Data.append (Tuple (arr)) Fd.close () return Data def Train_model_with_bernoulli (self): sel
        f.sumposinstance = 0.
        self.sumneginstance = 0.
                    Self.termfreq = {} For instance in Self.trainingData:if Int (instance[-1]) = = 1: Self.sumposinstance + 1 Else:self.sumNegInStance + + 1 for i in range (Len (instance)-1): key = str () If I & Lt  55:if float (instance[i]) > 0:key = "Freq" + "|" + str (i + 1) + "|" + "1" Else:key = "freq" + "|" + str (i + 1)

                    + "|" + "0" Else:key = "Length" + "|" + str (i + 1) + "|" + Instance[i]
                    
                    If key not in self.termfreq:self.termfreq[key] = [0, 0]
                        if Int (instance[-1]) = = 1:self.termfreq[key][1] + = 1 Else: Self.termfreq[key][0] + = 1 # Prior_prob = P (y = 1) Self.prior_prob = Self.sumposinsta NCE/(Self.sumposinstance + self.sumneginstance) # Prior_ratio = P (y=1)/P (y=0) Self.prior_ratio = self. Sumposinstance/selF.sumneginstance # The function should be called before predict () def set_testfile (self, testfile): self. Testing_data = Self.read_data (testfile) def predict (self): self.testingy = [] Self.predict_result = [ ] For instance in Self.testing_data:self.predict_result.append (Self.predict_instance_with_bernoulli (  Instance)) Self.testingY.append (Instance[-1]) def get_statistics (self): True_classify_count =
        0. False_classify_count = 0. For instance in Self.testing_data:post_prob = Self.predict_instance_with_bernoulli (instance) if p Ost_prob >= 0.5 and instance[-1] = = 1:true_classify_count + 1 elif post_prob >= 0.5 an
                D instance[-1] = = 0:false_classify_count + 1 Elif post_prob < 0.5 and instance[-1] = 1: False_classify_count + 1 Elif post_prob < 0.5 and instance[-1] = 0:
                True_classify_count + 1 return true_classify_count, False_classify_count def Predict_insta
        Nce_with_bernoulli (self, instance): F = 0.                        
                For I in range (Len (instance)-1): key = str () If I < 55:
                    If float (instance[i]) > 0:key = "freq" + "|" + str (i + 1) + "|" + "1" Else: Key = "Freq" + "|" + str (i + 1) + "|" + "0" Else:key = "Length" + "|" + str ( i + 1) + "|" + instance[i] if key in Self.termfreq:f + = Math.log ((self.termfreq[key][1) + 1
        )/((Self.termfreq[key][0] + 2) * self.prior_ratio)) Posterior_ratio = Self.prior_ratio * MATH.EXP (f) # posterior probability prob = posterior_ratio/(1. + posterior_ratio) return prob def Getau C (self): y = Np.array (self.testingy) pred = Np.array (self.predict_resulT) FPR, TPR, thresholds = Metrics.roc_curve (y, pred) # Metrics.roc_curve (y, pred, pos_label=1) AUC = Metri CS.AUC (FPR, TPR) return AUC def main (Trainfile, testfile): NB = Naivebayes (trainfile) Nb.train_model_with _bernoulli () nb.set_testfile (testfile) nb.predict () print ("The value of AUC:"% NB.GETAUC ()) print ("The V Alue of priori probability: "% nb.priorprob) if __name__ =" __main__ ": If Len (argv)!= 3:print" Usage:pyth On%s trainfile (in) testfile (out)% __file__ sys.exit (0) Main (argv[1], argv[2])

partitionfile.py is used to split data files into training files (80%) and test Files (20%)
#!/usr/bin/python # Partitioning a file into training (80%) and testing file (20%) # lming_08 2014.07.06 from random impor T randint def partition_file (file, Train_file, test_file): Ltest = [] Ltrain = [] fd = open (file, "R") tr AIN_FD = open (Train_file, "w") test_fd = open (Test_file, "w") Test_index = 0 Train_index = 0 for line in F D:rnum = Randint (1) If Rnum = 5 or Rnum = = 6:test_index = 1 Ltest.extend (l
                INE) If Test_index = = 100:test_fd.writelines (ltest) ltest = [] Test_index = 0 Else:train_index + 1 ltrain.extend (line) if Train_index = = 100:train_fd.writelines (ltrain) Ltrain = [] Train_index = 0 if Len (LT EST) > 0:test_fd.writelines (ltest) If Len (Ltrain) > 0:train_fd.writelines (ltrain) train_fd . Close () Test_fd.cloSE () fd.close ()
 

classify.py Execution File
#!/usr/bin/python%
#  main execute File
#  lming_08 2014.07.06
import sys from
sys import argv From
naivebayes import naivebayes from
partitionfile import partition_file

def main (Srcfile, Trainfile, testfile):
    partition_file (Srcfile, Trainfile, testfile)
    NB = Naivebayes (trainfile)
    Nb.train_model_ With_bernoulli ()
    nb.set_testfile (testfile)
    nb.predict ()
    print ("The value of AUC:%f"% NB.GETAUC ())
    print ("The value of priori probability:%f"% nb.prior_prob)
    true_classify_count, False_classify_count = Nb.get_statistics ()
    error_rate = false_classify_count/(True_classify_count + false_classify_count)
    Print ("The error rate:%f"% error_rate)
	
if __name__ = = "__main__":
    If Len (argv)!= 4:
        print "Usage:python%s srcfile (in) trainfile (out) testfile (out)" % __file__
        sys.exit (0)

    main (argv[1], argv[2], argv[3])

The results of the execution are:

You can see that the AUC value is still relatively high, the classification error rate is 9.87%, still acceptable. Finally spit in the trough, the above mentioned the error of the algorithm unexpectedly in the "machine learning" This book appeared, and many students on the internet also learned the book Code, but no one doubts and points out this.
Reference Documentshttp://blog.csdn.net/tbkken/article/details/8062358
http://blog.csdn.net/kongying168/article/details/7026389

http://cn.soulmachine.me/blog/20100528/

Http://www.chepoo.com/naive-bayesian-text-classification-algorithm-to-learn.html

http://blog.csdn.net/chjjunking/article/details/5933105

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.