A classical algorithm for machine learning and python implementation---naive Bayesian classification and its application in text categorization and spam detection

Source: Internet
Author: User
Tags natural log natural logarithm repetition

Summary:

Naive Bayesian classification is a Bayesian classifier, Bayesian classification algorithm is a statistical classification method, using probability statistical knowledge classification, the classification principle is to use the Bayesian formula based on the prior probability of an object to calculate the posteriori probability (that the object belongs to a certain class of probability), Then select the class that has the maximum posteriori probability as the class to which the object belongs. In general: When the number of sample features or the correlation between features is large, naive Bayesian classification efficiency is inferior to that of decision tree, and the performance of naive Bayesian classification is best when the correlation between features is small. In addition, the computational process of naive Bayes, such as conditional probabilities, are independent of each other, so it is especially suitable for distributed computing. This paper describes the statistical principles of naive Bayesian classification, and realizes the algorithm in text classification. Naive Bayesian classifier has two kinds of polynomial model and Bernoulli model when it is used in text classification, and the algorithm realizes these two models and is used for spam detection respectively, which has remarkable performance.

Note: Personally, the "machine learning Combat" naive Bayesian chapter on the text classification algorithm is wrong, whether it is its Bernoulli model ("word set") or polynomial model (the book is called "word bag"), because its calculation formula does not conform to the polynomial and Bernoulli model. See "Two models of text categorization" in this article.

(i) Understanding naive Bayesian classification

The naive Bayesian classification model is one of the most widely used classification models in decision tree algorithm, and naive Bayesian classification is a kind of Bayesian classifier, and Bayesian classification algorithm is a classification method of statistics, which is classified by probability statistic knowledge. The classification principle is to use the Bayesian formula to calculate the posteriori probability according to the prior probability of an object (that is, the probability that the object belongs to a certain class), and then select the class with the maximum posteriori probability as the class to which the object belongs. At present, there are four kinds of Bayesian classifiers: Naive Bayesian classification, TAN (tree Augmented Bayes Network) algorithm, BAN (BN augmented Naive Bayes) Classification and GBN (general Bayesian Network). This paper focuses on the principle of naive Bayesian classification, realizes the algorithm through Python, and introduces an application of naive Bayesian classification--Spell checking tool.

The basic idea of naive Bayes is this: according to the Bayesian formula, the probability P (yi|x) of each category (several pre-known categories) is calculated under the condition that the classification item x appears, and the last probability value is the largest, which determines which category the classification item belongs to. The purpose of training data is to obtain a priori probability of each characteristic of the sample under each classification. It is called "simplicity" because the Bayesian classification only makes the most primitive and simplest hypothesis--1, and all the characteristics are statistically independent; 2, all the characteristics of the same position. So if a sample x has A1,..., am properties, then there are:

P (x) =p (a1,..., am) = P (A1) *...*p (AM)

Naive Bayesian classification originates from classical mathematics theory, has a solid mathematical foundation, and stable classification efficiency, its advantage is the algorithm is simple, the required estimation parameters are few, the missing data is not too sensitive. In theory, naive Bayesian classification has the smallest error rate compared with other classification methods, but it is not always the case because naive Bayesian classification assumes that the characteristics of the sample are independent of each other, and this hypothesis is often not tenable in practical applications, thus affecting the correctness of classification. Therefore, when the number of sample features or the correlation between features is large, naive Bayesian classification efficiency is inferior to decision tree model, and naive Bayesian classification has the best performance when the correlation between features is small. In general, naive Bayesian classification algorithm is simple and effective, and it is suitable for two kinds of problems. In addition, the computational process of naive Bayes, such as conditional probabilities, are independent of each other, so it is especially suitable for distributed computing.

(ii) The mathematical principles of naive Bayesian classification1, Bayes theorem

Bayesian theorem is a theorem about conditional probability (or edge probability) of random events A and B.


where P (a| b) is the likelihood of a occurring in cases where B occurs.

In the Bayesian theorem, each noun has a conventional name:

P (a) is a priori probability or edge probability. It is called "transcendental" because it does not take into account any B-factor.

P (a| b) is known as b after the condition probability of a, also due to the value of B is called a posteriori probability.

P (b| A) is the conditional probability of B after a given occurrence, and is called a posteriori probability of B because it is taken from a.

P (b) is a priori probability or edge probability of B and also a normalized constant (normalizing constant).

2, the principle of probability theory of naive Bayesian classification2.1 Bayesian Classification probability theory description

The process of using probability theory to express naive Bayesian classification is:

1, is an item to be classified, AI is the characteristic attribute of x (the same status, independent of each other), a total of M.

2, there is a category collection.

3, calculation.

4, if, then.

Obviously, the crux of the problem lies in the third step--to find out the posterior probabilities of each categorical term, using the Bayesian formula.

According to Bayesian theorem, the following derivation is given:


The denominator is constant for all categories, so you only need to calculate molecules. Because each characteristic has the same status and the probability of each occurrence is independent of each other, there are:


In the above formula, P (AJ|YI) is the key, which means the conditional probability of each characteristic of the sample under each classification, and the conditional probability of calculating each partition is the key step of naive Bayesian classification, which is the task of the data training stage. The probability distribution of each characteristic of a sample may be discrete or continuous.

2.2 A method for calculating the probability of a priori condition

As seen above, the calculation of the conditional probability P (a|y) of each division is a key step in naive Bayes classification, the following shows that the characteristic attribute is the calculation and processing method of discrete distribution or continuous distribution.

1, discrete distribution-when the feature attribute is discrete, you can estimate P (a|y) as long as the frequency of each division in the statistics training sample appears in each category.

Because Bayesian classification needs to compute multiple eigenvalue probabilities to obtain the probability that things belong to a certain class, if the probability of a eigenvalue is 0, the whole probability product will be changed to 0 (called Data sparse), which destroys the assumption condition of the same status of each eigenvalue, so that the classification accuracy rate is greatly reduced. Therefore, in order to reduce the damage rate, the discrete distribution must be calibrated-such as Laplace calibration: All the characteristics of each category in the Count of 1, the number of training samples is large enough, and therefore does not affect the results, but it solves the above-mentioned probability of 0 embarrassing situation; In addition, from the computational efficiency, Addition also increases an order of magnitude over multiplication.

Data Sparse is a big problem in machine learning, Bilapuras smooth and better solution to sparse data is: by clustering the non-appearing words to find the system-related words, according to the probability of the relevant words to get an average. If: The system has "apple", "grape" and other words, but there is no "durian", we can according to "Apple", "grape" and other averages as "durian" probability, so more reasonable.

2, continuous distribution- when the characteristic attribute is a continuous value, it is usually assumed that its value obeys a Gaussian distribution (normal distribution). That


and

The training of continuous distributed sample features is the calculation of its mean and variance.

2.3 Algorithm Improvements

In real-world projects, the probability p is often a small decimal value, and the multiplication of successive tiny decimals easily causes the product to be 0 or the correct answer. One solution is to take the natural logarithm of the product, change the multiplication to a hyphen, and ln (AB) =LNA+LNB. There is no loss with natural log processing, which avoids errors caused by overflow or floating-point rounding. The curves of the F (x) and ln (f (x)) functions are given, which can be found in the same region where both increase or decrease and the extremum is obtained at the same point, so the natural logarithm does not affect the result of the final comparison.


Therefore, the calculation of the lower formula can be converted to its logarithm calculation, and the probability comparison can also be transformed to the probability logarithm comparison.


(c) Python implements naive Bayesian classification algorithm

In the Bayesian classifier construction process, the sample sequence with sample size n is often divided into a larger number of training sets and a smaller number of test sets, the training set is used to generate classifiers, test sets are used to test the classifier accuracy rate, this process is called "retained cross-validation." The M sequences in the test set are randomly selected in the sample sequence and the remainder as the training set.

In the fourth section, the naive Bayesian classification algorithm is realized by the polynomial model and Bernoulli model of text classification, and it is used in the spam detection.

(iv) Naive Bayesian classification for text categorization1, two models of text categorization

What is text categorization? The so-called text classification, refers to the text given, given a predefined one or more categories of labels, the text is accurate and efficient classification. It is an important part of many data management tasks. The classification of news categories (finance, sports, education, etc.) and spam detection are all classified as text. For a detailed description of the text classification, refer to " Text categorization Overview ". Text classification can use Rocchio algorithm, naive Bayesian classification algorithm, K-Nearest neighbor algorithm, decision tree algorithm, neural network algorithm and support vector machine algorithm, this section uses naive Bayesian to complete sub-classification task.

Text classification is based on cutting words, cutting words is a very important research direction, English cutting words relatively natural easier--almost cut into words can be, Chinese words are more troublesome (preprocessing stage needs special treatment), Python good Chinese word inject (jieba- Chinese word breaker tool,SNOWNLP -Chinese text Processing library,Loso-another Chinese word breaker). The focus of this paper is to adopt naive Bayes for the classification of text, assuming that the document has completed the cut word (the actual simple use of the Python string split method).

In text classification, there are two kinds of Naive Bayes probability calculation model can be used, the following two classification as a model to illustrate the two kinds of prior probability P (c) and the class conditional probability P (tk|c) of each word, p (tk|c) can be regarded as the word TK in the proof text D belongs to Class C provides much evidence, and P (c) It could be considered as a percentage of the overall size of category C (how large is the likelihood).

When naive Bayesian algorithm is used for text classification, the selection of feature words is a very important performance factor. If you use all the words as feature words will undoubtedly consume a huge amount of memory, computational performance will also decline. Therefore, the selection strategy of the feature words should be based on the actual optimization, such as training data first manual screening, only take the occurrence of more than N (n>=1) words.

(i) The polynomial model (multinomial models) is the frequency type-the probability is calculated based on the number of occurrences of a word in the classification.

It can be said that the polynomial model probability calculation is based on "word" as the base unit, set a text d= (T1,t2,..., tk), TK is the word in the document, allow repetition, then in the polynomial model, the class prior probability and the individual word class conditional probability calculation method is:

Priori probability P (c) = total number of words under Class C/number of words for the entire training sample

Class conditional probability P (tk|c) = (the sum of the number of times the word TK has occurred in each document in Class C)/(the total number of words under Class C +| v|)

V is the word list of the training sample (that is, the word is extracted, the word appears multiple times, and only one is counted), | V| is the number of words that the training sample contains.

The conditional probability calculation of each word in polynomial model takes into account the number of words appearing, which is equivalent to the weight of words, which is more in accord with the characteristics of natural language. In addition, according to Stanford's "Introduction to Information retrieval" courseware, the polynomial model calculation accuracy is higher. The classification category is judged only by the occurrence of the words in the posterior probability calculation, that is:

P (t1|c) *p (t2|c),..., *p (tk|c) *p (c)

(ii) The Bernoulli model (Bernoulli models) is a word-set type--based on whether a word in the classification has a computational probability. The Bernoulli model does not take into account the number of occurrences of a word in a document, and only considers that it does not appear, so in this sense the equivalent of the hypothetical word is equal weight.

First look at the definition of the Bernoulli model--whether or not an event occurs in a random experiment, there are only two possible results of the test, and this experiment with only two possible results is called a Bernoulli test. In the Bernoulli model, no matter what class, each word only appears or does not appear two, so you can understand the two classification of the Bayesian Sbenuli model--the appearance of certain words in the text determines the classification of the text, which actually contains two parts: some words appear, and others do not appear. Therefore, the calculation of the conditional probabilities of each word in the Bernoulli model is that the words appearing in the text participate in the calculation as "square", and There are no words appearing in the text but appearing in the global Word table as "opposing".

Bernoulli model probability calculation is based on "text" as the base unit, set a text d= (T1,t2,..., tk), TK is the document appears in the word, not allow repetition, Bernoulli model class prior probability and the various word class conditional probability calculation method is:

P (c) = Total files under Class C/number of files for the entire training sample

P (TK|C) = (number of files containing a word TK under Class C +1)/(total number of files under Class C +2)

where P (tk|c) is calculated, the numerator +1 is the introduction of Laplace calibration, the denominator plus 2 is because the word only occurs or not, and is also estimated for data calibration (TBD). The probability of "square" is P (Tk|c), and the probability of "negative" is (1-p (word|c)) (Word is a word that is not in D but appears on the Global Word table of the Class) when evaluating the posteriori probability. Because the existence of the Laplace calibration is actually equivalent to all the words involved in the calculation), namely:

P (t1|c) *p (t2|c),..., *p (tk|c) * (1-p (word0|c) *,..., * (1-p (wordm|c) *p (c)

The two models of the blog can also refer to the ' Naive Bayesian text classification algorithm learning ', this paper implemented two models.

2,python Implementation2.1 Naviebayes Object

A Naviebayes object is defined in the Naviebayes module that contains the properties as shown in the __INIT__ function:

Source Code:Copy
  1. class Naviebayes (object):
  2. def __init__ (self, vocabset = none, classpriorp= none, Conditionp = none,\
  3. Classpriorp_ber= None, conditionp_ber = None,negconditionp_ber = None,lapfactor = 1,\
  4. Classlabellist = None, **args):
  5. ' Modeltype is NB model type, ' multinomial ' or ' Bernoulli ', the default type is ' Multinomial '
  6. Parameters:
  7. Classpriorp is (1,m) logs of class prior probability list, M is class number, multinomial
  8. Conditionp is (m,n) logs of condition probability list, n is the count of Vocabulary,mutinomial
  9. Classpriorp_ber, Conditionp_ber, Bernoulli
  10. Negconditionp_ber is the negtive condition probability for Bernoulli model, actually log (1-condition Probab ility)
  11.            
  12. Vocabset is the vocabulary set
  13. Lapfactor is the Laplace adjust factor
  14. Classlabellist is the class labels
  15. " "
  16. Obj_list = Inspect. Stack () [1][-2]
  17. Self. __name__ = obj_list[0].split (' = ') [0].strip ()
  18. #self. Modeltype = Modeltype
  19. self. Classpriorp = array(Classpriorp)
  20. self. Conditionp = array(CONDITIONP)
  21. self. Classpriorp_ber = array(Classpriorp_ber)
  22. self. Conditionp_ber = array(Conditionp_ber)
  23. self. Negconditionp_ber = array(Negconditionp_ber)
  24. self. Vocabset = Vocabset
  25. if vocabset:
  26. self. Vocabsetlen = len(self. vocabset)
  27. self. Lapfactor = Lapfactor
  28. self. Classlabellist = Classlabellist

The property of the Naviebayes object Laplace factor is used for Laplace calibration, to prevent the conditional probability of entry is 0, you can adjust the Laplace factor value to make the statistical distribution of the thesaurus closer to the actual, thereby enhancing the classification performance. In a very large number of vocabulary, if the number of occurrences more than 1 of the conditions to take the feature words, then Laplacefactor = 1 is obviously unreasonable, in order to accurately describe the statistical distribution of words, it is advisable to laplacefactor =0.0001 or smaller values.

In addition, the Naviebayes object supports two classification models, the default Modeltype = ' multinomial ' polynomial model. Because both the prior probability and the class conditional probability are used in the numerical value, to implement the Bernoulli model, we also add a property to represent the ' negative ' conditional probability log (1-P), and participate in the calculation as "negative" when judging the Bernoulli model category.

Naviebayes includes methods such as TRAINNB and CLASSIFYNB, and selects the Bernoulli and train functions of the polynomial or classify model according to the Modeltype type to produce the corresponding naive Bayesian classifier. The input data format of the TRAINNB method is as follows:

Source Code:Copy
  1. Postinglist=[[' my ', ' dog ', ' have ', ' flea ', ' problems ', ' help ', ' please ',
  2. [' Maybe ', 'not ', ' take ', ' him ', ' to ', ' dog ', ' Park ', ' stupid '],
  3. [' My ', ' dalmation ', 'is ', ' so ', ' cute ', ' I ', ' love ', ' him '],
  4. [' Stop ', ' posting ', ' stupid ', ' worthless ', ' garbage '],
  5. [' Mr ', ' licks ', ' ate ', ' my ', ' steak ', ' How ', ' to ', ' stop ', ' him '],
  6. [' Quit ', ' buying ', ' worthless ', ' dog ', ' food ', ' stupid ']
  7. Classvec = [0,1,0,1,0,1] or [' A ', ' B ', ' A ', ' B ', ' A ', ' B ']//class list
2.2,naviebayes used to mark spam tests

When using naive Bayes to solve some real-life problems, you need to get a list of strings from the text content and then create the word vectors. In this example, we will learn about one of the most famous applications of naive Bayes: spam filtering.

Existing spam (spam) and non-spam (ham) Each of the 25, the word after its cut in accordance with the Naviebayes object TRAINNB defined format in a list, message classification is stored in another list. In addition, the classifier training process, in order to verify the effectiveness of the training obtained classifier, from the mailing sample list randomly extracted 10 as the test data.

Regardless of whether the polynomial model or the Bernoulli model, the naïve Bayesian classifier in the Laplace factor value for [1, 0.5, 0.1, 0.01, 0.001, 0.00001] The error rate of spam recognition is 0. Performance is so good, it is estimated that the training data and test data after the extraction of feature words are more distinct.

The naive Bayesian text classification learning package for the polynomial and Bernoulli models is:

In addition, "machine learning Combat"--the fourth chapter: naive Bayesian called its word set type Bernoulli model is not accurate, because in the class conditional probability calculation, the program is calculated according to the number of words, categorical decision does not calculate the "negative" probability. If it is defined as a polynomial model is not appropriate, because it only statistics the occurrence of words or not, for the definition of the word bag type can be called the polynomial model, but its class conditional probability calculation formula is not accurate.


Referencesalgorithm Grocer--naive Bayesian classification of classification algorithm (Naive Bayesian classification)study of naive Bayesian text classification algorithm

The author of this paper, Adan, derives from: The classical algorithm of machine learning and the implementation of Python---naive Bayesian classification and its application in text classification and spam detection. Reprint please indicate the source.

A classical algorithm for machine learning and python implementation---naive Bayesian classification and its application in text categorization and spam detection

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.