Classification method based on probability theory in Python programming: Naive Bayes and python bayesian

Source: Internet
Author: User

Classification method based on probability theory in Python programming: Naive Bayes and python bayesian

Probability Theory and probability theory are almost forgotten.

Probability theory-based classification method: Naive Bayes

1. Overview

Bayesian classification is a general term for classification algorithms. These algorithms are based on Bayesian theorem and are collectively referred to as Bayesian classification. This chapter first introduces the basis of Bayesian Classification Algorithms-Bayesian theorem. Finally, we use examples to discuss the simplest Bayesian classification: Naive Bayes classification.

2. bayesian theory & conditional probability

2.1 bayesian theory

We now have a dataset composed of two types of data, as shown in:

We now use p1 (x, y) to represent the probability that the data point (x, y) belongs to Category 1 (the class represented by dots in the figure), and p2 (x, y) indicates the probability that the data point (x, y) belongs to Category 2 (the triangle in the figure). For a new data point (x, y ), you can use the following rules to determine its category:

If p1 (x, y)> p2 (x, y), the category is 1. If p2 (x, y)> p1 (x, y), the category is 2.

That is to say, we will select a category with a high probability. This is the core idea of Bayesian decision making theory, that is, selecting a decision with the highest probability.

2.1.2 conditional probability

If you are familiar with the p (x, y | c1) symbol, skip this section.

There is a jar containing 7 stones, 3 of which are white and 4 are black. If a random stone is taken out of the jar, what is the possibility of a white stone? Because there are 7 kinds of stone, 3 of which are white, the probability of taking out the White Stone is 3/7. So what is the probability of getting a black stone? Obviously, it is 4/7. We use P (white) to represent the probability of getting a white stone. The probability value can be obtained by dividing the number of white stones by the total number of stones.

If the seven stones are put in two buckets, how should we calculate the above probability?

Calculate P (white) or P (black). If we know the information of the bucket where the stone is located, it will change the result. This is the so-called conditional probability ). Assuming that we calculate the probability of getting a white stone from bucket B, this probability can be recorded as P (white | bucketB). We call it "when we know that the stone is from bucket B, probability of taking out a white stone ". It is easy to obtain that the P (white | bucketA) value is 2/4, and the P (white | bucketB) value is 1/3.

The formula for calculating the conditional probability is as follows:

P (white | bucketB) = P (white and bucketB)/P (bucketB)

First, we divide the number of white stones in bucket B by the total number of stones in the two buckets, and get P (white and bucketB) = 1/7. second, because there are three stones in B's bucket, and the total number of stones is 7, P (bucketB) is equal to 3/7. So P (white | bucketB) = P (white and bucketB)/P (bucketB) = (1/7)/(3/7) = 1/3.

Another method for calculating conditional probability effectively is called Bayesian criterion. Bayesian principles tell us how to exchange conditions and results in the probability of conditions. If P (x | c) is known and P (c | x) is required, the following calculation method can be used:

Use conditional probability for classification

The Bayesian decision theory mentioned above requires the calculation of two probabilities p1 (x, y) and p2 (x, y ):

If p1 (x, y)> p2 (x, y), it belongs to category 1; if p2 (x, y)> p1 (X, y), it belongs to Category 2.

This is not all about Bayesian decision making theory. The use of p1 () and p2 () is only to simplify the description as much as possible, but what really needs to be calculated and compared is p (c1 | x, y) and p (c2 | x, y ). the specific meaning of these symbols is: Given a data point represented by x and y, what is the probability that the data point comes from Class c1? What is the probability that a data point comes from Class c2? Note that these probabilities are different from those of Probability p (x, y | c1). However, Bayesian principles can be used to exchange conditions and results in probabilities. Specifically, the Bayesian criterion is used to obtain:

Using the preceding definitions, we can define the Bayesian classification criterion as follows:

If P (c1 | x, y)> P (c2 | x, y), it belongs to the Class c1. If P (c2 | x, y)> P (c1 | x, y), so it belongs to the class c2.

In document classification, the entire document (such as an email) is an instance, and some elements in the email constitute a feature. We can observe the words that appear in the document and take each word as a feature. The appearance or absence of each word is the value of this feature, in this way, the number of features will be as large as the number of words in the vocabulary.

We assume that features are independent of each other. The so-called independence refers to the independence in the statistical sense, that is, the possibility of a feature or word appearing has no relationship with it adjacent to other words. For example, the probability that "we" and "we" appear is not related to the two words. This assumption is exactly the meaning of the word naive in the naive Bayes classifier. Another assumption in Naive Bayes classifier is that each feature is equally important.

Note: Naive Bayes classifier generally has two implementation methods: one is based on the bernuoli model and the other is based on the polynomial model. The previous implementation method is used here. This implementation method does not consider the number of times a word appears in the document, but does not consider it. Therefore, it is equivalent to assuming that the word is of equal weight.

2.2 Naive Bayes scenario

An important application of machine learning is automatic classification of documents.

In document classification, the entire document (such as an email) is an instance, and some elements in the email constitute a feature. We can observe the words that appear in the document and take each word as a feature. The appearance or absence of each word is the value of this feature, in this way, the number of features will be as large as the number of words in the vocabulary.

Naive Bayes is an extension of the Bayesian classifier described above. It is a common algorithm used for document classification. Below we will conduct some practical projects on Naive Bayes classification.

2.3 Naive Bayes Principle

How naive Bayes works

Extract and deduplicate entries in all documents
Retrieve all categories of documents
Calculate the number of documents in each category
For each training document:

For each category:
If the entry appears in the document --> Add the Count value of the entry (for loop or matrix addition)
Increase the count of all entries (total number of entries in this category)
For each category:

For each entry:
Divide the number of entries by the total number of entries to obtain the conditional probability (P (Entry | category ))
Returns the conditional probability that this document belongs to each category (P (Category | all entries of this document ))

2.4 Naive Bayes Development Process

Collect data: You can use any method.

Prepare data: numeric or boolean data is required.

Analysis Data: when there are a large number of features, it does not work much to draw the features. At this time, the histogram is better.

Training Algorithm: Calculate the conditional probability of different independent features.

Test Algorithm: Calculate the error rate.

Algorithm used: a common Naive Bayes application is document classification. Naive Bayes classifier can be used in any classification scenario, not necessarily text.

2.5 features of Naive Bayes Algorithm

Advantage: it is still valid when the data volume is small and can handle multiple categories of problems.
Disadvantage: sensitive to input data preparation methods.
Applicable data type: nominal data.

2.6 Case Study of Naive Bayes Project

2.6.1 Project Case 1

Block insulting comments on the Community message board

2.6.1.1 Project Overview

Build a quick filter to block insulting comments on the online community message board. If a message uses a negative or insulting language, the message is identified as inappropriate content. There are two types for this problem: insults and non-insults, which are represented by 1 and 0 respectively.

2.6.1.2 Development Process

Collect data: You can use any method

Prepare data: build word vectors from text

Analyze Data: Check the entries to ensure the correctness of the resolution.

Training Algorithm: calculate the probability from the word Vector

Test Algorithm: Modify the classifier according to the actual situation

Use algorithms: classify comments on the Community message board

Collect data: You can use any method

2.6.1.3 construct Word Table

Def loadDataSet (): "create Dataset: return: Word List postingList, category classVec" postingList = [['my', 'dog ', 'has ', 'flea', 'blems', 'help', 'please'], # [, 1 ......] ['maybe', 'not ', 'Take', 'him', 'to', 'Dog', 'Park ', 'stupid'], ['my ', 'dalmation ', 'is', 'so', 'cute ',' I ', 'love', 'him'], ['stop', 'posting ', 'topid', 'Worthless ', 'garbage'], ['Mr', 'lick', 'ate', 'my', 'steak', 'who ', 'to', 'stop', 'him'], ['quit', 'bucket', 'Worthless ', 'Dog', 'food ', 'stupid '] classVec = [0, 1, 0, 1, 0] #1 is abusive, 0 not return postingList, classVec

2.6.1.4 prepare data: build word vectors from text

Def createVocabList (dataSet): "Get a set of all words: param dataSet: return: a set of all words (that is, a list of words without repeated elements) "vocabSet = set ([]) # create empty set for document in dataSet: # Operator | used to calculate the union of two sets vocabSet = vocabSet | set (document) # union of the two sets return list (vocabSet) def setOfWords2Vec (vocabList, inputSet): "traverse to check whether the word appears. If the word appears, set the word to 1: param vocabList: list of all word sets: param inputSet: input dataset: return: matching list [0, 1, 0, 1...], where 1 and 0 indicate whether words in the vocabulary appear in the input dataset "" # create a vector with a length such as the vocabulary, and set all its elements to 0 returnVec = [0] * len (vocabList) # [0, 0 ......] # traverse all words in the document. if a word in the vocabulary appears, set the corresponding value in the output document vector to 1 for word in inputSet: if word in vocabList: returnVec [vocabList. index (word)] = 1 else: print "the word: % s is not in my Vocabulary! "% Word return returnVec

2.6.1.5 analyze data: Check the entries to ensure the correctness of the resolution.

Check the function execution status and Word Table. If you need to check the word list, you can sort it.

>>> listOPosts, listClasses = bayes.loadDataSet()>>> myVocabList = bayes.createVocabList(listOPosts)>>> myVocabList['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park','stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how','stupid', 'so', 'take', 'mr', 'steak', 'my']

Check the function validity. For example, in myVocabList, what is the word of the element whose index is 2? It should be help. The word appears in the first document. Check whether it appears in the fourth document.

>>> bayes.setOfWords2Vec(myVocabList, listOPosts[0])[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1] >>> bayes.setOfWords2Vec(myVocabList, listOPosts[3])[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

2.6.1.6 training algorithm: calculate the probability from the word Vector

Now you know whether a word appears in a document and the category of the document. Next, we will rewrite the Bayes criterion and replace x and y with w. Bold, which indicates that this is a vector, that is, it consists of multiple values. In this example, the number of values is the same as the number of words in the vocabulary.

We use the formula above to calculate the value for each class and then compare the values of the two probability values.

First, you can calculate the probability p (ci) by dividing the number of documents in category I (insulting or non-Insulting) by the total number of documents ). Next we will calculate p (w | ci). Here we will use Naive Bayes hypothesis. If w is expanded into independent features, p (w0, w1, w2... Wn | ci ). Assume that all words are independent from each other. This assumption is also called conditional independence hypothesis (for example, when two people A and B throw A dice, the probability is independent of each other, that is, they are independent of each other, the probability that A throws 2 points and B throws 3 points is 1/6*1/6), which means that p (w0 | ci) p (w1 | ci) p (w2 | ci) can be used)... P (wn | ci) is used to calculate the above probability, which greatly simplifies the calculation process.

2.6.1.7 Naive Bayes Classifier Training Function

Def _ trainNB0 (trainMatrix, trainCategory): "original training data: param trainMatrix: file word matrix [[, 1...], [], []...]: param trainCategory: the category of the file [0, 1, 0...], the list length is equal to the number of word matrices, where 1 indicates that the corresponding file is an insulting file, and 0 indicates that it is not an insulting matrix: return: "" # number of files numTrainDocs = len (trainMatrix) # Number of words numWords = len (trainMatrix [0]) # The probability of appearance of an insulting file, that is, the number of all 1 in trainCategory, # represents how many insulting files, in addition to the total number of files, the probability of appearance of an insulting file is obtained. pAbusive = sum (trainCategory)/float (numTrainDocs) # construct the word appearance list p0Num = zeros (numWords) # [0, 0, 0,...] p1Num = zeros (numWords) # [0, 0, 0,...] # Total Number of words displayed in the entire dataset p0Denom = 0.0 p1Denom = 0.0 for I in range (numTrainDocs): # whether it is an insulting file if trainCategory [I] = 1: # if it is an insulting file, add the vector of the insulting file p1Num + = trainMatrix [I] # [, 1,...] + [0, 1,...] -> [0, 2,...] # sum all elements in the vector, that is, calculate the total number of words in all insulting files p1Denom + = sum (trainMatrix [I]) else: p0Num + = trainMatrix [I] p0Denom + = sum (trainMatrix [I]) # Category 1, that is, [P (F1 | C1), P (F2 | C1 ), P (F3 | C1), P (F4 | C1), P (F5 | C1)...] list # That is, the probability that each word appears under Category 1 p1Vect = p1Num/p1Denom # [1/90,]/90-> [,...] # Category 0: [P (F1 | C0), P (F2 | C0), P (F3 | C0), P (F4 | C0 ), P (F5 | C0)...] list # The probability that each word appears under Category 0: p0Vect = p0Num/p0Denom return p0Vect, p1Vect, pAbusive

Summary

The above is the classification method based on probability theory for Python programming in this article: the full content of Naive Bayes, I hope to help you. Interested friends can refer to this site: Python memory management and garbage collection algorithm parsing, python basic exercises for a few simple games, python using the adjacent matrix construction diagram code examples, etc, if you have any questions, you can leave a message at any time. The editor will reply to you in a timely manner. Thank you for your support for the website!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.