Naive Bayesian-a probability-based classification method

Source: Internet
Author: User
Tags natural logarithm

The decision tree and KNN algorithm are the classification algorithms for the result determination , and the data examples are clearly divided into a certain classification.

Bayesian: It is not entirely certain that the data instance should be divided into a class , and lake synthesis can only give the probability that the data instance belongs to a given classification.

* Introduce a priori probability and logical reasoning to deal with uncertain propositions

* (Extended proposition), another called the frequency probability, from the data itself to get the conclusion, not to test the logical reasoning and prior knowledge.

Naive Bayes: Making the most primitive and simplest assumptions throughout the formal process

Python text capabilities: dividing text into Word vectors, using word vectors to classify text

Naive Bayes: Select the category of high probability corresponding

Pros: Still valid with less data, can handle multiple categories of problems

Cons: Sensitive to the way the input data is prepared

Working with Data types: nominal type data

Bayesian, KNN, decision tree method The condition of two kinds of data for a data set

KNN: Large Computational capacity

Decision Tree: Along x, y axes, but not high success rate

Bayesian: Applying probability comparison, that is, choosing the highest probability

Conditional probabilities:

Sample values: Assuming 1 eigenvalues, n samples, then 10 eigenvalues, n**10 samples, if the sample is completely independent (statistically, the likelihood of a feature appearing is not related to its proximity to other words), then the number of samples can be n*10

Naive Bayes:

Suppose one: Here is the case that there is no statistical relationship between the samples, then 10 eigenvalues, the sample size is used *10 way

Assuming two, each eigenvalue is equally important, at this time 10 eigenvalues, even do not need to n*10 a characteristic value can be calculated

Naive Bayes classifier is divided into two kinds

Bernoulli model: The implementation process does not test the number of occurrences in the document, only to consider whether the occurrence, equivalent to the assumption that the word is equal weight

Polynomial model: Consider the number of occurrences of a word in text

Example: Python for text categorization:

Note: Each text fragment represents an entry vector, where a value of 1 indicates that the entry appears in the document, and 0 indicates that it does not appear

Bayesian implementation process:

1. Collect data: You can use any method, such as RSS feeds

2. Prepare data: Numeric or Boolean data required

* Whether each word appears or not--the word set model

* Each word can appear multiple times--word bag model, Word bag model when you find that the same value appears multiple times, add 1 to the previous value

3. Analysis data: When there are a lot of features, the drawing feature is not very useful, and the histogram effect is better at this time

Convert a sentence to a vector by seeing the text as a vector of words or terms

* Create an empty set to add the new Word collection returned by each document to the collection

* Create a set of two sets, bitwise operator processing

* Create a vector where all elements are 0, traverse all the words in the document, and if a word appears in the glossary, set the output document vector value to 1

* Returns a matrix file

4. Training algorithm: Calculating conditional probabilities of different independent features

* Calculate probabilities from Word vectors-pseudo code

Calculate the number of documents in each category

For each training document:

For each category:

If the entry appears in the document, add the calculated value of the entry

Add calculated values for all entries

For each category

For each entry:

Get the conditional probability by the number of entries in this article

Returns the conditional probabilities for each category

Specific steps:

* Initialization probability

* Vector addition

* Make a departure for each element

5. Test algorithm: Calculate error rate

Modify the classifier according to the actual situation

* To calculate the score of multiple probabilities to get the number of documents in a category, but if one is 0, the product is 0, to reduce the effect, we initialize all the words to 1 and the denominator to 2

* Underflow or not get the correct answer: Due to too many small product (too small to five into 0) solution: The product takes the natural logarithm;

6, using the algorithm: a common Bayesian should be in the text classification, you can use the Bayesian classifier in any class scene, but not necessarily the text

Example: Applying a Bayesian classifier to filter junk e-mail

Note: A string list is introduced to generate the word vector

Pseudo code:

1. Collect data: Provide text file

2. Prepare the data: Parse the file into a vector of terms

* Cut a penny (email header, URL)

3, analysis data: Check the entry to ensure the correctness of the analysis

4. Training algorithm: Using the TRAINNB0 () function established previously

5, test algorithm: Use CLASSFYNB (), and build a new test function to calculate the document error rate

* File parsing and complete spam test function

* Import and parse text files

* Randomly build training set

* Classification of test sets

Example: Using naive Bayesian classifier to derive regional tendencies from personal ads

Example: Using Bayesian to discover geographical-related terms

Objective: The goal is not to classify the classifier, but to discover the content related to a particular city by observing the word and conditional probability values.

1, Collect data: Collect content from RSS, here need to build an excuse to RSS source

* Calculate the frequency of occurrence

* One RSS feed per visit

* Remove the words with the highest number of occurrences

2, prepare the data: Jiang Wen can not parse into the term vector

3, analysis data: Check the entry to ensure the correctness of the resolution

* Show the terms of the first party

4. Training algorithm: Using the TRAINNB0 () function established previously

5, Test algorithm: Observe the error rate, to ensure that the classifier is available, you can modify the segmentation program, reduce the error rate, improve the classification results

6, using the algorithm: to build a complete program, encapsulating all content, given two RSS feeds, the program displays common common words.

6, using the algorithm: to build a complete program to classify a set of documents, the wrong points of the document input to the screen

  

Naive Bayesian-a probability-based classification method

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.