Naive Bayes algorithm for Data Mining

Source: Internet
Author: User
Recently, I have read some data mining materials to understand and study the classification technology in the data mining process. 1. Data Mining overview data mining is to extract or mine data, mainly by in-depth sorting and analysis of a large amount of data that has been obtained, the analysis results can reflect the past results and predict future trends. Current number

Recently, I have read some data mining materials to understand and study the classification technology in the data mining process. 1. Data Mining overview data mining is to extract or mine data, mainly by in-depth sorting and analysis of a large amount of data that has been obtained, the analysis results can reflect the past results and predict future trends. Current number

Recently, I have read some data mining materials to understand and study the classification technology in the data mining process.

1. Data Mining Overview

Data mining is to extract or mine data. It is mainly through in-depth sorting and analysis of a large amount of data that has been obtained. The analysis results can reflect the past results and forecast future trends. Currently, several typical data mining researches include association rules, classification, clustering, prediction, and web mining. Classification mining can extract relevant features from data, establish corresponding models or functions, and classify each object in the data into a specific category. For example, you can detect whether the email is spam, whether the data is attack data, and whether the sample is a malicious program, classification Mining involves decision tree, statistics, Bayesian Networks, neural networks, and other classification technologies.

2. Naive Bayes Algorithm

Bayesian classification is a statistical-based classification method, that is, probability statistics by universities. Naive Bayes is a simple probabilistic classifier based on Bayes' theorem with independent assumptions. Therefore, the technology mentioned here is a small part of data mining. The basic idea is as follows:

Demand Analysis-> Feature Extraction-> training sample-> detection feature-> posterior probability calculation-> Determination

The first is requirement analysis. We need to understand our purpose: What results can be obtained from the analysis of these data? What results do we need? A classification model. For example, we need to analyze and process a large number of emails. In the end, we need to establish a model to automatically determine whether an email is a spam or normal email. Therefore, we only have two categories, that is, spam and normal emails. This is what we need.

Secondly, feature extraction requires detailed analysis of the analyzed data to extract different points. For example, we need to study the differences between normal mail and spam, the characteristics of spam, and the additional features of normal mail. Generally, spam content contains special features such as images, links, mail headers, multiple recipients, and HTML tags. Normal mail generally does not have these features.

The training sample is used again. This step is generally to extract a large number of samples and analyze and collect statistics based on the feature values extracted in the previous step to obtain a detailed feature statistical table. For example, 1000 emails are randomly extracted from the email server, and then the content of each email is statistically analyzed based on the features mentioned above.

Again, we have established a naive Bayes model through the previous process. We can implement automatic feature detection by writing code. For example, you can use python or c ++ to implement text feature matching. Other text matching algorithms can be used here.

The next step is to calculate the posterior probability. Based on the naive Bayes algorithm, we can calculate the feature Probability under known classification conditions, that is, the prior probability. For example, we can calculate the probability P (image | spam) when we assume that the text features are images, links, and multiple recipients), P (link | spam), and so on. Then, when we assume that the calculation is spam, probability P (image | normal email), P (link | normal email), and so on when the text features include images, links, and multiple recipients.

Finally, we can determine the type of the sample by comparing the values and probabilities of a prior probability. For example, calculate P (spam) * P (image | spam) * P (link | spam )*.... And P (normal mail) * P (image | normal mail) * P (link | normal mail )*...., Then we can see that the value is relatively large to determine that it belongs to this category.

System Performance indicators are generally evaluated by the accuracy, accuracy, and recall rate.

3. Summary

In general, the entire process is still complicated, especially in terms of sample features, which requires comprehensive consideration before the effect becomes more obvious, and the training sample value also affects the final result. There is also a simple example on the Internet. The python-implemented Naive Bayes [document 3] can be used for your reference. With a few good articles.

4. References

(1) Fan Ming, Fan Hongjian, Introduction to Data Mining

(2) JIAO Li Cheng, Intelligent Data Mining and Knowledge Discovery

(3)Python Implementation of Naive Bayes

(4) Text Classification Algorithm Based on Naive Bayes classifier (I)

(5) Bayesian inference and its Internet applications (I): Theorem Introduction

(6) Bayesian inference and its Internet applications (2): filtering spam

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.