Research on Content-based spam filtering

Source: Internet
Author: User

Research on Content-based spam filtering

Author: Met3or Source: http://www.istroop.org

Email has become one of the important means of communication and communication in people's daily life, but the problem of spam is also becoming increasingly serious. The average number of spam mails received by netizens on a daily basis has exceeded normal mails. Currently, the spam filtering technology is commonly used, including whitelist and blacklist, rule filtering, and keyword-based content scanning. Another route is to start with the text content of the email, and use the text classification and information filtering algorithms to learn the spam Classifier in the training mail set. Common text classification methods in spam filtering include simple Bayesian, k-nn, decision tree, and boosting. The simple Bayesian method is easy to calculate, but it is difficult to mention a higher level of recall rate and accuracy, and is not suitable for incremental feedback learning. Some other methods have better results than simple Bayes, but the calculation is complicated. Based on the analysis of the simple Bayesian method, this paper attempts to find a spam filter with fast speed, simple computing, good performance, and convenient feedback and learning. Winnow is an error-driven Linear Classification Algorithm for online learning. Its Online Learning feature is very suitable for incremental feedback from "one case and one learning". The author applies Winnow algorithm to spam filtering, experiments on the common mail corpus show that Winnow is better than the simple Bayesian method, which is close to the Boosting method.

Specifically, the work of this article mainly includes the following:
1) summarize the research status of spam filtering. This includes the definition, harm, and common filtering technologies of spam.
2) This paper introduces the application of Text Classification Algorithms in mail filtering, and summarizes common feature selection methods, classification algorithms, and general mail corpus.
3) analyze the simple Bayesian Algorithm in Email Filtering in detail. The performance of the Bayesian algorithm is tested using the PU1 mail corpus, and the effects of the number of features, the classification threshold, and the pre-processing level of the corpus on the results are compared.
4) The Winnow linear classification algorithm is introduced into mail filtering, and the Winnow classifier is tested on PU1 corpus and Ling-Spam corpus to obtain better results.
5) combined with the simple Bayesian algorithm and Winnow classifier, the feedback Learning Technology of spam filtering is analyzed.
6) a basic framework of the client mail filtering system is designed. Key words: spam filtering; text classification; simple Bayes; Winnow; feedback learning; Information Filtering

Http://www.nosounds.com/meteor/01.pdf

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.