e-mail filtering system based on naive Bayesian classification algorithm

Last Update:2017-09-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transfer from Mu Chen

Read Catalogue

Objective
Prepare data: Slice text
Training and testing
Summary

Back to the top of the preface

The most extensive and classic applications of naive Bayesian algorithms are undoubtedly document classification, and more specifically the mail filtering system.

In this paper, the implementation of a mail filtering system based on naive Bayesian classification algorithm is explained in detail.

This article focuses on the implementation of the project, for many of the details of the algorithm, please refer to a previous article: naive Bayesian classification algorithm principle analysis and code implementation.

After getting to the text file, the first thing to do is two things:

1. Convert a text file to a vocabulary list

2. Convert the result of the previous step further into a word vector

For 1, specifically, the text file is cut with characters other than letters or numbers.

Just using the split function of a string is cumbersome to implement, while the real tool for working with text in Python is regular expressions, which can be easily accomplished with regular expressions.

The following functions can be used to implement 1:

1 #============================================= 2 #    input: 3 #        bigstring:       document string to be converted 4 #    output: 5 #        List format of documents to be converted 6 #============================================= 7 def textparse (bigstring): 8     import re 9     Listoftokens = Re.split (R ' \w* ', bigstring)     return [Tok.lower () for Tok in Listoftokens if Len (tok) > 2]

Note that because of the possibility of whitespace in the result of the segmentation, a layer of filtering is added to the return.

The specific use of regular expressions is not covered by this article, interested readers please consult the relevant information.

For 2, in the previous article: Naive Bayes classification algorithm principle analysis and code implementation has already had the implementation of the example, here is no longer to be described.

1. Find the message (two directories in different categories) from the directory in the specified path in the code, and collect all the message information and convert it to Word vector format.

2. Divide this part of the data into the training and test set sections.

3. Call Naive Bayes classification function to train the data set, and get the each probability child in the Bayesian formula.

4. Find the word vector of the document to be classified and continue to complete the Bayesian formula to calculate the probability that the word vector belongs to each classification. The maximum probability is the result of classification.

5. The final Test information is printed by comparing the results of the classification obtained in the previous step with the actual results.

The following code is used for training and testing:

 1 #============================================= 2 # Input: 3 # Vocablist: Glossary 4 # Inputset: Pending conversion List format of the document 5 # output: 6 # Returnvec: converted word vector (Word bag model) 7 #============================================= 8 def bagofwor  DS2VECMN (Vocablist, Inputset): 9 ' document (list format)-word vector (Word bag model) ' Ten Returnvec = [0]*len (vocablist) ' for word In Inputset:13 if Word in vocablist:14 returnvec[vocablist.index (word)] + + + +-ret        Urn RETURNVEC17 18 #=============================================19 # Input: 20 # Two classified messages within the specified path in code 21 # output: 22 #     Null (print results of Bayesian classification Test) #============================================= def spamtest (): 25 ' Test Bayesian classification and print results ' 26 27 # Document String Collection doclist=[]29 # document category Collection classlist = []31 # All strings Fulltext =[]33 34 # from two classified messages Each takes out 25 messages, gets a collection of document strings, a collection of document classifications, and all the strings. For I in Range (1,26): WordList = textparse (open ('/home/fangmeng/email/spam/%d.txt '% i). Read ()) 37         Doclist.append (wordList) fulltext.extend (wordList) classlist.append (1) wordList = t Extparse (Open ('/home/fangmeng/email/ham/%d.txt '% i). Read ()) Doclist.append (wordList) fulltext.extend (     wordList) classlist.append (0) 44 45 # Glossary Vocablist = createvocablist (docList) 47 # Training Set Range 48 Trainingset = Range (50) 49 # test Set Range testset=[]51 52 # In a total of 50 e-mails, 10 were used for classification testing. 53 # Remove the 10 test messages from the training set at the same time. Si for i in range: randindex = Int (Random.uniform (0,len (Trainingset))) Testset.append (training     Set[randindex]) Trainingset[randindex del (]) 58 59 # Training Set Word vector matrix # trainmat=[]61 # Training Set Classification list 62 trainclasses = []63 64 # Build training set Word vector matrix and Training set classification list for Docindex in trainingset:66 trainmat.append (Bagofword S2VECMN (Vocablist, Doclist[docindex])) Trainclasses.append (Classlist[docindex]) 68 69 # Classify and test the messages in the training set and print the measurement Trial results of P0v,p1v,pSpam = trainNB0 (Numpy.array (Trainmat), Numpy.array (trainclasses)) Errorcount = 072 for Docindex in testset:73 Wordvector = BAGOFWORDS2VECMN (Vocablist, Doclist[docindex]) if CLASSIFYNB (Numpy.array (wordvector), p0V,p1V,p SPAM)! = classlist[docindex]:75 Errorcount + = 176 print "Error Category: \ n", doclist[docindex]77 print ' wrong Error rate: \ n ', float (errorcount)/len (Testset)

The printed result is roughly the following (the test set is randomly selected, so each execution may be different):

Back to the top of the summary

1. Document parsing occupies a large proportion in the specific document classification project. Perfect document parsing can be achieved with regular expressions.

2. When the project is implemented, it is possible to use some of Python's tools (such as regular expressions), or third-party libraries (such as numpy) to greatly improve development efficiency.

e-mail filtering system based on naive Bayesian classification algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

e-mail filtering system based on naive Bayesian classification algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

e-mail filtering system based on naive Bayesian classification algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support