e-mail filtering system based on naive Bayesian classification algorithm

Source: Internet
Author: User

Transfer from Mu Chen

Read Catalogue

    • Objective
    • Prepare data: Slice text
    • Training and testing
    • Summary
Back to the top of the preface

The most extensive and classic applications of naive Bayesian algorithms are undoubtedly document classification, and more specifically the mail filtering system.

In this paper, the implementation of a mail filtering system based on naive Bayesian classification algorithm is explained in detail.

This article focuses on the implementation of the project, for many of the details of the algorithm, please refer to a previous article: naive Bayesian classification algorithm principle analysis and code implementation.

Back to top Prepare data: Slice text

After getting to the text file, the first thing to do is two things:

1. Convert a text file to a vocabulary list

2. Convert the result of the previous step further into a word vector

For 1, specifically, the text file is cut with characters other than letters or numbers.

Just using the split function of a string is cumbersome to implement, while the real tool for working with text in Python is regular expressions, which can be easily accomplished with regular expressions.

The following functions can be used to implement 1:

1 #============================================= 2 #    input: 3 #        bigstring:       document string to be converted 4 #    output: 5 #        List format of documents to be converted 6 #============================================= 7 def textparse (bigstring): 8     import re 9     Listoftokens = Re.split (R ' \w* ', bigstring)     return [Tok.lower () for Tok in Listoftokens if Len (tok) > 2]

Note that because of the possibility of whitespace in the result of the segmentation, a layer of filtering is added to the return.

The specific use of regular expressions is not covered by this article, interested readers please consult the relevant information.

For 2, in the previous article: Naive Bayes classification algorithm principle analysis and code implementation has already had the implementation of the example, here is no longer to be described.

Back to top training and testing

1. Find the message (two directories in different categories) from the directory in the specified path in the code, and collect all the message information and convert it to Word vector format.

2. Divide this part of the data into the training and test set sections.

3. Call Naive Bayes classification function to train the data set, and get the each probability child in the Bayesian formula.

4. Find the word vector of the document to be classified and continue to complete the Bayesian formula to calculate the probability that the word vector belongs to each classification. The maximum probability is the result of classification.

5. The final Test information is printed by comparing the results of the classification obtained in the previous step with the actual results.

The following code is used for training and testing:

 1 #============================================= 2 # Input: 3 # Vocablist: Glossary 4 # Inputset: Pending conversion List format of the document 5 # output: 6 # Returnvec: converted word vector (Word bag model) 7 #============================================= 8 def bagofwor  DS2VECMN (Vocablist, Inputset): 9 ' document (list format)-word vector (Word bag model) ' Ten Returnvec = [0]*len (vocablist) ' for word In Inputset:13 if Word in vocablist:14 returnvec[vocablist.index (word)] + + + +-ret        Urn RETURNVEC17 18 #=============================================19 # Input: 20 # Two classified messages within the specified path in code 21 # output: 22 #     Null (print results of Bayesian classification Test) #============================================= def spamtest (): 25 ' Test Bayesian classification and print results ' 26 27 # Document String Collection doclist=[]29 # document category Collection classlist = []31 # All strings Fulltext =[]33 34 # from two classified messages Each takes out 25 messages, gets a collection of document strings, a collection of document classifications, and all the strings. For I in Range (1,26): WordList = textparse (open ('/home/fangmeng/email/spam/%d.txt '% i). Read ()) 37         Doclist.append (wordList) fulltext.extend (wordList) classlist.append (1) wordList = t Extparse (Open ('/home/fangmeng/email/ham/%d.txt '% i). Read ()) Doclist.append (wordList) fulltext.extend (     wordList) classlist.append (0) 44 45 # Glossary Vocablist = createvocablist (docList) 47 # Training Set Range 48 Trainingset = Range (50) 49 # test Set Range testset=[]51 52 # In a total of 50 e-mails, 10 were used for classification testing. 53 # Remove the 10 test messages from the training set at the same time. Si for i in range: randindex = Int (Random.uniform (0,len (Trainingset))) Testset.append (training     Set[randindex]) Trainingset[randindex del (]) 58 59 # Training Set Word vector matrix # trainmat=[]61 # Training Set Classification list 62 trainclasses = []63 64 # Build training set Word vector matrix and Training set classification list for Docindex in trainingset:66 trainmat.append (Bagofword S2VECMN (Vocablist, Doclist[docindex])) Trainclasses.append (Classlist[docindex]) 68 69 # Classify and test the messages in the training set and print the measurement Trial results of P0v,p1v,pSpam = trainNB0 (Numpy.array (Trainmat), Numpy.array (trainclasses)) Errorcount = 072 for Docindex in testset:73 Wordvector = BAGOFWORDS2VECMN (Vocablist, Doclist[docindex]) if CLASSIFYNB (Numpy.array (wordvector), p0V,p1V,p SPAM)! = classlist[docindex]:75 Errorcount + = 176 print "Error Category: \ n", doclist[docindex]77 print ' wrong Error rate: \ n ', float (errorcount)/len (Testset)

The printed result is roughly the following (the test set is randomly selected, so each execution may be different):

Back to the top of the summary

1. Document parsing occupies a large proportion in the specific document classification project. Perfect document parsing can be achieved with regular expressions.

2. When the project is implemented, it is possible to use some of Python's tools (such as regular expressions), or third-party libraries (such as numpy) to greatly improve development efficiency.

e-mail filtering system based on naive Bayesian classification algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.