Sixth: Mail filtering system based on naive Bayesian classification algorithm

Source: Internet
Author: User

Preface

The most extensive and classic applications of naive Bayesian algorithms are undoubtedly document classification, and more specifically the mail filtering system.

In this paper, the implementation of a mail filtering system based on naive Bayesian classification algorithm is explained in detail.

This article focuses on the implementation of the project, for many of the details of the algorithm, please refer to a previous article : naive Bayesian classification algorithm principle analysis and code implementation .

Prepare data: Slice text

After getting to the text file, the first thing to do is two things:

1. Convert a text file to a vocabulary list

2. Convert the result of the previous step further into a word vector

For 1, specifically, the text file is cut with characters other than letters or numbers.

Just using the split function of a string is cumbersome to implement, while the real tool for working with text in Python is regular Expressions , which can be easily accomplished with regular expressions.

The following functions can be used to implement 1:

1 #=============================================2 #Input:3 #bigstring: Document string to convert4 #Output:5 #list format of documents to be converted6 #=============================================7 defTextparse (bigstring):8     ImportRe9Listoftokens = Re.split (r'\w*', bigstring)Ten     return[Tok.lower () forTokinchListoftokensifLen (tok) > 2]

Note that because of the possibility of whitespace in the result of the segmentation, a layer of filtering is added to the return.

The specific use of regular expressions is not covered by this article, interested readers please consult the relevant information.

For 2, in the previous article: Naive Bayes classification algorithm principle analysis and code implementation has already had the implementation of the example, here is no longer to be described.

Training and testing

1. Find the message (two directories in different categories) from the directory in the specified path in the code, and collect all the message information and convert it to Word vector format.

2. Divide this part of the data into the training and test set sections .

3. Call Naive Bayes classification function to train the data set, and get the each probability child in the Bayesian formula .

4. Find the word vector of the document to be classified and continue to complete the Bayesian formula to calculate the probability that the word vector belongs to each classification. The maximum probability is the result of classification .

5. The final Test information is printed by comparing the results of the classification obtained in the previous step with the actual results.

The following code is used for training and testing:

1 #=============================================2 #Input:3 #vocablist: List of words4 #inputset: List format of documents to be converted5 #Output:6 #Returnvec: Converted word vector (Word bag model)7 #============================================= 8 defbagofwords2vecmn (Vocablist, inputset):9     'document (list format)-word vector (Word bag model)'Ten      OneReturnvec = [0]*Len (vocablist) A      forWordinchInputset: -         ifWordinchvocablist: -Returnvec[vocablist.index (word)] + = 1 the              -     returnReturnvec -  - #============================================= + #Input: - #two categorized messages that specify a path within the code + #Output: A #null (Results of print Bayesian classification test) at #=============================================     - defspamtest (): -     'test Bayesian Classification and print results' -      -     #Document String Collection -doclist=[] in     #Document Classification Collection -Classlist = [] to     #All Strings +Fulltext =[] -      the     #remove 25 messages from each of the two classified messages to get a collection of document strings, a collection of document classifications, and all the strings.  *      forIinchRange (1,26): $WordList = Textparse (Open ('/home/fangmeng/email/spam/%d.txt'%i). Read ())Panax Notoginseng doclist.append (wordList) - fulltext.extend (wordList) theClasslist.append (1) +WordList = Textparse (Open ('/home/fangmeng/email/ham/%d.txt'%i). Read ()) A doclist.append (wordList) the fulltext.extend (wordList) + classlist.append (0) -       $     #Vocabulary List $Vocablist =createvocablist (docList) -     #Training Set Range -Trainingset = Range (50) the     #Test Set Scope -testset=[]Wuyi      the     #in a total of 50 e-mails, 10 were used for classification tests.  -     #The 10 test messages are also removed from the training set scope.  Wu      forIinchRange (10): -Randindex =Int (random.uniform (0,len (Trainingset))) About testset.append (Trainingset[randindex]) $         del(Trainingset[randindex]) -      -     #training set Word vector matrix -trainmat=[] A     #Training set Category List +Trainclasses = [] the      -     #Constructing training set word vector matrix and Training set classification list $      forDocindexinchTrainingset: the trainmat.append (BAGOFWORDS2VECMN (Vocablist, Doclist[docindex])) the trainclasses.append (Classlist[docindex]) the      the     #classify and test the messages in the training set and print the test results -P0v,p1v,pspam =trainNB0 (Numpy.array (Trainmat), Numpy.array (trainclasses)) inErrorcount =0 the      forDocindexinchTestset: theWordvector =bagofwords2vecmn (Vocablist, Doclist[docindex]) About         ifCLASSIFYNB (Numpy.array (wordvector), p0v,p1v,pspam)! =Classlist[docindex]: theErrorcount + = 1 the             Print "Error Category: \ n", Doclist[docindex] the     Print 'Error Rate: \ n', float (errorcount)/len (Testset)

The printed result is roughly the following (the test set is randomly selected, so each execution may be different):

Summary

1. Document parsing occupies a large proportion in the specific document classification project. Perfect document parsing can be achieved with regular expressions .

2. When the project is implemented, it is possible to use some of Python's tools (such as regular expressions), or third-party libraries (such as NumPy) to greatly improve development efficiency.

Sixth: Mail filtering system based on naive Bayesian classification algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.