e-mail spam filtering
1. How to build your own word list from a text document. Use regular expressions to slice sentences and convert all strings to lowercase.
####################################
# function: Slice text
# Input variable: large string big_string
# Output variables: List of strings
####################################
def text_parse (big_string):
List_of_tokens = Re.split (R ' \w* ', big_string)
return [Tok.lower () for Tok in List_of_tokens if Len (tok) > 2]
2. Automated processing of Bayesian spam classifier. In this example, there are 50 e-mails, 10 of which are randomly selected as test sets and the remainder as training sets. The average error rate is calculated by multiple iterations to measure the performance of the classifier.
####################################
# function: Spam test
# input variable: null
# OUTPUT Variable: Error rate
####################################
Def spam_test ():
Doc_list = []
Class_list = []
For I in xrange (1, 26):
Word_list = text_parse (open (' email/spam/%d.txt '% i). Read ())
Doc_list.append (word_list)
Class_list.append (1)
Word_list = text_parse (open (' email/ham/%d.txt '% i). Read ())
Doc_list.append (word_list)
Class_list.append (0)
Vocab_list = Create_vocab_list (doc_list)
Training_set = Range (50)
Test_set = []
# Choose 10 randomly from 50 messages as a test set, and reject the 10 messages in the training set accordingly.
For I in Xrange (10):
rand_index = Int (random.uniform (0, Len (training_set)))
Test_set.append (Training_set[rand_index])
Del (Training_set[rand_index])
Train_mat = []
Train_classes = []
For Doc_index in Training_set:
Train_mat.append (Set_of_words2vec (Vocab_list, Doc_list[doc_index]))
Train_classes.append (Class_list[doc_index])
p0v, p1v, p_spam = train_nb0 (Array (train_mat), Array (train_classes))
Error_count = 0
# traverse the test set to categorize each message in it
For Doc_index in Test_set:
Word_vector = Set_of_words2vec (Vocab_list, Doc_list[doc_index])
If CLASSIFY_NB (Array (word_vector), p0v, P1V, p_spam)! = Class_list[doc_index]:
Error_count + = 1
print ' classification error ', Doc_list[doc_index]
print ' The error rate is: ', float (error_count)/len (Test_set)
3. Code Test
def main ():
spam_test ()
if __name__ = = ' __main__ ':
Main ()
Machine learning--naive Bayesian algorithm case