Python and scikit-learn for spam filtering

Source: Internet
Author: User
Tags svm

Textual Mining (text Mining, getting information from text) is a relatively broad concept that has become a growing concern in today's era of massive text data generation. At present, with the help of machine learning model, many text mining applications, including sentiment analysis, file classification, topic classification, text summarization, machine translation, etc., have been automatically implemented.

In these applications, spam filtering is a good start for beginners to practice file classification, such as "Junk Mailbox" in Gmail accounts is a real-world application of spam filtering. Below we will write a spam filter based on a public message data set Ling-spam. The Ling-spam data set is as follows:

http://t.cn/RKQBl9c

Here we have extracted the same number of spam and non-spam messages from Ling-spam, as follows:

Http://t.cn/RKQBkRu

Below we will write a real-world spam filter by following several steps.

  1. preparing text data;

  2. Create a dictionary (Word dictionary);

  3. feature extraction;

  4. training classifier.

Finally, we validate the filter with a test data set.

  

1. Preparing Text data

Here we divide the data set into two parts, the training set (702 messages) and the test set (260 messages), where both the garbage and the non-spam account for 50%. This is because each spam message's dataset is named Spmsg, so it's easy to differentiate.

In most of the text mining problems, text cleanup is the first step, that is, to clean up those words that are not relevant to our target information, as in this case. Usually the message contains a lot of useless characters, such as punctuation, stop words, numbers and so on, these characters are not helpful to detect spam, so we need to clean them out. Here the messages in the Ling-spam dataset have been processed in the following steps:

  a) Clear the inactive words---like "and", "the", "of" and so on are very common in English sentences. However, these terms are not used to determine the true identity of the message, so the words have been removed from the message.

  b) Form reduction ---this is a process of combining different forms of the same word to be analyzed as a separate project. For a chestnut, "include", "includes" and "included" can all be represented by "include". At the same time, the contextual meaning of a statement is preserved by means of a word-back method (stemming), which differs from the method of stemming (note: Stemming is another method of text mining that does not take into account the meaning of a statement).

In addition, we need to remove some non-literal symbols (non-words) such as punctuation or special characters. There are many ways to do this, and here we will first create a dictionary (creating a dictionary) before removing the symbols for these non-literal classes. It should be pointed out that this is actually very convenient, because when you have a dictionary on hand, for each non-literal type of symbol, only need to remove once is OK.

  

2. Create a dictionary (Creating Word dictionary)

A sample message in a dataset typically looks like this:

Subject:posting

Hi, ' m work phonetics project Modern Irish ' m hard source. Anyone recommend book article 中文版? ', specifically interest palatal (slender) consonant, work helpful too. Thank! Laurel Sutton (Sutton @ Garnet. Berkeley. Edu

You'll notice that the first line of the message is the title, and the text starts at the third line. Here we only do text analysis based on the content of the message body, to determine whether the message is junk e-mail. In the first step, we need to create a dictionary of text and the frequency with which the text appears. In order to create such a "dictionary", here we take advantage of the 700 messages in the training set. See the following Python function for specific implementations:

def make_dictionary (Train_dir):

emails = [Os.path.join (train_dir,f) for F in Os.listdir (Train_dir)]

All_words = []

For mail in emails:

With open (mail) as M:

For I,line in Enumerate (m):

if i = = 2: #Body of email is only 3rd line of text file

Words = Line.split ()

All_words + = words

Dictionary = Counter (all_words)

# Paste code for Non-word removal here (code SniPPet is given belowfeifanshifan8.cn)

return dictionary

Once the dictionary is created, we can remove the non-literal class symbols mentioned earlier by adding a few lines of code to the above function. Here I also deleted some of the spam decisions unrelated to the single character, see the following code, note that the code is attached to the end of the Def make_dictionary (train_dir) function.

List_to_remove = Dictionary.keys ()

For item in List_to_remove:

If item.isalpha () = = False:

Del Dictionary[item]

Elif len (item) = = 1:

Del Dictionary[item]

Dictionary = Dictionary.most_common (3000)

Here you can output the dictionary by typing the print dictionary instruction. It is important to note that you may see many insignificant words in the printed dictionary, which you should not worry about because we always have the opportunity to adjust them in subsequent steps. In addition, if you are strictly following the data set mentioned above, then your dictionary should have these high-frequency words (in this case we have selected the highest frequency of the 3,000 words):

[(' Order ', 1414), (' Address ', 1293), (' Report ', 1216 www.yongshiyule178.com), (' Mail ', 1127), (' Send ', 1079), (' language ') , 1072), (' email ', 1051), (' Program ', 1001), (' we ', 987), (' list ', 935), (' One ', 917), (' name ', 878), (' Receive ', 826), ( "Money", 788), (' Free ', 762)

  

3. Feature Extraction

After the dictionary is ready, we can extract the dimensions of each message in the training set as a 3000 word count vector (this vector is our feature), and each word count vector contains the specific frequency of the 3,000 high-frequency words previously selected. Of course, you may have guessed that most of the frequency should be 0. Give a chestnut: for example, we have 500 words in our dictionary, and each word count vector contains the frequency of the 500 words in the training set. Suppose the training set has a set of text: "Get The work-done, work-done". So, this sentence corresponds to the number of words vector should be like this: [0,0,0,0,0,....... 0,0,2,0,0,0,......, 0,0,1,0,0,... 0,0,1,0,0,...... 2,0,0,0,0,0]. Here, the frequency of each word in the sentence can be displayed: the words correspond to the positions of the 296,359,415 and 495 in the word number vectors with a length of 500, and the other locations are displayed as 0.

The following Python function will help us to generate a eigenvectors matrix with 700 rows and 3000 columns. Each of these lines represents each message in the training set for 700 messages, and each column represents 3,000 keywords in the dictionary. The value on the "IJ" position represents the number of times that the word "J" in the dictionary appears in the message (letter i).

def extract_features (Mail_dir):

Files = [Os.path.join (MAIL_DIR,FI) for fi in Os.listdir (Mail_dir) www.dajinnylee.cn]

Features_matrix = Np.zeros (len (Files), 3000))

DocID = 0;

For fil in Files:

With Open (fil) as fi:

For I,line in Enumerate (FI):

if i = = 2:

Words = Line.split ()

For word in words:

Wordid = 0

For i,d in Enumerate (dictionary):

If d[0] = = Word:

Wordid = i

Features_matrix[docid,wordid] = Words.count (Word www.6788878.cn)

DocID = DocID + 1

Return Features_matrix

  

4. Training classifier

Here we will use the Scikit-learn machine learning Libraries Authoring training classifier, related links to the Scikit-learn library are as follows:

Http://t.cn/SMzAoZ

This is an open source machine learning library that is bound to the third-party python distribution Anaconda, can be downloaded together with Anaconda, or can be installed independently as prompted in the following link:

Http://t.cn/8kkrVlQ

Once installed, we only need to import it into our program to be able to use it.

Here we train two models, namely naive Bayesian classifier and SVM (Support vector machine). Naive Bayesian classifier is a traditional supervised probabilistic classifier, which is very common in the scene of text categorization, it is based on Bayes theorem, assuming that each pair of features are independent of each other. SVM is a supervised two classifier, which is very effective in the face of scenes with a large number of features, and its ultimate goal is to isolate a subset from the training data, called the support vector (separating the boundary of the hyper plane). The SVM decision function that determines the final class of test data is based on the support vector and kernel technique (kernel trick).

After the classifier training is complete, we can test the performance of the model on the test set. Here we extract the word vector for each message in the test set, and then use the trained naive Bayesian classifier and SVM model to predict its category (normal mail or spam). The following is the complete Python code for the spam classifier, plus the two functions we defined in steps 2 and 3.

Import OS

Import NumPy as NP

From collections Import Counter

From Sklearn.naive_bayes import MULTINOMIALNB, G www.aomenyonli.cn aussiannb, BERNOULLINB

From SKLEARN.SVM import SVC, nusvc, linearsvc

From Sklearn.metrics import Confusion_matrix

# Create A Dictionary of words with its frequency

Train_dir = ' Train-mails '

Dictionary = make_dictionary (train_dir)

# Prepare feature vectors per training mail and its labels

Train_labels = Np.zeros (702)

train_labels[351:701] = 1

Train_matrix = Extract_features (Train_dir)

# Training SVM and Naive Bayes classifier

Model1 = MULTINOMIALNB ()

Model2 = Linearsvc ()

Model1.fit (Train_matrix,train_labels)

Model2.fit (Train_matrix,train_labels)

# Test The unseen mails for Spam

Test_dir = ' Test-mails '

Test_matrix = Extract_features (Test_dir)

Test_labels = Np.zeros (260)

TEST_LABELS[130:260] = 1

RESULT1 = Model1.predict (Test_matrix)

RESULT2 = Model2.predict (Test_matrix)

Print Confusion_matrix (TEST_LABELS,RESULT1)

Print Confusion_matrix (TEST_LABELS,RESULT2)

  

Performance testing

Here our test set contains 130 spam and 130 non-spam, and if you have successfully completed all the previous steps, you will get the following results. Here are two models in the test data of the confusion matrix, the diagonal element represents the correct recognition of the number of messages, non-diagonal elements are represented by the wrong classification.

It can be seen that two models have similar performance on the test set, but SVM is more inclined to decide spam. It is important to note that the test data set here is neither used to create a dictionary nor used for model training.

  

Expand

Interested friends can follow the steps described above for some expansion, here to expand the relevant database and results.

The extension uses the pre-processed Euron-spam database, which contains 6 directories, 33716 messages, each containing non-spam and junk e-mail subdirectories, and the total number of non-spam and spam messages is 16545 and 17171 respectively. The download links for the Euron-spam library are as follows:

Http://t.cn/RK84mv6

It is important to note that because the EURON-SPAM database is organized differently from the Ling-spam library mentioned above, some of the above functions also require minor modifications to be applied to euron-spam.

Here we divide the Euron-spam database into training set and test set according to 3:2, according to the above steps, we get the following result in 13478 test mails:

As you can see, the SV www.senta77.com M behaves slightly better than naive Bayes.

  

Summarize

In this article we try to keep a simple and understandable narrative, omitting many technical explanations and nouns. We hope this is an easy-to-understand tutorial and hopefully this tutorial will be useful for beginners who are interested in text analysis.

Some friends may be curious about the mathematical principles behind naive Bayesian models and SVM models, and it should be pointed out that SVM is mathematically a more complex model, while naive Bayes is relatively easier to understand. We certainly encourage our friends who are interested in mathematical principles to delve deeply into these mathematical models on the web with very detailed tutorials and examples. In addition, the use of different ways to achieve the same goal is also a good method of research. For example, you can adjust the following parameters to see how they affect the actual effect of spam filtering:

A) the size of the training data

b) The size of the dictionary

c) Different machine learning models, including GAUSSIANNB,BERNOULLINB,SVC

d) different SVM model parameters

e) Delete insignificant words to improve dictionaries (e.g. manual deletion)

f) Use of other feature models (find TD-IDF)

Finally, the full Python code mentioned in the blog is described in the following links:

Http://t.cn/R6ZeuiN

If you have any questions, please leave a comment at the end of this article.

Lei Feng Net (public number: Lei Feng net) related reading:

Google: Machine learning detection spam accuracy has reached 99.9%

Python and scikit-learn for spam filtering

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.