Data Analysis Example--r language How to classify spam messages

Coursera Data Analysis Instance--r language How to classify spam messages

Structure of a Data analysis

    1. Steps for data analysis

L DEFINE the question

L DEFINE the ideal data set

L Determine what data can access

L Obtain the data

L CLEAN the data

L Exploratory Data analysis

L Statistical Prediction/model

L Interpret results

L Challenge Results

L Synthesize/write up results

L Create Reproducible Code

    1. A sample

1) problem.

Can I automatically detect emails that is SPAM or not?

2) materialization Issues

Can I Use quantitative characteristics of the emails to classify them as Spam/ham?

3) Get the data


4) Sampling

#if it isn ' t installed,please install the package first.

Library (Kernlab)

Data (spam)

#perform the Subsampling

Set.seed (3435)

Trainindicator =rbinom (4601,size = 1,prob = 0.5)

Table (Trainindicator)

Trainspam = Spam[trainindicator = = 1,]

Testspam = Spam[trainindicator = = 0,]

5) Preliminary analysis

A) Names: View the column name

Names (Trainspam)

b) Head: View top six lines

Head (Trainspam)

c) Summaries: summary

Table (Trainspam$type)

d) Plots: draw to see the distribution of spam and non-spam messages

Plot (trainspam$capitalave ~ trainspam$type)

The distribution is not obvious, we take the logarithm, and then look at

Plot (log10 (Trainspam$capitalave + 1) ~ trainspam$type)

e) Finding the intrinsic relationship of predictions

Plot (log10 (trainspam[, 1:4] + 1))

f) Try hierarchical clustering

Hcluster = Hclust (Dist (t (trainspam[, 1:57)))

Plot (Hcluster)

It's too messy. I can't find anything. The old method is not to take a log look

hclusterupdated = Hclust (Dist (t (log10 (trainspam[, 1:55] + 1))))

Plot (hclusterupdated)

6) Statistical Prediction and modeling

Trainspam$numtype = As.numeric (trainspam$type)-1

Costfunction = function (x, y) sum (x! = (y > 0.5))

Cverror = Rep (NA, 55)

Library (boot)

For (i-1:55) {

Lmformula = reformulate (names (trainspam) [i], response = "Numtype")

Glmfit = GLM (lmformula, family = "binomial", data = Trainspam)

Cverror[i] = CV.GLM (Trainspam, Glmfit, Costfunction, 2) $delta [2]


# # which Predictor has minimum cross-validated error?

Names (Trainspam) [Which.min (Cverror)]

7) Detection

# # Use the best model from the group

Predictionmodel = GLM (numtype ~ chardollar, family = "binomial", data = Trainspam)

# # Get Predictions on the test set

Predictiontest = Predict (Predictionmodel, testspam)

Predictedspam = Rep ("Nonspam", Dim (Testspam) [1])

# # classify as ' spam ' for those with prob > 0.5

predictedspam[predictionmodel$fitted > 0.5] = "spam"

# # Classification Table View Classification results

Table (Predictedspam, Testspam$type)

Classification Error Rate: 0.2243 = (61 + 458)/(1346 + 458 + 61 + 449)

8) interpret results (results explained)

The fraction of charcters that was dollar signs can be used to predict if a email is Spam

Anything with + than 6.6% dollar signs is classified as Spam

More dollar signs means more Spam under our prediction

Our test set error rate is 22.4%

9) Challenge Results

Ten) Synthesize/write up results

One) Create reproducible code

