Coursera Data Analysis Instance--r language How to classify spam messages
Structure of a Data analysis
- Steps for data analysis
L DEFINE the question
L DEFINE the ideal data set
L Determine what data can access
L Obtain the data
L CLEAN the data
L Exploratory Data analysis
L Statistical Prediction/model
L Interpret results
L Challenge Results
L Synthesize/write up results
L Create Reproducible Code
- A sample
1) problem.
Can I automatically detect emails that is SPAM or not?
2) materialization Issues
Can I Use quantitative characteristics of the emails to classify them as Spam/ham?
3) Get the data
Http://search.r-project.org/library/kernlab/html/spam.html
4) Sampling
#if it isn ' t installed,please install the package first.
Library (Kernlab)
Data (spam)
#perform the Subsampling
Set.seed (3435)
Trainindicator =rbinom (4601,size = 1,prob = 0.5)
Table (Trainindicator)
Trainspam = Spam[trainindicator = = 1,]
Testspam = Spam[trainindicator = = 0,]
5) Preliminary analysis
A) Names: View the column name
Names (Trainspam)
b) Head: View top six lines
Head (Trainspam)
c) Summaries: summary
Table (Trainspam$type)
d) Plots: draw to see the distribution of spam and non-spam messages
Plot (trainspam$capitalave ~ trainspam$type)
The distribution is not obvious, we take the logarithm, and then look at
Plot (log10 (Trainspam$capitalave + 1) ~ trainspam$type)
e) Finding the intrinsic relationship of predictions
Plot (log10 (trainspam[, 1:4] + 1))
f) Try hierarchical clustering
Hcluster = Hclust (Dist (t (trainspam[, 1:57)))
Plot (Hcluster)
It's too messy. I can't find anything. The old method is not to take a log look
hclusterupdated = Hclust (Dist (t (log10 (trainspam[, 1:55] + 1))))
Plot (hclusterupdated)
6) Statistical Prediction and modeling
Trainspam$numtype = As.numeric (trainspam$type)-1
Costfunction = function (x, y) sum (x! = (y > 0.5))
Cverror = Rep (NA, 55)
Library (boot)
For (i-1:55) {
Lmformula = reformulate (names (trainspam) [i], response = "Numtype")
Glmfit = GLM (lmformula, family = "binomial", data = Trainspam)
Cverror[i] = CV.GLM (Trainspam, Glmfit, Costfunction, 2) $delta [2]
}
# # which Predictor has minimum cross-validated error?
Names (Trainspam) [Which.min (Cverror)]
7) Detection
# # Use the best model from the group
Predictionmodel = GLM (numtype ~ chardollar, family = "binomial", data = Trainspam)
# # Get Predictions on the test set
Predictiontest = Predict (Predictionmodel, testspam)
Predictedspam = Rep ("Nonspam", Dim (Testspam) [1])
# # classify as ' spam ' for those with prob > 0.5
predictedspam[predictionmodel$fitted > 0.5] = "spam"
# # Classification Table View Classification results
Table (Predictedspam, Testspam$type)
Classification Error Rate: 0.2243 = (61 + 458)/(1346 + 458 + 61 + 449)
8) interpret results (results explained)
The fraction of charcters that was dollar signs can be used to predict if a email is Spam
Anything with + than 6.6% dollar signs is classified as Spam
More dollar signs means more Spam under our prediction
Our test set error rate is 22.4%
9) Challenge Results
Ten) Synthesize/write up results
One) Create reproducible code
Data Analysis Example--r language How to classify spam messages