[Reading notes] machine learning: Practical Case Analysis (5)

Last Update:2016-06-11 Source: Internet

Author: User

Tags ggplot

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The 5th Chapter regression model: Predicting Web page traffic

Regression model: Predict another DataSet with a known dataset, known as an input, also called a predictor or feature, and the data you want to predict is called an output. The regression model differs from the classification model in that the output of the regression model is a meaningful value.

Benchmark model: Using mean as a prediction

#machine learing for Heckers
#chapter 5

Library (Ggplot2) ages <-read.csv (' Ml_for_hackers/05-regression/data/longevity.csv ') #密度图ggplot (ages, aes (x = Ageatdeath, fill = factor (smokes)) + geom_density () + Facet_grid (smokes ~.)

#采用均值作为估计时的均方误差与采用其他结果作为估计时的均方误差比较

Guess <-73with (ages, mean ((ageatdeath-guess) ^ 2)) Guess.accuracy <-Data.frame () for (guess in seq (+, 1)) {  Prediction.error <-with (ages, mean ((ageatdeath-guess) ^ 2))  guess.accuracy <-rbind (guess.accuracy, Data.frame (guess = guess, error = prediction.error))}ggplot (Guess.accuracy, aes (x = guess, y = error)) + geom_point () + GE Om_line ()

#对是否吸烟分组后, estimate the mean value of the average and calculate the RMS error

Constant.guess <-with (ages, mean (Ageatdeath)) with (ages, sqrt (mean ((ageatdeath-constant.guess) ^ 2))) Smokers.guess <-with (subset (ages, smokes = 1), mean (Ageatdeath)) Non.smokers.guess <-with (subset (ages, smokes = = 0), mean (Ageatdeath)) Ages <-transform (ages, newprediction = ifelse (smokes = = 0, non.smokers.guess, smokers.guess)) With (ages, sqrt (mean ((ageatdeath-newprediction) ^ 2)))

Introduction to Linear regression:

Assumptions used: additive, linear

General system problems: regression is good at pushing interpolation (interpolation), not good at extrapolation induction (extrapolation). In other words, the input data is too far away from the observed data, which can result in inaccurate prediction

How does the model work? A model should differentiate between real-world signals (predicted values) and noise (residuals), and if there is a signal in addition to the real noise in the residuals, the model is not strong enough to extract all the signals.

Evaluation method:

Mean square error (MSE): The average deviation of the forecast can be evaluated, but the MSE is the square of the average deviation value

RMS error (RMSE): The root value of the MSE, but it is not intuitive to see the model is unreasonable, that is, only compare two models which is better, but not alone to evaluate the performance of a model

R2: Evaluate the quality of a single model, with the mean prediction as the evaluation criterion, the value is 0~1. Calculation method, the RMSE2 of RMSE1 and mean prediction based on model prediction are calculated separately, then R2 = N (rmse1/rmse2)

###################################
#预测网页流量
###################################

To observe the relationship between traffic and access users, first plot scatter plots and density plots

Top.1000.sites <-read.csv (' ml_for_hackers/05-regression/data/top_1000_sites.tsv ', Sep = ' \ t ',                            Stringsasfactors = FALSE) Ggplot (Top.1000.sites, aes (x = pageviews, y = uniquevisitors)) + Geom_point () Ggplot ( Top.1000.sites, AES (x = pageviews)) + geom_density ()

The scatter plots drawn first are all squeezed together, so consider the density distribution first, but the density distribution is meaningless and the effect is not intuitive. At this point, consider the logarithmic transformation of data, and then plot the density and scatter plots.

Ggplot (Top.1000.sites, AES (x = log (pageviews))) + geom_density () Ggplot (Top.1000.sites, AES (x = log (pageviews), y = log (Un iquevisitors)) + Geom_point () #也可以用ggplot2内置的scale_x_log10 () and SCALE_Y_LOG10 () direct conversion scale, same effect

Perform a linear regression and interpret the results:

Lm.fit <-lm (log (pageviews) ~ log (uniquevisitors), data = top.1000.sites) Summary (lm.fit)

Call: Calling function

Risiduals: The number of bits of the residuals

Coefficients: coefficient information for regression models

Signif.codes:t-value how big or p-value how small, t-value meaning is the coefficient estimate distance 0 standard deviation number, general 3 above represents significant

Residual standard error: that is Rmse. Degrees of freedom: the number of independent or free-to-change arguments in a sample. The degree of freedom of this statistic is 1000-2=998, since two coefficients have been determined, and the two coefficients are determined to require at least 2 independent variable values. The greater the degree of freedom, the smaller the RMSE, the better the model, the more universal

Multiple r-squared: Standard R2 value

Adjusted r-squared: Based on the number of coefficients used adjusted R2 value, the more coefficients used, the greater the penalty for R2 value

F-statistic: Represents an improved metric for the model relative to the results obtained by using mean predictions only, and is an alternative to R2 that can be used to calculate p-value

(Note: It is mentioned in the book that P-value and f-statistic are deceptive in the prediction of models, and these two indicators are more reasonable to use for fitting problems)

#########################################

#引入更多信息并进行回归

Lm.fit <-lm (log (pageviews) ~ hasadvertising + log (uniquevisitors) + inenglish, data = top.1000.sites) Summary (lm.fit)

Analysis:

For factor hasadvertising: Two kinds of factors: ' YES ' and ' NO '. ' YES ' is separated from the Intercept and ' NO ' is included in The Intercept (intersept)

For factor Inenglish: Three kinds of factors: ' NA ', ' YES ' and ' NO '. ' NA ' is included in the Intercept, and ' YES ' and ' NO ' are respectively fitted coefficients

To compare the input with a single input, which has a stronger predictive ability, you can extract the R2 of each summary function:

Inenglish should explain 30%, it should be wrong in the book. It also explains why the book mentions that 1% of hasadvertising can be shed without mentioning 3% of Inenglish.

Analysis: Since hasadvertising only explains the results of 1%, in practice, if the input is easy to obtain, it is worthwhile to include all inputs into a predictive model, and if it is difficult to obtain, it can be removed from the model

#################################

Correlation Brief:

Correlation can be used to measure the relationship between a linear regression model and two variables: a value of 0 o'clock indicates that there is no line to link two variables; a value of 1 o'clock indicates that a perfect forward line can link two variables, and a value of 1 indicates a perfect negative line.

In the R language, the correlation can be calculated using the function cor ().

Another way to calculate this is to use the LM () function to fit the two variables after the scale transformation, and the resulting coefficients are the correlations. The scale is transformed by subtracting the mean of two variables, divided by the standard deviation, and the result can be obtained directly from the scales () function in the R language.

It is important to note that correlation can only measure how strong linear relationships are between two variables, but it does not explain whether there is a causal relationship between the two variables. Even if there is no logical causal relationship, it is still important to know whether there is a correlation between the two variables for predicting the problem.

[Reading notes] machine learning: Practical Case Analysis (5)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More