Regression Model performance evaluation series 1-QQ chart, regression model evaluation 1-qq

Source: Internet
Author: User

Regression Model performance evaluation series 1-QQ chart, regression model evaluation 1-qq
(Erbqi) the QQ plot is the Quantile-Quantile diagram, that is, the Quantile-Quantile diagram. A simple understanding is to plot the values of the two same Quantile distributions into points (x, y; if the two distributions are very close, the vertex (x, y) will be distributed near the y = x straight line; otherwise, no; the prediction result of the regression model can be evaluated from the QQ plot.

There are two types of QQ charts: normal QQ plot and normal QQ plot. The difference is that one of the normal QQ charts is normal distribution. The following two types of distribution are shown below.

Normal QQ plot

From here, use Filliben's estimate to determine n points

Below we try to draw a normal QQ chart

Built-in functions using open-source libraries are simple, but some details may not be seen

Import numpy as np from matplotlib import pyplot as pltimport matplotlibmatplotlib. style. use ('ggplot ') # use a normal distribution to randomly generate 100 data records x = np. round (np. random. normal (loc = 0.0, scale = 1.0, size = 100), 2) from scipy. stats import probplotf = plt. figure (figsize = (8, 6) ax = f. add_subplot (111) probplot (x, plot = ax) plt. show ()

 

Below are some details to pave the way for our ordinary QQ

Import sys, osimport pandas as pd import numpy as np from scipy. stats import norm, linregressfrom matplotlib import pyplot as plt # Return order_statistic_mediansdef round (x) of len (x): N = len (x) osm_uniform = np. zeros (N, dtype = np. float64) osm_uniform [-1] = 0.5 ** (1.0/N) osm_uniform [0] = 1-osm_uniform [-1] I = np. arange (2, N) osm_uniform [1:-1] = (I-0.3175)/(N + 0.365) return osm_uniform # generate 100 data records randomly using a normal distribution x = np. round (np. random. normal (loc = 0.0, scale = 1.0, size = 100), 2) osm_uniform = calc_uniform_order_statistic_medians (x) # ppf (Percent point function) is cdf (Cumulative distribution function) the inverse function of is to take the value osm = norm corresponding to the corresponding quantile. ppf (osm_uniform) osr = np. sort (x) # calculates the slope intercept, intercept, rvalue, pvalue, stderr = linregress (osm, osr) plt of the samples in the osm and osr combinations. figure (figsize = (10, 8) plt. plot (osm, osr, 'bo', osm, slope * osm + intercept, 'r-') plt. legend () plt. show ()

 

The figure on the left Shows 100 sampling points, and the figure on the right shows 1000 sampling points. We can see that the distribution of the 1000 sampling points is closer to the linear y = x, that is, better fit the normal distribution.

The difference between a normal QQ plot and a normal one is that the reference system is not a normal distribution but may be a data set with arbitrary distribution, which is exactly what we need.

From here

It is a scenario where the dotted line is a real network change and the solid line is the result of a simple smooth prediction. I hope to see the fitting effect of a simple smooth prediction through a general QQ plot.

First look at the cdf diagram of the two curves (Fx (x) = P (X ≤ x )),

The cumulative distribution points of this graph are calculated by np. linspace (min (X), max (X), and len (X). It looks a bit strange.

After recalculating the cdf chart with raw data as the cumulative distribution point, did we find something interesting?

When the number of two curves is the same, the cdf values corresponding to the same position are the same after the two groups of data are sorted from small to large,

Therefore, when the number of two curves is the same, the QQ plot only needs to be sorted in ascending order.

We can see that the slope of the official network curve and the QQ plot of the smooth prediction curve is only 0.79, indicating that the distribution of smooth prediction and the distribution of source data are quite different.

Code

httpspeedavg = np.array([1821000, 2264000, 2209000, 2203000, 2306000, 2005000, 2428000,       2246000, 1642000,  721000, 1125000, 1335000, 1367000, 1760000,       1807000, 1761000, 1767000, 1723000, 1883000, 1645000, 1548000,       1608000, 1372000, 1532000, 1485000, 1527000, 1618000, 1640000,       1199000, 1627000, 1620000, 1770000, 1741000, 1744000, 1986000,       1931000, 2410000, 2293000, 2199000, 1982000, 2036000, 2462000,       2246000, 2071000, 2220000, 2062000, 1741000, 1624000, 1872000,       1621000, 1426000, 1723000, 1735000, 1443000, 1735000, 2053000,       1811000, 1958000, 1828000, 1763000, 2185000, 2267000, 2134000,       2253000, 1719000, 1669000, 1973000, 1615000, 1839000, 1957000,       1809000, 1799000, 1706000, 1549000, 1546000, 1692000, 2335000,       2611000, 1855000, 2092000, 2029000, 1695000, 1379000, 2400000,       2522000, 2140000, 2614000, 2399000, 2376000])def smooth_(squences,period=5):    res = []    gap = period/2    right = len(squences)    for i in range(right):        res.append(np.mean(squences[i-gap if i-gap > 0 else 0:i+gap if i+gap < right else right]))    return res httpavg = np.round((1.0*httpspeedavg/1024/1024).tolist(),2)smooth = np.round(smooth_((1.0*httpspeedavg/1024/1024).tolist(),5),2)f = plt.figure(figsize=(8, 6))ax = f.add_subplot(111)probplot(smooth, plot=ax)# plt.show()f = plt.figure(figsize=(8, 6))ax = f.add_subplot(111)probplot(httpavg, plot=ax)# plt.show()import statsmodels.api as smplt.figure(figsize=(15,8))ecdf = sm.distributions.ECDF(httpavg)x = np.linspace(min(httpavg), max(httpavg), len(httpavg))y = ecdf(x)plt.plot(x, y, label='httpavg',color='blue',marker='.')ecdf1 = sm.distributions.ECDF(smooth)x1 = np.linspace(min(smooth), max(smooth), len(smooth))y1 = ecdf1(x1)plt.plot(x1, y1, label='smooth',color='red',marker='.')plt.legend(loc='best')# plt.show()def cdf(l):    res = []    length = len(l)    for i in range(length):        res.append(1.0*(i+1)/length)    return resplt.figure(figsize=(15,8))x = np.sort(httpavg)y = cdf(x)plt.plot(x, y, label='httpavg',color='blue',marker='.')x1 = np.sort(smooth)y1 = cdf(x1)plt.plot(x1, y1, label='smooth',color='red',marker='.')plt.legend(loc='best')# plt.show()from scipy.stats import norm,linregressplt.figure(figsize=(10,8))httpavg = np.sort(httpavg)smooth  = np.sort(smooth)slope, intercept, rvalue, pvalue, stderr = linregress(httpavg, smooth)plt.plot(httpavg, smooth, 'bo', httpavg, slope*httpavg + intercept, 'r-')xmin = np.amin(httpavg)xmax = np.amax(httpavg)ymin = np.amin(smooth)ymax = np.amax(smooth)posx = xmin + 0.50 * (xmax - xmin)posy = ymin + 0.01 * (ymax - ymin)plt.text(posx, posy, "$R^2=%1.4f$ y = %.2f *x + %.2f"  % (rvalue,slope,intercept))plt.plot(httpavg,httpavg,color='green',label='y=x')plt.legend(loc='best')# plt.show()

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.