Regression Model performance evaluation series 1-QQ chart, regression model evaluation 1-qq
(Erbqi) the QQ plot is the Quantile-Quantile diagram, that is, the Quantile-Quantile diagram. A simple understanding is to plot the values of the two same Quantile distributions into points (x, y; if the two distributions are very close, the vertex (x, y) will be distributed near the y = x straight line; otherwise, no; the prediction result of the regression model can be evaluated from the QQ plot.
There are two types of QQ charts: normal QQ plot and normal QQ plot. The difference is that one of the normal QQ charts is normal distribution. The following two types of distribution are shown below.
Normal QQ plot
From here, use Filliben's estimate to determine n points
Below we try to draw a normal QQ chart
Built-in functions using open-source libraries are simple, but some details may not be seen
Import numpy as np from matplotlib import pyplot as pltimport matplotlibmatplotlib. style. use ('ggplot ') # use a normal distribution to randomly generate 100 data records x = np. round (np. random. normal (loc = 0.0, scale = 1.0, size = 100), 2) from scipy. stats import probplotf = plt. figure (figsize = (8, 6) ax = f. add_subplot (111) probplot (x, plot = ax) plt. show ()
Below are some details to pave the way for our ordinary QQ
Import sys, osimport pandas as pd import numpy as np from scipy. stats import norm, linregressfrom matplotlib import pyplot as plt # Return order_statistic_mediansdef round (x) of len (x): N = len (x) osm_uniform = np. zeros (N, dtype = np. float64) osm_uniform [-1] = 0.5 ** (1.0/N) osm_uniform [0] = 1-osm_uniform [-1] I = np. arange (2, N) osm_uniform [1:-1] = (I-0.3175)/(N + 0.365) return osm_uniform # generate 100 data records randomly using a normal distribution x = np. round (np. random. normal (loc = 0.0, scale = 1.0, size = 100), 2) osm_uniform = calc_uniform_order_statistic_medians (x) # ppf (Percent point function) is cdf (Cumulative distribution function) the inverse function of is to take the value osm = norm corresponding to the corresponding quantile. ppf (osm_uniform) osr = np. sort (x) # calculates the slope intercept, intercept, rvalue, pvalue, stderr = linregress (osm, osr) plt of the samples in the osm and osr combinations. figure (figsize = (10, 8) plt. plot (osm, osr, 'bo', osm, slope * osm + intercept, 'r-') plt. legend () plt. show ()
The figure on the left Shows 100 sampling points, and the figure on the right shows 1000 sampling points. We can see that the distribution of the 1000 sampling points is closer to the linear y = x, that is, better fit the normal distribution.
The difference between a normal QQ plot and a normal one is that the reference system is not a normal distribution but may be a data set with arbitrary distribution, which is exactly what we need.
From here
It is a scenario where the dotted line is a real network change and the solid line is the result of a simple smooth prediction. I hope to see the fitting effect of a simple smooth prediction through a general QQ plot.
First look at the cdf diagram of the two curves (Fx (x) = P (X ≤ x )),
The cumulative distribution points of this graph are calculated by np. linspace (min (X), max (X), and len (X). It looks a bit strange.
After recalculating the cdf chart with raw data as the cumulative distribution point, did we find something interesting?
When the number of two curves is the same, the cdf values corresponding to the same position are the same after the two groups of data are sorted from small to large,
Therefore, when the number of two curves is the same, the QQ plot only needs to be sorted in ascending order.
We can see that the slope of the official network curve and the QQ plot of the smooth prediction curve is only 0.79, indicating that the distribution of smooth prediction and the distribution of source data are quite different.
Code
httpspeedavg = np.array([1821000, 2264000, 2209000, 2203000, 2306000, 2005000, 2428000, 2246000, 1642000, 721000, 1125000, 1335000, 1367000, 1760000, 1807000, 1761000, 1767000, 1723000, 1883000, 1645000, 1548000, 1608000, 1372000, 1532000, 1485000, 1527000, 1618000, 1640000, 1199000, 1627000, 1620000, 1770000, 1741000, 1744000, 1986000, 1931000, 2410000, 2293000, 2199000, 1982000, 2036000, 2462000, 2246000, 2071000, 2220000, 2062000, 1741000, 1624000, 1872000, 1621000, 1426000, 1723000, 1735000, 1443000, 1735000, 2053000, 1811000, 1958000, 1828000, 1763000, 2185000, 2267000, 2134000, 2253000, 1719000, 1669000, 1973000, 1615000, 1839000, 1957000, 1809000, 1799000, 1706000, 1549000, 1546000, 1692000, 2335000, 2611000, 1855000, 2092000, 2029000, 1695000, 1379000, 2400000, 2522000, 2140000, 2614000, 2399000, 2376000])def smooth_(squences,period=5): res = [] gap = period/2 right = len(squences) for i in range(right): res.append(np.mean(squences[i-gap if i-gap > 0 else 0:i+gap if i+gap < right else right])) return res httpavg = np.round((1.0*httpspeedavg/1024/1024).tolist(),2)smooth = np.round(smooth_((1.0*httpspeedavg/1024/1024).tolist(),5),2)f = plt.figure(figsize=(8, 6))ax = f.add_subplot(111)probplot(smooth, plot=ax)# plt.show()f = plt.figure(figsize=(8, 6))ax = f.add_subplot(111)probplot(httpavg, plot=ax)# plt.show()import statsmodels.api as smplt.figure(figsize=(15,8))ecdf = sm.distributions.ECDF(httpavg)x = np.linspace(min(httpavg), max(httpavg), len(httpavg))y = ecdf(x)plt.plot(x, y, label='httpavg',color='blue',marker='.')ecdf1 = sm.distributions.ECDF(smooth)x1 = np.linspace(min(smooth), max(smooth), len(smooth))y1 = ecdf1(x1)plt.plot(x1, y1, label='smooth',color='red',marker='.')plt.legend(loc='best')# plt.show()def cdf(l): res = [] length = len(l) for i in range(length): res.append(1.0*(i+1)/length) return resplt.figure(figsize=(15,8))x = np.sort(httpavg)y = cdf(x)plt.plot(x, y, label='httpavg',color='blue',marker='.')x1 = np.sort(smooth)y1 = cdf(x1)plt.plot(x1, y1, label='smooth',color='red',marker='.')plt.legend(loc='best')# plt.show()from scipy.stats import norm,linregressplt.figure(figsize=(10,8))httpavg = np.sort(httpavg)smooth = np.sort(smooth)slope, intercept, rvalue, pvalue, stderr = linregress(httpavg, smooth)plt.plot(httpavg, smooth, 'bo', httpavg, slope*httpavg + intercept, 'r-')xmin = np.amin(httpavg)xmax = np.amax(httpavg)ymin = np.amin(smooth)ymax = np.amax(smooth)posx = xmin + 0.50 * (xmax - xmin)posy = ymin + 0.01 * (ymax - ymin)plt.text(posx, posy, "$R^2=%1.4f$ y = %.2f *x + %.2f" % (rvalue,slope,intercept))plt.plot(httpavg,httpavg,color='green',label='y=x')plt.legend(loc='best')# plt.show()