Skew (skewness)
In probability theory and statistics, skewness measures the asymmetry of the probability distribution of a real random variable. The value of skewness can be positive, negative or even undefined. In quantity, a negative bias (negative bias) means that the tail of the left side of the probability density function is longer than the right, and the vast majority of values (including the median) are on the right side of the average . A positive (positive) skewness means that the tail on the right side of the probability density function is longer than the left, and the vast majority of values (but not necessarily the median) are on the left side of the average. A zero skewness means that the values are evenly distributed on both sides of the mean, but not necessarily the symmetrical distribution.
importas pltplt.hist(test_scores_negative)plt.show()plt.hist(test_scores_positive)plt.show()plt.hist(test_scores_normal)plt.show()fromimport skewnegative_skew = skew(test_scores_negative)positive_skew = skew(test_scores_positive)no_skew = skew(test_scores_normal)‘‘‘-0.6093247474592194 0.5376950498203763 0.0223645171350847‘‘‘
- The figure depicts a histogram of three data, representing the distribution of the data, and it can be found that the data of the first graph is mostly centered on the right side of the average (negative bias), and the center of the graph is centered on the left side of the mean (positive bias), and the last graph is centered around the mean value.
Kurtosis (peak degree)
In statistics, kurtosis (Kurtosis) measures the peak state of the probability distribution of a real random variable. Higher kurtosis means that the increase in variance is caused by an extreme difference of low frequency greater than or less than the average value.
Kurtosis (Kurtosis) and skewness (skewness) are the two indexes of the normal distribution characteristics of the measured data. Kurtosis measures the flatness of the data distribution (flatness). The tail large data distribution, its kurtosis value is large. A normal distribution has a kurtosis value of 3. Symmetry of biased measurements. 0 description is the most perfect symmetry, the normal distribution of the bias is 0.
- Kurtosis its formula is as follows:
- The formula for the partial state is as follows:
kurt_platy = kurtosis(test_scores_platy)
Modality (modality)
Modality refers to the number of modes, or peaks, in a distribution. Real-world data often is unimodal (with only one mode).
Import Matplotlib.pyplot as plt# This plot have one mode, making it Unimodalplt.hist (Test_scores_uni) plt.show () # This plot Has peaks, and isbimodal# this could happenifOneGroup ofStudents learned the material, andOne learned somethingElse, forExample.plt.hist (TEST_SCORES_BI) plt.show () # More than one peak means that the plot ismultimodal# We can' tEasily measure the modality ofA plot, like we can withKurtosisorskew.# Often, the best toDetect Multimodality is toObserve the Plot.plt.hist (Test_scores_multi) plt.show ()
Mean (mean value)
import matplotlib.pyplot as pltplt.hist(test_scores_normal)# The axvline function will plot a vertical line over an existing plotplt.axvline(test_scores_normal.mean())plt.show()plt.hist(test_scores_negative)plt.axvline(test_scores_negative.mean())plt.show()plt.hist(test_scores_positive)plt.axvline(test_scores_positive.mean())plt.show()
Median (median)
- Simultaneous display of median and mean values
Import Numpyimport Matplotlib. PyplotAs Plt# Plot the histogramPlt. hist(test_scores_negative)# Compute The medianMedian = NumPy. Median(test_scores_negative)# Plot The median in blue (the color argument of "B" means blue)Plt. Axvline(Median, color="B")# Plot the mean in redPlt. Axvline(test_scores_negative. Mean(), color="R")# How does the median is further to the right than the mean?# It's less sensitive to outliers, and isn ' t pulled to the left.Plt. Show() PLT. hist(test_scores_positive) PLT. Axvline(NumPy. Median(test_scores_positive), color="B") PLT. Axvline(test_scores_positive. Mean(), color="R") PLT. Show()
- The following statistical analysis is based on the NBA data set, the approximate format is as follows
Player,pos,age,bref_team_id,g,gs,mp,fg,fga,fg.,x3p,x3pa,x3p.,x2p,x2pa,x2p.,efg.,ft,fta,ft.,orb,drb,trb,ast,stl , Blk,tov,pf,pts,season,season_end
[Quincy,acy,sf,23,tot, 63,0,847,66,141,0.468,4,15,0.266666666666667,62,126,0.492063492063492,0.482,35,53,0.66,72,144,216,28,23,26,30,122,171,201 3-2014,2013]
[Steven,adams,c,20,okc,81,20,1197,93,185,0.503,0,0,na, 93,185,0.502702702702703,0.503,79,136,0.581,142,190,332,43,40,57,71,203,265,2013-2014,2013]
player –name of the player.
pts , Haven total number of points the player scored in the season.
AST , Haven total number of assists the player had in the season.
FG. , Haven Player ' s field goal percentage for the season.
Calculating Standard Deviation
- Q In fact the STD () function can calculate the standard deviation
# The nba stats are loaded into the nba_stats variable.def calc_column_deviation(column): mean = column.mean() 0 forin column: difference = p - mean 2 variance += square_difference variance = variance / len(column) return variance ** (1/2)mp_dev = calc_column_deviation(nba_stats["mp"])ast_dev = calc_column_deviation(nba_stats["ast"])
Normal distribution
- Norm.pdf can generate a set of data with normal distribution data, giving each data the corresponding probability to satisfy a given mean variance.
Import NumPy as Npimport Matplotlib.pyplot as plt# the norm module have a PDF function (PDF stands for probability density function) from scipy.stats import norm# the Arange function generates a numpy vector# the vector below wouldStart at-1, and GoUp to, but notIncluding1# It'll proceedinch "Steps" of .. So the FirstElement would be-1, theSecond-., the third-. 98, AllThe UP to .. Points = Np.arange (-1,1,0.01) # The Norm.pdf function would take points vector andTurn it intoA probability vector# eachElementinchThe vector would correspond toThe normal distribution (earlier elements andLater element smaller, peakinchThe center) # The distribution'll be centered on 0, andWould has a standard devation of . 3probabilities = norm.pdf (points,0,. 3) # Plot the pointsValues onThe X axis andThecorrespondingProbabilities onThe Y axis# See the Bell Curve?plt.plot (points, probabilities) plt.Show() points = Np.arange (-Ten,Ten,0.1) probabilities = norm.pdf (points,0,2) Plt.plot (points, probabilities) plt.Show()
Covariance (covariance)
# the nba_stats variable has been loaded. def covariance Span class= "Hljs-params" > (x, y) : x_mean = SUM (x)/len (x) Y_mean = SUM (y)/len (y) x_diffs = [I-x _mean for i in x] y_diffs = [I-y_mean for i in y] codeviates = [x_diffs[i] * Y_diffs[i] for i in range (len (x))] return< /span> sum (codeviates)/len (codeviates) COV_STL_PF = covariance (Nba_stats[ "STL" ], Nba_stats[ "PF" ]) cov_fta_pts = covariance (Nba_stats[ " FTA "], Nba_stats[])
Correlations
- The most common measure of relevance is Pearson S R, also called R-value.
fromimport pearsonrr, p_value = pearsonr(nba_stats["fga"], nba_stats["pts"])# As we can see, this is a very high positive r value -- close to 1print(r)r_fta_pts, p_value = pearsonr(nba_stats["fta"], nba_stats["pts"])r_stl_pf, p_value = pearsonr(nba_stats["stl"], nba_stats["pf"])‘‘‘0.369861731248‘‘‘
From NumPy import cov# the NBA_statsVariable has been loaded in.r_fta_blk= CoV (NBA_stats["FTA"], NBA_stats["Blk"])[0,1]/((NBA_stats["FTA"].var () * NBA_stats["Blk"].var ()) * * (1/2)) R_ast_stl = CoV (NBA_stats["AST"], NBA_stats["STL"])[0,1]/((NBA_stats["AST"].var () * NBA_stats["STL"].var ()) * * (1/2))
Probability and Statistics--correlations&covariance