Compiling: Li Lei, Zhang Xinyue, Wang Mengze, small fish
In addition to the code block attached to the text, you can also find the entire program at the end of the Jupyter notebook on the link.
Among the many topics in the field of data science or statistics, I find it interesting but difficult to understand is Bayesian analysis. In a course, I had the opportunity to learn Bayesian statistical analysis, but I also need to do some review and strengthen it.
From a personal point of view, I just want to better understand Bayesian theory and how to apply it to real life.
This article is mainly by the Rasmusb?? Th is inspired by the "Bayesian Data Analysis Primer" series on YouTube. Rasmusb?? Th is very good at allowing you to intuitively understand Bayesian analysis, not to throw all kinds of complex formulas to you, but to guide you to think in a step-by-step way.
Rasmusb?? Th video link:
Https://www.youtube.com/user/rasmusab/feed
This article will analyze the baseball player's hit rate by Bayesian theory, and teach you how to use Bayesian theory to analyze. To tell you the truth, I am not a sports fan, and I seldom watch sports games.
So why Choose Baseball?
"Whether you know it or not, the magic of baseball is precision. No other sport like baseball is completely dependent on the continuity, statistical and orderliness of the motion data. Baseball fans pay more attention to numbers than CPAs. ”
--sports journalist Jim Murray
Some people say that baseball may be the most detailed movement of data recorded in the world. Historically, baseball statistics have been accumulating for nearly a century.
However, collecting statistics alone does not make baseball interesting in statistics, and perhaps more importantly, the nature of the sport itself.
For example, in the completion of a hit number (at Bats, is a baseball game in the calculation of the noun, refers to the batter hit the number of times) in the process, who play in the wild for the batter can hit the home run is very little effect.
In other sports, especially football and basketball, the significance of player statistics may be diluted by important events elsewhere in the field. In the baseball game, statistics play an important role in comparing players ' performance.
Baseball statistics contain many indicators, some of which are intuitively defined and others more complex. The measurement indicator I chose to observe was the strike rate (batting average,avg). In baseball, the strike rate is defined by the number of hits (Hits, a noun in a baseball game) divided by the number of hits, usually three digits after the decimal point.
Some question the impact rate, but just as C. "Nonetheless, the strike rate is indeed historical and contextual compared to other statistics," Trent Rosecrans said. We all know what Avg's level is for 0.300, and we also know how bad avg is for 0.200 dozen, and how great avg is for 0.400 dozen of them. ”
In Major League Baseball (MLB), spring training is a series of exercises and exhibition games before the start of the regular season.
I will try to solve the following two questions:
How to interpret the strike rate in the 2018 spring training
How to compare the strike rate of two players
Before entering the code content, I will briefly introduce Rasmus B?? Th in the content of his video.
First, we need three things to complete the Bayesian analysis.
1. Data
2. Generating a model
3. Prior probabilities
In my case, the data is the strike rate of the 2018 spring training we have observed.
A build model is a model that generates data when a given parameter is input. These input parameters are used to generate a probability distribution. For example, if you know the mean and standard deviation, you can easily generate normal distribution data for the selected dataset by running the following code. Later we will see the use of other types of distributions in Bayesian analysis.
Import Matplotlib.pyplot as Pltimport numpy as NPMU, sigma = 0, 0.1 # mean and standard deviations = Np.random.normal (MU, Sigma, Plt.hist (s)
In the case of Bayesian analysis, we reverse-generate the model and try to infer the parameters using observational data.
Introduction to Bayesian Data Analysis part 1th
Finally, a priori probability refers to the information that the model has before it processes the data. For example, is the event equal to the probability? Are there any previous data that can be used? Is it possible to make a basis of speculation?
First I'll define a function to grab the player data from Fox Sports and then grab the player's spring training or regular season's batting stats.
Fox Sports Links:
Https://www.foxsports.com/mlb/stats
Import pandas as Pdimport Seaborn as Snsimport requestsfrom bs4 import BeautifulSoupplt.style.use (' FiveThirtyEight ')% Matplotlib inline%config Inlinebackend.figure_format = ' Retina ' def batting_stats (Url,season): R = Requests.get (URL) Soup = BeautifulSoup (r.text, ' lxml ') Table = Soup.find_all ("table", {"Class": "Wisbb_standardtable Tablesorter"}) [0] Table_head = Soup.find_all ("Thead", {"Class": "Wisbb_tableheader"}) [0]if season = ' Spring ': Row_height = Len (table.find _all (' tr ') [: -1]) else:row_height = Len (Table.find_all (' tr ') [: -2]) RESULT_DF = PD. DataFrame (Columns=[row.text.strip () for row in Table_head.find_all (' th ')], index = range (0,row_height))
Row_marker = 0for row in Table.find_all (' tr ') [: -1]: column_marker = 0 columns = row.find_all (' TD ') for column in Columns:r Esult_df.iat[row_marker,column_marker] = Column.text.strip () Column_marker + = 1 Row_marker + 1return RESULT_DF
Next, we select a player of interest and analyze its statistics.
Statistics page of the Metropolitan Police spring training
If you are ranked by the player's strike rate (AVG), you can see that the first is Dominic Smith (DS), while the Gavin Cecchini (GC) is ranked second. So are they good players? I do not know. But if you only see Avg,ds with a 1.000 avg value at the top.
On Google's search, I found that "in recent years, the average strike rate in the league is usually around 0.260". If so, then the AVG of DS and GC seems to be too high. By further observing the number of hits (AB) and hit (H) of two players, it is clear that DS has only 1 AB and GC has 7. And after looking at the other players ' AB, it was found that the top 2018 AB was 13, while the 2017 Mets had the highest ab of 60.
Scene One
Assuming I knew nothing about the players ' past performance, spring 2018 was the only source of data, so I don't know what the AVG value range is. So, how do I interpret statistics for the spring 2018 training?
First, we'll crawl the DS's spring training data.
Ds_url_st = "https://www.foxsports.com/mlb/dominic-smith-player-stats?seasonType=3" dominic_smith_spring = Batting_ Stats (Ds_url_st, ' Spring ') dominic_smith_spring.iloc[-1]
N_draw = 20000prior_ni = PD. Series (Np.random.uniform (0, 1, size = N_draw)) plt.figure (figsize= (8,5)) plt.hist (Prior_ni) plt.title (' Uniform Distribution (0,1) ') Plt.xlabel (' Prior on AVG ') Plt.ylabel (' Frequency ')
A priori probability represents a general view of something before we get concrete data. In the above distributions, all probabilities are almost the same (because they are randomly generated, so there is a slight difference).
So that means I know nothing about the players and can't even make any reasonable guesses about Avg. I assume that AVG is 0.000 and AVG is 1.000 probability of the same, or equal to AVG value of 0 and 1 for any number of probabilities.
The data we have observed now shows that when there are 1 ab and one H, Avg is 1.000, which can be represented by two distributions. A random variable x with a two-item distribution is the number of successful times in N-independent/non-experimental sequences, where the probability of success for each trial is p.
In our case, AVG is the probability of success, AB is the number of trials, and H is the number of successes.
Remember these terms first, and then we define the inverse generation model.
We will randomly select a probability value from the defined uniform distribution and use this probability as the parameter to generate the model. Suppose we randomly pick a probability value of 0.230, which means that the probability of success in two distributions is 23%.
The number of Trials is 1 (DS has 1 ab), and if the result of the generated model matches the results we observed (DS has 1 h), then the probability value of 0.230 remains the same. If we repeat this process and filter it, we will eventually get a probability distribution, and the result is the same as the result we observed.
This is the posteriori probability.
def posterior (n_try, k_success, prior): hit = List () for P in Prior:hit.append (Np.random.binomial (N_try, p)) posterior = P Rior[list (map (lambda x:x = = k_success, hit))] Plt.figure (figsize= (8,5)) plt.hist (posterior) plt.title (' posterior Distribution ') Plt.xlabel (' posterior on AVG ') Plt.ylabel (' Frequency ') print (' Number of draws left:%d, posterior mean:%.3 F, posterior median:%.3f, posterior 95%% quantile interval:%.3f-%.3f '% (len (posterior), Posterior.mean (), posterior.med Ian (), Posterior.quantile (. 025), Posterior.quantile (. 975))) ds_n_trials = Int (dominic_smith_spring[[' AB ', ' H ']].iloc [-1] [0]) ds_k_success = Int (dominic_smith_spring[[' AB ', ' H ']].iloc[-1][1]) posterior (ds_n_trials, ds_k_success, Prior_ni)
The interval of 95% in the posterior probability distribution is called confidence interval, which is slightly different from the confidence interval in frequency statistics. There is another credible interval that can be used, as I'll mention when I talk about PYMC3 later.
The main difference between the confidence interval and the frequency statistic in Bayesian statistics is that the two interpretations are different. Bayesian probability reflects the subjective belief of man. According to this theory, we can think that the probability of real parameters in the confidence interval is measurable. This is fascinating because it allows us to describe the parameters directly using probabilities.
Many people think that this concept is a more natural way to understand the probability interval, and it is easy to explain. The confidence interval allows you to determine whether an interval contains real parameters.
If we collect a new sample, calculate the confidence interval, and repeat the process many times, then our calculated 95% confidence interval will contain the true AVG value.
Confidence interval: According to the observed data, the probability of the AVG's true value falling within the confidence interval is 95%.
Confidence interval: When we use this type of data to calculate confidence intervals, a 95% confidence interval contains the AVG's true value.
Notice the difference between the two, the confidence interval is the probability description of the parameter value under the given fixed boundary condition, the belief interval is the boundary probability in the case of given fixed parameter value.
In real life, we want to know the real parameters rather than the boundaries, so the Bayesian confidence interval is a more appropriate choice. In this case, we are only interested in the player's true avg.
With the posterior distribution above, I have a 95% certainty that the DS real avg will be between 0.155 and 0.987. But this range is too big. In other words, in the absence of prior knowledge and in the case of only one trial, I am not quite sure what the true avg of DS is.
Scene Two
For the second scenario, we assume that we know the statistics for the spring training of the previous year.
Dominic_smith_spring.iloc[-2:]
Now that we have statistics for the spring 2017 training, our priori assumptions should reflect this information. Note that the avg of DS in the spring of 2017 is 0.167, so the statistical data for 2017 years are not evenly distributed.
The beta distribution is a continuous probability distribution, which has two parameters, alpha and beta. One of the most common uses of the beta distribution is to model the uncertainty of the success probability of an experiment.
In particular, the condition distribution of x is a beta distribution of alpha=k+1, beta=n-k+1, in the case of a K-success observed in a known n-th experiment.
Beta Distribution related content:
Https://www.statlect.com/probability-distributions/beta-distribution
N_draw = 20000prior_trials = Int (dominic_smith_spring.iloc[3]. AB) prior_success = Int (dominic_smith_spring.iloc[3]. H) Prior_i = PD. Series (Np.random.beta (prior_success+1, prior_trials-prior_success+1, size = N_draw)) plt.figure (figsize= (8,5)) Plt.hist (Prior_i) plt.title (' Beta distribution (a=%d, b=%d) '% (prior_success+1,prior_trials-prior_success+1)) Plt.xlabel (' Prior on AVG ') Plt.ylabel (' Frequency ')
Posterior (ds_n_trials, ds_k_success, Prior_i)
Compared to the posterior results obtained by using a priori hypothesis of uniform distribution in scene one, 95% of the region of the subregion has been narrowed down. Now I have 95% certainty that the Avg of DS will be between 0.095 and 0.340.
However, the average AVG over 0.300 is already a good hitter, and the AVG estimate here means the player can be the worst or the best hitter. So we need more data to narrow down the range of confidence intervals.
Scene Three
In this scenario, let's say I have not only statistics for the 2017 spring training, but also statistics for the 2017-year regular season. So how does this affect the post-mortem results and conclusions?
Ds_url = "https://www.foxsports.com/mlb/dominic-smith-player-stats?seasonType=1" Dominic_smith_reg = Batting_stats ( Ds_url, ' regular ') Dominic_smith = Dominic_smith_reg.append (dominic_smith_spring.iloc[3], ignore_index=true) Dominic _smith
Ds_prior_trials = Pd.to_numeric (dominic_smith. AB). SUM () ds_prior_success = Pd.to_numeric (dominic_smith. H). SUM () N_draw = 20000prior_i_02 = PD. Series (Np.random.beta (ds_prior_success+1, ds_prior_trials-ds_prior_success+1, size = N_draw)) plt.figure (figsize= ( 8,5)) plt.hist (prior_i_02) plt.title (' Beta distribution (a=%d, b=%d) '% (ds_prior_success+1,ds_prior_trials-ds_prior_ success+1) Plt.xlabel (' Prior on AVG ') Plt.ylabel (' Frequency ')
Posterior (ds_n_trials, ds_k_success, prior_i_02)
Now I have 95% certainty that the true avg of DS will be between 0.146 and 0.258. Although the range is not very precise, the confidence interval of scenario three is much narrower than scene one and scene two.
Scene Four
I want to compare two players and see who's doing better with Avg. I looked at the data from the spring 2018 training, prior knowledge is the 2017 spring training and regular season. Now I want to compare the strike rate between DS and GC for the two players.
In scenario three, I removed all the generated results that were inconsistent with the observed data, and then sampled them in a simulated sample. However, this type of random sample generation and filtering is computationally large and slow to run.
Therefore, we can use some tools to enable the sampler to spend more time in high probability areas to improve efficiency. Probabilistic programming tools like PYMC3 can effectively handle the sampling process by using ingenious algorithms such as hmc-nuts.
PYMC3 Links:
Https://github.com/pymc-devs/pymc3
Hmc-nuts Links:
Http://blog.fastforwardlabs.com/2017/01/30/the-algorithms-behind-probabilistic-programming.html
We'll start by grabbing Gavin Cecchini statistics from Fox Sports.
Gc_url_st = "https://www.foxsports.com/mlb/gavin-cecchini-player-stats?seasonType=3" Gc_url_reg = "https:// www.foxsports.com/mlb/gavin-cecchini-player-stats?seasonType=1 "gavin_cecchini_spring = Batting_stats (Gc_url_st, ' Spring ') Gavin_cecchini_reg = Batting_stats (Gc_url_reg, ' regular ') gc_n_trials = Int (gavin_cecchini_spring.iloc[1]. AB) gc_k_success = Int (gavin_cecchini_spring.iloc[1]. H) Gc_prior = PD. DataFrame (Gavin_cecchini_reg.iloc[1]). Transpose (). Append (Gavin_cecchini_spring.iloc[0]) Gc_prior
Gc_prior_trials = Pd.to_numeric (gc_prior. AB). SUM () gc_prior_success = Pd.to_numeric (gc_prior. H). SUM () def observed_data_generator (n_try,observed_data): result = Np.ones (observed_data) fails = N_try-observed_data result = Np.append (result, Np.zeros (fails)) return resultds_observed = Observed_data_generator (ds_n_trials,ds_k_ Success) gc_observed = Observed_data_generator (gc_n_trials,gc_k_success)
Next, we fit a PYMC3 model.
Import PYMC3 as Pmwith pm. Model () as Model_a:d_p = pm. Beta (' Ds_avg ', ds_prior_success+1, ds_prior_trials-ds_prior_success+1) g_p = pm. Beta (' Gc_avg ', gc_prior_success+1, gc_prior_trials-gc_prior_success+1) DS = pm. Bernoulli (' DS ', p=d_p, observed=ds_observed) GC = pm. Bernoulli (' GC ', p=g_p, observed=gc_observed) DvG = pm. Deterministic (' DvG ', d_p-g_p) start = Pm.find_map () trace = Pm.sample (10000, Start=start) Pm.plot_posterior (Trace, Varna mes=[' Ds_avg ', ' gc_avg ', ' DvG '],ref_val=0)
If we use the Plot_posterior function in PYMC3 to plot the posterior distributions of ds_avg, Gc_avg, and DVG (DS_AVG-GC_AVG), we can see that the terms appearing in the figure are HPD, not the number of bits (quantile).
The maximum posterior density (highest posterior density,hpd) interval is another confidence interval that we can use for the posterior density function. The HPD interval selects the narrowest interval of the maximum posterior probability density value, including the majority number.
In Rasmus B?? In another article of th, we compared the interval of the interval and the highest density, and provided a simple and clear contrast chart. The following are the majority of the six different posterior distributions and the highest density range that covers the probability density of 95%.
Article Links:
http://www.sumsar.net/blog/2014/10/probable-points-and-credible-intervals-part-one/
Possible points and confidence intervals, part 1th: Graphical summary
The interval contains the median, the probability of the median grumble on the left side of the interval is 50%, the probability of falling on the right side is 50%, and 95% of the confidence interval for example, the probability of falling on either side of the interval is 2.5%.
Possible points and confidence intervals, part 1th: Graphical summary
In the case of DS and GC AVG, their majority and median does not look much different, and if this is the case, the two-player AVG's HPD interval and the interval of the intervals should be roughly the same. Let's see what they look like.
Pm.summary (tracehuarenyl.cn/)
We can see that for DS and GC two players, the HPD interval and the interval are either exactly the same, or only slightly different after the decimal point.
The problem is, I want to judge by Avg who is the better player and for now, I'm not sure. At least I'm 95% sure that the two players have the same Avg.
The results of the calculations and the resulting graphs show that the difference between the two players avg is between 0.162 and 0.033 (we use DVG (DS-GC) to denote their AVG difference, if the DVG is positive for DS better, the other is better GC).
The results show that the interval includes 0.000, which means there is no difference in the AVG of two players. So, even if there is evidence that the GC is better than DS (because the DVG of the posterior distribution in the negative area is larger than the area in the positive area), I have a 95% certainty that there is no difference between the two players ' Avg.
Maybe with more data, I can determine the difference between them. After all, this is the essence of Bayesian theory. Not that the truth does not exist, but that the process of understanding the truth is slow, and as technology progresses, what we can do is to constantly revise our perceptions.
Python code and Bayesian theory tell you who's the best baseball player