Fitting and testing of distribution
More MATLAB Data Analysis Video Please click, or in the NetEase cloud classroom search "MATLAB data analysis and statistics" http://study.163.com/course/courseMain.htm?courseId=1003615016
In some statistical inferences, it is generally assumed that the population obeys a certain distribution (for example, a normal distribution), and then, on the basis of this distribution, the corresponding statistics are constructed, and statistical inferences are made according to the distribution of statistics, and the distribution of statistics is usually dependent on the overall distribution assumption, In other words, the distribution of the total obedience is very important in statistical inference, which will affect the reliability of the result. In this sense, it is necessary to infer the distribution of the population by sample observation data. This section is based on the sample observation data to fit the overall distribution, and the distribution of the test.
below to
The results were data and inferred the distribution of the total score data.
It is usually necessary to infer the distribution based on the sample observation data by descriptive statistics and statistical graphs, first of all, to make a judgment on the distributed form intuitively, and then to test, in the previous analysis, we already know that the data of the total score is approximately obeyed the normal distribution, The following calls the MATLAB function (Chi2gof, jbtest, Kstest, and lillietest) for testing.
(1) card-side fitting excellent degree test
The CHI2GOF function is used to perform the distribution of the chi-square fitting test to verify that the sample is subject to the specified distribution. The principle of the CHI2GOF function is that it groups the sample observation data in several small intervals (by default, it is divided into 10 groups), in theory each group contains more than 5 observations, that is, the theoretical frequency of each group is greater than or equal to 5, if not meet this requirement, you can merge adjacent groups to achieve this requirement.
Call format for the CHI2GOF function:
<1>h=chi2gof (x)
In order to test the goodness of the card, the sample x obeys the normal distribution, the original assumption sample x obeys the normal distribution, and the distribution parameter is estimated by X. The output parameter h equals 0 or 1, if h=0, the original hypothesis is accepted at the significant level 0.05, the X obeys the normal distribution, and if 1, the original hypothesis is rejected at the significant level 0.05.
<2> [H,p]=chi2gof (...)
Returns the P value of the test, rejecting the original hypothesis when the P value is less than or equal to the significant level, otherwise accepting the original hypothesis.
<3> [H,p,stats]=chi2gof (...)
Returns a struct variable stats, which contains the following fields:
Chi2stat: Card-side inspection statistics
DF: Degrees of freedom
Edges: The boundary vectors of each interval after merging
O: The number of observations that fall into each small interval, that is, the time frequency
E: Theoretical frequencies corresponding to each small interval
% read file score. xls the data in the g2:g52 in the first worksheet, which is the total score data
Score=xlsread (' results. xls ', ' g2:g52 ');
% minus 0 of the total score, that is, missing test data
Score=score (score>0);
% to test the optimal degree of card square fitting
[H,p,stats]=chi2gof (Score)
h =
1
p =
0.0244
Stats =
chi2stat:9.4038
Df:3
Edges: [49.0000 68.6000 73.5000 78.4000 83.3000 88.2000 98.0000]
O: [4 10 6 15 4 10]
E: [7.4844 6.9183 8.9423 9.1961 7.5245 8.9344]
Due to the return value h=1,p=0.0244<0.05, the total score is considered to be not subject to normal distribution at a significant level of 0.05.
<4> [...] =chi2gof (X,name1,val1,name2,val2,.......)
The initial grouping, the distribution of the original hypothesis and the significance level are controlled by the optional pairs of parameter names and parameter values. The parameters and parameter values for the initial grouping are controlled as follows:
Parameter value description of argument name
' Nbins ' positive integer with the default value of 10 groups (or intervals)
The ' ctrs ' vector specifies the midpoint of each interval
The ' edges ' vector specifies the bounds of each interval
Note: The above three parameters cannot be specified at the same time, and only one of the parameters can be specified at a time, since the last two parameters already potentially specify the number of packets
% read file score. xls the data in the g2:g52 in the first worksheet, which is the total score data
Score=xlsread (' results. xls ', ' g2:g52 ');
% minus 0 of the total score, that is, missing test data
Score=score (score>0);
% specifies the midpoint between the initial cells
CTRS=[50,60,70,78,85,94];
% Specify CTR parameters for card-side fitting test
[H,p,stats]=chi2gof (Score, ' CTRs ', CTRs)
h =
0
p =
0.3747
Stats =
chi2stat:0.7879
Df:1
Edges: [45.0000 74.0000 81.5000 89.5000 98.5000]
O: [15 16 10 8]
E: [15.2451 14.0220 12.3619 7.3710]
Using the ' ctrs ' parameter to control the initial grouping number to 6, the CTRs vector is used to specify the midpoint between the initial 6 initial cells. The test results show that the initial 6 cells are merged into 4 small intervals by merging adjacent intervals. The returned h=0,p=0.3747>0.05 that the total score obeys the normal distribution, the mean value of the normal distribution can be calculated by mean (score), and the standard deviation can be calculated by STD (score).
The initial grouping number is controlled by the ' nbins ' parameter below to 6
% read file score. xls the data in the g2:g52 in the first worksheet, which is the total score data
Score=xlsread (' results. xls ', ' g2:g52 ');
% minus 0 of the total score, that is, missing test data
Score=score (score>0);
% Specifies ' nbins ' parameters for card-side fitting test of goodness
[H,p,stats]=chi2gof (Score, ' Nbins ', 6)
h =
0
p =
0.3580
Stats =
chi2stat:0.8449
Df:1
Edges: [49.0000 73.5000 81.6667 89.8333 98.0000]
O: [14 17 10 8]
E: [14.4027 15.1752 12.4207 7.0014]
The above two calls to get the same H value, but the P value and stats are not the same, and with the CHI2GOF function when the first call to test the results of the contrary, this shows that the chi-square fit test is more sensitive to the grouping result, when using the Chi2gof function, Each grouping (cell) should be made to contain more than 5 observations.
The CHI2GOF function can use the following parameter values to control the distribution of the original hypothesis
Parameter value description of argument name
' CDF ' function name string, function handle, specify the distribution in the original hypothesis, and ' expected ' parameter
Occurs at the same time as a function string (or a function handle) from a function, and if it is a function name string or a function handle, X is the function's only
A cell array input parameter consisting of the parameter value of the parameter; if the function name string (or function handle) and the parameter value contained in the function
, X is the first input parameter of the function, and the other parameters are subsequent inputs
The ' expected ' vector specifies the theoretical frequencies of each interval, and the ' CDF ' cannot appear at the same time
The ' nparams ' vector specifies the number of parameters to be evaluated in the distribution, which determines the degree of freedom of the card side distribution
List of parameters and parameter values for other aspects of the CHI2GOF function control test
Parameter value description of argument name
' Emin ' nonnegative integer, the default value of 5 specifies the minimum theoretical frequency corresponding to an interval, in the initial grouping,
The theoretical frequency is less than the interval of this value and the adjacent interval merges. If specified as 0,
No interval merging will be performed
' Frequency ' and x-length vectors Specify the frequency of each element in X
Number between ' alpha ' 0--1, default value 0.05 specifies the significant level of the test
% read file score. xls the data in the g2:g52 in the first worksheet, which is the total score data
Score=xlsread (' results. xls ', ' g2:g52 ');
% minus 0 of the total score, that is, missing test data
Score=score (score>0);
% average MS and standard deviation SS
Ms=mean (score);
SS=STD (score);
The value of the% parameter ' CDF ' is a cell array consisting of the function name string and the parameter value of the parameter contained in the function
[H1,p1,stats1]=chi2gof (Score, ' Nbins ', 6, ' cdf ', {' normcdf ', MS,SS})
The value of the% parameter ' CDF ' has a cell array consisting of function handles and parameter values contained in the function
[H2,p2,stats2]=chi2gof (Score, ' Nbins ', 6, ' cdf ', {@normcdf, ms,ss})
% Specifies the initial grouping number of 6 to check whether the total score data is subject to the Poisson distribution of MS
[H3,p3,stats3]=chi2gof (Score, ' Nbins ', 6, ' cdf ', {@poisscdf, MS})
H1 =
0
P1 =
0.3580
Stats1 =
chi2stat:0.8449
Df:1
Edges: [49.0000 73.5000 81.6667 89.8333 98.0000]
O: [14 17 10 8]
E: [14.4027 15.1752 12.4207 7.0014]
H2 =
0
P2 =
0.3580
Stats2 =
chi2stat:0.8449
Df:1
Edges: [49.0000 73.5000 81.6667 89.8333 98.0000]
O: [14 17 10 8]
E: [14.4027 15.1752 12.4207 7.0014]
H3 =
0
P3 =
0.4871
Stats3 =
chi2stat:1.4385
Df:2
Edges: [49.0000 73.5000 81.6667 89.8333 98.0000]
O: [14 17 10 8]
E: [13.3213 16.9281 12.8698 5.8808]
From the results of the test, the first two outputs show that the test data obeys the normal distribution of n (MS,SS) under the significant level = 0.05, and the last input shows the Poisson distribution of the test data obeying the parameter as Ms.
Therefore, the results of the test in the comprehensive, at a significant level of 0.05, throwing that the performance data to obey the normal distribution, the mean is MS, the standard deviation is SS.
(2) Jarque-bera inspection
The Jbtest function is used to perform a jarque-bera test to verify that the sample obeys a normal distribution and calls the function without specifying the mean and variance of the distribution. Since the skewness of the normal distribution is 0, the peak value is 3, if the sample obeys the normal distribution, the sample skewness should be close to 0, the sample kurtosis is close to 3, based on this, the Jarque-bera test is to use sample skewness and kurtosis to construct the test statistic.
The call format for the Jbtest function is as follows:
<1> h=jbtest (x)
If the sample x obeys the normal distribution of the mean value and the variance is unknown, the original assumption is that x obeys the mean normal distribution. When the output h=1, the representation sample rejects the original hypothesis at the significant level = 0.05, and when h=0, the original hypothesis is accepted at the significant level = 0.05. The Jbtest function ignores the Nan (ambiguous value) in X as missing data.
<2> H=jbtest (X,alpha)
Specifies a test for the distribution of the significant horizontal alpha, with the original hypothesis and the optional hypothesis
<3> [H,p]=jbtest (...) )
Returns the test P value, when p is less than or equal to a given significant level alpha, rejecting the original hypothesis, greater than the significant level, accepting the original hypothesis
<4>[h,p,jbstat]=jbtest (.....) )
Returns the observational value of the test statistic Jbstat
<5>[h,p,jbstat,critval]=jbtest (.....) )
Returns the critical value of the test Critval. When Jbstat>=crival, reject the original hypothesis at the level of the significant alpha
<6> [H,p,......] =jbtest (X,alpha,mctol)
To specify a termination tolerance Mctol, the approximate value of p value is computed by Monte Carlo simulation method
Note: The Jbtest function is only based on the sample skewness and kurtosis of the normal test, the result of the impact of the abnormal value is relatively large, there may be a large deviation.
Cases:
RANDN (' seed ', 0); % specifies that the initial seed of the random number generator is 0
X=RANDN (10000,1); % generates 10,000 random numbers that obey the standard normal distribution
H=jbtest (x)% call jbtest function for normal test
X (end) = 5;
H1=jbtest (x)
h =
0
H1 =
1
From the above results, we can find that for a normal normal distribution of 10,000 elements of the random number vector, the change of its last element, it will lead to the opposite conclusion of the test, which fully illustrates the limitations of the Jbtest function, is affected by the anomaly value is relatively large.
The following calls the Jbtest function to test performance data for normality
% read file score. xls the data in the g2:g52 in the first worksheet, which is the total score data
Score=xlsread (' results. xls ', ' g2:g52 ');
% minus 0 of the total score, that is, missing test data
Score=score (score>0);
[H,p]=jbtest (Score)
h =
1
p =
0.0193
Because the return value is h=1,p<0.05, the original hypothesis is rejected under the significance level 0.05, and the total score data is not obeyed the normal distribution. However, due to the limitations of the jbtest function, this conclusion is only a reference, and should be combined with the test results of other functions to make a comprehensive inference.
(3) Kolmogorov-smirnov (k-s) test of single sample
Kstest function is used to make a single sample of the K-s test: it can be a two-way test, test whether the sample is subject to the specified distribution, can also do a unilateral test, test the distribution of the sample function on the specified distribution function or below, where the distribution is completely determined, does not contain unknown parameters. The Kstest function is based on the empirical distribution function of the sample, FN (x) and the specified distribution function g (x) to construct the test statistics
Ks=max (| Fn (x)-G (x) |)
The call format for the Kstest function is as follows:
<1> h=kstest (x)
Whether the sample X obeys the standard normal distribution, the original assumption is that x obeys the standard normal distribution, and the optional assumption is that x does not obey the standard normal distribution. When the output is h=1, the original hypothesis is rejected at the significant level = 0.05; when h=0, the original hypothesis is accepted at the level of significance = 0.05
<2>h=kstest (X,CDF)
Verify that sample x obeys a continuous distribution defined by the CDF, where the CDF can be a matrix that contains two columns of elements, or it can be a probability distribution object. When the CDF is a matrix that contains two columns of elements, its first column represents the possible value of the random variable, either the value in the sample X or not. The second column of the CDF is the value of the specified distribution function, and if the CDF is empty, it is recommended that the sample X obey the standard normal distribution.
<3> H=kstest (X,cdf,alpha)
Specifies the significant horizontal alpha of the validation, with the default value of 0.05
<4>h=kstest (X,cdf,alpha,type)
Specify the type of the check (both sides or one-sided) with the type parameter. The possible value of the type parameter is
' Unequal ': two-sided test, optional assumption is that the overall distribution function is not equal to the specified distribution function
' Larger ': Unilateral test, optional assumption is that the overall distribution function is greater than the specified distribution function
' Smaller ': unilateral test, optional assumption is that the overall distribution function is less than the specified distribution function
<5> [H,p,ksstat,cv]=kstest (...)
Returns the test p value, the test statistic and the observed value Ksstat and the neighborhood value CV
Example: Call the Kstest function to check whether the total score obeys normal distribution
% read file score. xls the data in the g2:g52 in the first worksheet, which is the total score data
Score=xlsread (' results. xls ', ' g2:g52 ');
% minus 0 of the total score, that is, missing test data
Score=score (score>0);
% First call Kstest test whether to obey the standard normal distribution
H=kstest (Score)
% then test whether to obey the mean of MS, standard deviation is the normal distribution of SS
Ms=mean (score);
SS=STD (score);
% generates the CDF matrix, which is used to specify the distribution: mean MS, normal distribution of SS with standard deviation
The second column of%CDF is the value of the specified distribution function, which is determined by the cumulative normal distribution function NORMCDF
CDF=[SCORE,NORMCDF (SCORE,MS,SS)];
% calls the Kstest function to check whether the total score obeys the distribution specified by the CDF
[H1,p,ksstat,cv]=kstest (SCORE,CDF)
h =
1
H1 =
0
p =
0.5486
Ksstat =
0.1107
CV =
0.1903
Because of the h=1, so at a significant level of 0.05, reject the original hypothesis (x obeys the standard normal distribution); h1=0, so in the significance level = 0.05, accept the original hypothesis, that the total score data obeys the mean value is MS, the standard deviation is the normal distribution of SS.
(4) K-s test of double sample
The KSTEST2 function is used to perform a k-s test of two samples, it can be used for two-sided testing, test whether both samples are subject to the same distribution, can also be a unilateral test, to verify whether a sample distribution function on the distribution function of another sample or below, the distribution function here is completely determined, do not contain unknown. Kstest2 function to compare the empirical distribution function of two samples, construct the test statistic
Ks=max (| F1 (x)-f2 (x) |)
The F1 (x) and F2 (x) are the empirical distribution functions of two samples respectively.
The call format for the KSTEST2 function is as follows:
<1>h=kstest2 (X1,X2)
Whether the sample X1 and X2 have the same distribution, the original assumption is that X1 and X2 from the same continuous distribution, the alternative hypothesis is from different continuous distribution. When the output is h=1, the level of significance = 0. 05 rejection of the original hypothesis; when h=0, the original hypothesis is accepted at the significance level = 0.05. It does not require X1 to have the same length as the X2.
<2>h=kstest2 (X1,x2,alpha)
Specifies the significant horizontal alpha of the validation, which defaults to 0.05
<3>h=kstest2 (X1,x2,alpha,type)
Specify the type of the check (both sides or one-sided) with the type parameter. The possible value of the type parameter is
' Unequal ': two-sided test, optional hypothesis is two total distribution function is not equal
' Larger ': Single side test, alternative hypothesis is the distribution function of the 1th general distribution function greater than the second population
' Smaller ': unilateral test, alternative hypothesis is the distribution function of the 1th population less than the second population
<4>[h,p]=kstest2 (.....) )
Returns the asymptotic p value of the test, rejecting the original hypothesis when p is less than or equal to the significant level alpha. The larger the sample size, the more accurate the P value, the same requirement
(N1*N2)/(N1+N2) >=4
Among them, n1,n2 were sample sizes of X1 and X2 respectively.
<5>[h,p,ks2stat]=kstest2 (...)
Returns the observational value of the test statistic Ks2stat
Cases:
For the data in XLS, call the KSTEST2 function to test whether the total score of class 60101 and class 601,022 is subject to the same distribution.
% read file score. xls The class data in the first worksheet is B2:b52
Banji=xlsread (' results. xls ', ' b2:b52 ');
% reads the total score data from the first worksheet in the file, that is, the g2:g52
Score=xlsread (' results. xls ', ' g2:g52 ');
% removal of missing test scores
Banji=banji (score>0);
Score=score (score>0);
% total score of class 60101 and 60102 respectively
Score1=score (banji==60101);
Score2=score (banji==60102);
% call Kstest2 function to test whether the total score of two classes obeys the same distribution
[H,p,ks2stat]=kstest2 (Score1,score2)
h =
0
p =
0.7597
Ks2stat =
0.1839
As a result of h=0,p>0.05, the original hypothesis was accepted at a significant level of 0.05, and the total score of the two classes was assumed to be the same distribution.
(5) Lillefors inspection
When the overall mean and variance is unknown, Lilliefor (1967) proposes to replace the mean and standard deviation of the population with the sample mean and standard deviation, and then use the K-s test, which is called lilliefors test.
Lillietest function is used to do the lilliefors test, test whether the sample is subject to the specified distribution, the parameters of the distribution are unknown, and according to the sample data, the available distributions are normal distribution, exponential distribution, and extremum distribution.
The call format for the Lilltest function is as follows:
<1> h=lillietest (x)
If the sample x obeys the normal distribution of the mean value and the variance is unknown, the original assumption is that x obeys the normal distribution. When the output is h=1, the representation rejects the original hypothesis at the significant level = 0.05, and when h=0, the original hypothesis is accepted at the significant level = 0.05, and the Lillietest function ignores the Nan (ambiguous value) in X as missing data
<2>h=lillietest (X,alpha)
Specifies a test for the distribution of the significant horizontal alpha, with the original hypothesis and the optional hypothesis ibid.
<3> H=lillietest (X,ALPHA,DISTR)
Verify that sample x obeys the specified distribution of parameter distr, DISTR is a string variable, possible values are ' norm ' (normal distribution, default), ' exp ' (exponential distribution), ' EV ' (extreme value distribution)
<4>[h,p]=lillietest (...)
Returns the P value of the test, when p is less than or equal to a given significant level alpha, rejecting the original hypothesis, greater than the significant level, and accepting the original hypothesis
<5>[h,p,kstat]=lillietest (...)
Returns the observational value of the test statistic Kstat
<6>[h,p,kstat,critval]=lillietest (...)
Returns the critical value of the test Critval. When Kstat>=critval, reject the original hypothesis at the level of the significant alpha
<7>[h,p,....] =lillietest (X,alpha,distr,mctol)
Specify a termination tolerance Mctol, direct use of Monte Carlo simulation method to calculate P value
Example: The normal test of total score data using Lillietest function
% reads the total score data from the first worksheet in the file, that is, the g2:g52
Score=xlsread (' results. xls ', ' g2:g52 ');
Score=score (score>0);
% call lillietest function for lilliefors test, check whether the total score obeys normal distribution
[H,p]=lillietest (Score)
% Call lillietest function to check whether total score is distributed by integer
[H1,p1]=lillietest (score,0.05, ' exp ')
h =
0
p =
0.1346
H1 =
1
P1 =
1.0000e-03
Because h= 0, so at the significance level 0.05 accept the original hypothesis, that the total score obeys the normal distribution, because lilliefors test with sample mean and sample standard deviation in lieu of the overall mean and standard deviation, the normal distribution of the mean value of mean (score), the standard deviation is STD (score); h1= 1, so at the significant level 0.05, reject the original hypothesis, that the total score data does not obey the exponential distribution.
(6) Final conclusion
By using Chi2gof, Jbtest, Kstest, Kstest2 and Lillietest functions respectively, only the test results of the Jbtest function do not agree with the normal distribution, and the other signs are normal distribution. Because of the limitation of the Jbtest function (the result is greatly affected by the anomaly value), it can be thought that the total score data obeys the normal distribution, the mean value is mean (score) and the standard deviation is STD (score)
More MATLAB Data Analysis Video Please click, or in the NetEase cloud classroom search "MATLAB data analysis and statistics" http://study.163.com/course/courseMain.htm?courseId=1003615016