6.3 Interval estimation of two normal populations
(1) Two total variance known
The function Twosample.ci () that calculates the confidence interval in R is as follows, with the input parameter being the standard deviation of sample x, Y, confidence α, and two samples.
> twosample.ci=function (X,Y,ALPHA,SIGMA1,SIGMA2) {+ n1=length (x); N2=length (y) + xbar=mean (x)-mean (y) + Z =qnorm (1-ALPHA/2) *sqrt (SIGMA1^2/N1+SIGMA2^2/N2) + C (xbar-z,xbar+z) +}
The z-Test function z.test () described above can be used to calculate the confidence interval of the two population mean difference in the case that the two population variance is known, and to illustrate the known standard deviation values by parameter sigma.x and SIGMA.Y respectively.
Cases:
Bamberger's is a 0-pass store that provides the community with popular goods, and in an effort to maintain the good reputation of the store, the company implemented a plan to extend the business hours to the night. The Bamberger's sales data for the 27 typical weeks before and after the extended business hours were used as an example (in million units) to calculate the interval estimation of the mean difference between the two samples, so as to show the effect after the plan was implemented. First look at the basic types of data and draw a histogram comparison.
> sales=read.table ("D:/program files/rstudio/sales.txt", header=t) > Head (Sales) prior Post1 67.90 86.102 76.12 71.133 68.64 116.254 74.94 102.605 63.32 97.516 50.43 65.39> Attach (Sales) > par ( Mfrow=c) > hist (Prior) #分别绘制计划前后销售额的直方图 > hist (POST)
As can be seen from the histogram, the sales sample is generally distributed, assuming that the overall standard deviation before and after the planned implementation is 8 and 12, call the function written above, and calculate the confidence interval of the sample mean difference at confidence level 1-a
> Twosample.ci (post,prior,alpha=0.05,8,12) [1] 19.10298 29.98295> z.test (post,prior,sigma.x=8,sigma.y=12) $ CONF.INT[1] 19.10298 29.98295attr (, "Conf.level") [1] 0.95
The result of the interval estimate is that Bamberger's company extended business hours Hou Zhou a significant increase in turnover, the range of the increase is [19.10, 29.98]
(2) two total variance unknown but equal
Just as the calculation sheet. The confidence interval of the normal population mean, the function t.test () in R can also be used to find the confidence interval of the difference between the two populations, the mountain is equal to the total variance, and the parameter var.equal in it needs to be set to true.
> t.test (post,prior,var.equal=true) $conf. int[1] 18.66541 30.42051attr (, "Conf.level") [1] 0.95
The results of the calculation can also be concluded: Barnberger's company extended business hours Hou Zhou turnover increased significantly, at 0.95 confidence level, the confidence interval of the increase in turnover is [18.67,30.42]
(3) Two total variance unknown and unequal
There is no direct function available in R, still need to write a function manually twasarnple.ci2 ()
> twosample.ci2=function (x,y,alpha) {+ n1=length (x); N2=length (y) + Xbar=mean (x)-mean (y) + S1=var (x) ; S2=var (y) + nu= (s1/n1+s2/n2) ^2/(s1^2/n1^2/(n1-1) +s2^2/n2^2/(n2-1)) + z=qt (1-alpha/2,nu) *sqrt (s1/n1+s2/n2 ) + C (xbar-z,xbar+z) +}
In the actual analysis, the variance of the two populations is unknown and unequal is the most common case, in the case of Bamberger ' s company if the variance before and after the extended business hours is unknown and unequal, the confidence interval of the sample mean difference will be calculated by the function written above:
> Twosample.ci2 (post,prior,0.05) [1] 18.63821 30.44771
Compared to before, the increase in the average value of the sample after the extension of business hours, the confidence area of the difference between the two samples is [18.64, 30.45] because of the unknown variance, in the interval estimates can be used less information, so at the same level of confidence, the estimated confidence interval is relatively wider.
6.3.2 interval Estimation of two-variance ratio
The estimation of variance ratio is closely related to the hypothesis test of variance, so the function var.test () in R can be used to calculate the confidence interval of two normal total square pickup directly, and the calling format is as follows:
Var.test (x, y, ratio = 1, alternative = C ("two.sided", "less", "greater"), conf.level = 0.95, ...) > var.test (prior,post) $conf. int[1] 0.1772458 0.8534348attr (, "Conf.level") [1] 0.95
The result of the calculation of the function var.test () shows that the two variance ratio is estimated to be [0.1772, 0.8534] at 95% confidence level, which shows that the fluctuation of weekly turnover becomes larger after the increase of turnover.
6.4 Interval estimation of ratios
The estimate of the ratio is simpler to implement in R, and the function prop.test () can directly complete the estimation and verification of P and its invocation format is
Prop.test (x, n, p = NULL,
Alternative = C ("two.sided", "less", "greater"),
Conf.level = 0.95, correct = TRUE)
Where the parameter x is the number of samples with some characteristics; n is the sample size; P sets the ratio value of the hypothesis when the hypothesis is tested; Correct is a logical value that sets whether to apply Yates continuity correction, which is true by default
Cases:
A city to understand the housing situation, spot checks n=2000 households, of which less than 5 square meters per capita hardship have x=214, through the sample information to calculate the city hardship ratio p confidence interval (confidence is 0.95)
> prop.test (214,2000) 1-sample proportions test with continuity correction data: 214 out of $, NULL Pro Bability 0.5x-squared = 1234, df = 1, p-value < 2.2e-16alternative hypothesis:true p is not equal to 0.595 percent con Fidence interval:0.09396256 0.12157198sample estimates: p0.107
The calculation results of the ratio test function show that at 95% confidence level, the interval of the hardship ratio is estimated to be [0.0940,0.1216], and the P value of the last row gives the point estimate, the city hardship ratio is 0.107
In fact, when the number of samples is long enough, X obeys the super-distribution, we use the normal distribution approximation, but when the sampling than very small can be used to approximate two distributions, then the function used is two-item test binom.test (), its invocation format is as follows, the internal parameters and prop.test () consistent. In the above example, if the total population of the city is larger, then the sampling ratio is very small and should be approximated with two distributions:
> binom.test (214,2000) Exact binomial test data: 214 and 2000number of successes = 214, number of trials = 2000, P-value < 2.2e-16alternative hypothesis:true probability of success are not equal to 0.595 percent confidence interval : 0.09378632 0.12137786sample estimates:probability of Success 0.107
The results of the two-item test show that at 95% confidence level, the interval of the hardship ratio is estimated to be [0.0938, 0.1214], which is very close to the correction of the normal approximation and the point estimate is still 0.107.
"Data Analysis R Language Practice" study notes the sixth chapter parameter estimation and r implementation (bottom)