The stat2.3x inference (statistical inference) course was taught at the EdX platform by the University of California, Berkeley (University of California, Berkeley) in 2014.
Download PDF Note (academia.edu)
Summary
Dependent Variables (paired samples)
- SD of the difference is $$\sqrt{\sigma_x^2+\sigma_y^2-2\cdot r\cdot\sigma_x\cdot\sigma_y}$$ where $r $ is the correlation b Etween the variables X and Y.
- Correlation $ $r =\frac{1}{n}\cdot\sum_{i=1}^{n} (\frac{x_i-\bar{x}}{\sigma_x}\cdot\frac{y_i-\bar{y}}{\sigma_y}) $$ Where $$\sigma_x=\sqrt{\frac{1}{n}\sum_{i=1}^{n} (x_i-\bar{x}) ^2}$$ $$\sigma_y=\sqrt{\frac{1}{n}\sum_{i=1}^{n} (y_ I-\bar{y}) ^2}$$ R function:
Cor (x, y)
ADDITIONAL practice Problems for EXERCISE SET 4
Problem 1
To see whether caffeine affects the ' speed ' at which mice can run to a ' reward ', a simple random sample of the mice is taken From a large population of mice. Each mouse ran twice; Once before the caffeine, and once after. The "before" run times had a mean of seconds and an SD of 3 seconds. The ' after ' run times had a mean of seconds and an SD of 3.5 seconds. The correlation between the "before" and "after" run times were 0.7. For the ' after ' run time was shorter than the ' before '. Which of the following is a correct $z $ statistic to test whether mice in this population run faster after caffeine? More than one answer might is correct.
A) $ (2-0)/\sqrt{0.424^2 + 0.495^2}$
b) $ (2-0)/0.362$
c) $ (31.5-25)/3.535$
Solution
Dependent paired variables, so (a) is incorrect. Based on the sample mean, we have $ $H _0: \mu_1 = \mu_2$$ $ $H _a: \mu_1 > \mu_2$$ and $ $n =50, \mu_1=32, \mu_2=30, \sigma_ 1=3, \sigma_2=3.5$$ Therefore $$\sigma=\sqrt{\sigma_1^2+\sigma_2^2-2\cdot r\cdot\sigma_1\cdot\sigma_2}=\sqrt{3^2+{ 3.5}^2-2\times0.7\times3\times3.5}$$ $$ Se=\frac{\sigma}{\sqrt{n}}\rightarrow z=\frac{\mu_1-\mu_2}{SE}=\frac{2-0} {0.362}$$ Thus, (b) is correct. We can conclude the p-value is too small and reject $H _0$, which is, $\mu_1 > \mu _2$. R Code:
SD = sqrt (3^2 + 3.5^2-2 * 0.7 * 3 * 3.5) SE = SD/SQRT (50); z = 2/sese[1] 0.36193921-pnorm (z) [1] 1.640035e-08
(c) is correct, too. This is a coin toss. $ $H _0:p=0.5$$ $ $H _a:p > 0.5 $$ where $p $ is the percent of "faster" mice of the population. The observed number of heads is 32. If The null were true, we would expect it to be + give or take: $ $SE =\sqrt{\frac{p\cdot (1-p)}{n}}\cdot n=3.535$$ Thus $ $z =\frac{31.5-25}{3.535}$$ Since The p-value is small so reject $H _0$, which is, $p > 0.5$. R Code:
p = 0.5; n = 50se = sqrt (p * (1-p)/n) z = (32/50-p)/Se1-pnorm (z) [1] 0.02385744se * 50[1] 3.535534
Problem 2
In a study to weight loss, a simple random sample of the $ of the the participants was placed in the "Diet 1" group and the Remaining in the "Diet 2" group. After the treatment, the average weight loss in the "Diet 1" group is 4.3 pounds with an SD of 1.2 pounds; The average weight lost in the "Diet 2" group is 3.9 pounds with an SD of 1.7 pounds. In the ' Diet 1 ' group, 57% of the participants lost weight, compared to 54% in the ' Diet 2 ' group.
A) to test whether the diet affected the mean amount of weight lost, the $z $ statistic are (fill in the blank): $ (0.4-0)/ ( )$
b) To test whether the diet affects the percent of people who lose weight, the $z $ statistic is $ (3-0)/() $
Solution
Independent variables.
A. $ $H _0: \mu_1=\mu_2$$ $ $H _a: \mu_1\neq\mu_2$$ and $ $n _1=500, n_2=250, \mu_1=4.3, \mu_2=3.9, \sigma_1=1.2, \sigma_2=1.7$ $ $$\rightarrow se=\sqrt{se_1^2+se_2^2}=\sqrt{(\frac{\sigma_1}{\sqrt{n_1}}) ^2+ (\frac{\sigma_2}{\sqrt{n_2}}) ^2}= 0.1201666$$ Therefore, the P-value is 0.0008724816 which is smaller than 0.05. We reject $H _0$, that's is, $\mu_1\neq\mu_2$. R Code:
MU1 = 4.3; MU2 = 3.9; SD1 = 1.2; SD2 = 1.7; N1 = 500; N2 = 250se1 = Sd1/sqrt (n1); Se2 = Sd2/sqrt (n2); SE = sqrt (se1^2 + se2^2) se[1] 0.1201666z = (MU1-MU2)/SE (1-pnorm (z)) * 2[1] 0.0008724816
B. $ $H _0:p_1=p_2$$ $ $H _a:p_1\neq p_2$$ and $ $p _1=0.57, p_2=0.54, n_1=500, n_2=250, \hat{p}=\frac{n_1\cdot P_1+n_2\cdot p _2}{n_1+n_2}$$ $$\rightarrow Se=\sqrt{se_1^2+se_2^2}=\sqrt{\frac{\hat{p}\cdot (1-\hat{p})}{n_1}+\frac{\hat{p}\ CDOT (1-\hat{p})}{n_2}}=0.03844997$$ R code:
P1 = 0.57; P2 = 0.54; N1 = 500; N2 = 250p = (N1 * p1 + n2 * p2)/(n1 + n2) SE1 = sqrt (p * (1-P)/N1); Se2 = sqrt (p * (1-p)/n2) se = sqrt (se1^2 + se2^2) se[1] 0.03844997z = (p1-p2)/SE (1-pnorm (z)) * 2[1] 0.4352527
Because the p-value is larger than 0.05 so we reject $H _a$, which is, $p _1=p_2$.
EXERCISE SET 4
If a problem asks for a approximation, please use the methods described in the video lecture segments. Unless the problem says otherwise, please give answers correct to one decimal place according to those methods. Some of the problems below is about simple random samples. If The population size isn't given, you can assume that the correction factor for standard errors are close enough to 1st At it does isn't need to be computed. Please use the 5% cutoff for p-values unless otherwise instructed in the problem.
Problem 1
In a study of the effect of a medical treatment, a simple random sample of the A of the participating patients were ASSIG Ned to the treatment group; The remaining patients formed the control group. When the patients were assessed on the end of the study, favorable outcomes were observed in 162 patients in the treatment Group and patients in the control group. Did the treatment had an effect, or was this just chance variation? Perform A statistical test, following the steps in problems 1a-1d.
1 a The null hypothesis is (pick the best among the options):
A. The treatment have an effect which could is good or bad.
B. The treatment has a good effect.
C. The treatment has no effect.
D. The treatment have a bad effect.
1 B under the null hypothesis, the SE of the difference between the percents of favorable outcomes in the both groups is ABO UT ()%.
1C the $z $ statistic is closest to?
1D the conclusion of the test is (pick the better of the "options"): The observed difference is due to chance. The treatment has an effect.
Solution
1A) $ $H _0:p_1=p_2$$ $ $H _a:p_1 > p_2$$ where $p _1=\frac{162}{300}, p_2=\frac{97}{200}$.
1B) The samples is from the same population and so we don ' t use pooled estimate. $ $SE =\sqrt{se_1^2+se_2^2}=\sqrt{\frac{p_1\cdot (1-p_1)}{n_1}+\frac{p_2\cdot (1-p_2)}{n_2}}=0.04557274$$
1C) $ $z =\frac{p_1-p_2}{se}=1.205771$$
1D) P-value is $0.1137427 > 0.05$, which concludes rejecting $H _a$. Therefore, the conclusion is $p _1=p_2$. R Code:
P1 = 162/300; P2 = 97/200; N1 = 300; N2 = 200se = sqrt (P1 * (1-P1)/n1 + P2 * (1-P2)/n2) z = (p1-p2)/SE; Z[1] 1.2068621-pnorm (z) [1] 0.1137427
Problem 2
In a simple random sample of $ Father-son pairs taken from a large population of such pairs, the mean height of the Fath ERS is 68.5 inches and the SD are 2.5 inches; The mean height of the sons is inches and the SD are 3 inches; The correlation between the heights of the Fathers and Sons is 0.5. In the population, is the sons taller than their fathers, on average? Or is this just chance variation? Follow the steps in problems 2a-2b.
2 a The SE of the mean difference between heights of fathers and Sons in the sample are closest to?
2 B Which of the following most closely represents the result of the test?
A. The result is not statistically significant, so we conclude that it's due to chance variation.
B. The result is not statistically significant, so we conclude that the sons be taller than their fathers, on average.
C. The result is highly statistically significant, so we conclude that the sons be taller than their fathers, on average.
D. The result is highly statistically significant, so we conclude that it's due to chance variation.
Solution
2 a) Dependent variables. $ $H _0: \mu_1=\mu_2$$ $ $H _a: \mu_1 < \mu_2$$ where $\mu_1, \mu_2$ represents the height of fathers and sons on average, respectively. We have $ $n =250, \sigma_1=2.5, \sigma_2=3, \mu_1=68.5, \mu_2=69, r=0.5$$ and $ $SE _1=\frac{\sigma_1}{\sqrt{n}}, SE_2=\ frac{\sigma_2}{\sqrt{n}}$$ Thus $ $SE =\sqrt{se_1^2+se_2^2-2\cdot r\cdot Se_1\cdot se_2}=0.1760682$$
2B) $ $z = \frac{\mu_1-\mu_2}{se}=-2.839809$$ and the P-value is $0.002257026 < 0.05$ which is statistically significant . Therefore, we reject $H _0$ and the conclusion is $\mu_1 < \mu_2$. R Code:
n = 250; MU1 = 68.5; MU2 = 69; SIGMA1 = 2.5; sigma2=3; r = 0.5se1 = SIGMA1/SQRT (n); SE2 = sigma2/sqrt (n) se = sqrt (se1^2 + se2^2-2 * R * SE1 * se2) se[1] 0.1760682z = (MU1-MU2)/SE; Z[1] -2.839809pnorm (z) [1] 0.002257026
Problem 3
A group of scientists is studying whether a new medical treatment have an adverse (bad) effect on lung function. Here is data on a simple random sample of ten patients taken from a large population of patients in the study. Both variables was measurements, in liters, of the amount of air, the patient can blow out (this is a very rough descr Iption of a well-defined measure). The bigger a measurement is, the better the lung function. The "baseline" measurement was taken before the treatment, and the "final" measurement were taken after the treatment.
Baseline Final
4.19 4.17
4.52 4.20
4.50 4.53
3.90 3.95
4.33 4.15
4.30 4.19
3.94 3.96
4.35 4.26
4.21 4.07
4.17 3.93
Need summary statistics, here is some that is commonly used; The SDs has $n-1 = 9$ in the denominator. Baseline:mean 4.241, SD 0.2065 final:mean 4.141, SD 0.1798 Correlation between Baseline and final:0.8055 Perform a one- Sided test at the 5% level, following the steps in problems 3a-3c.
3 A Based on the information given, which test should you perform?
A. Binomial test for the fairness of a coin
B. one-sample $z $ test for a population mean (quantitative variable; not proportions of zeros and ones)
C. one-sample $t $ test for a population mean
D. two-sample $z $ test for the difference between population means, based on independent samples
E. two-sample $z $ test for the effect of a treatment, applied to the results of a randomized controlled experiment
3 b The P-value of the test is:
Less than 1%
Between 1% and 5%
Between 5% and 10%
Between 10% and 15%
Between 15% and 20%
3 c The conclusion of the test is:the treatment had a bad effect. The results is due to chance variation.
Solution
3 a) (a) is correct. The data is paired, so this would be a one-sample test; This is rules out (d) and (e). There is only ten observations, so the probabilities for sample means need isn't be normal; This is rules out (b). It cannot be $t $ test since there's no assumption about the underlying normality of the variables; This is rules out (c). ($t $ test:population roughly normal, unknown mean and SD). The only thing left are to compare the results to tosses of a coin. Define a "head" to is a patient whose score goes down after treatment. Then we'll test whether the number of heads is like the result of tossing a coin times, or whether there was too many Heads for "coin tossing" to be a reasonable conclusion. $ $H _0:p=0.5$$ $ $H _a:p>0.5$$ where $p =0.7$ is the This sample. For the given mean, SD and $r $ in the problem, its calculation in R could be:
Base = C (4.19, 4.52, 4.5, 3.9, 4.33, 4.3, 3.94, 4.35, 4.21, 4.17) final = C (4.17, 4.2, 4.53, 3.95, 4.15, 4.19, 3.96, 4.26, 4.07, 3.93) mean (base); SD (base); Mean (final); SD (final); Cor (base, final) [1] 4.241[1] 0.2064757[1] 4.141[1] 0.1798425[1] 0.8054805
3B) in 7 of the pairs, the patient's score went down. So we want the chance of 7 or more heads in tosses of a coin. Binomial distribution, under the null $n =10, k=7:10, p=0.5$, so $$\sum_{k=7}^{10}c_{10}^{k}\cdot0.5^k\cdot0.5^{10-k}= 0.171875$$ R Code:
Sum (Dbinom (7:10, 10, 0.5)) [1] 0.171875
3C) P-value is 0.171875 which are larger than 0.05, so we reject $H _a$. That was, the conclusion is the result of the due to chance Variation ($p =0.5$).
University of California, Berkeley stat2.3x Inference statistical Inference Study Note: section 4 Dependent Samples