It is used to estimate the population mean \(\mu\)
\[\overline{X}_n=\frac{\sum_{i=1}^{n} X_i}{n}\]
\[bias(\overline{X}_n) = E[\overline{X}_n] - \mu = 0\]
Demonstration
\(E[\overline{X}_n] = \frac{1}{n} \sum_{i} E[X_i] = \frac{1}{n} n E[X_i]= E[X_i]= \mu\)
given that \(E[\sum_{i=1}^n g(X_i)]= n E[g(X_1)]\) if \(X_1,\dotsc,X_n\) are identically distributed.
\[se = \frac{\sigma}{\sqrt{n}}\] where \(\sigma^2\) is the population variance.
\[V[\overline{X}_n]=V(\frac{1}{n}\sum_i X_i) = \frac{1}{n^2} \sum_i V(X_i) = \frac{1}{n^2} n \sigma^2 = \frac{\sigma^2}{n}\]
If \(X_1,\dotsc,X_n\) are normally distributed, then \(\overline{X}_n\) is distributed \(N(\mu,\frac{\sigma^2}{n})\)
\(X_1,\dotsc,X_n\) for large \(n\), \(\overline{X}_n\) is distributed \(N(\mu,\frac{\sigma^2}{n})\)
We say that \(\overline{X}_n\) is asymptotically normal. That means that probability statements about the sample mean can be approximated using a normal distribution.
\(\overline{X}_n\) is consistent because it converges in probability to \(\mu\).
That is,
\[\overline{X}_n \sim N(\mu, \,{se}^2) = N(\mu, \, \sigma^2 / n)\]
\[C_n = (\overline{X}_n - z_{1-\alpha/2} \, se, \,\overline{X}_n + z_{1-\alpha/2}\, se)\]
\[P \left( \overline{X}_n - z_{1-\alpha/2} \,se < \mu < \overline{X}_n + z_{1-\alpha/2} \,se \right) = P \left( -z_{1-\alpha/2} < \frac{\overline{X}_n - \mu}{se} < z_{1-\alpha/2} \right)=P\left(- z_{1-\alpha/2} < Z < z_{1-\alpha/2} \right)=1-\alpha\]
For \(\alpha = 0.05\) , \(z_{1-\alpha/2}\) is
qnorm(0.975)
## [1] 1.959964
which is approximately 2. So, the confidence intervals are aproximately the mean plus-minus two times the standard error.
\[C_n = (\overline{X}_n - t_{1-\alpha/2} \, \widehat{se},\, \overline{X}_n + t_{1-\alpha/2}\, \widehat{se})\]
with \(\widehat{se}^2=S^2_n / n\) and \(t\) has \(n - 1\) degrees of freedom.
REMINDER: For \(n\) large, \(t\) is distributed \(z\).
\[P \left( \overline{X}_n - t_{1-\alpha/2} \,\widehat{se} < \mu < \overline{X}_n + t_{1-\alpha/2} \, \widehat{se} \right) = P \left( -t_{1-\alpha/2} < \frac{\overline{X}_n - \mu}{\widehat{se}} < t_{1-\alpha/2} \right)=P\left( -t_{1-\alpha/2} < T < t_{1-\alpha/2} \right)=1-\alpha\]
It is taken into account that \(\frac{\overline{X}_n - \mu}{\widehat{se}}\) is distributed \(t\).
library(dplyr)
before <- sleep %>% filter(group ==1)
after <- sleep %>% filter(group ==2)
difference <- after$extra - before$extra
Manual calculation
se <- sd(difference)/sqrt(length(difference))
mean(difference) + c(-1, 1) * qt(.975, length(difference)-1) * se
## [1] 0.7001142 2.4598858
Using t.test
t.test(difference)$conf.int
## [1] 0.7001142 2.4598858
## attr(,"conf.level")
## [1] 0.95
library(tidyverse)
## Warning in file(con, "r"): cannot open file '/var/db/timezone/zoneinfo/
## +VERSION': No such file or directory
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ stringr 1.4.0
## ✓ tidyr 1.1.3 ✓ forcats 0.5.1
## ✓ readr 2.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(broom)
calculate_ci <- function(d, x, ...) {
d %>%
group_by(...) %>%
nest() %>%
mutate(temp = map(data,
~t.test(.x %>% pull(!!enquo(x)), conf.int = TRUE) %>%
tidy() %>%
mutate(sd = sd(.x %>% pull(!!enquo(x)))))) %>%
unnest(temp)
}
calculate_ci(mtcars, mpg, cyl)
## # A tibble: 3 × 11
## # Groups: cyl [3]
## cyl data estimate statistic p.value parameter conf.low conf.high method
## <dbl> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 6 <tibbl… 19.7 35.9 3.10e- 8 6 18.4 21.1 One Sa…
## 2 4 <tibbl… 26.7 19.6 2.60e- 9 10 23.6 29.7 One Sa…
## 3 8 <tibbl… 15.1 22.1 1.09e-11 13 13.6 16.6 One Sa…
## # … with 2 more variables: alternative <chr>, sd <dbl>