## Definition

It is used to estimate the population mean $$\mu$$

$\overline{X}_n=\frac{\sum_{i=1}^{n} X_i}{n}$

## Bias

$bias(\overline{X}_n) = E[\overline{X}_n] - \mu = 0$

• Demonstration

$$E[\overline{X}_n] = \frac{1}{n} \sum_{i} E[X_i] = \frac{1}{n} n E[X_i]= E[X_i]= \mu$$

## Standard error

$se = \frac{\sigma}{\sqrt{n}}$ where $$\sigma^2$$ is the population variance.

### Demonstration

$V[\overline{X}_n]=V(\frac{1}{n}\sum_i X_i) = \frac{1}{n^2} \sum_i V(X_i) = \frac{1}{n^2} n \sigma^2 = \frac{\sigma^2}{n}$

## Distribution

### For a normal distributed random variable (normal parent distribution)

If $$X_1,\dotsc,X_n$$ are normally distributed, then $$\overline{X}_n$$ is distributed $$N(\mu,\frac{\sigma^2}{n})$$

### For a large n (asymptotic normality, central limit theorem)

$$X_1,\dotsc,X_n$$ for large $$n$$, $$\overline{X}_n$$ is distributed $$N(\mu,\frac{\sigma^2}{n})$$

#### Demonstration (not shown)

We say that $$\overline{X}_n$$ is asymptotically normal. That means that probability statements about the sample mean can be approximated using a normal distribution.

## Consistency (law of large numbers)

$$\overline{X}_n$$ is consistent because it converges in probability to $$\mu$$.

## Confidence intervals when the sample mean is distributed normally

That is,

$\overline{X}_n \sim N(\mu, \,{se}^2) = N(\mu, \, \sigma^2 / n)$

### When $$\sigma^2$$ is known

$C_n = (\overline{X}_n - z_{1-\alpha/2} \, se, \,\overline{X}_n + z_{1-\alpha/2}\, se)$

#### Demonstration

$P \left( \overline{X}_n - z_{1-\alpha/2} \,se < \mu < \overline{X}_n + z_{1-\alpha/2} \,se \right) = P \left( -z_{1-\alpha/2} < \frac{\overline{X}_n - \mu}{se} < z_{1-\alpha/2} \right)=P\left(- z_{1-\alpha/2} < Z < z_{1-\alpha/2} \right)=1-\alpha$

#### Example

For $$\alpha = 0.05$$ , $$z_{1-\alpha/2}$$ is

qnorm(0.975)
## [1] 1.959964

which is approximately 2. So, the confidence intervals are aproximately the mean plus-minus two times the standard error.

### When $$\sigma^2$$ is unknown

$C_n = (\overline{X}_n - t_{1-\alpha/2} \, \widehat{se},\, \overline{X}_n + t_{1-\alpha/2}\, \widehat{se})$

with $$\widehat{se}^2=S^2_n / n$$ and $$t$$ has $$n - 1$$ degrees of freedom.

REMINDER: For $$n$$ large, $$t$$ is distributed $$z$$.

#### Demonstration

$P \left( \overline{X}_n - t_{1-\alpha/2} \,\widehat{se} < \mu < \overline{X}_n + t_{1-\alpha/2} \, \widehat{se} \right) = P \left( -t_{1-\alpha/2} < \frac{\overline{X}_n - \mu}{\widehat{se}} < t_{1-\alpha/2} \right)=P\left( -t_{1-\alpha/2} < T < t_{1-\alpha/2} \right)=1-\alpha$

It is taken into account that $$\frac{\overline{X}_n - \mu}{\widehat{se}}$$ is distributed $$t$$.

#### Example

library(dplyr)
before <- sleep %>% filter(group ==1)
after <- sleep %>% filter(group ==2)
difference <- after$extra - before$extra

Manual calculation

se <- sd(difference)/sqrt(length(difference))

mean(difference) + c(-1, 1) * qt(.975, length(difference)-1) * se
## [1] 0.7001142 2.4598858

Using t.test

t.test(difference)\$conf.int
## [1] 0.7001142 2.4598858
## attr(,"conf.level")
## [1] 0.95

#### Helper function to calculate confidence intervals

library(tidyverse)
## Warning in file(con, "r"): cannot open file '/var/db/timezone/zoneinfo/
## +VERSION': No such file or directory
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ stringr 1.4.0
## ✓ tidyr   1.1.3     ✓ forcats 0.5.1
## ✓ readr   2.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::lag()    masks stats::lag()
library(broom)

calculate_ci <- function(d, x, ...) {
d %>%
group_by(...) %>%
nest() %>%
mutate(temp = map(data,
~t.test(.x %>% pull(!!enquo(x)), conf.int = TRUE) %>%
tidy() %>%
mutate(sd = sd(.x %>% pull(!!enquo(x)))))) %>%
unnest(temp)
}

calculate_ci(mtcars, mpg, cyl)
## # A tibble: 3 × 11
## # Groups:   cyl [3]
##     cyl data    estimate statistic  p.value parameter conf.low conf.high method
##   <dbl> <list>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <chr>
## 1     6 <tibbl…     19.7      35.9 3.10e- 8         6     18.4      21.1 One Sa…
## 2     4 <tibbl…     26.7      19.6 2.60e- 9        10     23.6      29.7 One Sa…
## 3     8 <tibbl…     15.1      22.1 1.09e-11        13     13.6      16.6 One Sa…
## # … with 2 more variables: alternative <chr>, sd <dbl>