Sample mean

Definition

It is used to estimate the population mean \(\mu\)

\[\overline{X}_n=\frac{\sum_{i=1}^{n} X_i}{n}\]

Bias

\[bias(\overline{X}_n) = E[\overline{X}_n] - \mu = 0\]

Demonstration

\(E[\overline{X}_n] = \frac{1}{n} \sum_{i} E[X_i] = \frac{1}{n} n E[X_i]= E[X_i]= \mu\)

given that \(E[\sum_{i=1}^n g(X_i)]= n E[g(X_1)]\) if \(X_1,\dotsc,X_n\) are identically distributed.

Standard error

\[se = \frac{\sigma}{\sqrt{n}}\] where \(\sigma^2\) is the population variance.

Demonstration

\[V[\overline{X}_n]=V(\frac{1}{n}\sum_i X_i) = \frac{1}{n^2} \sum_i V(X_i) = \frac{1}{n^2} n \sigma^2 = \frac{\sigma^2}{n}\]

Distribution

For a normal distributed random variable (normal parent distribution)

If \(X_1,\dotsc,X_n\) are normally distributed, then \(\overline{X}_n\) is distributed \(N(\mu,\frac{\sigma^2}{n})\)

Demonstration

The sum of independent random variables normally distributed is distributed normally with a mean that is equal to the sum of the means and a variance that is the sum of the variances.

For a large n (asymptotic normality, central limit theorem)

\(X_1,\dotsc,X_n\) for large \(n\), \(\overline{X}_n\) is distributed \(N(\mu,\frac{\sigma^2}{n})\)

Demonstration (not shown)

We say that \(\overline{X}_n\) is asymptotically normal. That means that probability statements about the sample mean can be approximated using a normal distribution.

Consistency (law of large numbers)

\(\overline{X}_n\) is consistent because it converges in probability to \(\mu\).

Demonstration (not shown)

Confidence intervals when the sample mean is distributed normally

That is,

\[\overline{X}_n \sim N(\mu, \,{se}^2) = N(\mu, \, \sigma^2 / n)\]

When \(\sigma^2\) is known

\[C_n = (\overline{X}_n - z_{1-\alpha/2} \, se, \,\overline{X}_n + z_{1-\alpha/2}\, se)\]

Demonstration

\[P \left( \overline{X}_n - z_{1-\alpha/2} \,se < \mu < \overline{X}_n + z_{1-\alpha/2} \,se \right) = P \left( -z_{1-\alpha/2} < \frac{\overline{X}_n - \mu}{se} < z_{1-\alpha/2} \right)=P\left(- z_{1-\alpha/2} < Z < z_{1-\alpha/2} \right)=1-\alpha\]

Example

For \(\alpha = 0.05\) , \(z_{1-\alpha/2}\) is

qnorm(0.975)

## [1] 1.959964

which is approximately 2. So, the confidence intervals are aproximately the mean plus-minus two times the standard error.

When \(\sigma^2\) is unknown

\[C_n = (\overline{X}_n - t_{1-\alpha/2} \, \widehat{se},\, \overline{X}_n + t_{1-\alpha/2}\, \widehat{se})\]

with \(\widehat{se}^2=S^2_n / n\) and \(t\) has \(n - 1\) degrees of freedom.

REMINDER: For \(n\) large, \(t\) is distributed \(z\).

Demonstration

\[P \left( \overline{X}_n - t_{1-\alpha/2} \,\widehat{se} < \mu < \overline{X}_n + t_{1-\alpha/2} \, \widehat{se} \right) = P \left( -t_{1-\alpha/2} < \frac{\overline{X}_n - \mu}{\widehat{se}} < t_{1-\alpha/2} \right)=P\left( -t_{1-\alpha/2} < T < t_{1-\alpha/2} \right)=1-\alpha\]

It is taken into account that \(\frac{\overline{X}_n - \mu}{\widehat{se}}\) is distributed \(t\).

Example

library(dplyr)
before <- sleep %>% filter(group ==1) 
after <- sleep %>% filter(group ==2)
difference <- after$extra - before$extra

Manual calculation

se <- sd(difference)/sqrt(length(difference)) 

mean(difference) + c(-1, 1) * qt(.975, length(difference)-1) * se

## [1] 0.7001142 2.4598858

Using t.test

t.test(difference)$conf.int

## [1] 0.7001142 2.4598858
## attr(,"conf.level")
## [1] 0.95

Helper function to calculate confidence intervals

library(tidyverse)

## Warning in file(con, "r"): cannot open file '/var/db/timezone/zoneinfo/
## +VERSION': No such file or directory

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ stringr 1.4.0
## ✓ tidyr   1.1.3     ✓ forcats 0.5.1
## ✓ readr   2.0.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(broom)

calculate_ci <- function(d, x, ...) {
  d %>%
    group_by(...) %>%
    nest() %>% 
    mutate(temp = map(data, 
                      ~t.test(.x %>% pull(!!enquo(x)), conf.int = TRUE) %>% 
                        tidy() %>% 
                        mutate(sd = sd(.x %>% pull(!!enquo(x)))))) %>% 
    unnest(temp)
}

calculate_ci(mtcars, mpg, cyl)

## # A tibble: 3 × 11
## # Groups:   cyl [3]
##     cyl data    estimate statistic  p.value parameter conf.low conf.high method 
##   <dbl> <list>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <chr>  
## 1     6 <tibbl…     19.7      35.9 3.10e- 8         6     18.4      21.1 One Sa…
## 2     4 <tibbl…     26.7      19.6 2.60e- 9        10     23.6      29.7 One Sa…
## 3     8 <tibbl…     15.1      22.1 1.09e-11        13     13.6      16.6 One Sa…
## # … with 2 more variables: alternative <chr>, sd <dbl>