Akaike information theory AIC

Kullback–Leibler (KL) information \(I\)

The KL information or distance between a full reality model considered to be fixed \(f\) and an approximating model \(g\) is

\[I(f,g)=\int f(x) \log \left( \frac{f(x)}{g(x | \theta)} \right) \, dx\]

that can be written as

\[I(f,g)=E_f \left[ f(x) \right] - E_f \left[ g(x | \theta) \right]\]

The first expected value is a constant \(C\) that we will never know as we don’t know the full reality model, but we can calculate relative distances

\[relative \, distance = I(f,g) - C = - E_f \left[ g(x | \theta) \right]\]

For two approximating models:

\[I(f,g_1) - I(f,g_2) = - E_f \left[ g_1(x | \theta) \right] + - E_f \left[ g_2(x | \theta) \right]\]

Notice that while \(I(f,g)\) has a true zero, the \(relative \, distance\) has not a true zero.

Often, we will have an estimation of \(I\) as we don’t know the parameters of the model and need to estimate them \(\widehat{\theta}\).

Akaike information criterion \(AIC\)

\[ AIC = -2\log{L(\widehat{\theta})} + 2 k\]

where L is the likelihood evaluated in the parameters of the model that maximizes it and \(k\) is the number of parameters of the model.

Example

n <- 100
x <- c(.2, .4, .6, .8, 1)
k <- c(10, 26, 73, 94, 97)
r <- n - k
y <- k/n
dat <- data.frame(x, y, k, r, n)

Using quickpsy

library(quickpsy)
fit<-quickpsy(dat, x, k, n)
fit$aic

##        aic
## 1 30.89906

Using glm

model <- glm( cbind(k, r) ~ x, data= dat, family = binomial(probit))
model

## 
## Call:  glm(formula = cbind(k, r) ~ x, family = binomial(probit), data = dat)
## 
## Coefficients:
## (Intercept)            x  
##      -2.279        4.572  
## 
## Degrees of Freedom: 4 Total (i.e. Null);  3 Residual
## Null Deviance:       304.4 
## Residual Deviance: 6.662     AIC: 30.9

Sometimes, other software provide other values for the \(AIC\) become it uses the loglikelihood dropping the binomial coefficients (that do not depend on the parameters). In general, this is not a problem as we are interested in differences in the \(AIC\), but it could be a problem when comparing models with different distribution of errors.