We have some iid (independent and identically distributed) observations (or data) \(x=(x_1,x_2,...,x_n)\) from a random variable \(X\) with a probability density function \(f\). Suppose that we don’t know the exact \(f\), but we know its general form and describe this form using some parameters \(\theta=(\theta_1,\theta_2,...,\theta_n)\). So, \(\theta\) defines a family of probability density functions \(f(x|\theta)\). If we fix \(\theta\), then we get an exact form for \(f\). An exact form for \(f\) give us all the information that we need as we can calculate the probability of ocurrence of each possible observation.

Let’s suppose that we know the parametric family of probability density functions, but we don’t know \(\theta_0\), the true value of the paramaters. Maximum likelihood estimation (MLE) is a method to estimate it, that is, it is a method to estimate the exact form of \(f\) given \(x=(x_1,x_2,...,x_n)\) .

Before defining the likelihood function notice that a sample from X, the observations \(x=(x_1,x_2,...,x_n)\), is a random variable itself. Then, because the observations are independent, the probability density function for the sample \(f_n\) (joint density function when \(X\) is continous) is \[ f_n(x_1,x_2,...,x_n|\theta)=f(x_1|\theta)f(x_2|\theta)...f(x_n|\theta).\]

We define a function of \(\theta\) called the likelihood \(L\) as

\[L(\theta|data)=L(\theta|x_1,x_2,...,x_n) \equiv f_n(x_1,x_2,...,x_n|\theta)=f(x_1|\theta)f(x_2|\theta)...f(x_n|\theta).\]

It seems intuitive from this definition that given \((x_1,x_2,...,x_n)\), \(L\) would be large when \(\theta\) is close the unknow \(\theta_0\) and small when \(\theta\) is far from \(\theta_0\). Indeed, the MLE of \(\theta\) is defined as the value that maximizes \(L\). As \(log(L)\) increases monotonically with \(L\), the \(\theta\) that maximizes \(L\) also maximizes \(log(L)\). Given that maximizing \(log(L)\) is often easier than maximizing \(L\), many times \(log(L)\) is maximized.


Non iid observations

Sometimes the concept of likelihood is generalized to include observations that are independent but not identically distributed. The observations, for example, could be conditioned on other variable. In this case, usually the notation for the observations is \(y_1, y_2,..y_n\) and for the variable they are conditioned on \(x_1,x_2,...,x_n\). Then \[L(\theta|x,y)=f(x,y|\theta)\]


Knoblauch, K., & Maloney, L. T. (2012). Modeling Psychophysical Data in R. New York: Springer.

Myung, I. J. (2003). Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology.

Prins, N., & Kingdom, F. A. A. (2010). Psychophysics: a practical introduction. London: Academic Press.