Loss functions

In supervised learning, the aim is to find the transformation $f[\v x, \v \phi]$ that maps two random variables, $\v x$ into $\v y$. The loss function $L[\v \phi]$ measures the quality of this transformation and guides the training process to find the parameters $\v \phi$ that minimize the loss.

The problem is that we don’t have direct access to $\v x$ and $\v y$; we only have their observations collected in a training dataset $\lbrace \v x_i , \v y_i \rbrace$. To handle this, we model the interaction between them indirectly. Given $\v x$, we estimate parameters $\v\theta$ to define the conditional distribution of $\v y$. However, we still cannot start training the model because the true distribution parameters $\v\theta$ are unknown.

Maximum likelihood

The maximum likelihood estimation (MLE) adds an assumption to solve the problem of missing true conditional distribution parameters. We assume that these parameters must maximize the likelihood of observing the training data. When observations $\lbrace \v x_i, \v y_i \rbrace$​ are independent and identically distributed, this results in:

\[\begin{align*} \hat {\v \phi} = \argmax{\phi}\left[ \prod_i p(\v y_i \mid f[\v x_i; \v \phi]) \right] && \gray p(\v y_1, \ldots | \v x_1, \ldots)= \prod_i p(\v y_i |\v x_i) \end{align*}\]

A neural network predicts the conditional distribution parameters $\hat {\v \theta}$, which we then plug into a predefined and fixed transformation: the assumed formula of the conditional distribution of $\v y$ given $\v x$. While these conditional distributions can take various forms, we commonly assume that they belong to the same distribution family as the marginal distribution of $\v y$, simplifying the modeling process.

\[\begin{align*} \gray\underbrace{\space f[\v x; \v \phi] \approx\black \hat{\v\theta}\space}_{\text{training}} \space \black\overset{\text{MLE}}{\approx} \gray \underbrace{\black \space \v\theta \gray \to \v y\space}_{\text{fixed formula}} \end{align*}\]

In this framework, we find the most likely model parameters $\v\phi$ such that a model with these parameters returns the estimated distribution parameters $\hat{\v\theta}_i$ given $\v x_i$ making $\v y_i$ the most likely. During inference, the model estimates the distribution parameters, and we may return either the full distribution or its most likely value:

\[\gray \hat y = \argmax y\bigg[ p(\v y | \v\theta) \bigg]\gray\]

We approximate $\hat{\v\theta} \approx \v \theta$ assuming that data points must be the most likely what likely might not be true. For an alternative approach, check the Bayesian linear regression post.

Moreover, maximizing the likelihood and log-likelihood is equivalent since the log function is monotonically increasing. Because we prefer minimizing, we multiply by $-1$. This gives us the numerically stable negative log-likelihood loss (NLL) function:

\[\hat {\v \phi} = \argmax{\phi}\left[ \prod_i p(\v y_i \mid f[\v x_i; \v \phi]) \right] = \argmin{\phi}\left[ -\sum_i \log p(\v y_i \mid f[\v x_i; \v \phi]) \right] \gray = \argmin{\phi}\bigg[L[\v\phi]\bigg]\]


When $y$ is a random variable that marginally follows the Beta distribution, a model estimates two parameters $a_i$ and $b_i$ of the conditional Beta distribution given $\v x_i$. For the training, we can derive the negative log-likelihood loss function involving the Beta distribution $p(y) \propto y^{a-1}(1-y)^{b-1}$:

\[\gray L[\v\phi] = -\sum_i \log p(\v y_i \mid \v\theta) = -\sum_i \bigg[ (a-1)\log y_i + (b-1)\log(1-y_i)\bigg]\]

The figure shows a possible prediction of $a_i$ and $b_i$ for given $\v x_i$. The prediction is inaccurate since the likelihood of $y_i$ is low, resulting in a high loss. These parameters would be perfect if $y_i$ were $0.2$ but not $0.7$. In other words, the parameters of the conditional distribution that we assume to be true are those for which $y_i$ is the most likely. For example, the parameters $a_i=2$ and $b_i=5$ might be chosen, but other parameter configurations that satisfy $(a-1)/(a+b-2) = 0.7$ are also possible. The MLE assumption allows for a set of valid parameter configurations rather than a single specific one.

In inference, we may either return the most likely value or the full distribution to quantify uncertainty. As we return the maximum of a distribution, some distributions offer a closed form. For the Beta distribution, the most likely value (mode) is given by $(a-1)/(a+b-2)$.


The negative log-likelihood loss functions can be derived for various distributions. For regression, we may assume a Normal distribution with constant variance, so the model estimates only the mean. Ignoring all constants and using $p(y) \propto \exp[-(y - f[\v x, \v \phi])^2]$, we aim to minimize:

\[\gray \hat {\v \phi} = \argmin \phi \left[ -\sum_i (y_i-f[\v x_i; \v\phi])^2 \right]\]

The NLL loss function with a Normal conditional distribution aligns with the least squares method, creating a bridge between probabilistic approaches and traditional regression analysis. When a model predicts both the mean and the variance, indicating a heterogeneous Normal distribution, variance estimation serves as a measure of uncertainty, providing insights into the model’s confidence.


In binary classification, we may use the Bernoulli distribution $p(y) = \lambda^y(1-\lambda)^{1-y}$. Its parameter $\lambda$ must stay within the range between $0$ and $1$. To achieve this, we add the sigmoid activation at the end of the network, such as the logistic function $\sigma(x) = 1/(1 + e^{-x})$. The loss function becomes:

\[\begin{align*} \gray L[\v \phi] = -\sum_i\bigg[ y_i \log\lambda_i + (1-y_i) \log(1-\lambda_i) \bigg] && \gray\lambda_i = \sigma(f(\v x_i; \v\phi) \end{align*}\]

For multiclass classification, we use the Categorical distribution with the softmax activation function instead. The NLL loss function encourages the model to increase the logit of the correct class relative to others.

\[\gray L[\phi] = -\sum_i \bigg[ f_{y_i}(x_i;\phi) -\log\left(\sum_d \exp(f_d(x_i; \phi)\right)\bigg]\]

Difference between distributions

Instead of maximizing the likelihood of observing data, we can alternatively minimize the distance between the true and estimated distributions, $p$ and $q$. The true empirical distribution is represented as a set of Dirac delta functions, forming infinite peaks at the positions of observations $y_i$:

We can measure the difference between distributions using the KL-divergence. As before, the model predicts the parameters $\v \theta = f[\v x_i; \phi]$ that we use in the predicted conditional distribution $q$ of $\v y_i$ given $\v x_i$. Removing constants, and picking “values” of a continuous function through a set of Dirac delta functions, we obtain:

\[\begin{align*} \hat \phi &= \argmin \phi \left[\int p(y)\log \frac{p(y)}{q(y|\v \theta)} dy \right] \\ &\gray = \argmin \phi \left[- \int p(y)\log [{q(y|\v \theta)}] dy \right] \quad\quad\text{cross-entropy} \\ &\gray = \argmin \phi \left[- \int \left(\frac 1 n \sum_{i=1}^n \delta[y - y_i] \right)\log[{q(y|\v \theta)}] dy \right]\\ &= \argmin \phi \left[- \sum_{i=1}^n \log[q(y_i|\v \theta)] \right] \end{align*}\]

This shows that minimizing the KL-divergence between the true and estimated distributions is equivalent to minimizing the Negative Log-Likelihood loss function.


The Kullback-Leibler divergence measures a distance between two probability distributions $p(x)$ and $q(x)$. Both function are defined over the space $x$ but represent different random variables. They share a common support, allowing for comparison.

\[D_{KL}\bigg[p(x)\ \|\ q(x) \bigg] = \int p(x) \log \frac{p(x)}{q(x)} dx\]

The probability distance must be greater than or equal to zero. This can be shown using $-\log z > 1-z$,

\[\gray \int p(x)\bigg(-\log\frac{q(x)}{p(x)}\bigg) dx \ge \int p(x) \bigg(1- \frac{q(x)}{p(x)}\bigg) dx = 0\]

The KL-divergence is not symmetric. To illustrate this, consider $p(x)$ and $q(x)$ as the true and predicted normal distributions, with $p(x)$ having two modes and $q(x)$ having one mode.

Using the forward divergence $D_{KL}(p\ |\ q)$ as the loss function, the trained $q(x)$ would fit the mean of the two nodes, thereby maximizing recall. If $q(x)$ is close to zero in regions where $p(x)$ is high, the match between them $q(x) / p(x)$ will approach zero, leading to a high cost $-\log(q(x)/p(x))$. Consequently, the model aims to avoid such situations by ensuring $q(x) > 0$ wherever $p(x) > 0$.

In the reversed divergence $D_{KL}(q\ |\ p)$, the model can select and focus on a subregion of the support of $p$, as the expectation is taken with respect to the random variable $q(x)$. As a result, the trained $q(x)$ would closely match one of the two nodes (maximizing precision).