Bayesian linear regression

In supervised learning, the aim is to map two random variables, $\v x$ and $\v y$. The problem is that we have no direct access to them. We have only some observations of them. To handle this, we model the interaction between them indirectly. Given $\v x$, we estimate parameters $\v\theta$ that we then use to define the conditional distribution of $\v y$.

In the maximum likelihood estimation, the first part is a task for a neural network to predict the conditional distribution parameters $\v \theta$, and the second part is a predefined and fixed transformation; the assumed formula of the conditional distribution of $\v y$ given $\v \theta$. Because the true distribution parameters $\v\theta$ are unknown, we train a neural network to predict parameters $\hat{\v\theta}$ that maximize the likelihood of observing the training data:

\[\begin{align*} \gray\underbrace{\black f(\v x; \v \phi)\approx\ \black \hat{\v\theta}\ }_{\text{training}} \approx \v\theta \to \v y &&&&\gray \hat {\v \phi} = \argmax{\phi}\left[ \prod_i p(\v y_i \mid f[\v x_i; \v \phi]) \right]\\ \end{align*}\]

In this framework, we find one specific set of the most likely model parameters $\v\phi$ in the sense that a neural network with these parameters returns the distribution parameters $\hat{\v\theta}$ for which ${\v y_i}$ are the most likely. In the inference, we use that specific model to estimate the distribution parameters and return its most likely value:

\[\gray \hat y = \argmax y\bigg[ p(\v y | \v\theta) \bigg]\gray\]

Bayesian approach

In the Bayesian approach, we do not rely on a single set of the most-likely model parameters. Instead, we define the posterior distribution over all possible model parameters. Note that less likely model parameters are not necessarily worst since we approximate $\hat{\v\theta} \approx \v \theta$ by assuming that data points must be the most likely what likely might not be true. By Bayes’ theorem,

\[p(\v\phi|\{\v x_i, \v y_i\}) = \frac {\prod_i p(\{\v y_i\}| \{\v x_i\}, \v\phi)p(\v\phi)} {\gray\int\prod_i p(\{\v y_i\}| \{\v x_i\}, \v\phi)p(\v\phi)d\phi}\]

The prediction $\v y$ for the new input $\v x$ is an infinite weighted sum (an integral) of the predictions for each parameter set, where the weights are the associated probabilities:

\[p(\v y | \v x, \{\v x_i, \v y_i\}) = \gray \int\black p(\v y| f[\v x; \v\phi]) p(\v\phi|\{\v x_i, \v y_i\})\gray d\v\phi\]

This is effectively an infinite weighted ensemble. The Bayesian approach is more elegant and can provide more robust predictions. However, for complex neural networks, there is no practical way to represent the full probability distribution over the parameters or to integrate over it during the inference phase.

Linear regression

It’s possible to derive the full distribution over parameters and bypass the integration during the inference phase for linear regression. Here, instead of using input vectors, we perform the linear regression on the last hidden states of a neural network, so model parameters $\v\phi$ are of the last layer only. Let’s collect these hidden states in columns of $\v H$.

To define the posterior over model parameters, we begin with two reasonable assumptions: (i) the prior distribution of model parameters is a multivariate normal with the constant variance $\sigma^2_p$ and (ii) the distribution of $\v y$ is also a multivariate normal with the constant variance $\sigma^2$. Note that $\v y$ is a high-dimensional distribution, where each dimension corresponds to an element in the training dataset.

Since the linear transformation of a normal distribution is still normal, we change the distribution of $\v y$ into the distribution of $\v \phi$. Then, as the product of two normal distributions is also normal, we can combine them. These two lemmas are detailed in the appendix. Completing the square, the posterior over model parameters is:

\[\begin{align*} p(\v\phi|\{\v x_i, y_i\}) &\propto \prod_i p(y_i | \v x_i) p(\v \phi) \\[-3px] &\propto \MVN{\v y}{\v H^T\v\phi}{\sigma^2\v I} \MVN{\phi}{\v 0}{\sigma^2_p\v I} \\[2px] &\propto \MVN{\v \phi} {(\v H\v H^T)^{-1}\v H\v y}{\sigma^2\v (\v H\v H^T)^{-1}} \MVN{\phi}{\v 0}{\sigma^2_p\v I} \\[-2px] &\propto \MVN {\v \phi} {\v \mu} {\v C} \quad\quad\quad \gray \v \mu = \frac{1}{\sigma^{2}} \v C \v H \v y \quad\quad \gray \v C = \left( \frac{1}{\sigma^2}\v{H}\v{H}^T+\frac{1}{\sigma_p^2}\v{I} \right)^{-1} \end{align*}\]

Once we have derived the posterior distribution, we can inference a new prediction $y$ given $\v x$. We restructure our integral and collect together terms related with $\v \phi$. Then, completing the square and taking other terms as constants outside of the integral, we reach the form the normal distribution of $\v\phi$.

\[\begin{align*} p(y |\v x ) &= \int \MVN y {\v x^T\v \phi} {c^{-1}} \MVN {\v \phi} {\v \mu} {\v C} d\v \phi \\ &\propto\int \text{exp}\bigg[-\frac12 \bigg( \v \phi^T(c\v x \v x^T + \v C^{-1})\v \phi - 2\v\phi^T(cy\v x + \v C^{-1}\v \mu) + cy^2 \bigg)\bigg] d\v\phi \\ &\propto \text{exp}\bigg[ \frac 1 2(cy^2 -\v m^T\v M \v m) \bigg] \gray \int \text{exp}\bigg[ \frac 1 2(\v\phi - \v m)^T\v M(\v\phi-\v m) \bigg] d\v\phi \\[10px] &\; \gray \text{where} \quad \quad\v M= c\v x\v x^T + \v C^{-1} \quad\text{and}\quad \v m = \v M^{-1}(cy\v x + \v C^{-1}\v \mu) \end{align*}\]

Here is the crux; the integral of the distribution must be one and cancel out. The remaining part outside the integral is the Normal distribution of $y$ but its component is entangled within $\v m$. Expanding it, we obtain:

\[p(y | \v x) \propto \text{exp}\bigg[\frac 1 2\bigg( (c-c^2\v x^TM^{-1}\v x)y^2 + 2(c\v x^T\v M^{-1}\v C^{-1}\v\mu)y\bigg)\bigg] \propto \text{exp}\bigg[\frac k 2(y - u)^2\bigg] \\[10px] \begin{align*} \gray k = c - c^2\v x^T \v M^{-1}\v x && \gray u = \frac 1 k(c\v x^T\v M^{-1} C^{-1} \v\mu) && \gray \to\quad \text{Norm}_{y}[u, 1/k] \end{align*}\]

Simplifying the above (couldn’t do it by myself but checked numerically), we reach the final form:

\[p(y | \v x) = \MVN {y} {\v\mu^T\v x} {c^{-1} + \v x^T\v C\v x}\]

This agrees with vanilla linear regression as it reflects the MAP estimate of the weights and the input vector (the expectation is the mode in the Normal distribution). Moreover, the prediction’s variance combines the inherent noise $c^{-1} = \sigma^2_p$ with uncertainty in the regression function itself.

Appendix

If $\v z$ is a multivariate normal random variable, its linear transformation $\v A\v z = \v x$ is also a multivariate normal. Let’s start with calculating the expected value and the variance of $\v x$. Using the linearity of expectation, $E(\v {A x}) = \v A E(\v z) = \v A \v\mu$. Since $\v x - E(\v x) = \v A( \v z - \v \mu)$, the covariance matrix is defined as:

\[\text{Cov}(\v x) = E(\v x - E(\v x))(\v x - E(\v x))^T = \v A\bigg[E(\v z - \v\mu)(\v z - \v\mu)^T\bigg]\v A^T = \v {A\Sigma A}^T\]

This does not prove that $\v x$ is a multivariate normal yet. Defining $\v B = \v{A\Sigma A}^T$, we do the change of variables and work backwards, assuming that $\v x$ is indeed a multivariate normal. Considering all terms unrelated with $\v x$ as constants and completing the square, we obtain:

\[\begin{align*} \MVN{\v x}{\v A\v z}{\v B} &\propto \exp\bigg[-\frac 1 2 (\v x - \v A\v z)^T\v B^{-1}(\v x - \v A\v z)\bigg] \\ &\gray \propto \exp\bigg[-\frac 1 2 (\v z^T\v A^T\v B^{-1}\v A \v z - 2\v z^T\v A^T \v B^{-1}\v x) \bigg] \\ &\gray \propto \exp\bigg[-\frac 1 2 (\v z^T\v M\v z - 2\v z^T\v m)\bigg] \\ &\gray \propto \exp\bigg[-\frac 1 2 (\v z - \v M^{-1}\v m)^T\v M(\v z - \v M^{-1}\v m)\bigg] \\ &\gray\propto \MVN{\v z} {(\v A^T\v B^{-1}\v A)^{-1}\v A^T\v B^{-1}\v x} {(\v A^T\v B^{-1}\v A)^{-1}} \\ &\black\propto \MVN{\v z}{\v A^{-1} \v x}{\v A^{-1} \v B \v A^{-T}} = \MVN{\v z}{\v \mu}{\v \Sigma} \end{align*}\]

The product of two multivariate normal distributions is also the multivariable normal:

\[\begin{align*} \MVN{\v u}{\v a}{\v A} \MVN{\v u}{\v b}{\v B} &\propto \exp\bigg[-\frac 1 2\bigg( (\v u- \v a)^T \v A^{-1} (\v u- \v a) + (\v u- \v b)^T \v B^{-1} (\v u- \v b) \bigg)\bigg] \\ &\gray \propto \exp\bigg[-\frac 1 2\bigg( \v u^T(\v A^{-1} + \v B^{-1})\v u - 2(\v a^T\v A^{-1} +\v b^T\v B^{-1})^T\v u \bigg)\bigg] \\ &\gray \propto \exp\bigg[-\frac 1 2 (\v u^T\v M\v u - 2\v m^T\v u)\bigg] \\ &\gray \propto \exp\bigg[-\frac 1 2 (\v u - \v M^{-1}\v m)^T\v M(\v u - \v M^{-1}\v m)\bigg] \\ &\black\propto \MVN{\v u} {(\v A^{-1} + \v B^{-1})^{-1}(\v A^{-1}\v a +\v B^{-1}\v b)}{(\v A^{-1} + \v B^{-1})^{-1}} \end{align*}\]