Bayesian linear regression
In supervised learning, the aim is to map two random variables, $\v x$ and $\v y$. The problem is that we don’t have direct access to them, only some observations of them. To handle this, we model the interaction between them indirectly. Given $\v x$, we estimate parameters $\v\theta$ to define the conditional distribution of $\v y$.
From the frequentist perspective, the true parameter $\v \theta$ of the probability distribution $p(\v y \mid \v x)$ is unknown but fixed. The model’s estimate of this parameter is treated as a random variable because it depends on the dataset, which is itself considered random because frequentists see data sampling as the source of randomness. In this framework, models rely on predefined rules to summarize all the information into a single point estimate. For example, in maximum likelihood estimation (see Loss functions), it’s assumed that the observed data points must be the most likely - a premise that likely might not be true. Less probable distribution parameters, while less likely given a specific dataset, may generalize better across other datasets.
Bayesian approach
The Bayesian perspective assumes that the dataset is directly observed and therefore is not treated as random, whereas the true parameter $\v\theta$ is unknown and uncertain, so becoming a random variable. Instead of relying on a single point estimate of the most likely distribution parameters, the Bayesian approach defines a posterior distribution over all possible model parameters. This posterior distribution combines prior beliefs about the parameters - shifting probability mass towards preferred regions of the parameter space - with the evidence provided by the observed data. Using the Bayes’ theorem,
\[p(\v\phi|\v x, \v y) \propto {p(\v y| \v x, \v \phi)p(\v \phi)}\]We use the proportionality notation and omit the integral, as it serves as the normalizing factor. When the observations ${ \v x_i, \v y_i }$ are independent and identically distributed, the posterior over model parameters becomes:
\[p(\v\phi|\{ \v x_i, \v y_i\}) \propto \prod_i p(\v y_i| \v x_i, \v\phi)p(\v\phi)\]With the distribution of model parameters, the prediction $\v y$ for a new input $\v x$ is obtained by integrating predictions over all parameter sets. The weights for each prediction are given by the associated probabilities:
\[p(\v y | \v x, \{\v x_i, \v y_i\}) = \int\black p(\v y| f[\v x; \v\phi]) p(\v\phi|\{ \v x_i, \v y_i\}) d\v\phi\]This is effectively an infinite weighted ensemble. The Bayesian approach is more elegant and often provides more robust predictions. Unfortunately, for complex models, representing the full probability distribution over the parameters or performing exact integration during inference is computationally infeasible. Fortunately, for simpler models such as linear regression, these calculations can be carried out analytically.
Linear regression
In linear regression, the goal is to learn a linear mapping from an input vector $\v x \in \mathbb R^d$ to predict the value of a scalar $y$. The prediction is parametrized by the vector $\v w \in \mathbb R^d$:
\[\hat {y} = \v w^T\v x\]Given a set of $n$ training examples $(\v X, \v y)$, the prediction of $y$ over the entire training set is:
\[\hat {\v y} = \v X\v y\]We assume the distribution of $\v y$ is a multivariate normal with a constant variance $\sigma^2$. This is reasonable because, for a continuous distribution with a specific mean and variance, the normal distribution maximizes entropy, representing the greatest uncertainty. Because the observations are i.i.d., the covariance matrix is diagonal with the constant variance indicating uniform uncertainty across all predictions:
\[p(\v y \mid \v X, \v w) = \MVN{\v y}{\v X\v w}{\sigma^2\v I}\]To compute the posterior distribution over model parameters $\v w$, it is necessary to specify a prior distribution that encodes our initial beliefs about the parameters. A common choice is a multivariate normal distribution, which represents a broad and flexible prior, allowing for a high degree of initial uncertainty:
\[p(\v w) = \MVN{\v w}{\v a_0}{\v A_0}\]Given this prior, the posterior distribution of $\v w$ can be derived by applying Bayes’ theorem. First, we map the distribution of $\v y$ into the parameter space of $\v w$ (see Proposition 1). Then, we combine the likelihood and the prior - both of which are normal distributions - into a single posterior distribution (see Proposition 2):
\[\begin{align*} p(\v w| \v X, \v y) &\propto p(\v y \mid \v X, \v w) p(\v w)\\[5px] &\propto \MVN{\v y}{\v X\v w}{\sigma^2\v I} \MVN{\v w}{\v a_0}{\v A_0} \\[2px] &\propto \MVN{\v w} {(\v X^T\v X)^{-1}\v X^T\v y}{\sigma^2\v (\v X^T\v X)^{-1}} \MVN{\v w}{\v a_0}{\v A_0} \\[-2px] &\propto \MVN {\v w} {\v A_n \bigg(\frac{1}{\sigma^{2}} \v X^T \v y + \v A_0^{-1}\v a_0\bigg)} {\v A_n} \quad\quad\quad \v A_n = \left( \frac{1}{\sigma^2}\v X^T \v X + \v A_0 ^ {-1} \right)^{-1} \end{align*}\]To estimate the parameter vector $\v w$ using least squares, we can derive the normal equations in two ways. The first method involves taking the derivative of the MSE loss function with respect to $\v w$ and setting the derivative to zero. The second method is geometric: the prediction vector $\hat{\v y} = \v X \v w$ lies in the column space of $\v X$, and the best $\hat{\v y}$ is obtained by projecting $\v y$ onto this space, minimizing the error $\v e = \v y - \hat{\v y}$. This is achieved by ensuring that the error vector, which defines the direction of the projection, is orthogonal to the column space of $\v X$.
\[\gray \begin{align*} & \v X^T (\v y - \v X \v w) = 0 \\ &\v w = (\v X^T \v X)^{-1}\v X^T \v y \end{align*}\]When the prior covariance matrix $\v A_0 \to \v \infty$, its precision matrix $\v A_0^{-1} \to \v 0$, causing the prior-related terms to vanish. In this case, the posterior mean aligns with the least squares estimate of $\v w$. However, an infinitely wide prior is purely theoretical, as it implies complete uncertainty about the initial parameter values.
Once we have derived the posterior distribution, we can predict $y$ given a new input $\v x$. In general, this requires integration which is intractable for complex models. However, for linear regression, it’s possible to bypass the integration and derive the closed form. By definition, we must combine predictions for all possible $\v w$:
\[p(y \mid\v x ) = \gray \int \black p(y\mid\v w, \v x)\ \gray p(\v w \vert \v X, \v y) d\v w\]For the likelihood, we assume the same Normal distribution as we used for calculating the posterior:
\[p(y \mid \v w, \v x) = \Norm{\v x^T\v w}{\sigma^2}\]Combining the likelihood and the posterior under the integral, we first group the common terms together and then “complete the square” and take out of the integral all terms unrelated with $\v w$ but related with $y$. Here is the crux. The remaining integral is over the entire Normal distribution so it must be one and cancels out.
\[\begin{align*} p(y |\v x ) &\propto\int \text{exp}\bigg[-\frac12 \frac{(y - \v x^T \v w)^2}{\sigma^2} \bigg] \text{exp}\bigg[-\frac12 (\v w - \v \mu)^T \v A_n^{-1} (\v w - \v\mu) \bigg] d\v w \\[10px] &\propto\int \text{exp}\bigg[-\frac12 \bigg\{ \frac {y^2} {\sigma^2} + \v w^T \underbrace{\bigg(\frac{1}{\sigma^2}\v x \v x^T + \v A_n^{-1}\bigg)}_{\v M}\v w - 2\v w^T \underbrace{\bigg( \frac{y}{\sigma^2} \v x + \v A_n^{-1}\v \mu\bigg)}_{\v m} \bigg\}\bigg] d\v w \\[0px] &\propto \text{exp}\bigg[ -\frac 1 2 \bigg(\frac {y^2} {\sigma^2} -\v m^T\v M^{-1} \v m \bigg) \bigg]\gray \int \text{exp}\bigg[ -\frac 1 2(\v w - \v m)^T\v M(\v w -\v m) \bigg] d\v w \\[10px] \end{align*}\]Decomposing $\v m$ and dismiss constants, we obtain:
\[p(y | \v x) \propto \exp \bigg[ - \frac 1 2 \bigg\{ \frac {y^2}{\sigma^2}\bigg(1 - \frac 1 {\sigma^2} \v x^T \v M \v x\bigg) + y \frac 2 {\sigma^2} \v x^T \v M^{-1} \v A_n^{-1}\v \mu \bigg\}\bigg]\]The formula seems complex but these are only coefficients of $y^2$ and $y$. This is the Normal distribution having the form $p(y\mid\v x) \propto \exp[-\frac 1 2 { \alpha y^2 + \beta y }]$. It’s possible to simplify the above expression and find the distribution parameters, the mean and variance, but this requires to know the Sherman–Morrison formula (see Proposition 3). Another approach is to use the properties of the expectation and the variance. For the expectation, we have:
\[\begin{align*} &E_{\v w}(y \mid \v x) = E_{\v w}(\v x ^T \v w) = \v x^TE_{\v w}(\v w) = \v x^T \v \mu \end{align*}\]For the variance, we use the expectation over $y$ and $\v w$:
\[\begin{align*} \text{Var}(y \mid \v x) &= E_{y, \v w}\Big[(y - \v x^T\v w)^2 \Big] \\ &= E_{y, \v w}\Big[((y - \v x^T\v \mu) + (\v x^T\v \mu - \v x^T\v w)\big)^2 \Big] \\ &= \gray\underbrace{\black E_y\Big[(y - \v x^T\v \mu)^2\Big]}_{\black=\ \normalsize{\sigma^2}}\black + \gray\underbrace{\gray E_{y, \v w}\Big[2(y - \v x^T\v \mu)(\v x^T\v \mu - \v x^T\v w)\Big]}_{\black=\ \normalsize{0}}+ \gray\underbrace{\black E_{\v w}\Big[(\v w - \v\mu)^2 \Big]}_{\black=\ \normalsize{ \v x^T\v A_n\v x}} \end{align*}\]Alternatively, we could use $\text{Var}(y \mid \v x) = \text{Var}(\v x^T\v w + \epsilon \mid \v x)$ and leverage the fact that the noise is independent of the prediction. Either way, we reach the final form:
\[p(y | \v x) = \MVN {y} {\v x^T \v \mu} {\sigma^2 + \v x^T\v A_n\v x}\]This agrees with linear regression from the frequentist perspective. It reflects the MAP estimate of the weights and the input vector (since the mean is the mode in the Normal distribution). Moreover, the prediction’s variance combines the inherent noise $\sigma^2$ with uncertainty in the regression function itself.
Conclusions
Bayesian linear regression offers an alternative to traditional frequentist methods, emphasizing uncertainty in parameter estimates rather than relying on a single point estimate. By incorporating prior knowledge and updating it with observed data, the Bayesian approach provides more robust predictions. Unfortunately, calculating the posterior and performing inference can be computationally challenging. However, linear regression can also be adapted for complex networks by replacing the input $\v X$ with the last hidden states $\v H$ of a neural network. This allows Bayesian linear regression to take advantage of high-level features, resulting in uncertainty estimates that are more precise and reliable.
Appendix
Proposition 1. Change of variables:
\[\Normx{\v x}{\v A\v z}{\v B} \propto \Normx{\v z}{(\v A^T \v B^{-1}\v A)^{-1}\v A^T \v B^{-1}\v x} {(\v A^T \v B^{-1} \v A)^{-1}}\]Proof. For a symmetric matrix $\v M$, the method of “completing the square” is:
\[\gray (\v z - \v M^{-1}\v m)^T \v M(\v z - \v M^{-1}\v m) = \v z^T \v M\v z - 2\v z^T\v m + f(\v m)\]Applying this identity to the definition of the multivariate normal distribution, we obtain:
\[\gray \begin{align*} \MVN{\v x}{A\v z}{B} &\propto \exp\bigg[-\frac 1 2 (\v x - A\v z)^TB^{-1}(\v x - A\v z)\bigg] \\ &\propto \exp\bigg[-\frac 1 2 (\v z^TA^TB^{-1}A \v z - 2\v z^TA^T B^{-1}\v x) \bigg] \\ &\propto \exp\bigg[-\frac 1 2 (\v z^TM\v z - 2\v z^T\v m)\bigg] \\ &\propto \exp\bigg[-\frac 1 2 (\v z - M^{-1}m)^T M(\v z - M^{-1}\v m)\bigg] \\ &\propto \MVN{\v z} {(A^TB^{-1}A)^{-1}A^TB^{-1}\v x} {(A^TB^{-1}A)^{-1}} \\ \end{align*}\]If $\v x = \v A \v z$ is a multivariate normal random variable, then $\v z$ is also a multivariate normal. The formula, while appearing complex, makes intuitive sense. Let the covariance matrix of $\v z$ be $\v \Sigma$. Then the covariance of $\v x$ is:
\[\gray \begin{align*} \text{Cov}(\v x) &= E\bigg[(\v x - E(\v x))(\v x - E(\v x))^T \bigg]\\[2px] &= E\bigg[(\v A\v z - \v A E(\v z))(\v A\v z - \v A E(\v z))^T \bigg]\\[2px] &= \v A E\bigg[(\v z - E(\v z))(\v z - E(\v z))^T\bigg]\v A^T = \v {A\Sigma A}^T \end{align*}\]When both $\v A$ and $\v \Sigma$ are square and invertible matrices,
\[\gray \begin{align*} \MVN{\v x}{\v A\v z}{\v B} &\propto \MVN{\v z}{\v A^{-1} \v x}{\v \Sigma} \end{align*}\]Proposition 2. Product of two normal distributions:
\[\Normx{\v u}{\v a}{A} \Normx{\v u}{\v b}{B} \propto \Normx{\v u} {(A^{-1} + B^{-1})^{-1}(A^{-1}A +B^{-1}B)}{(A^{-1} + B^{-1})^{-1}}\]Proof. Using the method of “completing the square”, we have:
\[\gray \begin{align*} \Normx{\v u}{a}{A} \Normx{\v u}{b}{B} &\propto \exp\bigg[-\frac 1 2\bigg( (\v u- \v a)^T A^{-1} (\v u- \v a) + (\v u- \v b)^T B^{-1} (\v u- \v b) \bigg)\bigg] \\ &\propto \exp\bigg[-\frac 1 2\bigg( \v u^T(A^{-1} + B^{-1})\v u - 2(\v a^TA^{-1} +\v b^TB^{-1})^T\v u \bigg)\bigg] \\ &\propto \exp\bigg[-\frac 1 2 (\v u^T\v M\v u - 2\v m^T\v u)\bigg] \\ &\propto \exp\bigg[-\frac 1 2 (\v u - \v M^{-1}\v m)^T\v M(\v u - \v M^{-1}\v m)\bigg] \\ &\propto \Normx{\v u}{(A^{-1} + B^{-1})^{-1}(A^{-1}\v a +B^{-1}\v b)}{(A^{-1} + B^{-1})^{-1}} \end{align*}\]Proposition 3. Show the distribution $p(y \mid \v x)$ has the variance $\sigma^2 + \mathbf{x}^T \mathbf{A} \mathbf x$ using Sherman–Morrison identity:
\[p(y \mid \v x) \propto \text{exp}\bigg[ -\frac 1 2 \bigg\{\frac{y^2}{\sigma^2} \left( 1 - \frac{1}{\sigma^2} \mathbf{x}^T \mathbf{M}^{-1} \v x \right) \bigg\} \bigg]\]where $\mathbf M = \frac{1}{\sigma^2} \mathbf{x} \mathbf{x}^T + \mathbf{A}^{-1}$.
Proof. Using the Sherman–Morrison formula for the inverse of a “rank-1 update” [1], we compute:
\[\gray \mathbf{M}^{-1} = \mathbf{A} - \frac{\mathbf{A} \left( \frac{1}{\sigma^2} \mathbf{x}\mathbf{x}^T \right) \mathbf{A}}{1 + \frac{1}{\sigma^2} \mathbf{x}^T \mathbf{A} \mathbf{x}} = \mathbf{A} - \frac{\mathbf{A} \mathbf{x} \mathbf{x}^T \mathbf{A}}{\sigma^2 + \mathbf{x}^T \mathbf{A} \mathbf{x}}.\]Multiplying $\mathbf{M}^{-1}$ on the left and right by $\mathbf{x}^T$ and $\mathbf{x}$ gives:
\[\gray \mathbf{x}^T \mathbf{M}^{-1} \mathbf{x} = \mathbf{x}^T \mathbf{A} \mathbf{x} - \frac{\mathbf{x}^T \mathbf{A} \mathbf{x} \mathbf{x}^T \mathbf{A} \mathbf{x}}{\sigma^2 + \mathbf{x}^T \mathbf{A} \mathbf{x}}.\]Let $c = \mathbf{x}^T \mathbf{A} \mathbf{x}$. Then:
\[\gray \mathbf{x}^T \mathbf{M}^{-1} \mathbf{x} = c - \frac{c^2}{\sigma^2 + c} = \frac{c\sigma^2}{\sigma^2 + c} = \frac{\sigma^2 \mathbf{x}^T \mathbf{A} \mathbf{x}}{\sigma^2 + \mathbf{x}^T \mathbf{A} \mathbf{x}}.\]Substituting this result into the exponent, the coefficient of $y^2$ becomes:
\[\gray \frac{1}{\sigma^2} \left(1 - \frac{1}{\sigma^2} \mathbf{x}^T \mathbf{M}^{-1} \mathbf{x} \right) = \frac{1}{\sigma^2} \left( 1 - \frac{\mathbf{x}^T \mathbf{A} \mathbf{x}}{\sigma^2 + \mathbf{x}^T \mathbf{A} \mathbf{x}} \right) = \frac{1}{\sigma^2 + \mathbf{x}^T \mathbf{A} \mathbf{x}}.\]This confirms that $p(y)$ follows the Normal distribution with variance $\sigma^2 + \mathbf{x}^T \mathbf{A} \mathbf{x}$.