Backpropagation is the cornerstone for computing gradients within neural networks, essential for training. Consider a feedforward neural network $f[\v x_i; \v \phi]$ containing $K$ hidden layers with the $\text{ReLU}$ activation function. For $k \in {0, 1, \ldots, K}$ where $\v h_0 = \v x$,

\[\begin{align*} &\f_k = \v \beta_k + \v W_k\v h_k \\[2px] &\v h_{k+1} = a[\f_k]\\[3px] &l_i =l[\f_k, y_i] \end{align*}\]

Neural network learning reduces to calculating gradients of model parameters ${\part l_i}/{\part \v \beta_k}$ and ${\part l_i}/{\part W_k}$ as they provide signal for optimization methods. To calculate them, we use the chain rule. By default, we use a reverse-mode differentiation and start with the gradient of the loss function $\part l_i / \part \f_k$ and then propagate back through the network via a series of Jacobian matrices, one for each transformation. It’s like an alternative neural network, operating in reverse (from outputs to inputs).

\[\frac{\part l_i}{\part \v x} = \gray \bigg(\bigg(\gray \frac{\part l_i}{\part \f_k}^T \black \frac{\part \f_k}{\part \v h_k} \gray\bigg)\black \frac{\part \v h_k}{\part \f_{k-1}} \gray\bigg)\black\cdots \frac{\part \f_0}{\part \v x}\]

Passing gradients from one transformation to another, the gradient flows from the outputs back to the inputs. Having access to this flow, so for all $\part l_i / \part \f_k$, we can calculate the desired gradients of model parameters.

Jacobians & Gradients

To understand the above flow, we must understand Jacobians, as they represent how changes in inputs affect transformation outputs. For a function $\f: \mathbb R^n \to \mathbb R^m$, the Jacobian matrix contains all partial derivatives $\part f_i / \part x_j$. We define the Jacobian such that columns are $\part \f / \part x_i$ and rows are $\nabla^T f_i$:

\[\gray \v J = \begin{bmatrix}{ \dfrac {\partial f_{1}}{\partial x_{1}}}&\cdots &{\dfrac {\partial f_{1}}{\partial x_{n}}}\\\vdots &\ &\vdots \\{\dfrac {\partial f_{m}}{\partial x_{1}}}&\cdots &{\dfrac {\partial f_{m}}{\partial x_{n}}} \end{bmatrix}\]

For a simple feedforward network, we encounter two types of Jacobians. First, consider the Jacobian $\part \f_k / \part \v h_k$. When $\v w$ is the first row of $\v W_k$, the first component of the function $\f_k$ in the forward pass is:

\[\gray \begin{align*} z = \v w^T \v h_k + b &&&&\text{then}&&&& \part z /\part \v h_k = \v w \end{align*}\]

The weights $\v w$ define how each element of $\v h_k$ impacts $z$. Since this is exactly the first row of the Jacobian, we can derive other rows similarly, what leads us to:

\[\part \f_k / \part \v h_k = \v W\]

The second matrix $\part \v h_{k+1} / \part \f_k$ is diagonal because the activation function is a point-wise operation - each dimension is processed independently. For ReLU, on its diagonal, there are indicators of whether activation was positive or not. Stacking Jacobians together, the backward pass becomes:

\[\frac{\part l_i}{\part \f_{k-1}} = \gray \frac{\part l_i}{\part \f_k}^T\space \black \v W_k \odot \mathbb I(\f_{k-1} > 0)\]

where $\odot$ defines point-wise multiplication, serving as an alternative for a diagonal matrix. PyTorch and JAX frameworks use this convention for defining the Jacobian. However, to calculate from right to left, as in matrix operations, the Jacobian must be transposed [1].

With access to all derivatives $\part l_i / \part \f_k$, we can compute the gradients of model parameters. For any layer $k$,

\[\begin{align*} \frac{\part l_i}{\part \v \beta_k} = \gray \frac{\part l_i}{\part \f_k} &&&&\text{and}&&&& \frac{\part l_i}{\part W_k} = \gray\frac{\part l_i}{\part \f_k} \black\v h_k^T \end{align*}\]

Since each row of $\v W$ works independently in the forward pass, we can treat them separately and analyze again the first component $z$ of the function $\f_k$. We connect the derivative $\part l_i /\part z$, which represents the upstream effect of $z$ within the network, with the weights $\v h_k$ that control how each component of $\v w$ impacts $z$. Since each row uses the same $\v h_k$, instead of using duplicated rows, we write a single row vector $\v h_k^T$.

Automatic differentiation

Deriving backpropagation equations by hand and implementing them is time-consuming and prone to errors. Calculating derivatives numerically using finite differences is imprecise and scales poorly in high dimensions. Automatic differentiation (AD) streamlines this process by augmenting the forward pass code with additional variables that accumulate the required derivatives.

Frameworks such as PyTorch or JAX build computational graphs where nodes represent functional components (linear layer, ReLU activation, loss function) and edges represent tensors. We implement the forward pass using these functional blocks, which come with built-in Jacobian calculations necessary for backpropagation.

There are two main AD types: forward-mode and reverse-mode. Consider a feedforward network $\f: \mathbb R^n \to \mathbb R ^m$ with two hidden layers, for which we write the Jacobian using the chain rule. Regardless of the order in which we multiply these matrices, the product remains the same because matrix multiplication is associative. However, the computational complexity depends on the order of multiplication, where the shape of the matrices dictates the optimal grouping and minimal cost [1].

\[\begin{align*} \frac{\part \f_2}{\part\v x} = \gray\bigg(\black \frac{\part \f_2}{\part\f_1} \gray\bigg(\black \frac{\part \f_1}{\part\f_0} \frac{\part \f_0}{\part\v x} \gray\bigg)\black \gray\bigg)\black && \gray\text{forward-mode} \\[10px] \frac{\part \f_2}{\part\v x} = \gray\bigg(\black \gray\bigg(\black \frac{\part \f_2}{\part\f_1} \frac{\part \f_1}{\part\f_0} \gray\bigg)\black \frac{\part \f_0}{\part\v x} \gray\bigg)\black && \gray\text{reverse-mode} \\ \end{align*}\]

In forward-mode, we start with defining a tuple $(\v x, \dot {\v x})$ of the primal and tangent variable (a dual number). Instead of performing forward propagation to compute intermediate representations $\v z_i$, we propagate tuples $(\v z_i,\dot {\v z}_i)$ where variables and derivatives are evaluated in parallel. To find $\part \f_2 / \part x_1$, we begin with $\dot{\v x} = \v e_1$, and all intermediate tangent variables are $\dot{\v z}_i = \part \v z_i / \part x_1$. One forward pass gives one column of the Jacobian matrix. However, for a specific direction $\v x = \dot {\v r}$​, it also costs a single forward pass. More about dual numbers in [1, 2].

Here is a code example demonstrating forward-mode differentiation in JAX.
import jax
import numpy as np
import jax.numpy as jnp
from jax import jvp

W = jnp.array(np.random.randn(3, 5))
x = jnp.array(np.random.randn(5))
f = lambda x: jax.nn.relu(W @ x)

# The directional derivative
r = jnp.array(np.random.randn(5))
y, u = jvp(f, primals=(x,), tangents=(r,))

# Manually compute the Jacobian-vector product
dh_dz = jnp.diag(jnp.where(W @ x > 0, 1., 0.))
dz_dx = W
jvp = dh_dz @ dz_dx @ r
assert jnp.allclose(jvp, u)

Forward-mode is useful if $m \gg n$ because it requires a separate forward pass for each input variable. This situation occurs in specific cases, such as calculating the Hessian [1, 2]. However, in many cases, the loss function is at the end of the model, and we have the large numbers of model parameters to differentiate with respect to. With $m=1$​, reverse-mode performs vector-matrix products, requiring only a single backward pass to compute all derivatives. This makes it the default mode for most ML frameworks.

Here is a code example demonstrating reverse-mode differentiation in PyTorch.
import torch

x = torch.randn(5, requires_grad=True)
W = torch.randn(3, 5)

# Forward pass
z = W @ x
h = torch.relu(z)

# Reverse-mode AD
v = torch.randn(3)

# Manually compute vector-Jacobian product
dh_dz = torch.diag(torch.where(z > 0, 1., 0.))
dz_dx = W
vjp = v @ dh_dz @ dz_dx
assert torch.allclose(vjp, x.grad)

Reverse mode is often more memory intensive than forward mode because all intermediate variables must be stored, as they are needed when computing the vector-Jacobian products during the backward pass.


Backpropagation works if we cautiously initialize neural network weights. Otherwise, we would suffer from either too small or too large gradients, causing vanishing or exploding gradient problems. Here, we derive He initialization, which is widely used to keep the variance on a stable level for a dense linear transformation.

Consider a neuron within a hidden layer:

\[\v h = a[\v x] \\ Y= \beta + \v w^T\v h\]

The weights are initialized such that $W_i \sim \mathcal{N}(0, \sigma_w^2)$ and the bias term is zero. Using the linearity of expectation, the independence between $W_i$ and $H_i$, and $E(W_i) = 0$, we obtain:

\[\gray EY = E(\beta + \sum_i W_iH_i) = \sum_iE(W_iH_i) = \sum_i EW_i EH_i = 0\]

Since $EY=0$ and variables $W_i$ are independent of each other, the variance of $Y$ is equal to:

\[E(Y^2) = E(\sum_i W_i H_i)^2 = \sum_i E(W_i^2)E(H_i^2) + \gray \sum_{i\ne j}E(W_iW_j) E(Y_iY_j) \black = \sigma_w^2\sum_iE(H_i^2)\]

Using the law of total expectation, and assuming the distribution of $X$ is symmetric about zero,

\[E(H_i^2) = E(X_i^2| X_i > 0)P(X_i >0) = \frac {\sigma_x^2} 2\]

We can now derive the expression for the variance of $Y$ in terms of $X$:

\[\sigma_y^2 = \sigma_w^2\sum_iE(H_i^2) = \sigma_w^2\sum_i \frac {\sigma_x^2} 2 = \black \frac {d_{\text{in}}} 2 \sigma_w^2 \sigma_x^2\]

To keep the variance of $Y$ on a stable level, $\sigma_y^2 = \sigma_x^2$ thus $\sigma_w^2 = 2/d_{\text{in}}$. This is true for the forward pass but for the backward pass, we use the $W^T$ transformation. Following the same reasoning, we would get $\sigma_w^2 = 2/d_{\text{out}}$​. For a rectangular matrix, we cannot satisfy both conditions simultaneously, so one approach is to average them.

Residual connections

Deeper networks perform better, likely because they can define functions containing a large number of complex polygonal regions. As model complexity grows, we expect better performance (due to overparameterization) but it’s only true as long as the training is effective. For sequential deep networks, we cannot fit a model to the training set, which implies that the problem lays in training them, so back-propagating the gradient.

The gradients define the direction of progress for infinitesimal parameter changes, but optimization algorithms use a finite step size. As a result, the pre-update gradients are applied at a new position dictated by updates from preceding layers. When these gradients at different positions become uncorrelated, there is no progress. This issue, known as the shattered gradients problem [1], particularly affects deep sequential networks, where the correlation of gradients is close to zero, regardless of the step size.

Residual connections help to mitigate the shattered gradients problem by branching the computational path,

\[\begin{align*} \v h_1 &= \v x + \f_1[\v x, \v \phi_1] \\ \v h_2 &= \v h_1 + \f_2[\v h_1, \v \phi_2] \\ \v h_3 &= \v h_2 + \f_3[\v h_2, \v \phi_3] \\ \v y &= \v h_3 + \f_4[\v h_3, \v \phi_4] \end{align*}\]

Each addictive combination of the input and the processed output is a residual block. By substituting,

\[\begin{align*} \v y = \v x &+ \f_1[\v x] \\ &+ \f_2[\v x + \f_1[\v x]] \\ &+ \f_3[\v x + \f_1[\v x] + \f_2[\v x + \f_1[\v x]]] \\ &+ \f_4[\v x + \f_1[\v x] + \f_2[\v x + \f_1[\v x]] + \f_3[\v x + \f_1[\v x] + \f_2[\v x + \f_1[\v x]]]] \end{align*}\]

A network with residual connections becomes an ensemble of smaller networks [1] where the gradient can flow through shorter paths. For example, the Jacobian of $\v y$ with respect to $\f_1$ is:

\[\frac{\partial \v y}{\partial \f_1} = \v I + \frac{\partial \f_2}{\partial \f_1} + \left( \frac{\partial \f_3}{\partial \f_1} + \frac{\partial \f_3}{\partial \f_2}\frac{\partial \f_2}{\partial \f_1} \right) + \left( \frac{\partial \f_4}{\partial \f_1} + \frac{\partial \f_4}{\partial \f_2} \frac{\partial \f_2}{\partial \f_1} + \frac{\partial \f_4}{\partial \f_3} \frac{\partial \f_3}{\partial \f_1} + \frac{\partial \f_4}{\partial \f_3} \frac{\partial \f_3}{\partial \f_2} \frac{\partial \f_2}{\partial \f_1} \right)\]

The residual block typically contains multiple layers and often begins and ends with a linear transformation. Starting with a $\text{ReLU}$ activation, the block processes only nonnegative part of the input vector, which could potentially limit access to information. Ending with a $\text{ReLU}$ activation, all block outputs remain non-negative, narrowing the range of inputs. This restricts the amount of information that can be passed through a network. Classic residual architectures are ResNet, DenseNet and U-Net [1, 2, 3].