Normalizing flows

Published

23 May 2024

Author

Rafał Rolczyński

Normalizing flows are probabilistic generative models where we assume the one-to-one mapping, a bijection, between $p(\v z)$ and $p(\v x)$, the distribution of latent variables and data. If we choose $p(\v z)$ to be a simple base distribution from which we can easily draw samples, due to the connection between them, we can easily calculate the likelihood of data $p(\v x)$ applying the formula for a change of variables:

\[p(\v x) = \left| \frac{\part \v x}{\part \v z} \right|^{-1} p(\v z)\]

The absolute determinant of the Jacobian is crucial in connecting these two distributions. A transformation maps an event from one region in the latent space $\v{z}$ to another region in the data space $\v x$. Due to a bijection, the same number of infinitely many small fragments exists in both regions. However, two regions often differ in size, meaning the corresponding fragments $d\v z$ and $d \v x$ cannot be of the same size, even though they are infinitesimally small. The determinant of the Jacobian measures the change in size of these fragments.

The absolute value in the determinant is needed because a transformation can either increase or decrease in any direction. For example, if it’s strictly decreasing, $F_X(x) = 1 - F_Z(f^{-1}(x))$. Differentiating this results in $-p_Z(z)dz/dx$ where $dz/d x$ is negative. The general form uses the absolute value to cover both cases.

Loss function

In machine learning, a transformation function is an invertible neural network $\v x = f(\v z; \v \phi)$. It’s trained with a dataset $\lbrace\v x_i \rbrace$ using the negative log-likelihood loss:

\[\begin{align*} \hat{ \v \phi} &= \argmax \phi \prod_ip(\v x_i| \v \phi) \\ &= \argmax \phi \prod_i\left| \frac{\part \v x_i}{\part \v z_i} \right|^{-1} p(\v z_i) \\ &= \argmin \phi -\sum_i \bigg[ \log p(\v z_i) - \log \left| \frac{\part f(\v z_i;\v \phi)}{\part \v z_i} \right| \bigg] \end{align*}\]

where $\v z_i = f^{-1}(\v x_i)$ is the inverse mapping. Each of subsequent inverse layers gradually “flow” the complex data density toward the predefined normal distribution $p(\v z)$, thus the name normalizing flows.

We pass through the model only once in the inverse direction to calculate the likelihood. The above absolute Jacobian determinant is calculated as the reciprocal of the product of the absolute Jacobian determinants of all layers during the inverse pass. Each layer calculates activations and its transformation determinant. Each must be invertible; otherwise, its Jacobian would be singular, causing the overall determinant to be zero.

To successfully use normalizing flows, the neural network must be invertible and sufficiently expressive to map a multivariate standard normal distribution to an arbitrary density. Additionally, it must be possible to efficiently compute both its inverse and the determinant of the Jacobian.

Linear flows

A linear flow is a basic building block. Using a vanilla dense matrix is inefficient since inverting the transformation and calculating its determinant requires $\mathcal O(n^3)$ operations. Instead, we leverage triangular matrices $\v L$ and $\v U$ where solving each can be done in $\mathcal O( n^2)$ using forward and backward substitutions. Given $\v{LUx} = \v b$, we first solve $\v {L c} = \v b$ and then $\v {U x} = \v c$. Moreover, we keep pivots separately then the determinant is simply the product of the diagonal entries of $\v D$. This leads us to:

\[\v A = \v L(\v U + \v D)\]

If the base distribution is $\text{Norm}({\v \mu}, {\v \Sigma})$, the linear flow produces $\text{Norm}(\v A \v \mu + \v \beta, \v A \v \Sigma \v A^T)$. To transform into more complex distributions, we must add nonlinear functions. It could be pointwise and (almost everywhere) invertible activation function like the leaky $\text{ReLU}$ or more flexible piecewise linear functions.

Coupling flows

An invertible transformation can become complex if a simple invertible function $\g$ is controlled by complex parametrization. Parametrization changes the behavior of the function given inputs. For example, the self-attention transformation maps values using attentions, which are the transformation parameters derived from the inputs.

Since a transformation needs to be invertible, the input vector $\v x$ is divided into two parts $\v x_1$ and $\v x_2$. During the forward pass, $\v x_1$ is copied so it can be reused to recalculate parameters in the backward pass. As a result, there is no need to invert function $\v \phi(\cdot)$ which is often a neural network and cannot be inverted.

\[\begin{align*} \v y_1 \gets \v x_1 && \v y_2 &= \g[\v x_2, \v \phi(\v x_1)] && \gray \text{forward} \\ \v x_1 \gets \v y_1 && \v h_2 &= \g^{-1}[\v y_2, \v \phi(\v x_1)] && \gray \text{backward} \end{align*}\]

Here, we effectively transform only the vector $\v x_2$ because $\v x_1$ controls the conditional behavior of function $\g$. To ensure all dimensions are transformed across layers, we add a predefined and fixed permutation matrix $\v P$ to linear flows. For images, these linear flows are applied within $1\times1$ convolutions [1, 2].

Autoregressive flows

Autoregressive flows leverage parametrization similar to coupling flows but, instead of dividing the vector $\v x$ into two parts, they split it into individual elements $\v x = [x_1, \ldots, x_d]$. The invertible single-variable function $\g$ is parametrized by a neural network. To calculate the parameters of $\g$ for transforming $x_d$, the network uses all previous states $\v x_{1:d-1}$, hence the name autoregressive.

\[\begin{align*} y_d &= \g[x_d, \v \phi(\v x_{1:d-1})] && \gray\text{forward} \\ x_d &= \g^{-1}[y_d, \v \phi(\v x_{1:d-1})] && \gray\text{backward} \end{align*}\]

Computing parameters in forward pass can be done in parallel. A neural network $\v \phi(\cdot)$ processes all steps at once using masked elements of $\v x$, similar to masked attentions in autoregressive tasks. The forward processing is used in the normalizing direction $\v x \to \v z$ which is fast and allows efficient training. However, sampling (the backward pass) is slow because we must decode sequentially, as $\g^{-1}$ for $x_d$ depends on previous states $\v x_{1:d-1}$. Inverse autoregressive flows reverse this process, making sampling fast but training slower.

Residual flows

Residual flows use residual connections specifically designed to make the transformation easily invertible. Similar to coupling flows, the vector $\v x$ is divided into two parts, $\v x_1$ and $\v x_2$. For the forward pass,

\[\v y_1 = \v x_1 + \f_1(\v x_2\gray ; \v \phi_1\black) \\ \v y_2 = \v x_2 + \f_2(\v y_1\gray ; \v \phi_2\black) \\\]

For the backward pass,

\[\v x_2 = \v y_2 - \f_2(\v y_1\gray ; \v \phi_2\black) \\ \v x_1 = \v y_1 - \f_1(\v x_2\gray ; \v \phi_1\black) \\\]

Backward processing is sequential since calculating $\v x_1$ requires $\v x_2$. The Jacobian determinant $\part \v y / \part \v x$ is equal to:

\[\begin{vmatrix} \v I & {\frac{\part \f_1(\v x_2)}{\part \v x_2}} \\ {\frac{\part \f_2(\v y_1)}{\part \v y_1}} & \v I + {\frac{\part \f_1(\v x_2)}{\part \v x_2}}{\frac{\part \f_2(\v y_1)}{\part \v y_1}} \end{vmatrix}\]

Using the block elimination, it can be simplified to:

\[\gray \begin{align*} \begin{vmatrix} \v I & \v A \\ \v B & \v I + \v{AB} \end{vmatrix} = \begin{vmatrix} \v I & \v B \\ & \v I + \v{AB} - \v {BA} \end{vmatrix} \end{align*}\]

Functions $\f_1$ and $\f_2$ do not need to be invertible for the flow transformation to be invertible. Even if their gradients were singular, having zero determinants, the overall structure of the flow guarantees that the combined Jacobian determinant remains non-zero (as we still could invert the transformation). However, computing the Jacobian for general functions $\f_1$ and $\f_2$ remains a challenge [1].

This mechanism can be used to save memory during training. If a layer is invertible, storing activations for backpropagation is unnecessary since we can recalculate them in the backward pass [1].

Residual flows can be defined differently [1, 2] using the contraction mappings having the property:

\[\gray \begin{align*} \text{dist}[\f(x), \f(y)] \le \beta \cdot\text{dist}[x, y] && \forall x, y \end{align*}\]

where $0 < \beta < 1$ is the Lipschitz constant of $f$. The Banach theorem says that each contraction mapping has one unique fixed point. We can start from an arbitrary point $z_0$ and iterate the function $z_{k+1} = f(z_k)$, repeatedly pass outputs back in as inputs, and this process converges to a fixed point $z = f(z)$.

If $f(z)$ is a contraction mapping then $c- f(z)$ as well, where $c$ is a constant, because their absolute derivatives match. In the context of residual flows, it means that the residual connection $y = z + f(z)$ for given $y$ can be inverted following the same iterative process in which, starting with arbitrary $z_0$, the two imbalanced sides of the equation converge and find the equilibrium at a fixed point, $z = y - f(z)$. This iterative process is used only for sampling, not calculating the likelihood required for training.

\[z_{k+1} = y - f(z_k)\]

With a residual connection $\v x \gets \v x + \f(\v x)$, a multivariable transformation is invertible if the Lipschitz constant is less than one. Assuming that the slope of the activation functions is not greater than one, it means that the largest eigenvalue of each weight matrix $\v W$ must be less than one. Similar to Wasserstein distance in GANs, one crude approximation is to clip matrix weights to small values.

Additionally during the training, we must efficiently calculate the likelihood, thus the logarithm of the determinant of the Jacobian $\part \f(\v x, \v \phi) / \part \v x) = \v I + \v A$. In the appendix, we show 1) how the logarithm of the determinant can be represented using the trace, and 2) how to change the logarithm of a matrix into a power series.

\[\log \det(\v I + \v A) = \tr\log (\v I + \v A) = \sum_{k=1}^\infty\frac{(-1)^{k+1}}{k} \tr (\v A^k)\]

Even when we truncate this series, it’s still computationally expensive to compute the trace of individual terms. We approximate them using Hutchinson’s trace estimator (also in the appendix).

Multi-scale flows

Multi-scale flows introduce a hierarchical structure of the latent variables, dividing the latent space into several components $\v z = [\v z_1, \v z_2, \dots, \v z_n]$. Each is concatenated with hidden states at different levels. In the normalizing direction, the $\v z_i$ component skips the remaining processing and is passed directly to the loss function where they are assessed against the base density. This not only allow latent variables to control the generating process at multiple levels but also accelerates both density estimation and sampling, optimizing the flow [1].

\[\begin{align*} \v h_1 &\gets \f_1[\v z_1] \\ \v h_2 &\gets \f_2[\v z_2, \v h_1] \\ \v h_3 &\gets \f_3[\v z_3, \v h_2] \\ \end{align*}\]

Approximates

Normalizing flows can learn to generate samples by approximating a given density that is easy to evaluate but difficult to sample from. We define the normalizing flow $p(\v x \mid \v\theta)$ as the student, and the known target density $q(\v x)$ as the teacher.

During the training, we sample latent variables $\v z_i$ and generate samples $\v x_i = \f[\v z_, \v \phi]$ from the student’s model. We then use the teacher’s model to provide likelihoods necessary for the loss function based on the reverse KL divergence, which aims to align* the student and teacher distributions:

\[\hat {\v \phi} =\argmin \phi \bigg[D_\text{KL} \bigg[ \frac 1 n \sum_{i=1}^n \delta[\v x - \f[\v z_i, \v \phi]] \bigg \| q(\v x) \bigg] \bigg] \gray =\argmin \phi \bigg[-\sum_{i=1}^n \log q(\f[\v z_i; \v \phi])\bigg]\]

Note that the latent variables are known because we generate them ourselves. Since we do not need to inverse the transformation, we can use a model like a masked-autoregressive flow where inversion is slow.

This contrasts with the typical use of normalizing flows to build a probability model $p(\v x_i \mid \v \theta)$ of data that came from an unknown distribution with samples $\v x_i$ minimizing the forward KL divergence:

\[\gray \hat {\v \phi} =\argmin \phi \bigg[D_\text{KL} \bigg[ \frac 1 n \sum_{i=1}^n \delta[\v x - \v x_i] \bigg \| p(\v x | \v \theta) \bigg] \bigg] =\argmin \phi -\sum_i \bigg[ \log p(\v z_i) - \log \left| \frac{\part f(\v z_i;\v \phi)}{\part \v z_i} \right| \bigg]\]

Appendix

The trace can be used to represent the determinant. Consider matrices $\v A$ and $\v B$ with sizes $m \times n $ and $n \times m$.

\[\tr[\v {AB}] \gray = \sum_{i=1}^m \sum_{k=1}^n a_{ik}b_{ki} = \sum_{k=1}^n \sum_{i=1}^m b_{ki}a_{ik} \black = \tr[\v {BA}]\]

This symmetry extends to the product of more matrices. Due to the associative nature of matrix multiplication, the trace is invariant under cyclic permutations:

\[\tr[\v {ABC}] = \tr[\v {BCA}] = \tr[\v {CAB}]\]

Leveraging this property, for any diagonalizable matrix $\v A$ whose eigenvalues are nonzero,

\[\begin{align*} \tr \log \v A \gray = \tr (\v P\log (\v D) \v P^{-1}) =\tr(\log\v D) =\log\lambda_1 + \log \lambda_2 + \ldots =\black \log|\v A| \end{align*}\]

The logarithm of matrix $\v I + \v A$ can be expanded into a power series. Consider the function $\log(1-x)$. Its derivative can be expressed as a geometric series,

\[\gray \frac{d}{dx} \log(1+x) = \frac 1 {1+x} = 1 -x + x^2 - \ldots\]

This expression holds for $-1 < x < 1$ when the series converges since $x^{n+1} = 0$ with $n \to \infty$.

\[\gray (1 + x + x^2 + \ldots)(1-x) = 1 - x^{n+1}\]

By integrating and determining the constant (which is zero), the power series for the logarithm becomes:

\[\gray \log(1 + x) = \sum_{k=1}^\infty \frac{(-1)^{k+1}}{k} x^k\]

Following the same reasoning applied to matrices,

\[\log(1 + \v A) = \sum_{k=1}^\infty \frac{(-1)^{k+1}}{k} \v A^k\]

with the same underlying assumption that the geometric series converges. This requires:

\[\lim_{n\to\infty} \v A^{n+1} = \v 0\]

Convergence is guaranteed if all eigenvalues of $\v A$ have magnitudes less than one.

Hutchinson’s estimator can substantially speed up the computation of the trace of a matrix function $f(\v A)$. Consider a normal random variable vector $\v z$ with mean zero and variance $\v I$. The trace of a matrix $\v A$ can be estimated as:

\[\begin{align*} \tr[\v A] &= \tr[\v A E[\v z \v z^T]] \\ &=\tr[E[\v A\v z \v z^T]] \\ &=E[\tr[\v A\v z \v z^T]] \\ &=E[\tr[\v z^T\v A\v z ]] \\ &=E[\v z^T\v A\v z ] \end{align*}\]

The first line holds because $E[\v z \v z^T ] = \v I$. The next step uses the linearity of expectation. Then, since the expected value of the sum of random variables is the sum of their expected values, trace and expectation operators can be interchanged. In the fourth step, the trace’s cyclic permutation invariance results in a scalar, allowing us to omit the trace operator. The trace is then estimated by sampling $n$ observations $\v z_i$:

\[\tr[A] = E[\v z^T\v A \v z] \approx \frac{1}{n} \sum_{i=1}^n \v z_i^T \v A \v z_i\]

Consider a function $f(\v A)= \v A^k$ where $\v A \in \mathbb R^{d\times d}$. Directly calculating the trace of $\v A^k$ requires $\mathcal O((k-1)d^3)$ multiplications to form the matrix from which diagonal elements are then summed. Using Hutchinson’s estimator, we pass vectors $\v z_i$ through matrix $\v A$ iteratively ($k$ times), bypassing building explicitly matrix $\v A^k$. The estimation cost $\mathcal O(nkd^2 + nd)$ is significantly reduced for large $d$ and $k$.