Generative adversarial networks

The aim is to generate new samples $\lbrace \v x_j \rbrace$ that belong to the same distribution as a set of real training data $\lbrace \v x_i \rbrace$. Once we draw a latent variable $\v z_j$ from a simple predefined distribution, a generator produces a new sample $\v x^\star_j = g[\v z_j, \v \theta]$. To find model parameters, we need a signal $\part l / \part{ \v {x^\star}}$ to prompt how to change a generated sample to be closer to real data where $l$ measures that distance.

Loss function

GANs use a discriminator as a signal. This is a binary classifier that predicts whether a given sample is real or not. Using the binary cross-entropy and grouping real (i) and generated (j) samples together,

\[\begin{align*} \hat{ \v \phi} = \argmin \phi \bigg[ -\sum_j \log(1-\lambda(\v x^\star_j)) - \sum_i \log \lambda(\v x_i) \bigg] &&\gray \lambda(x) = \text{sig}[f(x, \v \phi)] \end{align*}\]

When $P(\text{real}) = 1/2$, for half of samples, we can propagate gradients further back to train a generator:

\[\begin{align*} \hat{\v \theta} = \argmax \theta\bigg[ -\sum_j \log(1-\lambda(\v z_j)) \bigg] && \gray \lambda(\v z_j) = \text{sig}[f(g(\v z_j, \v \theta), \v \phi)] \end{align*}\]

GAN training is a minimax game. The generator tries to find new ways to fool the discriminator, which in turn searches for new ways to distinguish generated samples from real examples. Let $D$ and $G$ be the decoder and the generator respectively. Multiplying by $-1$ and approximating expectations by sums, we obtain:

\[\min_{G} \max_D V(D,G) = E_{\v x \sim p}[\log D(\v x)] + E_{\v z \sim \text{noise}}[\log (1-D(G(\v z)))]\]

Hard to train

GANs are hard to train, and a common problem is a mode dropping where a neural network generates plausible samples but belonging to a subset of the data. Inspecting the loss function,

\[\gray \begin{align*} L(\v \phi) = \int q(\v x)\log(1- \lambda(\v x)) d\v x + \int p(\v x)\log \lambda(\v x) d\v x \\ \end{align*}\]

When we mix generated and true samples in even proportions $P(\text{real}) = 1/2$, we may say:

\[\gray \begin{align*} \lambda(x) = P(\text{real}\mid \v x) = \frac{P(\v x\mid \text{real})} {P(\v x\mid \text{real}) + P(\v x\mid \text{gen})} = \frac{p(\v x)}{p(\v x) + q(\v x)} \end{align*}\]

We substitute this and obtain the Jensen-Shannon divergence between the true and generated distributions:

\[\begin{align*} L(\v \phi) = \gray -\log 4 + \underbrace{\black \int q(\v x)\log \frac{q(\v x)}{m(\v x)}d\v x}_{\gray \text{quality}} + \underbrace{\black \int p(\v x)\log \frac{p(\v x)}{m(\v x)}d\v x}_{\gray \text{coverage}} && \gray m(x) = \frac{p(x) + q(x)} 2 \end{align*}\]

In generator training, the coverage term is not included. That’s why it may fail to cover all modes of $p(\v x)$.

The loss measures the distance between the generated and true distributions. Theoretically, if the probability distributions are completely disjoint, this distance is infinite, and any small change to the generator will not decrease the loss. In other words, if the discriminator can perfectly separate the generated and real samples, the gradient is flat so no small change to the generated samples will change the classification score.

The generated samples lie in a subspace that is the size of the latent variable $\v z$. The real examples $\v x$ also lie in a subspace due to the physical processes that generate them. As a result, there may be little or no overlap between $p(\v x)$ and $q(\v x)$. This results in very small and unstable gradients, what disturbs the training.

Wasserstein distance

A different distance metric with better properties is the Wasserstein or earth mover’s distance [1]. It measures required work to transport the probability mass from one distribution to form the other. It is well-defined when distributions are disjoint and changes smoothly as they become closer to each other [1, 2, 3].

Let $\gamma$ be a joint probability distribution that belongs to the set of all distributions $\Pi(p, q)$ with marginals $p$ and $q$. We minimize a transform plan $\gamma$ using the Euclidian distance as a cost. The Earth mover’s distance (discrete) is:

\[\text{EMD}(p, q) = \inf_{\gamma \in \Pi} \sum_{x, y} \| x - y\| \gamma(x, y) = \inf_{\gamma \in \Pi} E_{(x, y) \sim \gamma} \| x - y \| = \gray \inf_{\gamma \in \Pi} \langle \v \Gamma, \v D \rangle _{\text{F}}\]

The Frobenius inner product sums all the element-wise products of $\v \Gamma = \gamma(x ,y)$ and $\v D = | x - y |$. Flattering them into vectors $\v x = \text{vector}(\v \Gamma)$ and $\v c = \text{vector}(\v D)$, we can leverage linear programming. Find a vector $\v x$ that minimizes the cost $z = \v c^T \v x$ where $\v x$ is constrained by $\v A \v x = \v b$ and $\v x \ge 0$. Note that $\v A$ is given, it’s a sparse matrix containing zero and ones with a specific structure representing how to marginalize a joint distribution of two r.v.s. For example, if random variables are discrete with three possible states, $\v x \v A^T = \v b$ is:

\[\gray \begin{bmatrix} \gamma(x_1, y_1)\\ \gamma(x_1, y_2)\\ \gamma(x_1, y_3)\\ \gamma(x_2, y_1)\\ \gamma(x_2, y_2)\\ \gamma(x_2, y_3)\\ \gamma(x_3, y_1)\\ \gamma(x_3, y_2)\\ \gamma(x_3, y_3)\\ \end{bmatrix} \begin{bmatrix} 1&&&1&& \\ 1&&&&1& \\ 1&&&&&1 \\ &1&&1&& \\ &1&&&1& \\ &1&&&&1 \\ &&1&1&& \\ &&1&&1& \\ &&1&&&1 \\ \end{bmatrix} = \begin{bmatrix} p(x_1)\\ p(x_2)\\ p(x_3)\\ q(y_1)\\ q(y_2)\\ q(y_3)\\ \end{bmatrix}\]

In practice, the problem is intractable for high-dimensional inputs. However, for training a model, we need only the $\text{EMD}$, not the entire probability distribution $\gamma$, and its gradient with respect to $q$ (which is here only a constraint). We take a use of the dual form that changes relations between the same variables. We maximize $\tilde z = \v b^T \v y$ where $\v A^T \v y \le \v c$. The Strong Duality theorem says that these two forms are equivalent and $ z= \tilde z$.

By splitting $\v b$ into separate vectors $\v p$ and $\v q$ (marginals), the dual form becomes:

\[\gray \text{EMD}(p, q) = \f^T\v p +\g^T\v q\]

under the constraints $\v A^T \v y \le \v c$. This can be simplified to $f(x_i) + g(x_j) \le \v D_{i,j}$ because each row of $\v A^T$ contains only two non-zero elements. For $i=j$ where $\v D_{i, j} = 0$, it requires $f(x_i) + g(x_i) \le 0$. Given $\v p \ge 0$ and $\v q \ge 0$, it leads to $g = -f$. This holds true given all other constraints.

Using $g = -f$, the constraints are $f(x_i) - f(x_j) \le \v D_{i,j}$ and $f(x_i) - f(x_j) \ge -\v D_{i,j}$. This means the difference between two consecutive function values cannot exceed $-1$ and $1$. For continuous probability distributions, the function $f$ must be a Lipschitz continuous function with Lipschitz constant of $1$.

\[\text{EMD}(p, q) = \sup_{\|\f\|_{L \le1}} E_{x \sim p}f(x) - E_{x \sim q}f(x)\]

A discriminator no longer classifies directly real and generated samples. Instead, it learns a Lipschitz continuous function $f$ to help compute the Wasserstein distance. The challenge in the loss function is enforcing $| f|_{1}$.

\[\bigg|\frac{\part f(\v x, \v \theta)}{\part \v x} \bigg| < 1\]

To roughly achieve this, we may either clip the discriminator weights or use the gradient penalty as a regularization term that increases when the gradient norm deviates from one [1].

Other tricks

Other tricks are required to help GANs to produce high-quality images. In progressive growing, a generator starts with learning very small 4x4 images. When training at this low-resolution finishes, subsequent layers are added to the generator (and to the decoder) to process 8x8 images, and so on. The higher-resolution layers gradually “fade in” over time; initially, the higher-resolution image is an upsampled version of the previous result, passed via a residual connection, and the new layers gradually take over.

Mini-batch discrimination increases the coverage. If the problem of mode collapse exists, the generated samples are close to each other. A discriminator could then easily classify a batch whether it contains real or generated samples. To correlate samples in a batch, we transform batch statistics (authors use the L1 norm) into feature maps which we add to decoder inputs. As a result, a discriminator classifies a batch, not independent samples.

Another trick is truncation where we use only latent variables $\v z$ with high probability. This reduces the diversity of samples but improves their quality.

Conditional generation

Conditional generation provides more control over the outputs we generate. In vanilla conditional GANs, the generator transforms latent variables $\v z$ into $\v x^\star$ given attributes $\v c$. The discriminator, receiving both $\v x^\star$ and $\v c$, determines whether it is a generated sample with the target attributes or a real example with the real attributes (that we need to provide).

For GANs with an auxiliary classifier, a generator takes $\v z$ with an additional attribute, such as an image class. In the case with $k$ classes, the decoder processes $\v x^\star$ and returns $k+1$ outputs. The first output is dedicated for the discrimination task, while the subsequent outputs are used for class prediction, using cross-entropy loss.

The InfoGAN learns attributes without supervision [1]. It splits latent variables into noise $\v z$ and undefined random attributes $\v c$, latent codes. The idea is to maximize mutual information between $\v c$ and $\v x = G(\v z, \v c)$. When the mutual information is high, then latent codes well describe a generated sample. The loss function is: \(\gray \min_{G} \max_{D} V_{\text{New}} (D, G) = V (D, G) \black − λI(\v c; \v x)\)

The mutual information can be defined as the difference of two entropy terms; we minimize the conditional entropy of $\v c$ given $\v x$ and omit the marginal entropy of $\v c$ for simplicity (but for common base distributions, a simple analytical form exists and could be added to the loss). Defining $\v x = G(\v z, \v c)$,

\[\gray \begin{align*} I(\v c;\v x) &= -H(\v c | \v x) + H(\v c) = \black E_{\v z,\ \v c\ }\bigg[\log P(\v c | \v x) \bigg] \gray + H(c) \end{align*}\]

It is hard to maximize this directly as it requires access to the posterior $P(\v c \mid \v x)$. However, we can obtain a lower bound by defining an auxiliary distribution $Q(\v c \mid \v x)$. By Variational Information Maximization [1], “at least” mutual information is:

\[\gray \begin{align*} E_{\v z,\ \v c\ }\bigg[\log \black P(\v c | \v x)\gray \bigg] &\ge E_{\v z,\ \v c\ }\bigg[\log \black Q(\v c | \v x)\gray \bigg] \end{align*}\]

The mutual information loss part becomes:

\[\begin{align*} L_I(G, D) &= E_{\v z, \v c}\bigg[ \log Q(\v c | \v x) \bigg] \gray \le I(\v c, \v x) \end{align*}\]

The auxiliary distribution $Q$ is a neural network where $D$ and $Q$ share all layers except the last one. To sum up, the decoder processes $\v x^\star$ and returns a single output for the discrimination task and the predicted attributes $\tilde{ \v {c}}$. For discrete latent codes, a softmax function is used, while for continuous, a normal distribution.

Appendix

The information content of a discrete random variable $X$ is defined as $I(X) = -\log P(X)$, a random variable indicating that events with lower probabilities contain more information. Entropy, denoted as $H(X)$, measures the expected information and is defined as $H(X) = E(-\log P(X))$. The conditional entropy of $Y$ given $X$ is defined as $H(Y \mid X) = E_XE_Y(-\log P( Y \mid X))$.


The mutual information between $X$ and $Y$ measures the “amount of information” learned from knowledge of random variable $Y$ about the other random variable $X$. The mutual information can be expressed as the difference of two entropy terms:

\[I(X; Y) = H(X) - H(Y|X)\]

If $X$ and $Y$ are independent, then $I (X ; Y ) = 0$, because knowing one variable reveals nothing about the other. We reach maximal mutual information if $X$ and $Y$​ are related by a deterministic invertible function.

It can be also defined in terms of PDFs for continuous distributions (or PMFs for discrete distributions): \(I(X, Y) = \iint_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)}\ dxdy\)


The Jensen-Shannon divergence is symmetric by design. It is the mean KL-divergence of $p(x)$ and $q(x)$ to the average of these two distributions.

\[\begin{align*} D_{JS}\bigg[p(x)\ \|\ q(x) \bigg] = \frac 1 2 D_{KL}\bigg[p(x)\ \|\ m(x) \bigg] + \frac 1 2 D_{KL}\bigg[q(x)\ \|\ m(x) \bigg] &&\gray m(x) = \frac {p(x) + q(x)}{2} \end{align*}\]

Deriving the InfoGAN function loss, authors introduce partitioning $f(\v x, \v y)$ along $\v x$ using a random variable $\v x’$. It’s an interesting approach but overly complex:

\[\gray \begin{align*} E_{x,\ y\ }[f(x, y)] &= \iint_{x, y} p(x, y) f(x, y) \bigg(\int_{x'} p(x'|y) dx' \bigg) dx dy\\ &= \iint_{x, y} p(x, y) \bigg(\int_{x'} p(x'|y) f(x', y) dx' \bigg) dx dy\\ &= E_{x,\ y,\ x' \sim X | y\ } E[f(x', y)] \end{align*}\]