Advanced

GAN Architecture

Explore Generative Adversarial Networks — the adversarial game between generator and discriminator, the evolution from vanilla GANs to StyleGAN, and why diffusion models eventually overtook them for image generation.

The Adversarial Game

Generative Adversarial Networks (GANs), introduced by Ian Goodfellow in 2014, consist of two neural networks locked in competition. The Generator creates fake data from random noise, while the Discriminator tries to distinguish real data from generated data. Through this adversarial process, the generator learns to produce increasingly realistic outputs.

Think of it as a counterfeiter (generator) trying to create convincing fake currency while a detective (discriminator) tries to spot the forgeries. Over time, the counterfeiter becomes so skilled that the detective can no longer tell real from fake.

Generator vs Discriminator

The two networks have opposing objectives:

  • Generator G(z): Takes a random noise vector z from a latent space (typically Gaussian or uniform distribution) and maps it to a data sample. The generator wants to maximize the probability that the discriminator classifies its output as real.
  • Discriminator D(x): Takes a data sample x (either real or generated) and outputs a probability that it is real. The discriminator wants to correctly classify real samples as real and fake samples as fake.

Training alternates between updating the discriminator on a batch of real and fake samples, then updating the generator based on how well it fooled the discriminator.

The Minimax Objective

The GAN training process is formalized as a minimax game with the following objective function:

Mathematics
# GAN Minimax Objective
min_G max_D V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))]

# Where:
# E[log D(x)]        - Expected log probability that D correctly
#                      identifies real data as real
# E[log(1 - D(G(z)))]- Expected log probability that D correctly
#                      identifies fake data as fake
# D wants to MAXIMIZE this (be a good detective)
# G wants to MINIMIZE this (fool the detective)

# At Nash equilibrium (ideal convergence):
# D(x) = 0.5 for all x (can't distinguish real from fake)
# G produces samples from the true data distribution

Training Instability

GANs are notoriously difficult to train. The adversarial training process is inherently unstable because you are simultaneously optimizing two competing objectives. Two major failure modes plague GAN training:

Mode Collapse

Mode collapse occurs when the generator learns to produce only a small subset of possible outputs. Instead of generating diverse faces, for example, the generator might produce variations of just one or two faces that consistently fool the discriminator. The generator finds a "safe" output and exploits it rather than learning the full data distribution.

Vanishing Gradients

When the discriminator becomes too good too quickly, it classifies all generated samples as fake with high confidence. The loss function flattens out, providing near-zero gradients to the generator. Without useful gradient signal, the generator cannot learn and training stalls completely.

Training GANs is an art: Successful GAN training requires careful hyperparameter tuning, learning rate scheduling, architectural choices, and often domain-specific tricks. Many researchers spent years developing techniques to stabilize GAN training before diffusion models offered a simpler alternative.

DCGAN: Deep Convolutional GAN

DCGAN (Radford et al., 2015) established architectural guidelines that made GAN training more stable and became the foundation for subsequent GAN architectures:

  • Replace pooling layers with strided convolutions (discriminator) and transposed convolutions (generator)
  • Use batch normalization in both generator and discriminator
  • Remove fully connected hidden layers for deeper architectures
  • Use ReLU activation in the generator (except output layer which uses Tanh)
  • Use LeakyReLU activation in the discriminator

Wasserstein GAN (WGAN)

WGAN (Arjovsky et al., 2017) addressed the vanishing gradient problem by replacing the original GAN loss with the Wasserstein distance (Earth Mover's Distance). Instead of the discriminator outputting a probability, the critic (renamed from discriminator) outputs an unbounded score.

  • Wasserstein distance provides smooth, meaningful gradients even when the distributions are far apart, unlike the Jensen-Shannon divergence used in vanilla GANs.
  • Weight clipping enforces the Lipschitz constraint required by the Wasserstein distance, though WGAN-GP (gradient penalty) later replaced this with a more principled approach.
  • The critic can be trained to optimality without vanishing gradients, making training more stable and providing a meaningful loss metric that correlates with sample quality.

Progressive GAN

Progressive GAN (Karras et al., 2017) introduced the idea of growing both the generator and discriminator progressively, starting from low-resolution images (4x4) and gradually adding layers to increase resolution (up to 1024x1024). This approach:

  • Stabilizes training by learning large-scale structure first, then refining details
  • Reduces total training time significantly
  • Enabled the first photorealistic face generation at high resolution
  • Used smooth fade-in of new layers to prevent sudden shocks to the training process

StyleGAN (1, 2, and 3)

The StyleGAN family (Karras et al., NVIDIA) represents the pinnacle of GAN-based image generation:

  • StyleGAN (2018): Introduced the mapping network (8-layer MLP transforming z to w space), adaptive instance normalization (AdaIN) for style injection at each resolution, and stochastic noise inputs for fine details. Enabled unprecedented control over generated face attributes like age, pose, and hair style.
  • StyleGAN2 (2020): Fixed characteristic "water droplet" artifacts by replacing AdaIN with weight demodulation. Improved perceptual path length and removed progressive growing in favor of skip connections and residual architecture.
  • StyleGAN3 (2021): Addressed the "texture sticking" problem where fine details stuck to pixel coordinates rather than moving naturally with objects. Introduced continuous signal interpretation and equivariance, enabling smooth animations and video generation.

Conditional GANs

Standard GANs generate random samples with no control over the output. Conditional GANs (cGANs) add conditioning information to guide generation:

  • Class-conditional GAN: Feed class labels to both generator and discriminator. Generate specific categories like "cat" or "dog" on demand.
  • pix2pix (Isola et al., 2017): Paired image-to-image translation. Maps input images to output images using a U-Net generator and PatchGAN discriminator. Applications include edges-to-photos, segmentation maps to street scenes, and day-to-night conversion.
  • CycleGAN (Zhu et al., 2017): Unpaired image-to-image translation using cycle consistency loss. Two generators translate between domains A and B, with the constraint that translating A to B and back to A should recover the original image. Famous for horse-to-zebra and photo-to-painting transformations.

Comparison of GAN Variants

GAN Variant Year Key Innovation Training Stability Best For
Vanilla GAN 2014 Adversarial training concept Poor Proof of concept
DCGAN 2015 Convolutional architecture guidelines Moderate Image generation baseline
WGAN 2017 Wasserstein distance loss Good Stable training, meaningful loss
Progressive GAN 2017 Progressive resolution growing Good High-resolution images
pix2pix 2017 Paired image-to-image translation Good Paired domain transfer
CycleGAN 2017 Unpaired translation, cycle loss Good Unpaired domain transfer
StyleGAN 2018 Style-based generation, W space Good Controllable face generation
StyleGAN2 2020 Weight demodulation, no artifacts Very Good State-of-the-art faces
StyleGAN3 2021 Equivariance, alias-free Very Good Animation-ready generation

GANs vs Diffusion Models: Why Diffusion Won

By 2022, diffusion models (DALL-E 2, Stable Diffusion, Imagen) largely replaced GANs as the dominant generative model for images. Here is why:

  • Training stability: Diffusion models use a simple denoising objective with a well-defined loss function. No adversarial training means no mode collapse, no vanishing gradients, and no careful balancing of two networks.
  • Mode coverage: Diffusion models naturally cover the full data distribution, while GANs tend to focus on high-density regions, missing rare modes.
  • Diversity: Diffusion models generate more diverse outputs for the same prompt, while GANs often produce less variety.
  • Text conditioning: Diffusion models integrate naturally with text encoders (CLIP), enabling powerful text-to-image generation. GANs struggled with free-form text conditioning at scale.
  • Scalability: Diffusion models scale better to large datasets and model sizes, following predictable scaling laws.

However, GANs have one significant advantage: speed. A GAN generates an image in a single forward pass, while diffusion models require many iterative denoising steps (typically 20-50). This makes GANs preferable for real-time applications.

Where GANs Are Still Used

Despite the rise of diffusion models, GANs remain valuable in several domains:

  • Super-resolution: ESRGAN and Real-ESRGAN use GAN-based training to upscale images with sharp, realistic details. The single-pass inference is ideal for real-time video upscaling.
  • Data augmentation: GANs generate synthetic training data for domains with limited data, such as medical imaging (synthetic X-rays, MRIs) and rare defect detection in manufacturing.
  • Video generation: GAN-based approaches remain competitive for real-time video synthesis, face reenactment, and talking head generation due to their speed.
  • Image editing: GAN inversion techniques (projecting real images into the latent space) enable precise semantic editing of real photographs.
  • Game asset generation: Real-time texture synthesis and style transfer for gaming applications where inference speed matters.
  • Adversarial training for other models: The discriminator concept lives on in techniques like RLHF reward models and adversarial data augmentation for robustness.
When to choose GANs: If you need real-time generation (single forward pass), super-resolution, or are working in a domain where GAN architectures are well-established and diffusion models are overkill, GANs remain an excellent choice. For general-purpose image generation from text, diffusion models are the better default.

Code Example: Simple GAN in PyTorch

Python (PyTorch)
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Hyperparameters
latent_dim = 64
hidden_dim = 256
image_dim = 28 * 28  # MNIST flattened
batch_size = 128
lr = 0.0002
epochs = 50

# Generator: maps noise z to image space
class Generator(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.BatchNorm1d(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim * 2),
            nn.LeakyReLU(0.2),
            nn.BatchNorm1d(hidden_dim * 2),
            nn.Linear(hidden_dim * 2, image_dim),
            nn.Tanh()  # Output in [-1, 1]
        )

    def forward(self, z):
        return self.net(z)

# Discriminator: classifies real vs fake
class Discriminator(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(image_dim, hidden_dim * 2),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.net(x)

# Setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
G = Generator().to(device)
D = Discriminator().to(device)
criterion = nn.BCELoss()
opt_G = optim.Adam(G.parameters(), lr=lr, betas=(0.5, 0.999))
opt_D = optim.Adam(D.parameters(), lr=lr, betas=(0.5, 0.999))

# Load MNIST
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])
dataset = datasets.MNIST("./data", download=True, transform=transform)
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Training loop
for epoch in range(epochs):
    for real_imgs, _ in loader:
        real_imgs = real_imgs.view(-1, image_dim).to(device)
        bs = real_imgs.size(0)
        real_labels = torch.ones(bs, 1).to(device)
        fake_labels = torch.zeros(bs, 1).to(device)

        # Train Discriminator
        z = torch.randn(bs, latent_dim).to(device)
        fake_imgs = G(z).detach()
        d_loss = criterion(D(real_imgs), real_labels) + \
                 criterion(D(fake_imgs), fake_labels)
        opt_D.zero_grad()
        d_loss.backward()
        opt_D.step()

        # Train Generator
        z = torch.randn(bs, latent_dim).to(device)
        fake_imgs = G(z)
        g_loss = criterion(D(fake_imgs), real_labels)
        opt_G.zero_grad()
        g_loss.backward()
        opt_G.step()

    print(f"Epoch {epoch+1}/{epochs} | D Loss: {d_loss:.4f} | G Loss: {g_loss:.4f}")
💡
Key training tips: Use label smoothing (0.9 instead of 1.0 for real labels), add small noise to discriminator inputs, use separate batches for real and fake samples, and monitor both losses. If D loss drops to zero, the discriminator is too strong and the generator cannot learn. Try reducing the discriminator learning rate or training the generator more frequently.