GAN Architecture
Explore Generative Adversarial Networks — the adversarial game between generator and discriminator, the evolution from vanilla GANs to StyleGAN, and why diffusion models eventually overtook them for image generation.
The Adversarial Game
Generative Adversarial Networks (GANs), introduced by Ian Goodfellow in 2014, consist of two neural networks locked in competition. The Generator creates fake data from random noise, while the Discriminator tries to distinguish real data from generated data. Through this adversarial process, the generator learns to produce increasingly realistic outputs.
Think of it as a counterfeiter (generator) trying to create convincing fake currency while a detective (discriminator) tries to spot the forgeries. Over time, the counterfeiter becomes so skilled that the detective can no longer tell real from fake.
Generator vs Discriminator
The two networks have opposing objectives:
- Generator G(z): Takes a random noise vector
zfrom a latent space (typically Gaussian or uniform distribution) and maps it to a data sample. The generator wants to maximize the probability that the discriminator classifies its output as real. - Discriminator D(x): Takes a data sample
x(either real or generated) and outputs a probability that it is real. The discriminator wants to correctly classify real samples as real and fake samples as fake.
Training alternates between updating the discriminator on a batch of real and fake samples, then updating the generator based on how well it fooled the discriminator.
The Minimax Objective
The GAN training process is formalized as a minimax game with the following objective function:
# GAN Minimax Objective min_G max_D V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))] # Where: # E[log D(x)] - Expected log probability that D correctly # identifies real data as real # E[log(1 - D(G(z)))]- Expected log probability that D correctly # identifies fake data as fake # D wants to MAXIMIZE this (be a good detective) # G wants to MINIMIZE this (fool the detective) # At Nash equilibrium (ideal convergence): # D(x) = 0.5 for all x (can't distinguish real from fake) # G produces samples from the true data distribution
Training Instability
GANs are notoriously difficult to train. The adversarial training process is inherently unstable because you are simultaneously optimizing two competing objectives. Two major failure modes plague GAN training:
Mode Collapse
Mode collapse occurs when the generator learns to produce only a small subset of possible outputs. Instead of generating diverse faces, for example, the generator might produce variations of just one or two faces that consistently fool the discriminator. The generator finds a "safe" output and exploits it rather than learning the full data distribution.
Vanishing Gradients
When the discriminator becomes too good too quickly, it classifies all generated samples as fake with high confidence. The loss function flattens out, providing near-zero gradients to the generator. Without useful gradient signal, the generator cannot learn and training stalls completely.
DCGAN: Deep Convolutional GAN
DCGAN (Radford et al., 2015) established architectural guidelines that made GAN training more stable and became the foundation for subsequent GAN architectures:
- Replace pooling layers with strided convolutions (discriminator) and transposed convolutions (generator)
- Use batch normalization in both generator and discriminator
- Remove fully connected hidden layers for deeper architectures
- Use ReLU activation in the generator (except output layer which uses Tanh)
- Use LeakyReLU activation in the discriminator
Wasserstein GAN (WGAN)
WGAN (Arjovsky et al., 2017) addressed the vanishing gradient problem by replacing the original GAN loss with the Wasserstein distance (Earth Mover's Distance). Instead of the discriminator outputting a probability, the critic (renamed from discriminator) outputs an unbounded score.
- Wasserstein distance provides smooth, meaningful gradients even when the distributions are far apart, unlike the Jensen-Shannon divergence used in vanilla GANs.
- Weight clipping enforces the Lipschitz constraint required by the Wasserstein distance, though WGAN-GP (gradient penalty) later replaced this with a more principled approach.
- The critic can be trained to optimality without vanishing gradients, making training more stable and providing a meaningful loss metric that correlates with sample quality.
Progressive GAN
Progressive GAN (Karras et al., 2017) introduced the idea of growing both the generator and discriminator progressively, starting from low-resolution images (4x4) and gradually adding layers to increase resolution (up to 1024x1024). This approach:
- Stabilizes training by learning large-scale structure first, then refining details
- Reduces total training time significantly
- Enabled the first photorealistic face generation at high resolution
- Used smooth fade-in of new layers to prevent sudden shocks to the training process
StyleGAN (1, 2, and 3)
The StyleGAN family (Karras et al., NVIDIA) represents the pinnacle of GAN-based image generation:
- StyleGAN (2018): Introduced the mapping network (8-layer MLP transforming z to w space), adaptive instance normalization (AdaIN) for style injection at each resolution, and stochastic noise inputs for fine details. Enabled unprecedented control over generated face attributes like age, pose, and hair style.
- StyleGAN2 (2020): Fixed characteristic "water droplet" artifacts by replacing AdaIN with weight demodulation. Improved perceptual path length and removed progressive growing in favor of skip connections and residual architecture.
- StyleGAN3 (2021): Addressed the "texture sticking" problem where fine details stuck to pixel coordinates rather than moving naturally with objects. Introduced continuous signal interpretation and equivariance, enabling smooth animations and video generation.
Conditional GANs
Standard GANs generate random samples with no control over the output. Conditional GANs (cGANs) add conditioning information to guide generation:
- Class-conditional GAN: Feed class labels to both generator and discriminator. Generate specific categories like "cat" or "dog" on demand.
- pix2pix (Isola et al., 2017): Paired image-to-image translation. Maps input images to output images using a U-Net generator and PatchGAN discriminator. Applications include edges-to-photos, segmentation maps to street scenes, and day-to-night conversion.
- CycleGAN (Zhu et al., 2017): Unpaired image-to-image translation using cycle consistency loss. Two generators translate between domains A and B, with the constraint that translating A to B and back to A should recover the original image. Famous for horse-to-zebra and photo-to-painting transformations.
Comparison of GAN Variants
| GAN Variant | Year | Key Innovation | Training Stability | Best For |
|---|---|---|---|---|
| Vanilla GAN | 2014 | Adversarial training concept | Poor | Proof of concept |
| DCGAN | 2015 | Convolutional architecture guidelines | Moderate | Image generation baseline |
| WGAN | 2017 | Wasserstein distance loss | Good | Stable training, meaningful loss |
| Progressive GAN | 2017 | Progressive resolution growing | Good | High-resolution images |
| pix2pix | 2017 | Paired image-to-image translation | Good | Paired domain transfer |
| CycleGAN | 2017 | Unpaired translation, cycle loss | Good | Unpaired domain transfer |
| StyleGAN | 2018 | Style-based generation, W space | Good | Controllable face generation |
| StyleGAN2 | 2020 | Weight demodulation, no artifacts | Very Good | State-of-the-art faces |
| StyleGAN3 | 2021 | Equivariance, alias-free | Very Good | Animation-ready generation |
GANs vs Diffusion Models: Why Diffusion Won
By 2022, diffusion models (DALL-E 2, Stable Diffusion, Imagen) largely replaced GANs as the dominant generative model for images. Here is why:
- Training stability: Diffusion models use a simple denoising objective with a well-defined loss function. No adversarial training means no mode collapse, no vanishing gradients, and no careful balancing of two networks.
- Mode coverage: Diffusion models naturally cover the full data distribution, while GANs tend to focus on high-density regions, missing rare modes.
- Diversity: Diffusion models generate more diverse outputs for the same prompt, while GANs often produce less variety.
- Text conditioning: Diffusion models integrate naturally with text encoders (CLIP), enabling powerful text-to-image generation. GANs struggled with free-form text conditioning at scale.
- Scalability: Diffusion models scale better to large datasets and model sizes, following predictable scaling laws.
However, GANs have one significant advantage: speed. A GAN generates an image in a single forward pass, while diffusion models require many iterative denoising steps (typically 20-50). This makes GANs preferable for real-time applications.
Where GANs Are Still Used
Despite the rise of diffusion models, GANs remain valuable in several domains:
- Super-resolution: ESRGAN and Real-ESRGAN use GAN-based training to upscale images with sharp, realistic details. The single-pass inference is ideal for real-time video upscaling.
- Data augmentation: GANs generate synthetic training data for domains with limited data, such as medical imaging (synthetic X-rays, MRIs) and rare defect detection in manufacturing.
- Video generation: GAN-based approaches remain competitive for real-time video synthesis, face reenactment, and talking head generation due to their speed.
- Image editing: GAN inversion techniques (projecting real images into the latent space) enable precise semantic editing of real photographs.
- Game asset generation: Real-time texture synthesis and style transfer for gaming applications where inference speed matters.
- Adversarial training for other models: The discriminator concept lives on in techniques like RLHF reward models and adversarial data augmentation for robustness.
Code Example: Simple GAN in PyTorch
import torch import torch.nn as nn import torch.optim as optim from torchvision import datasets, transforms from torch.utils.data import DataLoader # Hyperparameters latent_dim = 64 hidden_dim = 256 image_dim = 28 * 28 # MNIST flattened batch_size = 128 lr = 0.0002 epochs = 50 # Generator: maps noise z to image space class Generator(nn.Module): def __init__(self): super().__init__() self.net = nn.Sequential( nn.Linear(latent_dim, hidden_dim), nn.LeakyReLU(0.2), nn.BatchNorm1d(hidden_dim), nn.Linear(hidden_dim, hidden_dim * 2), nn.LeakyReLU(0.2), nn.BatchNorm1d(hidden_dim * 2), nn.Linear(hidden_dim * 2, image_dim), nn.Tanh() # Output in [-1, 1] ) def forward(self, z): return self.net(z) # Discriminator: classifies real vs fake class Discriminator(nn.Module): def __init__(self): super().__init__() self.net = nn.Sequential( nn.Linear(image_dim, hidden_dim * 2), nn.LeakyReLU(0.2), nn.Dropout(0.3), nn.Linear(hidden_dim * 2, hidden_dim), nn.LeakyReLU(0.2), nn.Dropout(0.3), nn.Linear(hidden_dim, 1), nn.Sigmoid() ) def forward(self, x): return self.net(x) # Setup device = torch.device("cuda" if torch.cuda.is_available() else "cpu") G = Generator().to(device) D = Discriminator().to(device) criterion = nn.BCELoss() opt_G = optim.Adam(G.parameters(), lr=lr, betas=(0.5, 0.999)) opt_D = optim.Adam(D.parameters(), lr=lr, betas=(0.5, 0.999)) # Load MNIST transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)) ]) dataset = datasets.MNIST("./data", download=True, transform=transform) loader = DataLoader(dataset, batch_size=batch_size, shuffle=True) # Training loop for epoch in range(epochs): for real_imgs, _ in loader: real_imgs = real_imgs.view(-1, image_dim).to(device) bs = real_imgs.size(0) real_labels = torch.ones(bs, 1).to(device) fake_labels = torch.zeros(bs, 1).to(device) # Train Discriminator z = torch.randn(bs, latent_dim).to(device) fake_imgs = G(z).detach() d_loss = criterion(D(real_imgs), real_labels) + \ criterion(D(fake_imgs), fake_labels) opt_D.zero_grad() d_loss.backward() opt_D.step() # Train Generator z = torch.randn(bs, latent_dim).to(device) fake_imgs = G(z) g_loss = criterion(D(fake_imgs), real_labels) opt_G.zero_grad() g_loss.backward() opt_G.step() print(f"Epoch {epoch+1}/{epochs} | D Loss: {d_loss:.4f} | G Loss: {g_loss:.4f}")
Lilly Tech Systems