Skip to content

Cheatsheet: Normalizing flows, change of variables for distributions

1D: for Z with density p_Z, and X = f(Z) with f differentiable and strictly monotonic:

p_X(x) = p_Z(f^{-1}(x)) / |df/dz| with z = f^{-1}(x)

Multidimensional: for f : R^d → R^d invertible:

p_X(x) = p_Z(f^{-1}(x)) / |det(J_f(z))|

J_f is the Jacobian matrix (d × d partial derivatives); det(J_f) is the volume scaling factor from Track 4. The absolute value keeps the density non-negative.

Intuition: the transformation warps z-space into x-space; the determinant tells you how much it stretches volume; dividing by it conserves total probability.

1D (uniform → uniform). Z ~ Uniform(0, 1), f(z) = 3z + 1. Then f^{-1}(x) = (x-1)/3, df/dz = 3, and:

p_X(x) = 1 / 3 on x in [1, 4] (integral = 3 · 1/3 = 1)

2D (uniform square → rectangle). Z ~ Uniform([0,1]^2), f(z) = A z with A = [[2,0],[0,1]], det A = 2:

p_X(x) = 1 / 2 on x in [0, 2] × [0, 1] (integral = 2 · 1/2 = 1)

The unit square doubled in area, so density halved.

Training objective (NLL = forward-KL minimization, same as L3)

Section titled “Training objective (NLL = forward-KL minimization, same as L3)”

Take log of the change-of-variables formula:

log p_X(x) = log p_Z(f^{-1}(x)) - log |det(J_f(z))|
NLL(x) = -log p_Z(f^{-1}(x)) + log |det(J_f(z))|

For each training example: invert it to get z, evaluate log p_Z(z) under the simple base, compute log |det J_f| at that point, sum (negative). Backpropagate.

z ~ p_Z (one RNG call, e.g. multivariate standard Gaussian)
x = f(z) (one forward pass through the flow)

vs. autoregressive’s n forward passes for an n-piece output. This parallel sampling is the flow paradigm’s main advantage.

f = f_K ∘ f_{K-1} ∘ ... ∘ f_1
log |det J_f| = sum_{k=1..K} log |det J_{f_k}| (chain rule of determinants)

Each new layer adds one tractable log-determinant term; flows get flexible by stacking many simple invertible blocks.

RequirementWhat it meansStandard solution
Invertibilityf must be a bijection (one-to-one, onto)Coupling layers (RealNVP, NICE), autoregressive flows (MAF, IAF)
Tractable Jacobiandet(J_f) must be cheap to computeTriangular Jacobian → det = product of diagonals (O(d) instead of O(d^3))

Without these constraints, the math is right but the training is infeasible. The architectural cleverness of flows is keeping the Jacobian triangular while making the transformation expressive.

PropertyAutoregressiveFlowVAEGANDiffusion
Exact log p(x)YesYesLower bound (ELBO)NoIndirect
Parallel samplingNo (sequential)Yes (one pass)YesYesNo (multi-step)
Architectural constraintsCausality (mask)Invertibility + tractable JacobianEncoder + decoderTwo-net gameForward + reverse process
Where used in practiceLLMs, audioDensity estimation, scientific applicationsOlder image gen, latent encodersSome image gen, facesModern image / video / audio

The cells in bold are what this lesson adds.

  • Density estimation (anomaly detection, scientific likelihood, statistical inverse problems): flows are often the right paradigm precisely because the others cannot give exact p_model(x) cheaply.
  • The change-of-variables formula recurs. It appears in VAEs (reparameterization), diffusion (ODE view, lesson 14), and beyond. This formula is more foundational than the paradigm built on it.
  • Drop the | | on the determinant. No. The Jacobian determinant can be negative (orientation flip); without absolute value, you can get a negative “probability.”
  • Confuse the two formula forms. p_X(x) = p_Z(z) / |det J_f(z)| and p_X(x) = p_Z(z) · |det J_{f^{-1}}(x)| are the same identity (det J_{f^{-1}} = 1 / det J_f). Pick whichever is computationally cheaper.
  • Choose a fancy base distribution. No need; standard Gaussian works. Expressive power lives in the transformation.
  • Soft-enforce invertibility. No. Like causality in autoregressive, invertibility must be architectural; otherwise the formula does not apply.

A normalizing flow parameterizes p_model(x) exactly through an invertible transformation from a simple base, with the Jacobian determinant rescaling density to conserve probability; training is the same forward-KL = NLL minimization, sampling is one forward pass, and the price is invertibility plus a tractable Jacobian.