Summary: Normalizing flows, change of variables for distributions
This lesson introduced the second paradigm of likelihood-based generative modeling, normalizing flows (autoregressive was the first; VAEs bring the third in Phase 2). It is the only one of the four paradigms in this track that gives all three of exact log p_model(x), one-pass sampling, and a flexible model. The price is an architectural constraint. The whole thing reduces to one line: a normalizing flow parameterizes the model distribution exactly through an invertible transformation from a simple base, where the Jacobian determinant rescales density to conserve probability; training is the same forward-KL = NLL minimization from L3, and the constraint is that every layer must be invertible with a tractable Jacobian. This is the scan-it-in-five-minutes version.
Core ideas
Section titled “Core ideas”- A normalizing flow starts with a simple base distribution
p_Z(z)(typically a multivariate standard Gaussian), pusheszthrough an invertible neural networkf, and definesx = f(z)as the sample. - The multidimensional change-of-variables formula gives
p_model(x)exactly:p_X(x) = p_Z(f^{-1}(x)) / |det(J_f(z))|. The|det J_f|is the density rescaling factor, the same Track-4 determinant that scales volumes in space, now scaling density to keep total probability equal to 1. Where the transformation expands volume, density dilutes; where it contracts, density concentrates. - Worked anchors. 1D:
Z ~ Uniform(0,1),f(z) = 3z + 1→p_X(x) = 1/3on[1, 4], integral = 1. 2D:Z ~ Uniform([0,1]^2),f(z) = A zwithA = [[2,0],[0,1]]anddet A = 2→p_X(x) = 1/2on the image rectangle, integral = 1. - Training is one line of NLL.
log p_X(x) = log p_Z(f^{-1}(x)) - log |det J_f(z)|, soNLL(x) = -log p_X(x) = -log p_Z(z) + log |det J_f|. Same forward-KL = empirical NLL minimization from L3, this time computed by invertingxtoz, evaluating the base density, and adding the log-Jacobian-determinant. - Sampling is one forward pass. Draw
z ~ p_Z, computex = f(z). This is the parallel sampling the flow paradigm advertises, in contrast with autoregressive’snforward passes for ann-piece output. - Two architectural constraints. (1) Invertibility:
fmust be a bijection. (2) Tractable Jacobian:det(J_f)must be cheap, typicallyO(d)via a triangular Jacobian (whose determinant is the product of the diagonal entries). Standard solutions are coupling layers (NICE, RealNVP) and autoregressive layers (MAF, IAF). The Glow paper added invertible 1x1 convolutions as a standard channel-mixing primitive. - Composition is cheap. For
f = f_K ∘ ... ∘ f_1, the chain rule of determinants giveslog |det J_f| = sum_k log |det J_{f_k}|. Stacking many simple invertible blocks gives a flexible flow at low extra cost; this is how flows get their expressive power despite the per-layer constraints. - Cross-paradigm position. Flows uniquely combine exact
log p_model(x), one-pass sampling, and a flexible model. The price (invertibility + tractable Jacobian) restricts the architectures usable as flow layers, which is why flows are not the dominant paradigm for the largest image generation problems (diffusion is) or for language (autoregressive is). They are the right paradigm for density estimation tasks where exact likelihood is the requirement: anomaly detection, scientific likelihood, particle physics, astrophysics, chemistry. - The change-of-variables formula is foundational beyond flows. It appears in the VAE reparameterization trick (lesson 5-6), in the ODE-based view of diffusion (lesson 14), and beyond. The formula here is more foundational than the paradigm built on it.
What changes for you
Section titled “What changes for you”Before this lesson, the link between Track 4’s determinant (volume scaling) and probabilistic modeling was probably abstract. Now it is concrete: the Jacobian determinant in the change-of-variables formula is the same Track-4 determinant, multiplying densities instead of areas. When you next see a normalizing flow architecture, you can read it on sight as “an invertible network whose Jacobian determinant we can compute cheaply, with training one NLL and sampling one forward pass.” When you see the change-of-variables formula in a VAE reparameterization or a diffusion ODE, you know where it comes from. Phase 1 of the track is now done; you have the first two of three likelihood-based paradigms (autoregressive and flow), trained on the same forward-KL = NLL objective from L3. Phase 2 opens with VAEs as the third likelihood-based paradigm: the encoder-decoder structure looks superficially like a flow but relaxes the invertibility constraint, paying for that flexibility with a likelihood that becomes a bound (the ELBO) instead of an exact computation.