Skip to Content
All posts

Flow Matching Models

 — #Diffusion#ODE

Today, we will discuss an alternative to denoising diffusion probabilistic models (DDPM), and model the denoising of an image via flow matching.

Flow matching has shown promising results across various generative tasks:

  • Image Generation: Flow matching models generates high-quality images by learning smooth trajectories from noise to realistic photos, offering results similar to with state-of-the-art diffusion models while requiring fewer sampling steps.

  • 3D World Generation: In virtual world synthesis, flow matching enables continuous generation of 3D scenes by learning vector fields that transform random noise into coherent geometry and textures.

  • Audio Generation: For sound synthesis, flow matching can model the complex temporal dynamics of audio waveforms by following learned trajectories in the audio feature space.

Introduction

The key idea is to define a vector field that describes how points should move at each position to transform the noise distribution into the data distribution. This is similar to watching particles flow in a fluid - each particle follows a path determined by the surrounding flow.

The model learns the optimal vector field by matching it to "ground truth" flows between paired noise and data samples. Once trained, we can generate new samples by:

  1. Sampling from the noise distribution
  2. Following the learned vector field (solving the ODE)
  3. Arriving at a sample from the data distribution

The main advantages of flow matching compared to diffusion models are:

  • No fixed number of steps - we can adjust sampling precision vs speed
  • Direct optimization of the vector field wi thout auxiliary losses
  • Exact likelihood computation through the change of variables formula

However, they can be more challenging to train since we need to learn a continuous vector field rather than discrete denoising steps. For this, we will learn the proof and the math that surrounds matching an objective function of flow matching.

This blog will be highly proof-based and use lots of probability terminology and notation.

Notation

  • x0N(0,I)x_0 \sim \mathcal{N}(0,I): random noise sampled from standard normal distribution
  • x1p(x1)x_1 \sim p(x_1): real image/data-point sampled from true data distribution
  • ϕt(x0)\phi_t(x_0): continuous-time transformation that moves x0x_0 to x1x_1
  • ddtϕt(x0)\frac{d}{dt}\phi_t(x_0): true velocity field at time tt for point x0x_0
  • vθ(t,x)v_\theta(t,x): neural network prediction of the velocity field at time tt and position xx
  • pt(x)p_t(x): how likely xx is at each time t[0,1]t \in [0,1]

Continous Normalizing Flows

Our goal in Continous Normalizing Flows (CNF) is to train a neural network vθ(t,x)v_\theta(t,x) that learns the optimal velocity field. This means we want,

ddtϕt(x)=vθ(t,ϕt(x))\frac{d}{dt}\phi_t(x) = v_\theta(t,\phi_t(x))

We also assume that the ϕ0(x)=x\phi_0(x) = x, meaning that at t=0t=0, we start at xx. Then we follow something called the push-forward equation, or the Forward Density Push:

pt=[ϕt]p0p_t = [\phi_t]_*p_0

Once again, this means that at time tt, the probability desnity ptp_t is just the original probability density p0p_0 after being moved by ϕt\phi_t.

This operator * is defined by

[ϕt]p0(x)=p0(ϕt1(x))detϕt1x(x)[\phi_t]_*p_0(x) = p_0(\phi_t^{-1}(x)) \left|\det \frac{\partial \phi_t^{-1}}{\partial x}(x)\right|

This formula tells us that to find the density at some point xx at time tt, we can look backwards to find where this xx came from at time 00 via ϕt1\phi_t^{-1} (the inverse flow), and multiply the original density at that point by the determinant of the Jacobian of the inverse transformation.

Example: If we want to know the probability density of marbles in a box at some point xx at time tt, we can find the original point ϕt1(x)\phi_t^{-1}(x) at t=0t=0. p0(ϕt1(x))p_0(\phi_t^{-1}(x)) tells us how dense the marbles were there initially. Then the determinant accounts for the dilation of the box over time.

Flow Matching

Now we will talk about flow matching. This is essentially attempting to model flow of probability distrubtion from noisy data all the way to a clear image.

Let q(x1)q(x_1) be our distrubiton of data (images in this case). Let's assume that we only have access to data samples and no access to the density function of qq

As previously mentioned ptp_t is our probability path such that p0=pp_0 = p where p(x)=N(x0,I)p(x) = \mathcal{N}(x|0, \mathbf{I}). And then trivially, our p1=q(x1)p_1 = q(x_1)

With this, we attempt to match vt(x)v_t(x) to ut(x)u_t(x) via MSE loss,

LFM(θ)=Et,pt(x)vθ(t,x)ut(x)2L_{FM}(\theta) = \mathbb{E}_{t,p_t(x)} \|v_\theta(t,x) - u_t(x)\|^2

From a perfectly learned vt(x)v_t(x), we would be able to generate the probability distrbution pt(x)p_t(x). However, we don't have the closed-form truth ut(x)u_t(x) that would generate us the pt(x)p_t(x), thus the novelty in this paper shows that pt(x)p_t(x) and ut(x)u_t(x) can be constructed via probability paths and vector fields.

Conditional Probability Paths

Using the notation from above, we can denote the conditional probability of pt(x)p_t(x) as,

pt(x)=pt(xx1)q(x1)dx1p_t(x) = \int p_t(x|x_1)q(x_1)dx_1

This integral tells us that the probability density pt(x)p_t(x) at any point xx and time tt is found by considering all possible final points x1x_1 (weighted by their probability q(x1)q(x_1)) and integrating over the conditional probabilities pt(xx1)p_t(x|x_1) of being at xx at time tt given that we end at x1x_1.

For t=1t=1, we see that the probability that we end up at xx, given that our end state is x1x_1, is the probability that xx is in the data.

p1(x)=p1(xx1)q(x1)dx1=q(x)p_1(x) = \int p_1(x|x_1)q(x_1)dx_1 = q(x)

More importantly and similarly, we can marginalize all the vector fields via the following equation,

ut(x)=ut(xx1)pt(xx1)q(x1)pt(x)dx1u_t(x) = \int u_t(x|x_1) \frac{p_t(x|x_1)q(x_1)}{p_t(x)} dx_1

This integral represents how we combine vector fields from different flows. At each point xx, we take all possible endpoint states x1x_1 and their corresponding vector fields ut(xx1)u_t(x|x_1). Each vector field is weighted by the ratio pt(xx1)q(x1)pt(x)\frac{p_t(x|x_1)q(x_1)}{p_t(x)}, which represents the relative contribution of that endpoint to the total probability at xx. The weighted vectors are then summed up to give us the overall vector field ut(x)u_t(x) at that point.

However, integrating over all data points is inefficient and results in a poor calculation of utu_t. Thus, they propose a simpler objective that results in the same optimal solution:

LCFM(θ)=Et,q(x1),pt(xx1)vθ(t,x)ut(xx1)2L_{CFM}(\theta) = \mathbb{E}_{t,q(x_1),p_t(x|x_1)} \|v_\theta(t,x) - u_t(x|x_1)\|^2

It is somewhat easier to calculate ut(xx1)u_t(x|x_1) because it is on a per sample basis. Additionally, they claim in Thm 2. that the gradients of LCFML_{CFM} and LFML_{FM} are equivalent, making them optimally the same.

Conditional Probability Path

Here, we will be defining an explicit version of ϕt\phi_t. But before that we will write the probability p(xx1)p(x|x_1) as the probability that at xx is at time tt given that we get x1x_1 at t=1t=1. We formulate this as,

pt(xx1)=N(xμt(x1),σt2(x1)I)p_t(x|x_1) = \mathcal{N}(x|\mu_t(x_1),\sigma_t^2(x_1)\mathbf{I})

and we have a time-scalar standard deviation and mean. Namely,

μt(x1)=tx1\mu_t(x_1) = t \cdot x_1 and σt=1t2\sigma_t = \sqrt{1-t^2}. μt(x1)\mu_t(x_1) represents the mean of the conditional probability distribution at time tt given the endpoint x1x_1.

Additionally, we have the base conditions as μ0(x1)=0\mu_0(x_1) = 0 and σ0(x1)=1\sigma_0(x_1) = 1 so that p0(x)=N(x0,I)p_0(x) = \mathcal{N}(x|0,\mathbf{I}) (randomly-noised image) and then μ1(x1)=x1\mu_1(x_1) = x_1 and σ1(x1)=σmin\sigma_1(x_1) = \sigma_{min} (the ground truth image with some slight variance) which allows for p1(xx1) x1p_1(x|x_1) ~ x_1.

Multiple Gaussian Probability Paths

Another problem that we have is that there are many vector fields that could possibly generate a particular probability path. Let's look at a simple example:

Imagine we have a 1D Gaussian distribution that we want to transform into another Gaussian distribution that's shifted to the right. We could:

  1. Move all points directly to the right at a constant speed
  2. Have points on the left move faster than points on the right
  3. Have points temporarily move up/down before reaching their final position

All of these different vector fields would result in the same final distribution, even though the paths taken by individual points are different. This illustrates how multiple vector fields can generate the same probability path.

Simple Affine Transformation

That's why we rely on a simple affine transformation for Gaussian distributions to give us the simplest possible vector field..

ϕt(x)=σt(x1)x+μt(x1)\phi_t(x) = \sigma_t(x_1) \cdot x + \mu_t(x_1)

Essentially ϕt\phi_t pushes the noise distribution wich starts off as p0(xx1)p_0(x|x_1) all the way to the ground truth image pt(xx1)p_t(x|x_1). This can be formally writted as,

[ϕt]p(x)=pt(xx1)[\phi_t]_* * p(x) = p_t(x|x_1)

ϕt\phi_t gives us where a sample should be along the flow from noise to data. If we have noisy image x0x_0 and apply ϕt(x0)\phi_t(x_0), then we should get a sample from the intermediate distribution of pt(xx1)p_t(x|x_1).

Vector Field Substitution

Now that we have a position estimate at every point tt, we can take the derivative of ϕt(x)\phi_t(x) with respect to tt to give us the estimate of the vector field,

ddtϕt(x)=ut(ϕt(x)x1)\frac{d}{dt} \phi_t(x) = u_t(\phi_t(x)|x_1)
LCFM(θ)=Et,q(x1),p(x0)vθ(ϕt(x0))ddtϕt(x0)2\mathcal{L}_{\mathrm{CFM}}(\theta) = \mathbb{E}_{t,\,q(x_1),\,p(x_0)} \left\| v_\theta(\phi_t(x_0)) - \frac{d}{dt} \phi_t(x_0) \right\|^2

We can use x0x_0 in this case because we will be sampling x0p0x_0 \sim p_0 and compute xtx_t deterministically via ϕt\phi_t which is the movement function.

Proof of Vector Field Uniqueness

1) First, recall that ϕt(x)=σt(x1)x+μt(x1)\phi_t(x) = \sigma_t(x_1) \cdot x + \mu_t(x_1) is our flow map.

2) To find the vector field utu_t, we need to take the time derivative of ϕt(x)\phi_t(x):

ddtϕt(x)=ddt[σt(x1)x+μt(x1)]=σt(x1)x+μt(x1)\begin{aligned} \frac{d}{dt} \phi_t(x) &= \frac{d}{dt}[\sigma_t(x_1) \cdot x + \mu_t(x_1)] \\ &= \sigma'_t(x_1) \cdot x + \mu'_t(x_1) \end{aligned}

3) Now, we need to express this in terms of ϕt(x)\phi_t(x) instead of xx. From the flow map equation:

x=ϕt(x)μt(x1)σt(x1)x = \frac{\phi_t(x) - \mu_t(x_1)}{\sigma_t(x_1)}

4) Substituting this back into our derivative:

ddtϕt(x)=σt(x1)(ϕt(x)μt(x1)σt(x1))+μt(x1)=σt(x1)σt(x1)(ϕt(x)μt(x1))+μt(x1)\begin{aligned} \frac{d}{dt} \phi_t(x) &= \sigma'_t(x_1) \cdot \left(\frac{\phi_t(x) - \mu_t(x_1)}{\sigma_t(x_1)}\right) + \mu'_t(x_1) \\ &= \frac{\sigma'_t(x_1)}{\sigma_t(x_1)}(\phi_t(x) - \mu_t(x_1)) + \mu'_t(x_1) \end{aligned}

5) By definition of the vector field:

ut(ϕt(x)x1)=ddtϕt(x)u_t(\phi_t(x)|x_1) = \frac{d}{dt} \phi_t(x)

6) Therefore:

ut(xx1)=σt(x1)σt(x1)(xμt(x1))+μt(x1)u_t(x|x_1) = \frac{\sigma'_t(x_1)}{\sigma_t(x_1)}(x - \mu_t(x_1)) + \mu'_t(x_1)

This proves that the vector field has the claimed form. The uniqueness follows from the fact that this is the only vector field that can generate the given flow map ϕt\phi_t.

In the CFM loss, we already calculate ϕt(x0)\phi_t(x_0) for the forward pass of the neural network and we just compare that to the true derivative of ddtϕt(x0)\frac{d}{dt} \phi_t(x_0) rather than going down to the ut(xx1)u_t(x|x_1) derivation.

Variance Exploding Diffusion

Variance Exploding (VE) diffusion adds more and more Gaussian noise to the data as time goes on, so the overall variance keeps growing. In contrast, Variance Preserving (VP) diffusion keeps the total variance fixed by scaling the signal as it adds noise. VE is useful when you want the noise level itself to grow significantly during the process.

Variance Exploding path has the form,

pt(x)=N(xx1,σ1t2I)p_t(x) = N(x|x_1, \sigma_{1-t}^2I)

where σt\sigma_t is increasing and σt=0\sigma_t = 0. Then

where σt\sigma_t is an increasing function, σ0=0\sigma_0 = 0, and σ11\sigma_1 \gg 1. For VE diffusion, we have:

1) From the given form of pt(x)p_t(x), we can identify that:

  • μt(x1)=x1\mu_t(x_1) = x_1 (the mean)
  • σt(x1)=σ1t\sigma_t(x_1) = \sigma_{1-t} (the standard deviation)

2) Substituting these into the general vector field formula derived earlier:

ut(xx1)=σt(x1)σt(x1)(xμt(x1))+μt(x1)u_t(x|x_1) = \frac{\sigma'_t(x_1)}{\sigma_t(x_1)}(x - \mu_t(x_1)) + \mu'_t(x_1)

3) We have:

  • μt(x1)=0\mu'_t(x_1) = 0 since μt(x1)\mu_t(x_1) is constant with respect to tt
  • σt(x1)=σ1t\sigma_t(x_1) = \sigma_{1-t}, so σt(x1)=σ1t\sigma'_t(x_1) = -\sigma'_{1-t} by the chain rule

4) Therefore:

ut(xx1)=σ1tσ1t(xx1)+0=σ1tσ1t(xx1)\begin{aligned} u_t(x|x_1) &= \frac{-\sigma'_{1-t}}{\sigma_{1-t}}(x - x_1) + 0 \\ &= -\frac{\sigma'_{1-t}}{\sigma_{1-t}}(x-x_1) \end{aligned}

Note that σ1t\sigma'_{1-t} is positive since σt\sigma_t is increasing, but appears negative in our final expression due to the chain rule when we changed from tt to 1t1-t. This gives us the vector field for Variance Exploding diffusion.

Variance Preserving Diffusion

Note: Unlike DDPM where x0x_0 is the ground truth image, in Flow Matching x0x_0 is the noise and x1x_1 is the ground truth image. This reversal makes the math more intuitive for describing the flow from noise to data.

Variance Preserving Diffusion is the traditional continous diffusion equation. We have the following path for DDPM (my previous blog) as,

xt=αtx0+1αtϵx_t = \sqrt{\alpha_t}x_0 + \sqrt{1-\alpha_t}\epsilon

where αt=s=1t(1βs)\alpha_t = \prod_{s=1}^t (1-\beta_s) and ϵN(0,I)\epsilon \sim \mathcal{N}(0,I). This is the forward process that gradually adds noise to the image x0x_0 until we reach pure noise.

For Flow Matching, we need to transition from the discrete DDPM formulation to a continuous version. The key is to recognize that in the discrete case, we have:

xt=αtx0+1αtϵx_t = \sqrt{\alpha_t}x_0 + \sqrt{1-\alpha_t}\epsilon

To make this continuous, we:

  1. Replace the discrete time steps with a continuous time parameter t[0,1]t \in [0,1]
  2. Express αt\alpha_t as a continuous function: αt=e12T(t)\alpha_t = e^{-\frac{1}{2}T(t)}
  3. Replace the discrete noise schedule βt\beta_t with a continuous function β(t)\beta(t)
  4. Define T(t)T(t) as the cumulative integral: T(t)=0tβ(s)dsT(t) = \int_0^t \beta(s)ds

This gives us a smooth, continuous path from noise to data, rather than discrete steps. The resulting continuous formulation becomes:

For Variance Preserving (VP) diffusion, the path from noise to data has the form:

pt(xx1)=N(xα1tx1,(1α1t2)I),p_t(x|x_1) = \mathcal{N}\big(x \,\big|\, \alpha_{1-t}x_1, \, (1-\alpha^2_{1-t})I\big),

where

αt=e12T(t),T(t)=0tβ(s)ds,\alpha_t = e^{-\frac{1}{2}T(t)}, \quad T(t) = \int_0^t \beta(s)ds,

and β(t)\beta(t) is the noise schedule.

This gives us a continuous-time VP diffusion path where:

  • The mean is scaled by α1t\alpha_{1-t}
  • The variance is (1α1t2)I(1-\alpha^2_{1-t})I
  • The total variance remains constant (variance preserving) for all tt

Note that this formulation goes from noise (t=0t=0) to data (t=1t=1), which is the reverse of the traditional DDPM formulation. The amplitude αt\alpha_t is defined using half the time integral compared to the forward process to maintain consistency with the variance preserving property.

We have:

  • μt(x1)=α1tx1\mu_t(x_1) = \alpha_{1-t}x_1
  • σt(x1)=1α1t2\sigma_t(x_1) = \sqrt{1-\alpha^2_{1-t}}

Then, substituting them into the original formula for ut(xx1)u_t(x|x_1),

ut(xx1)=α1t1α1t2(α1txx1)=T(1t)1eT(1t)(eT(1t)xe12T(1t)x1)\begin{aligned} u_t(x|x_1) &= \frac{\alpha'_{1-t}}{1-\alpha^2_{1-t}}(\alpha_{1-t}x - x_1) \\ &= \frac{T'(1-t)}{1-e^{-T(1-t)}}(e^{-T(1-t)}x - e^{-\frac{1}{2}T(1-t)}x_1) \end{aligned}

where we used:

  • α1t=12T(1t)e12T(1t)\alpha'_{1-t} = -\frac{1}{2}T'(1-t)e^{-\frac{1}{2}T(1-t)}
  • α1t=e12T(1t)\alpha_{1-t} = e^{-\frac{1}{2}T(1-t)}

This gives us the vector field for Variance Preserving diffusion, which describes how points should move at each time tt to transform the noise distribution into the data distribution while preserving total variance.

Example

Let's say that we have a noise starting at x0=0.5x_0 = 0.5 and our ground truth image as x1=1x_1=1

μt(x1)=tx1\mu_t(x_1) = t*x_1 and σt(x1)=(1t2)\sigma_t(x_1) = \sqrt(1-t^2).

Now we have the formula for ϕt(x0)=1t20.5+t\phi_t(x_0) = \sqrt{1-t^2} \cdot 0.5 + t

To get the denoised image at t=0.6t=0.6, or ϕ0.6(0.5)=10.620.5+0.6=1.0\phi_{0.6} (0.5) = \sqrt{1-0.6^2} \cdot 0.5 + 0.6 = 1.0

To get the ground-truth velocity at that point, we have the formula

ut(xx1)=ddtϕt(x0)=t1t20.5+1u_t(x|x_1) = \frac{d}{dt}\phi_t(x_0) = -\frac{t}{\sqrt{1-t^2}} \cdot 0.5 + 1

At t=0.6t=0.6, the ground truth velocity would be:

u0.6(xx1)=0.610.620.5+1=0.625u_{0.6}(x|x_1) = -\frac{0.6}{\sqrt{1-0.6^2}} \cdot 0.5 + 1 = 0.625

This means that at time t=0.6t=0.6, the neural network should predict that we are moving with velocity 0.625 in the positive direction.

We can verify this makes sense because:

  1. At t=0t=0, velocity is 1 (maximum speed toward target)
  2. As tt increases, velocity decreases as we get closer to target
  3. At t=1t=1, velocity approaches 0 (we've reached the target)

References

  1. Flow Matching for Generative Modeling - The original paper introducing Flow Matching as a framework for generative modeling tx