Flow Matching Models
— #Diffusion#ODE
Today, we will discuss an alternative to denoising diffusion probabilistic models (DDPM), and model the denoising of an image via flow matching.
Flow matching has shown promising results across various generative tasks:
Image Generation: Flow matching models generates high-quality images by learning smooth trajectories from noise to realistic photos, offering results similar to with state-of-the-art diffusion models while requiring fewer sampling steps.
3D World Generation: In virtual world synthesis, flow matching enables continuous generation of 3D scenes by learning vector fields that transform random noise into coherent geometry and textures.
Audio Generation: For sound synthesis, flow matching can model the complex temporal dynamics of audio waveforms by following learned trajectories in the audio feature space.
Introduction
The key idea is to define a vector field that describes how points should move at each position to transform the noise distribution into the data distribution. This is similar to watching particles flow in a fluid - each particle follows a path determined by the surrounding flow.
The model learns the optimal vector field by matching it to "ground truth" flows between paired noise and data samples. Once trained, we can generate new samples by:
- Sampling from the noise distribution
- Following the learned vector field (solving the ODE)
- Arriving at a sample from the data distribution
The main advantages of flow matching compared to diffusion models are:
- No fixed number of steps - we can adjust sampling precision vs speed
- Direct optimization of the vector field wi thout auxiliary losses
- Exact likelihood computation through the change of variables formula
However, they can be more challenging to train since we need to learn a continuous vector field rather than discrete denoising steps. For this, we will learn the proof and the math that surrounds matching an objective function of flow matching.
This blog will be highly proof-based and use lots of probability terminology and notation.
Notation
- : random noise sampled from standard normal distribution
- : real image/data-point sampled from true data distribution
- : continuous-time transformation that moves to
- : true velocity field at time for point
- : neural network prediction of the velocity field at time and position
- : how likely is at each time
Continous Normalizing Flows
Our goal in Continous Normalizing Flows (CNF) is to train a neural network that learns the optimal velocity field. This means we want,
We also assume that the , meaning that at , we start at . Then we follow something called the push-forward equation, or the Forward Density Push:
Once again, this means that at time , the probability desnity is just the original probability density after being moved by .
This operator is defined by
This formula tells us that to find the density at some point at time , we can look backwards to find where this came from at time via (the inverse flow), and multiply the original density at that point by the determinant of the Jacobian of the inverse transformation.
Example: If we want to know the probability density of marbles in a box at some point at time , we can find the original point at . tells us how dense the marbles were there initially. Then the determinant accounts for the dilation of the box over time.
Flow Matching
Now we will talk about flow matching. This is essentially attempting to model flow of probability distrubtion from noisy data all the way to a clear image.
Let be our distrubiton of data (images in this case). Let's assume that we only have access to data samples and no access to the density function of
As previously mentioned is our probability path such that where . And then trivially, our
With this, we attempt to match to via MSE loss,
From a perfectly learned , we would be able to generate the probability distrbution . However, we don't have the closed-form truth that would generate us the , thus the novelty in this paper shows that and can be constructed via probability paths and vector fields.
Conditional Probability Paths
Using the notation from above, we can denote the conditional probability of as,
This integral tells us that the probability density at any point and time is found by considering all possible final points (weighted by their probability ) and integrating over the conditional probabilities of being at at time given that we end at .
For , we see that the probability that we end up at , given that our end state is , is the probability that is in the data.
More importantly and similarly, we can marginalize all the vector fields via the following equation,
This integral represents how we combine vector fields from different flows. At each point , we take all possible endpoint states and their corresponding vector fields . Each vector field is weighted by the ratio , which represents the relative contribution of that endpoint to the total probability at . The weighted vectors are then summed up to give us the overall vector field at that point.
However, integrating over all data points is inefficient and results in a poor calculation of . Thus, they propose a simpler objective that results in the same optimal solution:
It is somewhat easier to calculate because it is on a per sample basis. Additionally, they claim in Thm 2. that the gradients of and are equivalent, making them optimally the same.
Conditional Probability Path
Here, we will be defining an explicit version of . But before that we will write the probability as the probability that at is at time given that we get at . We formulate this as,
and we have a time-scalar standard deviation and mean. Namely,
and . represents the mean of the conditional probability distribution at time given the endpoint .
Additionally, we have the base conditions as and so that (randomly-noised image) and then and (the ground truth image with some slight variance) which allows for .
Multiple Gaussian Probability Paths
Another problem that we have is that there are many vector fields that could possibly generate a particular probability path. Let's look at a simple example:
Imagine we have a 1D Gaussian distribution that we want to transform into another Gaussian distribution that's shifted to the right. We could:
- Move all points directly to the right at a constant speed
- Have points on the left move faster than points on the right
- Have points temporarily move up/down before reaching their final position
All of these different vector fields would result in the same final distribution, even though the paths taken by individual points are different. This illustrates how multiple vector fields can generate the same probability path.
Simple Affine Transformation
That's why we rely on a simple affine transformation for Gaussian distributions to give us the simplest possible vector field..
Essentially pushes the noise distribution wich starts off as all the way to the ground truth image . This can be formally writted as,
gives us where a sample should be along the flow from noise to data. If we have noisy image and apply , then we should get a sample from the intermediate distribution of .
Vector Field Substitution
Now that we have a position estimate at every point , we can take the derivative of with respect to to give us the estimate of the vector field,
We can use in this case because we will be sampling and compute deterministically via which is the movement function.
Proof of Vector Field Uniqueness
1) First, recall that is our flow map.
2) To find the vector field , we need to take the time derivative of :
3) Now, we need to express this in terms of instead of . From the flow map equation:
4) Substituting this back into our derivative:
5) By definition of the vector field:
6) Therefore:
This proves that the vector field has the claimed form. The uniqueness follows from the fact that this is the only vector field that can generate the given flow map .
In the CFM loss, we already calculate for the forward pass of the neural network and we just compare that to the true derivative of rather than going down to the derivation.
Variance Exploding Diffusion
Variance Exploding (VE) diffusion adds more and more Gaussian noise to the data as time goes on, so the overall variance keeps growing. In contrast, Variance Preserving (VP) diffusion keeps the total variance fixed by scaling the signal as it adds noise. VE is useful when you want the noise level itself to grow significantly during the process.
Variance Exploding path has the form,
where is increasing and . Then
where is an increasing function, , and . For VE diffusion, we have:
1) From the given form of , we can identify that:
- (the mean)
- (the standard deviation)
2) Substituting these into the general vector field formula derived earlier:
3) We have:
- since is constant with respect to
- , so by the chain rule
4) Therefore:
Note that is positive since is increasing, but appears negative in our final expression due to the chain rule when we changed from to . This gives us the vector field for Variance Exploding diffusion.
Variance Preserving Diffusion
Note: Unlike DDPM where is the ground truth image, in Flow Matching is the noise and is the ground truth image. This reversal makes the math more intuitive for describing the flow from noise to data.
Variance Preserving Diffusion is the traditional continous diffusion equation. We have the following path for DDPM (my previous blog) as,
where and . This is the forward process that gradually adds noise to the image until we reach pure noise.
For Flow Matching, we need to transition from the discrete DDPM formulation to a continuous version. The key is to recognize that in the discrete case, we have:
To make this continuous, we:
- Replace the discrete time steps with a continuous time parameter
- Express as a continuous function:
- Replace the discrete noise schedule with a continuous function
- Define as the cumulative integral:
This gives us a smooth, continuous path from noise to data, rather than discrete steps. The resulting continuous formulation becomes:
For Variance Preserving (VP) diffusion, the path from noise to data has the form:
where
and is the noise schedule.
This gives us a continuous-time VP diffusion path where:
- The mean is scaled by
- The variance is
- The total variance remains constant (variance preserving) for all
Note that this formulation goes from noise () to data (), which is the reverse of the traditional DDPM formulation. The amplitude is defined using half the time integral compared to the forward process to maintain consistency with the variance preserving property.
We have:
Then, substituting them into the original formula for ,
where we used:
This gives us the vector field for Variance Preserving diffusion, which describes how points should move at each time to transform the noise distribution into the data distribution while preserving total variance.
Example
Let's say that we have a noise starting at and our ground truth image as
and .
Now we have the formula for
To get the denoised image at , or
To get the ground-truth velocity at that point, we have the formula
At , the ground truth velocity would be:
This means that at time , the neural network should predict that we are moving with velocity 0.625 in the positive direction.
We can verify this makes sense because:
- At , velocity is 1 (maximum speed toward target)
- As increases, velocity decreases as we get closer to target
- At , velocity approaches 0 (we've reached the target)
References
- Flow Matching for Generative Modeling - The original paper introducing Flow Matching as a framework for generative modeling tx