MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation based Distillation

¹University of British Columbia, ²Vector Institute for AI, ³Simon Fraser University, ⁴Canada CIFAR AI Chair
Computer Vision and Pattern Recognition Conference (CVPR) 2025

Abstract

Human trajectory forecasting poses a challenge to predict the inherently multi-modal future movements of individuals based on their past trajectories and relevant contextual cues.

We tackle this challenge by proposing a novel Motion prediction conditional Flow matching model, termed MoFlow, to generate K-shot future trajectories for all agents in a given scene. Additionally, we design a novel flow matching loss function that not only ensures at least one of the \( K \) sets of future trajectories is accurate but also encourages all \(K\) sets of future trajectories to be diverse and plausible.

Furthermore, by leveraging the Implicit Maximum Likelihood Estimation (IMLE), we propose a novel distillation method for flow models that only requires samples from the teacher model.

Extensive experiments on the real-world datasets, including SportVU NBA games, ETH-UCY, and SDD, demonstrate that both our teacher flow model and the IMLE-distilled student model achieve state-of-the-art performance. These models can generate diverse trajectories that are physically and socially plausible. Moreover, our one-step student model is 100 times faster than the teacher flow model during sampling. 🚀🚀🚀

Flow Matching in MoFlow

In terms of the forward process in flow matching, we adopt a simple linear interpolation between the clean trajectories \(Y^1\sim q\) and pure noise \(Y^0\sim \mathcal{N}(\mathbf{0},\mathbf{I})\),

Y^t = (1-t)Y^0 + tY^1 \qquad t \in [0, 1].

The reverse process, which allows us to generate new samples, is described by the ordinary differential equation

d Y^t = v_\theta(Y^t, t, C)dt.

Here, \(v_\theta\) represents the parametrized vector field that approximates the straight flow \(U^t=Y^1-Y^0\). Note that \(C\) entails all the contextual information of agents in a scene.

We can simulate the forward and backward process of flow matching by animating the transition a single trajectory between noise and data. Use the slider here to visualize the trajectory at different flow matching time step between noise distribution \(Y^0\) and data distribution \(Y^1\). Check out our paper to see how we train and sample from the conditional flow-matching model.

\( Y^1 \)

Clean trajectory

\( Y^0 \)

Noisy trajectory

A novel MoFlow objective

By rearranging the original linear flow objective, we introduce a neural network \(D_\theta:=Y^t+(1-t)v_\theta(Y^t, C, t) \) that matches the future trajectory \(Y^1\) in the data space and the objective turns into

\mathcal{L}_{\text{FM}} = \mathbb{E}_{Y^t, Y^1\sim q, t\sim \mathcal{U}[0,1]} \left[ \frac{\| D_{\theta}(Y^t, C, t) - Y^1 \|_2^2}{(1 - t)^2} \right].

To promote the multi-modality of the future trajectory, we design a novel loss function. Specifically, our \(D_\theta\) generates \(K\) scene-level correlated waypoint predictions, denoted by \(\{S_i\}^K_{i=1}\), alongside corrresponding classification logits \(\{\zeta_i\}_{i=1}^K\). For simplicity, we omit the time-dependent coefficient and obtain a new loss \(\bar{\mathcal{L}}_{\text{FM}}\)

\bar{\mathcal{L}}_{\text{FM}} = \mathbb{E}_{Y^t, Y^1\sim q, t\sim \mathcal{U}[0,1]} \left[ \| S_{j^*} - Y^1 \|_2^2 + \text{CE}(\zeta_{1:K}, j^*) \right]

where \(j^* = \arg\min_{j} \| S_j - Y^1 \|_2^2\) and \(\text{CE}(\cdot,\cdot)\) is the cross-entropy loss.

IMLE Module

The IMLE distillation process is outlined in Algorithm 1. Specifically, lines 4–6 describe the standard ODE-based sampling of the teacher model, MoFlow. This produces \( K \) correlated multi-modal trajectory predictions \( \hat{Y}^1_{1:K} \) conditioned on the context \( C \). A conditional IMLE generator \( G_\phi \) then uses a noise vector \( Z \) and context \( C \) to generate \( K \)-component trajectories \( \Gamma \), matching the shape of \( \hat{Y}^1_{1:K} \).

The conditional IMLE objective generates more samples than those in the distillation dataset for the same context \( C \). Specifically, \( m \) i.i.d. samples are drawn via \( G_\phi \), and the one closest to the teacher prediction \( \hat{Y}^1_{1:K} \) is selected for loss computation. This minimizes the distance to the nearest student sample, ensuring the teacher model’s mode is well-approximated.

To preserve trajectory prediction multi-modality, we employ the Chamfer distance \(d_{\text{Chamfer}}(\hat{Y}^1,\Gamma)\) as our loss function

\mathcal{L}_{\text{IMLE}}(\hat{Y}^1_{1:K}, \Gamma) = \dfrac{1}{K} \left( \sum_{i=1}^K \min_j \|\hat{Y}^1_i - \Gamma^{(j)}\| + \sum_{j=1}^K \min_i \|\hat{Y}^1_i - \Gamma^{(j)}\| \right),

where \( \Gamma^{(i)} \in \mathbb{R}^{A \times 2T_f} \) is the \( i \)-th component of the IMLE-generated correlated trajectory.

Animation on NBA SportVU dataset

From left to right, the results displayed correspond to Leapfrog Diffusion, MoFlow, MoFlow-IMLE respectively.

Animation on ETH-UCY dataset

MoFlow

MoFlow-IMLE

BibTeX

@inproceedings{fu2025moflowonestepflowmatching, author = {Fu, Yuxiang and Yan, Qi and Wang, Lele and Li, Ke and Liao, Renjie}, title = {MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation based Distillation}, journal = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year = {2025}, }