Real-time streaming and arbitrary motion stylization with low latency and long-term consistency.
📄 View on arXiv 💻 View on GitHub
(Watch: Real-time stylized motion generation demo)
(Long-sequence streaming motion stylization results with trajectory copy)
Generating long and stylized human motions in real time is critical for applications that demand continuous and responsive character control. Despite its importance, existing streaming approaches often operate directly in the raw motion space, leading to substantial computational overhead and making it difficult to maintain temporal stability. In contrast, latent-space VAE–Diffusion–based frameworks alleviate these issues and achieve high-quality stylization, but they are generally confined to offline processing. To bridge this gap, LILAC (Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE–Diffusion with Causal Decoding) builds upon a recent high-performing offline framework for arbitrary motion stylization ( Song et al., CVPR 2024) and extends it to an online setting through a latent-space streaming architecture with a sliding-window causal design and the injection of decoded motion features to ensure smooth motion transitions. This architecture enables long-sequence real-time arbitrary stylization without relying on future frames or modifying the diffusion model architecture, achieving a favorable balance between stylization quality and responsiveness as demonstrated by experiments on benchmark datasets.
The proposed framework makes three main contributions:
Figure: Proposed streaming stylization pipeline. Motion segments are processed by an encoder–decoder with a style-conditioned denoiser. The latent output of the diffusion model is re-decoded into motion features, which are then concatenated with previous outputs, corrected through trajectory copy, re-encoded, and finally passed into a causal decoder for temporally consistent motion generation.
We present both quantitative and qualitative results demonstrating that the proposed streaming framework maintains high stylization quality and temporal consistency while achieving real-time performance.
Methods | FMD ↓ | CRA ↑ | SRA ↑ | Total Jitter ↓ |
---|---|---|---|---|
Real Motions | -- | 0.99 | 1.00 | -- |
1DConv + AdaIN | 42.68 | 0.31 | 0.57 | -- |
STGCN + AdaIN | 129.44 | 0.60 | 0.18 | -- |
Motion Puzzle | 113.31 | 0.26 | 0.46 | -- |
Offline (Original Paper) | 27.69 | 0.36 | 0.58 | 0.0089 |
Online (Proposed) | 31.69 | 0.30 | 0.58 | 0.0139 |
Table: Quantitative comparison between offline and proposed streaming settings. LILAC achieves a favorable balance between stylization quality and real-time performance.
Representative long-sequence stylization results demonstrate temporal coherence, smooth transitions, and responsiveness under different motion styles and trajectories.
Each sequence shows real-time generated motions under different style embeddings or trajectories.