🪻 LILAC: Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE–Diffusion with Causal Decoding

Real-time streaming and arbitrary motion stylization with low latency and long-term consistency.

(Watch: Real-time stylized motion generation demo)

Teaser figure showing long-sequence stylized motion generation

(Long-sequence streaming motion stylization results with trajectory copy)

Abstract

Generating long and stylized human motions in real time is critical for applications that demand continuous and responsive character control. Despite its importance, existing streaming approaches often operate directly in the raw motion space, leading to substantial computational overhead and making it difficult to maintain temporal stability. In contrast, latent-space VAE–Diffusion–based frameworks alleviate these issues and achieve high-quality stylization, but they are generally confined to offline processing. To bridge this gap, LILAC (Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE–Diffusion with Causal Decoding) builds upon a recent high-performing offline framework for arbitrary motion stylization ( Song et al., CVPR 2024) and extends it to an online setting through a latent-space streaming architecture with a sliding-window causal design and the injection of decoded motion features to ensure smooth motion transitions. This architecture enables long-sequence real-time arbitrary stylization without relying on future frames or modifying the diffusion model architecture, achieving a favorable balance between stylization quality and responsiveness as demonstrated by experiments on benchmark datasets.

Method Overview

The proposed framework makes three main contributions:

Latent-space streaming architecture. A latent-space streaming architecture for long-sequence arbitrary motion stylization, featuring a sliding-window causal design and a re-decoding/encoding mechanism that injects previously generated motion features into the latent representation, enabling real-time generation without future frames or modifications to the diffusion model architecture.
Style-conditioned generation. Enables smooth and instantaneous transitions between arbitrary motion styles in a streaming setting, bringing offline style-conditioning methods into real-time operation.
Experimental validation. Qualitative and quantitative evaluation on benchmark datasets, demonstrating the effectiveness of the proposed streaming framework compared to existing offline methods.

Figure: Proposed streaming stylization pipeline. Motion segments are processed by an encoder–decoder with a style-conditioned denoiser. The latent output of the diffusion model is re-decoded into motion features, which are then concatenated with previous outputs, corrected through trajectory copy, re-encoded, and finally passed into a causal decoder for temporally consistent motion generation.

Results

We present both quantitative and qualitative results demonstrating that the proposed streaming framework maintains high stylization quality and temporal consistency while achieving real-time performance.

Quantitative Comparison

Methods	FMD ↓	CRA ↑	SRA ↑	Total Jitter ↓
Real Motions	--	0.99	1.00	--
1DConv + AdaIN	42.68	0.31	0.57	--
STGCN + AdaIN	129.44	0.60	0.18	--
Motion Puzzle	113.31	0.26	0.46	--
Offline (Original Paper)	27.69	0.36	0.58	0.0089
Online (Proposed)	31.69	0.30	0.58	0.0139

Table: Quantitative comparison between offline and proposed streaming settings. LILAC achieves a favorable balance between stylization quality and real-time performance.

Qualitative Results

Representative long-sequence stylization results demonstrate temporal coherence, smooth transitions, and responsiveness under different motion styles and trajectories.

Each sequence shows real-time generated motions under different style embeddings or trajectories.