TY - GEN
T1 - Gated Probabilistic Diffusion for Temporal Action Segmentation
AU - Li, Yun
AU - Li, Hanmin
AU - Lam, Kin Man
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025/10
Y1 - 2025/10
N2 - Temporal action segmentation is a fundamental task in video understanding, involving the identification and classification of human actions in long, untrimmed videos. Existing methods often suffer from over-segmentation errors and struggle to model complex temporal dependencies. Inspired by denoising diffusion models, we propose Gated Probabilistic Diffusion Action Segmentation (GPDAS), a novel framework that formulates action segmentation as a conditional sequence generation task. GPDAS iteratively refines frame-wise action labels through a denoising process conditioned on video features, implicitly modeling action priors and domain-specific behavioral knowledge. Our approach includes (1) a gated probabilistic decoder with adaptive temporal convolutions to enhance boundary accuracy and action continuity, (2) dual boundary-aware and action-dependent loss functions to capture chronological dependencies and improve temporal localization, and (3) masked conditioning strategies to improve robustness. Evaluated on the GTEA, 50Salads, and Breakfast benchmarks, GPDAS achieves state-of-the-art performance, outperforming existing methods in edit score and segmental F1 scores while effectively mitigating over-segmentation. The gated decoder demonstrates strong performance in modeling long-range, complex action dynamics.
AB - Temporal action segmentation is a fundamental task in video understanding, involving the identification and classification of human actions in long, untrimmed videos. Existing methods often suffer from over-segmentation errors and struggle to model complex temporal dependencies. Inspired by denoising diffusion models, we propose Gated Probabilistic Diffusion Action Segmentation (GPDAS), a novel framework that formulates action segmentation as a conditional sequence generation task. GPDAS iteratively refines frame-wise action labels through a denoising process conditioned on video features, implicitly modeling action priors and domain-specific behavioral knowledge. Our approach includes (1) a gated probabilistic decoder with adaptive temporal convolutions to enhance boundary accuracy and action continuity, (2) dual boundary-aware and action-dependent loss functions to capture chronological dependencies and improve temporal localization, and (3) masked conditioning strategies to improve robustness. Evaluated on the GTEA, 50Salads, and Breakfast benchmarks, GPDAS achieves state-of-the-art performance, outperforming existing methods in edit score and segmental F1 scores while effectively mitigating over-segmentation. The gated decoder demonstrates strong performance in modeling long-range, complex action dynamics.
UR - https://www.scopus.com/pages/publications/105030496842
U2 - 10.1109/APSIPAASC65261.2025.11249297
DO - 10.1109/APSIPAASC65261.2025.11249297
M3 - Conference article published in proceeding or book
AN - SCOPUS:105030496842
T3 - 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025
SP - 1868
EP - 1873
BT - 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025
Y2 - 22 October 2025 through 24 October 2025
ER -