Disentangled Concepts Speak Louder Than Words:Explainable Video Action Recognition

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Disentangled Concepts Speak Louder Than Words:
Explainable Video Action Recognition

Jongseo Lee, Wooil Lee, Gyeon-Moon Park^*, Seong Tae Kim^*, Jinwoo Choi^*,

Kyung Hee University, Korea University
^*Corresponding authors

NeurIPS 2025, 🏆Spotlight Paper

Paper Supplementary Code arXiv

Abstract

Effective explanations of video action recognition models should disentangle how movements unfold over time from the surrounding spatial context. However, existing methods—based on saliency—produce entangled explanations, making it unclear whether predictions rely on motion or spatial context. Language-based approaches offer structure but often fail to explain motions due to their tacit nature—intuitively understood but difficult to verbalize. To address these challenges, we propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition, a framework that predicts actions through disentangled concept types: motion dynamics, objects, and scenes. We define motion dynamics concepts as human pose sequences. We employ a large language model to automatically extract object and scene concepts. Built on an ante-hoc concept bottleneck design, DANCE enforces prediction through these concepts. Experiments on four datasets—KTH, Penn Action, HAA500, and UCF-101—demonstrate that DANCE significantly improves explanation clarity with competitive performance. We validate the superior interpretability of DANCE through a user study. Experimental results also show that DANCE is beneficial for model debugging, editing, and failure analysis.

🎯 Disentangled concepts speak louder than words

🎯 Disentangled concepts speak louder than words.

💡 Existing spatio-temporal attribution methods give unstructured explanations that are hard to interpret.
🎥 In a Baseball Swing video, visual methods like 3D-saliency or VTCD can’t tell whether the model relies on motion, objects, or background.
🧠 Language-based approaches help, but motion remains tacit knowledge—understood intuitively yet hard to describe.

🚀 DANCE disentangles motion and context for clear, structured explanations:

🕺 Pose sequences → capture motion dynamics.

🏞️ Object & scene concepts → explained through text.

DANCE provides structured explanations.

Step 1. DANCE extracts spatio-temporal features from an input video using a pretrained backbone encoder.
Step 2. Three disentangled concept layers project the features into independent spaces — motion, object, and scene.
Step 3. Each layer produces interpretable activations representing distinct aspects of the action.
Step 4. The classification layer linearly combines these activations to predict the final action class.
Step 5. By separating motion dynamics from spatial context, DANCE delivers structured explanations aligned with human perception.

Concept discovery and labeling process of DANCE.

(a) Motion dynamics concept discovery: From each training video, we extract S key clips of length L centered at keyframes detected by a keyframe detection algorithm. A 2D pose estimator is applied to obtain human pose sequences, and all pose sequences are clustered to define each cluster as a motion dynamics concept.
(b) Motion dynamics concept labeling: For each video, we aggregate the cluster assignment tensor across its key clips to generate binary motion dynamics concept labels.
(c) Object concept discovery: We query GPT-4o with prompts containing action class names to automatically generate a set of object concepts for the dataset.
(d) Object concept pseudo labeling: Given a video and the object concept set, we compute pseudo labels using a vision-language dual encoder by multiplying the object concept embedding matrix with the video embedding vector.

The same procedure applies to scene concepts as well.

Weight visualization on KTH dataset.

Weight visualization on HAA100 dataset.

Weight visualization on Penn-action dataset.

Weight visualization on UCF101 dataset.

Sample-level explanation of a successful case

Sample-level explanation of a failure case

Demo Video

View Supplementary PDF

BibTeX

@article{Lee2025,
  title=Disentangled Concepts Speak Louder Than Words:Explainable Video Action Recognition,
  author=Jongseo Lee and Wooil Lee and Gyeon-Moon Park and Seong Tae Kim and Jinwoo Choi,},
  journal=NeurIPS,
  year={2025},
}

More Works from Our Lab

CAST: Cross-Attention in Space and Time for Video Action Recognition

ESSENTIAL: Episodic and Semantic Memory Integration for Video Class-Incremental Learning

Paper Title 3

Disentangled Concepts Speak Louder Than Words: Explainable Video Action Recognition

Abstract

🎯 Disentangled concepts speak louder than words

DANCE provides structured explanations.

Concept discovery and labeling process of DANCE.

Weight visualization on KTH dataset.

Weight visualization on HAA100 dataset.

Weight visualization on Penn-action dataset.

Weight visualization on UCF101 dataset.

Sample-level explanation of a successful case

Sample-level explanation of a failure case

Sample-level explanation of a failure case

Sample-level explanation of a failure case

Sample-level explanation of a failure case

Sample-level explanation of a failure case

Demo Video

BibTeX

Disentangled Concepts Speak Louder Than Words:
Explainable Video Action Recognition