[AAAI 2025] Hierarchical Vector Quantization for Unsupervised Action Segmentation

Official implementation of the paper "Hierarchical Vector Quantization for Unsupervised Action Segmentation"

Full implementation coming soon!

Citation

If you find this code or our model useful, please cite our paper:

@inproceedings{hvq2025spurio,
    author    = {Federico Spurio and Emad Bahrami and Gianpiero Francesca and Juergen Gall},
    title     = {Hierarchical Vector Quantization for Unsupervised Action Segmentation},
    booktitle = {AAAI Conference on Artificial Intelligence (AAAI)},
    year      = {2025}
}

Overview

Hierarchical Vector Quantization (HVQ) Architecture

Hierarchical Vector Quantization (HVQ) is an unsupervised action segmentation approach that exploits the natuaral compositionality of actions. By employing a hierarchical stack of vector quantization modules, it effectively achieves accurate segmentation.

Here is the overview of our proposed model:

More in details, the pre-extracted features are processed through an Encoder, implemented as a two-stage MS-TCN. The resulting encodings are then progressively quantized using a sequence of vector quantization modules. Each module operates with a decreasing codebook size, gradually refining the representation until the desired number of action classes is achieved.

Jensen-Shannon Distance (JSD) Metric

The Jensen-Shannon Distance (JSD) is a metric for evaluating the bias in the predicted segment lengths. For each video within the same activity, we compute the histogram of the predicted segment lengths, using a bin width of 20 frames. We then compare this histogram with the corresponding ground-truth histogram using the Jensen-Shannon Distance. These JSD scores are averaged across all videos for each activity. Finally, we calculate a weighted average across all activity, where the weights are the number of frames in each activity. In particular:

$$ \begin{equation} \text{JSD} = \frac{\sum_{a \in A} F_a \cdot \frac{1}{|M_a|} \sum_{m \in M_a} \text{JSDist}(H_m^{\text{pred}}, H_m^{\text{gt}})}{\sum_{a \in A} F_a}, \end{equation} $$

where $F_a{=}\sum_{m=1}^{M_a} T_m$ is the total number of frames and $M_a$ is the set of all the videos for activity $a \in A$. JSDist is the Jensen-Shannon Distance:

$$ \begin{equation} \text{JSDist}(P, Q) = \sqrt{\frac{D_{\text{KL}}(P \parallel M) + D_{\text{KL}}(Q \parallel M)}{2}} \end{equation} $$

where $M{=}\frac{P+Q}{2}$ and $D_{\text{KL}}$ is the Kullback-Leibler divergence. The input of JSDist is normalized such that the sum of each histogram $H_m$ is one.

Datasets

- Breakfast [1]

The features and annotations of the Breakfast dataset can be downloaded from features and ground-truth and mapping, as in [5]

- YouTube INRIA Instructional (YTI) [2]

- IKEA ASM [3]

Link for the features coming soon.

Data Folder Structure

The data folder should be aranged in the following way:

data
|--breakfast
|  `--features
|     `--cereals
|        `--P03_cam01_P03_cereals.txt
|        `...
|     `--coffee
|     `--friedegg
|     `...
|  `--groundTruth
|     `--P03_cam01_P03_cereals
|     `...
|  `--mapping
|     `--mapping.txt
|
|--YTI
|  `...
|
|--IKEA
|  `...

Installation

To create the conda environment run the following command:

conda env create --name hvq --file environment.yml
source activate hvq

Then run pip install -e . to avoid Module name hvq not found error.

Run

After activating the conda environment, just run the training file for the chosen dataset, e.g. for breakfast:

python ./BF_utils/bf_train.py

- Evaluation

To run and evaluate for every epoch with FIFA[4] decoding and Hungarian Matching, set opt.epochs=20 and opt.vqt_epochs=1. With the default combinations of parameter opt.epochs=1, opt.vqt_epochs=20, the model is trained and evaluated once at the end.

- Model Weights

Warning Unsupervised Action Segmentation models can be highly unstable. The system architecture, randomness, and PyTorch version may yield results different from those reported in the paper.

You can find the weights for the DoubleVQ models for every activity of the Breakfast dataset at this Kaggle link. Note: FIFA decoder still needs to be trained. The checkpoints were produced after the submission using a slightly different architecture; consequently, the results may differ slightly from those reported in the publication.

- Classifier on top of HVQ predictions

To have more stable predictions between epochs, it is suggested to set opt.use_cls=True. With this option, a classifier is trained on the embeddings produced by the HVQ model using the pseudo-labels as ground-truth. Paper's results are produced WITHOUT this option.

Acknowledgement

In our code we made use of the following repositories: MS-TCN, CTE and VQ. We sincerely thank the authors for their codebases!

Qualitative

Segmentation results for a sample from the Breakfast dataset (P22 friedegg). HVQ delivers highly consistent results across multiple videos (V1, V2, V3, V4) recorded from different cameras, but with the same ground truth.

References

[1] Kuehne, H.; Arslan, A.; and Serre T. The Language of Actions: Recovering the Syntax and Semantics of GoalDirected Human Activities. In CVPR, 2014

[2] Alayrac, J.-B.; Bojanowski, P.; Agrawal, N.; Sivic, J.; Laptev, I.; and Lacoste-Julien, S. Unsupervised Learning From Narrated Instruction Videos. In CVPR, 2016

[3] Ben-Shabat, Y.; Yu, X.; Saleh, F.; Campbell, D.; RodriguezOpazo, C.; Li, H.; and Gould, S. The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose. In WACV, 2021

[4] Souri, Y.; Farha, Y.A.; Despinoy, F.; Francesca, G.; and Gall, J. Fifa: Fast inference approximation for action segmentation. In GCPR, 2021.

[5] Kukleva, A.; Kuehne, H.; Sener, F.; and Gall, J. Unsupervised learning of action classes with continuous temporal embedding. In CVPR 2019.

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data_utils/BF_utils		data_utils/BF_utils
figures		figures
hvq.egg-info		hvq.egg-info
hvq		hvq
.gitignore		.gitignore
JSD_compute.ipynb		JSD_compute.ipynb
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[AAAI 2025] Hierarchical Vector Quantization for Unsupervised Action Segmentation

Citation

Overview

Hierarchical Vector Quantization (HVQ) Architecture

Jensen-Shannon Distance (JSD) Metric

Datasets

- Breakfast [1]

- YouTube INRIA Instructional (YTI) [2]

- IKEA ASM [3]

Data Folder Structure

Installation

Run

- Evaluation

- Model Weights

- Classifier on top of HVQ predictions

Acknowledgement

Qualitative

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

FedeSpu/HVQ

Folders and files

Latest commit

History

Repository files navigation

[AAAI 2025] Hierarchical Vector Quantization for Unsupervised Action Segmentation

Citation

Overview

Hierarchical Vector Quantization (HVQ) Architecture

Jensen-Shannon Distance (JSD) Metric

Datasets

- Breakfast [1]

- YouTube INRIA Instructional (YTI) [2]

- IKEA ASM [3]

Data Folder Structure

Installation

Run

- Evaluation

- Model Weights

- Classifier on top of HVQ predictions

Acknowledgement

Qualitative

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages