Official implementation of the paper "Hierarchical Vector Quantization for Unsupervised Action Segmentation"
Full implementation coming soon!
If you find this code or our model useful, please cite our paper:
@inproceedings{hvq2025spurio,
author = {Federico Spurio and Emad Bahrami and Gianpiero Francesca and Juergen Gall},
title = {Hierarchical Vector Quantization for Unsupervised Action Segmentation},
booktitle = {AAAI Conference on Artificial Intelligence (AAAI)},
year = {2025}
}Hierarchical Vector Quantization (HVQ) is an unsupervised action segmentation approach that exploits the natuaral compositionality of actions. By employing a hierarchical stack of vector quantization modules, it effectively achieves accurate segmentation.
Here is the overview of our proposed model:
More in details, the pre-extracted features are processed through an Encoder, implemented as a two-stage MS-TCN. The resulting encodings are then progressively quantized using a sequence of vector quantization modules. Each module operates with a decreasing codebook size, gradually refining the representation until the desired number of action classes is achieved.
The Jensen-Shannon Distance (JSD) is a metric for evaluating the bias in the predicted segment lengths. For each video within the same activity, we compute the histogram of the predicted segment lengths, using a bin width of 20 frames. We then compare this histogram with the corresponding ground-truth histogram using the Jensen-Shannon Distance. These JSD scores are averaged across all videos for each activity. Finally, we calculate a weighted average across all activity, where the weights are the number of frames in each activity. In particular:
where
where
- Breakfast [1]
The features and annotations of the Breakfast dataset can be downloaded from features and ground-truth and mapping, as in [5]
- YouTube INRIA Instructional (YTI) [2]
- IKEA ASM [3]
Link for the features coming soon.
The data folder should be aranged in the following way:
data
|--breakfast
| `--features
| `--cereals
| `--P03_cam01_P03_cereals.txt
| `...
| `--coffee
| `--friedegg
| `...
| `--groundTruth
| `--P03_cam01_P03_cereals
| `...
| `--mapping
| `--mapping.txt
|
|--YTI
| `...
|
|--IKEA
| `...
To create the conda environment run the following command:
conda env create --name hvq --file environment.yml
source activate hvqThen run pip install -e . to avoid Module name hvq not found error.
After activating the conda environment, just run the training file for the chosen dataset, e.g. for breakfast:
python ./BF_utils/bf_train.pyTo run and evaluate for every epoch with FIFA[4] decoding and Hungarian Matching, set opt.epochs=20 and opt.vqt_epochs=1. With the default combinations of parameter opt.epochs=1, opt.vqt_epochs=20, the model is trained and evaluated once at the end.
Warning Unsupervised Action Segmentation models can be highly unstable. The system architecture, randomness, and PyTorch version may yield results different from those reported in the paper.
You can find the weights for the DoubleVQ models for every activity of the Breakfast dataset at this Kaggle link. Note: FIFA decoder still needs to be trained. The checkpoints were produced after the submission using a slightly different architecture; consequently, the results may differ slightly from those reported in the publication.
To have more stable predictions between epochs, it is suggested to set opt.use_cls=True. With this option, a classifier is trained on the embeddings produced by the HVQ model using the pseudo-labels as ground-truth. Paper's results are produced WITHOUT this option.
In our code we made use of the following repositories: MS-TCN, CTE and VQ. We sincerely thank the authors for their codebases!
Segmentation results for a sample from the Breakfast dataset (P22 friedegg). HVQ delivers highly consistent results across multiple videos (V1, V2, V3, V4) recorded from different cameras, but with the same ground truth.
[1] Kuehne, H.; Arslan, A.; and Serre T. The Language of Actions: Recovering the Syntax and Semantics of GoalDirected Human Activities. In CVPR, 2014
[2] Alayrac, J.-B.; Bojanowski, P.; Agrawal, N.; Sivic, J.; Laptev, I.; and Lacoste-Julien, S. Unsupervised Learning From Narrated Instruction Videos. In CVPR, 2016
[3] Ben-Shabat, Y.; Yu, X.; Saleh, F.; Campbell, D.; RodriguezOpazo, C.; Li, H.; and Gould, S. The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose. In WACV, 2021
[4] Souri, Y.; Farha, Y.A.; Despinoy, F.; Francesca, G.; and Gall, J. Fifa: Fast inference approximation for action segmentation. In GCPR, 2021.
[5] Kukleva, A.; Kuehne, H.; Sener, F.; and Gall, J. Unsupervised learning of action classes with continuous temporal embedding. In CVPR 2019.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.



