We propose the Masked Temporal Interpolation Diffusion (MTID) model for procedure planning in instructional videos. The core concept is utilizing intermediate latent visual features, generated by a latent space temporal interpolation module, to provide comprehensive visual information for mid-state supervision. These generated visual features are directly fed into the action reasoning model, ensuring effective application of intermediate supervision to the current action reasoning task through end-to-end training.
In a conda env with cuda available, run:
conda create --name MTID python==3.10
conda activate MTID
pip install -r requirements.txt
- Download datasets&features
cd ./dataset/{dataset_name}
bash download.sh
dataset_name = crosstask, coin, NIV
Or you can find the datasets from the huggingface.
- Train transformer for task category prediction wiht single GPU.
python train_mlp.py --name=traim_mlp_test --dataset=crosstask_how --gpu=0 --horizon=3
The trained transformer will be saved in ./save_max_mlp and json files for training and testing data will be generated. Then run temp.py to generate json files with predicted task class for testing:
Then run temp.py :
python temp.py --num_thread_reader=1 --resume --batch_size=32 --gpu=0 --batch_size_val=32 --ckpt_path=/path
- Train MTID: Move the file generated by
temp.pyto the specified location indataset/environments_config.jsonand run:
python main_distributed.py --dataset=crosstask_how --name=main_test --gpu=0 --base_model=predictor --horizon=3
To train the temporalPredictor.py to remove 'time_mlp' modules and modify diffusion.py to change the initial noise, 'training' functions and p_sample_loop process.
Note: Numbers may vary from runs to runs for PDPP and
All results have been printed to the log files in the out folder. If you want to perform inference separately, you can use the following command:
python inference.py --resume --base_model=predictor --gpu=0 --ckpt_path=/path
To evaluate the temporalPredictor.py to remove 'time_mlp' modules and modify diffusion.py to change the initial noise and p_sample_loop process. For num_sampling(L26) in uncertain.py should be 1.
Modify the checkpoint path(L348) as the evaluated model in uncertain.py and run:
nohup python uncertain.py --gpu=1 --num_thread_reader=1 --cudnn_benchmark=1 --pin_memory --base_model=predictor --resume --batch_size=32 --batch_size_val=32 --evaluate > out/result.log 2>&1 &
@inproceedings{
zhou2025masked,
title={Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos},
author={Yufan Zhou and Zhaobo Qi and Lingshuai Lin and Junqi Jing and Tingting Chai and Beichen Zhang and Shuhui Wang and Weigang Zhang},
booktitle={ICLR},
year={2025},
}
We appreciate the authors of PDPP, diffusers to share their code.
