This dataset is an extension of the iMiGUE dataset, providing a spontaneous affective corpus in English for the study of emotional and affective states. The new release focuses on speech and enriches the original dataset with a variety of metadata, including speech transcripts, speaker-role separation between interviewer and interviewee, and word-level forced alignments. The dataset contains annotations and automatically generated metadata that are described in the next sections.
The iMiGUE-speech dataset is a collection of interview recordings organized by video/interview ID. It includes:
- a file with labels (
labels.csv) describing each interview recording, and - per-interview folders containing the full audio recording in WAV format and multiple transcript files.
Each interview is also split into speaker-specific segments for the interviewee (athlete) and the interviewer(s) (reporters), with corresponding ASR transcripts provided for each speaker.
The dataset root directory contains a file named labels.csv, with one row per interview folder. The columns are:
video_id: Unique identifier for the interview recording (matches the interview folder name).subject_gender: Gender of the interviewee (e.g.,M,F).subject_nationality: Nationality of the interviewee (country name as text).win_or_lose: Outcome label associated with the interviewee (e.g.,Win,Lose).
Example:
video_id,subject_gender,subject_nationality,win_or_lose
0001,M,Switzerland,Win
0002,M,Switzerland,Win
The dataset contains 359 interview folders (one folder per recording). Each folder is named with its corresponding video_id and includes the full recording audio, transcript files, and speaker-segment subfolders.
Example:
./0440
├── 0440.asr.txt
├── 0440.raw.txt
├── 0440.TextGrid
├── 0440.txt
├── 0440.wav
├── interviewee
└── interviewer
Each interview folder contains the full interview audio as:
<video_id>.wav: Full audio recording in WAV format (16-bit signed PCM, 44.1 kHz, mono).
Each interview folder contains transcript files generated by automatic speech recognition (ASR):
<video_id>.raw.txt: Non-normalized ASR output (with punctuation).<video_id>.asr.txt: Normalized ASR output.
In addition, the folder includes:
<video_id>.txt: A text transcript file associated with the interview.<video_id>.TextGrid: A Praat TextGrid file associated with the recording (commonly used for time-aligned segmentation/annotation).
Each interview folder contains two subfolders:
interviewee/: Speech segments belonging to the athlete (main interview subject).interviewer/: Speech segments belonging to the reporters/interviewers.
These subfolders contain the corresponding segmented speech data and transcript files. Similar to the full-recording transcripts, both normalized ASR output (*.asr.txt) and non-normalized ASR output with punctuation (*.raw.txt) are provided for the speaker-specific segments.
| Type | Tool | Added metadata / output |
|---|---|---|
| Audio standardization | ffmpeg | Extracted audio; normalized format (1-channel PCM, fixed sampling rate). |
| Speaker diarization | pyannote.audio |
Speaker-labeled time segments (e.g., SPEAKER_00). |
| Overlap detection | pyannote.audio |
Intervals of simultaneous speakers. |
| VAD | pyannote.audio |
Speech regions for removing silence/background noise. |
| Segment-level ASR | Whisper Large | English transcripts aligned to speech segments. |
| Segment-level TextGrid | Praat format | Unified tiers: diarization, overlap, VAD, transcripts. |
| Word-level alignment | MFA | Word boundaries from audio + Whisper transcripts. |
| Word-level TextGrid | MFA output | Separate tiered TextGrid with word and phone alignments. |
| Role identification | Heuristic | Longest cumulative speaking time mapped to athlete. |
| Speaker-specific clips | Custom | Disjoint athlete vs. journalist audio segment sets. |
| Segment indexing | Custom | Sequential IDs (e.g., segment_001). |
To request access to the dataset, please contact Haoyu Chen (University of Oulu, Finland) to sign the license agreement. Once the agreement has been signed, you will be granted access to the full dataset.
This repository contains the data described in the paper:
Kakouros, S., Kang, F., & Chen, H. (2026). iMiGUE-Speech: A Spontaneous Speech Dataset for Affective Analysis. Accepted for presentation in Speech Prosody 2026.
Abstract: This work presents an extension of the iMiGUE dataset, providing a spontaneous affective corpus for the study of emotional and affective states. The new release focuses on speech and enriches the original dataset with a variety of metadata, including speech transcripts, speaker-role separation between interviewer and interviewee, and word-level forced alignments. To demonstrate the utility of the dataset and establish initial performance benchmarks for the iMiGUE-Speech extensions, we introduce two affective state evaluation tasks to facilitate comparative evaluation: Speech Emotion Recognition (SER) and transcript-based sentiment analysis. These tasks leverage state-of-the-art pre-trained representations to assess the dataset’s capacity to capture spontaneous affective states from both acoustic and linguistic modalities. The extended dataset is made publicly available to support future research in the study of affect and related fields.
If you use the dataset, please cite:
@article{kakouros2026imiguespeechspontaneousspeechdataset,
title={iMiGUE-Speech: A Spontaneous Speech Dataset for Affective Analysis},
author={Sofoklis Kakouros and Fang Kang and Haoyu Chen},
year={2026},
eprint={2602.21464},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2602.21464},
}
The complete iMiGUE dataset (video and audio), along with the data collection protocol and microgesture annotations, is described in the following papers:
H Chen, X Liu, X Li, H Shi, G. Zhao Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning. IEEE 2019 14th IEEE International Conference on Automatic Face & Gesture (2019).
H Chen, H Shi, X Liu, X Li, G. Zhao SMG: A Micro-gesture Dataset Towards Spontaneous Body Gestures for Emotional Stress State Analysis. International Journal of Comput Vision (2023).