iMiGUE-speech: A Spontaneous Speech Dataset for Affective Analysis

This dataset is an extension of the iMiGUE dataset, providing a spontaneous affective corpus in English for the study of emotional and affective states. The new release focuses on speech and enriches the original dataset with a variety of metadata, including speech transcripts, speaker-role separation between interviewer and interviewee, and word-level forced alignments. The dataset contains annotations and automatically generated metadata that are described in the next sections.

Description

The iMiGUE-speech dataset is a collection of interview recordings organized by video/interview ID. It includes:

a file with labels (labels.csv) describing each interview recording, and
per-interview folders containing the full audio recording in WAV format and multiple transcript files.

Each interview is also split into speaker-specific segments for the interviewee (athlete) and the interviewer(s) (reporters), with corresponding ASR transcripts provided for each speaker.

Labels: `labels.csv`

The dataset root directory contains a file named labels.csv, with one row per interview folder. The columns are:

video_id: Unique identifier for the interview recording (matches the interview folder name).
subject_gender: Gender of the interviewee (e.g., M, F).
subject_nationality: Nationality of the interviewee (country name as text).
win_or_lose: Outcome label associated with the interviewee (e.g., Win, Lose).

Example:

video_id,subject_gender,subject_nationality,win_or_lose
0001,M,Switzerland,Win
0002,M,Switzerland,Win

Folder Structure

The dataset contains 359 interview folders (one folder per recording). Each folder is named with its corresponding video_id and includes the full recording audio, transcript files, and speaker-segment subfolders.

Example:

./0440
├── 0440.asr.txt
├── 0440.raw.txt
├── 0440.TextGrid
├── 0440.txt
├── 0440.wav
├── interviewee
└── interviewer

Audio

Each interview folder contains the full interview audio as:

<video_id>.wav: Full audio recording in WAV format (16-bit signed PCM, 44.1 kHz, mono).

Transcripts and Annotations

Each interview folder contains transcript files generated by automatic speech recognition (ASR):

<video_id>.raw.txt: Non-normalized ASR output (with punctuation).
<video_id>.asr.txt: Normalized ASR output.

In addition, the folder includes:

<video_id>.txt: A text transcript file associated with the interview.
<video_id>.TextGrid: A Praat TextGrid file associated with the recording (commonly used for time-aligned segmentation/annotation).

Speaker-Specific Segments

Each interview folder contains two subfolders:

interviewee/: Speech segments belonging to the athlete (main interview subject).
interviewer/: Speech segments belonging to the reporters/interviewers.

These subfolders contain the corresponding segmented speech data and transcript files. Similar to the full-recording transcripts, both normalized ASR output (*.asr.txt) and non-normalized ASR output with punctuation (*.raw.txt) are provided for the speaker-specific segments.

Generated audio metadata/annotations for iMiGUE-Speech

Type	Tool	Added metadata / output
Audio standardization	ffmpeg	Extracted audio; normalized format (1-channel PCM, fixed sampling rate).
Speaker diarization	`pyannote.audio`	Speaker-labeled time segments (e.g., `SPEAKER_00`).
Overlap detection	`pyannote.audio`	Intervals of simultaneous speakers.
VAD	`pyannote.audio`	Speech regions for removing silence/background noise.
Segment-level ASR	Whisper Large	English transcripts aligned to speech segments.
Segment-level TextGrid	Praat format	Unified tiers: diarization, overlap, VAD, transcripts.
Word-level alignment	MFA	Word boundaries from audio + Whisper transcripts.
Word-level TextGrid	MFA output	Separate tiered TextGrid with word and phone alignments.
Role identification	Heuristic	Longest cumulative speaking time mapped to athlete.
Speaker-specific clips	Custom	Disjoint athlete vs. journalist audio segment sets.
Segment indexing	Custom	Sequential IDs (e.g., `segment_001`).

Using the dataset and licensing

To request access to the dataset, please contact Haoyu Chen (University of Oulu, Finland) to sign the license agreement. Once the agreement has been signed, you will be granted access to the full dataset.

Citing and repository information

This repository contains the data described in the paper:

Kakouros, S., Kang, F., & Chen, H. (2026). iMiGUE-Speech: A Spontaneous Speech Dataset for Affective Analysis. Accepted for presentation in Speech Prosody 2026.

Abstract: This work presents an extension of the iMiGUE dataset, providing a spontaneous affective corpus for the study of emotional and affective states. The new release focuses on speech and enriches the original dataset with a variety of metadata, including speech transcripts, speaker-role separation between interviewer and interviewee, and word-level forced alignments. To demonstrate the utility of the dataset and establish initial performance benchmarks for the iMiGUE-Speech extensions, we introduce two affective state evaluation tasks to facilitate comparative evaluation: Speech Emotion Recognition (SER) and transcript-based sentiment analysis. These tasks leverage state-of-the-art pre-trained representations to assess the dataset’s capacity to capture spontaneous affective states from both acoustic and linguistic modalities. The extended dataset is made publicly available to support future research in the study of affect and related fields.

If you use the dataset, please cite:

@article{kakouros2026imiguespeechspontaneousspeechdataset,
      title={iMiGUE-Speech: A Spontaneous Speech Dataset for Affective Analysis}, 
      author={Sofoklis Kakouros and Fang Kang and Haoyu Chen},
      year={2026},
      eprint={2602.21464},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2602.21464}, 
}

Audio-video dataset

The complete iMiGUE dataset (video and audio), along with the data collection protocol and microgesture annotations, is described in the following papers:

H Chen, X Liu, X Li, H Shi, G. Zhao Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning. IEEE 2019 14th IEEE International Conference on Automatic Face & Gesture (2019).

H Chen, H Shi, X Liu, X Li, G. Zhao SMG: A Micro-gesture Dataset Towards Spontaneous Body Gestures for Emotional Stress State Analysis. International Journal of Comput Vision (2023).

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iMiGUE-speech: A Spontaneous Speech Dataset for Affective Analysis

Description

Labels: `labels.csv`

Folder Structure

Audio

Transcripts and Annotations

Speaker-Specific Segments

Generated audio metadata/annotations for iMiGUE-Speech

Using the dataset and licensing

Citing and repository information

Audio-video dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

iMiGUE-speech: A Spontaneous Speech Dataset for Affective Analysis

Description

Labels: labels.csv

Folder Structure

Audio

Transcripts and Annotations

Speaker-Specific Segments

Generated audio metadata/annotations for iMiGUE-Speech

Using the dataset and licensing

Citing and repository information

Audio-video dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Labels: `labels.csv`

Packages