Skip to content

CV-AC/imigue-speech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 

Repository files navigation

iMiGUE-speech: A Spontaneous Speech Dataset for Affective Analysis

This dataset is an extension of the iMiGUE dataset, providing a spontaneous affective corpus in English for the study of emotional and affective states. The new release focuses on speech and enriches the original dataset with a variety of metadata, including speech transcripts, speaker-role separation between interviewer and interviewee, and word-level forced alignments. The dataset contains annotations and automatically generated metadata that are described in the next sections.

Description

The iMiGUE-speech dataset is a collection of interview recordings organized by video/interview ID. It includes:

  • a file with labels (labels.csv) describing each interview recording, and
  • per-interview folders containing the full audio recording in WAV format and multiple transcript files.

Each interview is also split into speaker-specific segments for the interviewee (athlete) and the interviewer(s) (reporters), with corresponding ASR transcripts provided for each speaker.

Labels: labels.csv

The dataset root directory contains a file named labels.csv, with one row per interview folder. The columns are:

  • video_id: Unique identifier for the interview recording (matches the interview folder name).
  • subject_gender: Gender of the interviewee (e.g., M, F).
  • subject_nationality: Nationality of the interviewee (country name as text).
  • win_or_lose: Outcome label associated with the interviewee (e.g., Win, Lose).

Example:

video_id,subject_gender,subject_nationality,win_or_lose
0001,M,Switzerland,Win
0002,M,Switzerland,Win

Folder Structure

The dataset contains 359 interview folders (one folder per recording). Each folder is named with its corresponding video_id and includes the full recording audio, transcript files, and speaker-segment subfolders.

Example:

./0440
├── 0440.asr.txt
├── 0440.raw.txt
├── 0440.TextGrid
├── 0440.txt
├── 0440.wav
├── interviewee
└── interviewer

Audio

Each interview folder contains the full interview audio as:

  • <video_id>.wav: Full audio recording in WAV format (16-bit signed PCM, 44.1 kHz, mono).

Transcripts and Annotations

Each interview folder contains transcript files generated by automatic speech recognition (ASR):

  • <video_id>.raw.txt: Non-normalized ASR output (with punctuation).
  • <video_id>.asr.txt: Normalized ASR output.

In addition, the folder includes:

  • <video_id>.txt: A text transcript file associated with the interview.
  • <video_id>.TextGrid: A Praat TextGrid file associated with the recording (commonly used for time-aligned segmentation/annotation).

Speaker-Specific Segments

Each interview folder contains two subfolders:

  • interviewee/: Speech segments belonging to the athlete (main interview subject).
  • interviewer/: Speech segments belonging to the reporters/interviewers.

These subfolders contain the corresponding segmented speech data and transcript files. Similar to the full-recording transcripts, both normalized ASR output (*.asr.txt) and non-normalized ASR output with punctuation (*.raw.txt) are provided for the speaker-specific segments.

Generated audio metadata/annotations for iMiGUE-Speech

Type Tool Added metadata / output
Audio standardization ffmpeg Extracted audio; normalized format (1-channel PCM, fixed sampling rate).
Speaker diarization pyannote.audio Speaker-labeled time segments (e.g., SPEAKER_00).
Overlap detection pyannote.audio Intervals of simultaneous speakers.
VAD pyannote.audio Speech regions for removing silence/background noise.
Segment-level ASR Whisper Large English transcripts aligned to speech segments.
Segment-level TextGrid Praat format Unified tiers: diarization, overlap, VAD, transcripts.
Word-level alignment MFA Word boundaries from audio + Whisper transcripts.
Word-level TextGrid MFA output Separate tiered TextGrid with word and phone alignments.
Role identification Heuristic Longest cumulative speaking time mapped to athlete.
Speaker-specific clips Custom Disjoint athlete vs. journalist audio segment sets.
Segment indexing Custom Sequential IDs (e.g., segment_001).

Using the dataset and licensing

To request access to the dataset, please contact Haoyu Chen (University of Oulu, Finland) to sign the license agreement. Once the agreement has been signed, you will be granted access to the full dataset.

Citing and repository information

This repository contains the data described in the paper:

Kakouros, S., Kang, F., & Chen, H. (2026). iMiGUE-Speech: A Spontaneous Speech Dataset for Affective Analysis. Accepted for presentation in Speech Prosody 2026.

Abstract: This work presents an extension of the iMiGUE dataset, providing a spontaneous affective corpus for the study of emotional and affective states. The new release focuses on speech and enriches the original dataset with a variety of metadata, including speech transcripts, speaker-role separation between interviewer and interviewee, and word-level forced alignments. To demonstrate the utility of the dataset and establish initial performance benchmarks for the iMiGUE-Speech extensions, we introduce two affective state evaluation tasks to facilitate comparative evaluation: Speech Emotion Recognition (SER) and transcript-based sentiment analysis. These tasks leverage state-of-the-art pre-trained representations to assess the dataset’s capacity to capture spontaneous affective states from both acoustic and linguistic modalities. The extended dataset is made publicly available to support future research in the study of affect and related fields.

If you use the dataset, please cite:

@article{kakouros2026imiguespeechspontaneousspeechdataset,
      title={iMiGUE-Speech: A Spontaneous Speech Dataset for Affective Analysis}, 
      author={Sofoklis Kakouros and Fang Kang and Haoyu Chen},
      year={2026},
      eprint={2602.21464},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2602.21464}, 
}

Audio-video dataset

The complete iMiGUE dataset (video and audio), along with the data collection protocol and microgesture annotations, is described in the following papers:

H Chen, X Liu, X Li, H Shi, G. Zhao Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning. IEEE 2019 14th IEEE International Conference on Automatic Face & Gesture (2019).

H Chen, H Shi, X Liu, X Li, G. Zhao SMG: A Micro-gesture Dataset Towards Spontaneous Body Gestures for Emotional Stress State Analysis. International Journal of Comput Vision (2023).

About

A Spontaneous Speech Dataset for Affective Analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors