epoch_250_class_2.mp4
A class-conditional video generation model using a Diffusion Transformer architecture operating in VAE latent space. The model generates 16-frame videos at 256x256 resolution through iterative denoising.
Input Noise (4, 16, 32, 32)
│
▼
┌─────────────┐
│ Patchify │ 3D patches (2x2x2)
└─────────────┘
│
▼
┌─────────────┐
│ Patch Embed │ Linear projection to dim=1024
└─────────────┘
│
▼
┌─────────────┐
│ + PosEmbed │ Learned positional embeddings
└─────────────┘
│
▼
┌──────────────────┐
│ SpatioTemporal │ ×16 blocks
│ DiT Blocks │
│ (with adaLN) │
└──────────────────┘
│
▼
┌─────────────┐
│ Final Layer │ Project back to patch dim
└─────────────┘
│
▼
┌─────────────┐
│ Unpatchify │ Reconstruct latent tensor
└─────────────┘
│
▼
Predicted Noise (4, 16, 32, 32)