Skip to content

Minimal implementation of a DiT based video generation model

Notifications You must be signed in to change notification settings

123-code/video-generation-model

Repository files navigation

Video Diffusion Transformer (VideoDiT)

epoch_250_class_2.mp4

A class-conditional video generation model using a Diffusion Transformer architecture operating in VAE latent space. The model generates 16-frame videos at 256x256 resolution through iterative denoising.

Architecture Overview

Input Noise (4, 16, 32, 32)
         │
         ▼
   ┌─────────────┐
   │  Patchify   │  3D patches (2x2x2)
   └─────────────┘
         │
         ▼
   ┌─────────────┐
   │ Patch Embed │  Linear projection to dim=1024
   └─────────────┘
         │
         ▼
   ┌─────────────┐
   │  + PosEmbed │  Learned positional embeddings
   └─────────────┘
         │
         ▼
  ┌──────────────────┐
  │  SpatioTemporal  │ ×16 blocks
  │    DiT Blocks    │
  │  (with adaLN)    │
  └──────────────────┘
         │
         ▼
   ┌─────────────┐
   │ Final Layer │  Project back to patch dim
   └─────────────┘
         │
         ▼
   ┌─────────────┐
   │ Unpatchify  │  Reconstruct latent tensor
   └─────────────┘
         │
         ▼
  Predicted Noise (4, 16, 32, 32)
horse2.MP4
generated_2_class_5.MP4

About

Minimal implementation of a DiT based video generation model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages