Video Diffusion Transformer (VideoDiT)

epoch_250_class_2.mp4

A class-conditional video generation model using a Diffusion Transformer architecture operating in VAE latent space. The model generates 16-frame videos at 256x256 resolution through iterative denoising.

Architecture Overview

Input Noise (4, 16, 32, 32)
         │
         ▼
   ┌─────────────┐
   │  Patchify   │  3D patches (2x2x2)
   └─────────────┘
         │
         ▼
   ┌─────────────┐
   │ Patch Embed │  Linear projection to dim=1024
   └─────────────┘
         │
         ▼
   ┌─────────────┐
   │  + PosEmbed │  Learned positional embeddings
   └─────────────┘
         │
         ▼
  ┌──────────────────┐
  │  SpatioTemporal  │ ×16 blocks
  │    DiT Blocks    │
  │  (with adaLN)    │
  └──────────────────┘
         │
         ▼
   ┌─────────────┐
   │ Final Layer │  Project back to patch dim
   └─────────────┘
         │
         ▼
   ┌─────────────┐
   │ Unpatchify  │  Reconstruct latent tensor
   └─────────────┘
         │
         ▼
  Predicted Noise (4, 16, 32, 32)

horse2.MP4

generated_2_class_5.MP4

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
dataset		dataset
diffusion		diffusion
model		model
ssh_key		ssh_key
README.md		README.md
auto_trainer.py		auto_trainer.py
download_pexels.py		download_pexels.py
generate_latents.py		generate_latents.py
generate_v2.py		generate_v2.py
requirements.txt		requirements.txt
train_v2.py		train_v2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video Diffusion Transformer (VideoDiT)

Architecture Overview

About

Uh oh!

Releases

Packages

Languages

123-code/video-generation-model

Folders and files

Latest commit

History

Repository files navigation

Video Diffusion Transformer (VideoDiT)

Architecture Overview

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages