Skip to content

Fp8 opt bugfix#68

Open
ngc92 wants to merge 3 commits intodevfrom
fp8-opt-bugfix
Open

Fp8 opt bugfix#68
ngc92 wants to merge 3 commits intodevfrom
fp8-opt-bugfix

Conversation

@ngc92
Copy link
Collaborator

@ngc92 ngc92 commented Mar 6, 2026

seems to work again for 1.5b

Copilot AI review requested due to automatic review settings March 6, 2026 21:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug in FP8 optimizer state allocation that caused failures when model weight tensors had a first dimension not divisible by 128 × world_size. The fix introduces a flat_view / flattened_view utility to correctly shape FP8 momentum scaling tensors by flattening each weight tensor before sharding, rather than directly dividing the first dimension of multi-dimensional tensors.

Changes:

  • New flat_view(Tensor) and flattened_view(GenericTensorContainer) utility functions to produce 1D views of tensors and containers
  • Rewrote FP8 scale tensor shaping in AdamWStateManager::allocate_state to flatten first, shard by mWorld, then shard by 128
  • Added early-exit and null-pointer guards to fill_imp to safely handle zero-element tensors

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/utilities/tensor.h Declares new flat_view(const Tensor&) function
src/utilities/tensor.cpp Implements flat_view and flattened_view(GenericTensorContainer)
src/utilities/tensor_container.h Declares flattened_view(GenericTensorContainer)
src/training/adamw_optimizer.cpp Replaces direct shard_empty_container(_, 128*mWorld) with a correct flatten-then-shard approach for FP8 scale tensors
src/kernels/fill.cu Adds count == 0 early return and null pointer guard to fill_imp

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +185 to +188
// flatten the local shard
auto flattened = flattened_view(sharded);
// and group into scaling groups
auto grouped = shard_empty_container(std::move(flattened), 128);
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second call to flattened_view(sharded) on this line is redundant. After shard_empty_container(flattened_view(c), mWorld), all tensors in the container are already 1D (rank 1), so applying flattened_view again produces the same shapes. The sharded container can be passed directly to shard_empty_container on the next line. Removing this call would simplify the logic and avoid allocating an unnecessary intermediate container.

Suggested change
// flatten the local shard
auto flattened = flattened_view(sharded);
// and group into scaling groups
auto grouped = shard_empty_container(std::move(flattened), 128);
// tensors in 'sharded' are already 1D; directly group into scaling groups
auto grouped = shard_empty_container(std::move(sharded), 128);

Copilot uses AI. Check for mistakes.
auto grouped = shard_empty_container(std::move(flattened), 128);
return grouped;
};
// we "shard" for 128 as many GPUs, so that we get 1 scale per 128 weights.
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on this line ("we 'shard' for 128 as many GPUs, so that we get 1 scale per 128 weights") is outdated. The old code directly divided by 128 * mWorld; the new code first shards by mWorld (matching main weight sharding), then flattens the local shard, and shards by 128 to get one scale per 128 weights. The comment should be updated to reflect the two-step sharding in prepare_shape_for_scales.

Suggested change
// we "shard" for 128 as many GPUs, so that we get 1 scale per 128 weights.
// we first shard by mWorld (matching main weights), then shard the local
// flattened view by 128 to get 1 scale per 128 weights.

Copilot uses AI. Check for mistakes.
//! are `nullptr`, but sizes have been set up.
GenericTensorContainer shard_empty_container(GenericTensorContainer&& c, int world);

//! Flattens all tensors is the container.
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment says "Flattens all tensors is the container" — "is" should be "in".

Suggested change
//! Flattens all tensors is the container.
//! Flattens all tensors in the container.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants