Fp8 opt bugfix by ngc92 · Pull Request #68 · IST-DASLab/llmq

ngc92 · 2026-03-06T21:25:26Z

seems to work again for 1.5b

Copilot

Pull request overview

This PR fixes a bug in FP8 optimizer state allocation that caused failures when model weight tensors had a first dimension not divisible by 128 × world_size. The fix introduces a flat_view / flattened_view utility to correctly shape FP8 momentum scaling tensors by flattening each weight tensor before sharding, rather than directly dividing the first dimension of multi-dimensional tensors.

Changes:

New flat_view(Tensor) and flattened_view(GenericTensorContainer) utility functions to produce 1D views of tensors and containers
Rewrote FP8 scale tensor shaping in AdamWStateManager::allocate_state to flatten first, shard by mWorld, then shard by 128
Added early-exit and null-pointer guards to fill_imp to safely handle zero-element tensors

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/utilities/tensor.h`	Declares new `flat_view(const Tensor&)` function
`src/utilities/tensor.cpp`	Implements `flat_view` and `flattened_view(GenericTensorContainer)`
`src/utilities/tensor_container.h`	Declares `flattened_view(GenericTensorContainer)`
`src/training/adamw_optimizer.cpp`	Replaces direct `shard_empty_container(_, 128*mWorld)` with a correct flatten-then-shard approach for FP8 scale tensors
`src/kernels/fill.cu`	Adds `count == 0` early return and null pointer guard to `fill_imp`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T21:32:16Z

src/training/adamw_optimizer.cpp

+                // flatten the local shard
+                auto flattened = flattened_view(sharded);
+                // and group into scaling groups
+                auto grouped = shard_empty_container(std::move(flattened), 128);


The second call to flattened_view(sharded) on this line is redundant. After shard_empty_container(flattened_view(c), mWorld), all tensors in the container are already 1D (rank 1), so applying flattened_view again produces the same shapes. The sharded container can be passed directly to shard_empty_container on the next line. Removing this call would simplify the logic and avoid allocating an unnecessary intermediate container.

Suggested change

// flatten the local shard

auto flattened = flattened_view(sharded);

// and group into scaling groups

auto grouped = shard_empty_container(std::move(flattened), 128);

// tensors in 'sharded' are already 1D; directly group into scaling groups

auto grouped = shard_empty_container(std::move(sharded), 128);

Copilot · 2026-03-06T21:32:16Z

src/training/adamw_optimizer.cpp

+                auto grouped = shard_empty_container(std::move(flattened), 128);
+                return grouped;
+            };
            // we "shard" for 128 as many GPUs, so that we get 1 scale per 128 weights.


The comment on this line ("we 'shard' for 128 as many GPUs, so that we get 1 scale per 128 weights") is outdated. The old code directly divided by 128 * mWorld; the new code first shards by mWorld (matching main weight sharding), then flattens the local shard, and shards by 128 to get one scale per 128 weights. The comment should be updated to reflect the two-step sharding in prepare_shape_for_scales.

Suggested change

// we "shard" for 128 as many GPUs, so that we get 1 scale per 128 weights.

// we first shard by mWorld (matching main weights), then shard the local

// flattened view by 128 to get 1 scale per 128 weights.

Copilot · 2026-03-06T21:32:16Z

src/utilities/tensor_container.h

 //! are `nullptr`, but sizes have been set up.
 GenericTensorContainer shard_empty_container(GenericTensorContainer&& c, int world);

+//! Flattens all tensors is the container.


The doc comment says "Flattens all tensors is the container" — "is" should be "in".

Suggested change

//! Flattens all tensors is the container.

//! Flattens all tensors in the container.

ngc92 added 3 commits March 6, 2026 22:23

utility function to flatten tensor

11631c9

make FP8-M work also if first dim is not divisible by 512

b718b9a

prevent crash when trying to fill an empty tensor

8b14573

Copilot AI review requested due to automatic review settings March 6, 2026 21:25

Copilot started reviewing on behalf of ngc92 March 6, 2026 21:25 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fp8 opt bugfix#68

Fp8 opt bugfix#68
ngc92 wants to merge 3 commits intodevfrom
fp8-opt-bugfix

ngc92 commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// we "shard" for 128 as many GPUs, so that we get 1 scale per 128 weights.
	// we first shard by mWorld (matching main weights), then shard the local
	// flattened view by 128 to get 1 scale per 128 weights.

	//! Flattens all tensors is the container.
	//! Flattens all tensors in the container.

Conversation

ngc92 commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants