Skip to content

Conversation

@weilr
Copy link

@weilr weilr commented Feb 5, 2026

PhysicsNeMo Pull Request

Description

FIGConvNet: fixed 'split_by_node_equal', supports multi-GPU execution.

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 5, 2026

Greptile Overview

Greptile Summary

Refactored split_by_node_equal to properly support multi-GPU execution by incorporating DataLoader worker processes alongside distributed training processes.

Key changes:

  • Now properly unpacks all 4 values from pytorch_worker_info (rank, world_size, worker, num_workers) instead of ignoring the worker-related values
  • Calculates global worker ID across both distributed nodes and DataLoader workers: g_worker = rank * num_workers + worker
  • Adjusts chunk size from world_size to g_world = world_size * num_workers to account for all workers
  • Simplifies tail handling logic with clearer control flow

The implementation correctly handles the case where PyTorch DataLoaders spawn multiple workers per GPU, ensuring data is properly sharded across all workers in a multi-GPU setup.

Important Files Changed

Filename Overview
examples/cfd/external_aerodynamics/figconvnet/src/data/components/webdataset_utils.py Refactored split_by_node_equal to properly support multi-GPU execution by handling worker processes alongside distributed processes

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@coreyjadams coreyjadams self-requested a review February 10, 2026 14:56
Copy link
Collaborator

@coreyjadams coreyjadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @weilr - Thanks for submitting this. I have one small question but otherwise looks reasonable to me.

if not drop_last and tail_size > 0:
yield next_items[rank % tail_size]
it = iter(src)
while True:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with this data tool. Is there an alternative to an infinite loop here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants