Profiling GPU performance for CliMA.
To run this project, first install Calkit on the clima machine:
curl -LsSf https://github.com/calkit/calkit/raw/refs/heads/main/scripts/install.sh | shNext, configure a token for interacting with calkit.io (where we store version-controlled Nsight reports).
If you don't already have an SSH key added to GitHub, either follow their documentation or run:
calkit config github-sshThen clone the project:
calkit clone --ssh petebachant/clima-gpu-profilingLastly, call:
calkit runThis will run all pipeline stages in the order they're defined in
calkit.yaml.
If you'd like to run a single stage (in a reproducible way),
you can use its name as the first positional argument to calkit run.
For example:
calkit run amip-clima-nsysHowever, by default, only stages whose Nsight reports are now invalid (since their inputs have changed since the last run) will run.
srun --gpus=1 --mpi=none --pty bashcalkit jupyter lab --ip=0.0.0.0 --no-browserThen, copy the server URL, which starts with http://127.0.0.1,
and in VS Code, use that when selecting a kernel for the notebook.
ClimaCore.jl-->pb/rm-nvtxClimaCore.jl-mod-->pb/perfClimaCoupler.jl-->pb/rm-nvtxClimaCoupler.jl-mod-->pb/perfClimaAtmos.jl-->pb/rm-nvtxClimaAtmos.jl-mod-->pb/perf
| Commit (super-repo) | Change summary | Result |
|---|---|---|
130baab |
Occupancy for run_field_matrix_solver increased and reduced registers per thread but slowed down overall. |
|
e5845b7 |
Similar as above, but not quite as slow. | |
ff26f4b1 |
Use PCR for tri-diagonal matrix solve. | Seems to be 3% faster, but higher error. May not have isolated changed properly though. |
e6099c2 |
Try capping all threads to 256 | 1% slower on flagship. |
23c9104 |
Attempt to coalesce memory access in solvers. | 5% slowdown. |
7614ca6 |
Thread block restructuring and LocalGeometry caching. | No significant change. |
f9eb67a |
Tr/mem access patterns | 9% speedup. |