Conversation
📝 WalkthroughWalkthroughThis PR strengthens GPU development infrastructure by adding robust retry logic to CI workflows, introducing NVCC build configuration, adding GPU warp size utilities, enabling Python JIT bindings for GPU operations, and updating CUDA compute capabilities. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Fix all issues with AI agents
In @.sublime-project:
- Around line 16-18: The JSON for the "python-black" object contains a trailing
comma after the "format_on_save" value which makes the file invalid; remove the
trailing comma after "format_on_save": "on" (i.e., ensure the "python-black"
object ends with a closing brace and no trailing comma) so the file becomes
strict JSON-parsable while keeping the "python-black" and "format_on_save" keys
intact.
In `@inc/mkn/gpu/cuda.hpp`:
- Around line 88-90: The call to cudaGetDeviceProperties in the function (using
local cudaDeviceProp devProp and returning devProp.warpSize) ignores its return
value, risking uninitialized warpSize on failure; check the returned cudaError_t
and handle errors by invoking MKN_GPU_ASSERT (or similar error macro) or
returning a safe default (e.g., 32) when cudaGetDeviceProperties fails. Update
the code around cudaGetDeviceProperties to capture the return value, assert on
non-success via MKN_GPU_ASSERT(cudaGetDeviceProperties(...) == cudaSuccess,
"msg") or branch to return a sensible default, ensuring devProp.warpSize is only
used when the call succeeded.
In `@test/any/async_streaming.cpp`:
- Around line 8-9: The test uses std::this_thread::sleep_for (seen at usages
around lines for the async streaming tests) but doesn't include <thread>,
relying on transitive includes; add an explicit `#include` <thread> at the top of
async_streaming.cpp so std::this_thread::sleep_for is available portably and no
longer depends on indirect headers.
In `@test/poc/jit/eg.py`:
- Around line 17-19: The vadd function is currently decorated as a kernel
(`@cuda.jit`) but is called from inside vector_add_gpu, so change its decorator to
a device function by using `@cuda.jit`(device=True) (i.e., make vadd a device
function) so it can be invoked from the kernel; keep the function name vadd and
its parameter list unchanged and ensure vector_add_gpu continues to call vadd
from within the kernel.
In `@test/poc/jit/pybind.cpp`:
- Around line 57-66: The PYBIND11_MODULE binding for class FunctionSupport
contains a duplicate def_readwrite("a", &FunctionSupport::a) where the third
member should be exposed; update the duplicate binding to def_readwrite("c",
&FunctionSupport::c) so the class exposes members a, b, and c correctly and keep
the other method bindings (print, A, B, C) unchanged.
- Around line 44-46: The wrapper creates a py::array_t<float> in make(auto p)
with a no-op capsule, risking use-after-free if the underlying CUDA-managed
pointer outlives its parent; change the capsule to hold a reference to the
parent Python object (e.g., the FunctionSupport py::object) or otherwise set the
array base to that parent so NumPy keeps the parent alive. Locate make(auto p)
and modify the capsule construction to capture and incref (hold) the parent
py::object (or pass the parent as the array base) instead of a no-op destructor,
ensuring the parent’s lifetime is extended until the array is freed. Ensure the
symbol FunctionSupport (or the accessor that calls make) provides the py::object
parent to make so the returned py::array_t<float> keeps the parent alive.
- Line 58: The binding uses py::smart_holder in the py::class_<FunctionSupport,
py::smart_holder> (and the other class_ instantiation) which requires pybind11
v2.13+, so either pin/require that version in your build config (add an explicit
version constraint for lang.pybind11 in mkn.yaml such as >=2.13) or add a
build-time verification (check PYBIND11_VERSION_MAJOR/MINOR or equivalent) and
fail with a clear message if the installed pybind11 is older; update mkn.yaml to
explicitly request pybind11 >=2.13 and/or add the version check near the module
initialization that defines FunctionSupport to guarantee compatibility.
🧹 Nitpick comments (3)
test/poc/jit/requirements.txt (1)
1-2: Consider pinning dependency versions.Unpinned dependencies can lead to non-reproducible environments if APIs change. Even for a PoC, minimum version constraints (e.g.,
numpy>=1.24) would help avoid surprises.inc/mkn/gpu/cuda.hpp (1)
94-94:static inlinegives each TU its own copy — considerinlineonly.
staticgives internal linkage, so every translation unit that includes this header callsgetWarpSize()independently and gets its ownwarp_sizeinstance. If the intent is a single program-wide constant (typical for header-only libs in C++17+), use justinlinewithoutstatic:-static std::uint32_t inline const warp_size = getWarpSize(); +inline std::uint32_t const warp_size = getWarpSize();This ensures a single instance across TUs via the C++17 inline variable rule.
test/poc/jit/eg.py (1)
27-28: Redundants.print()call — constructor already prints twice.The
FunctionSupportconstructor (inpybind.cpplines 28–30) already callsprint()twice (before and afterinit). Line 28 here is a third call that prints the same post-init state. This is harmless in a PoC but adds noise to the output.
| "python-black": { | ||
| "format_on_save": "on", | ||
| }, |
There was a problem hiding this comment.
Trailing comma produces invalid JSON.
Line 18 has a trailing comma after "on" which is not valid in strict JSON. Some editors tolerate this, but it may cause parse errors in others.
Proposed fix
"python-black": {
- "format_on_save": "on",
- },
+ "format_on_save": "on"
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "python-black": { | |
| "format_on_save": "on", | |
| }, | |
| "python-black": { | |
| "format_on_save": "on" | |
| } |
🤖 Prompt for AI Agents
In @.sublime-project around lines 16 - 18, The JSON for the "python-black"
object contains a trailing comma after the "format_on_save" value which makes
the file invalid; remove the trailing comma after "format_on_save": "on" (i.e.,
ensure the "python-black" object ends with a closing brace and no trailing
comma) so the file becomes strict JSON-parsable while keeping the "python-black"
and "format_on_save" keys intact.
| cudaDeviceProp devProp; | ||
| [[maybe_unused]] auto ret = cudaGetDeviceProperties(&devProp, dev); | ||
| return devProp.warpSize; |
There was a problem hiding this comment.
Unchecked error from cudaGetDeviceProperties.
If cudaGetDeviceProperties fails (e.g., invalid device index, no GPU present), ret is silently discarded via [[maybe_unused]] and devProp.warpSize will contain uninitialized data. Consider using MKN_GPU_ASSERT here or returning a sensible default on failure.
Proposed fix
- [[maybe_unused]] auto ret = cudaGetDeviceProperties(&devProp, dev);
- return devProp.warpSize;
+ MKN_GPU_ASSERT(cudaGetDeviceProperties(&devProp, dev));
+ return devProp.warpSize;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cudaDeviceProp devProp; | |
| [[maybe_unused]] auto ret = cudaGetDeviceProperties(&devProp, dev); | |
| return devProp.warpSize; | |
| cudaDeviceProp devProp; | |
| MKN_GPU_ASSERT(cudaGetDeviceProperties(&devProp, dev)); | |
| return devProp.warpSize; |
🤖 Prompt for AI Agents
In `@inc/mkn/gpu/cuda.hpp` around lines 88 - 90, The call to
cudaGetDeviceProperties in the function (using local cudaDeviceProp devProp and
returning devProp.warpSize) ignores its return value, risking uninitialized
warpSize on failure; check the returned cudaError_t and handle errors by
invoking MKN_GPU_ASSERT (or similar error macro) or returning a safe default
(e.g., 32) when cudaGetDeviceProperties fails. Update the code around
cudaGetDeviceProperties to capture the return value, assert on non-success via
MKN_GPU_ASSERT(cudaGetDeviceProperties(...) == cudaSuccess, "msg") or branch to
return a sensible default, ensuring devProp.warpSize is only used when the call
succeeded.
| #include <algorithm> | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n test/any/async_streaming.cpp | head -30Repository: mkn/mkn.gpu
Length of output: 989
🏁 Script executed:
rg -n 'std::this_thread' test/any/async_streaming.cppRepository: mkn/mkn.gpu
Length of output: 292
🏁 Script executed:
rg -n '#include' test/any/async_streaming.cpp | head -20Repository: mkn/mkn.gpu
Length of output: 199
Add explicit #include <thread> header.
The file uses std::this_thread::sleep_for at lines 38, 70, 103, 138, and 173 but does not directly include <thread>. This relies on transitive inclusion from one of the project headers, which is fragile and non-portable across different compilers and standard library implementations.
Proposed fix
`#include` <algorithm>
+#include <thread>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #include <algorithm> | |
| `#include` <algorithm> | |
| `#include` <thread> | |
🤖 Prompt for AI Agents
In `@test/any/async_streaming.cpp` around lines 8 - 9, The test uses
std::this_thread::sleep_for (seen at usages around lines for the async streaming
tests) but doesn't include <thread>, relying on transitive includes; add an
explicit `#include` <thread> at the top of async_streaming.cpp so
std::this_thread::sleep_for is available portably and no longer depends on
indirect headers.
| @cuda.jit | ||
| def vadd(i, a, b, c): | ||
| c[i] = a[i] + b[i] |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Numba CUDA @cuda.jit device=True device function kernel calling conventions
💡 Result:
In Numba CUDA, a @cuda.jit(device=True) function is a CUDA device function (roughly analogous to __device__ in CUDA C/C++): it can only be called from GPU code (a kernel or another device function), and it is invoked with a normal Python call expression inside the kernel (not with the kernel[grid, block](...) launch syntax). Device functions may return a value; kernels may not. [1][2]
Defining + calling a device function (the “calling convention”)
from numba import cuda
`@cuda.jit`(device=True)
def bar(a, b):
return a + b
`@cuda.jit`
def kern(x, y, out):
i = cuda.grid(1)
out[i] = bar(x[i], y[i]) # normal call (device-side)This is the core convention: call like a regular function from within GPU code. [1]
Signatures / specialization
- You can supply an explicit signature for a device function (and kernel), e.g.
'int32(int32, int32)', but you don’t have to—Numba can specialize based on argument types (and caches compiled variants). Example of a device signature is shown in Numba’s CUDA JIT docs. [3] - Kernel arguments support device arrays (CUDA Array Interface), NumPy arrays (with host↔device copying caveats), scalars, and records; device functions use the same underlying type system when called from kernels. [2]
Inlining / call overhead controls
@cuda.jitexposes inlining controls (inline="always"at Numba IR level,forceinline=Trueat NVVM IR level). [2]- Practically, many small device helpers get inlined by compilation/optimization anyway; Numba community guidance often treats device-function calls as effectively “inlined into the caller” from a performance perspective. [4]
What you cannot do (common confusion)
- No dynamic parallelism: you cannot launch a kernel from inside a kernel in Numba (i.e., no device-side kernel launches). [5][6]
Also: calling non-cuda.jit(device=True) helpers
Numba can also compile a regular @numba.jit/@njit helper for the GPU when it’s called from a CUDA kernel (subject to CUDA-Python feature restrictions). [7]
Sources: Numba CUDA “Writing Device Functions” and kernel API docs, plus related Numba docs/discussions. [1][2][4][5][6][7]
[1] (nvidia.github.io)
[2] (nvidia.github.io)
[3] (numba.pydata.org)
[4] (numba.discourse.group)
[5] (nvidia.github.io)
[6] (numba.discourse.group)
[7] (developer.nvidia.com)
Citations:
- 1: https://nvidia.github.io/numba-cuda/user/device-functions.html
- 2: https://nvidia.github.io/numba-cuda/reference/kernel.html
- 3: https://numba.pydata.org/numba-doc/0.13/CUDAJit.html?utm_source=openai
- 4: https://numba.discourse.group/t/usage-of-cuda-python-linear-algebra-on-gpu-and-computational-code/1064?utm_source=openai
- 5: https://nvidia.github.io/numba-cuda/user/kernels.html?utm_source=openai
- 6: https://numba.discourse.group/t/kernel-within-a-kernel/1582?utm_source=openai
- 7: https://developer.nvidia.com/blog/seven-things-numba/?utm_source=openai
🏁 Script executed:
cat test/poc/jit/eg.pyRepository: mkn/mkn.gpu
Length of output: 606
vadd must be a device function to be callable from vector_add_gpu.
@cuda.jit without arguments declares a kernel, which cannot be called from another kernel in Numba CUDA. Since vadd is invoked on line 26 from within vector_add_gpu, it must be marked as a device function.
Fix
-@cuda.jit
+@cuda.jit(device=True)
def vadd(i, a, b, c):
c[i] = a[i] + b[i]🤖 Prompt for AI Agents
In `@test/poc/jit/eg.py` around lines 17 - 19, The vadd function is currently
decorated as a kernel (`@cuda.jit`) but is called from inside vector_add_gpu, so
change its decorator to a device function by using `@cuda.jit`(device=True) (i.e.,
make vadd a device function) so it can be invoked from the kernel; keep the
function name vadd and its parameter list unchanged and ensure vector_add_gpu
continues to call vadd from within the kernel.
| py::array_t<float> make(auto p) { | ||
| return {{32}, {sizeof(float)}, p, py::capsule(p, [](void* f) { /* noop */ })}; | ||
| } |
There was a problem hiding this comment.
Dangling pointer risk: noop capsule means Python arrays outlive the backing memory.
The make method wraps a raw CUDA-managed pointer into a py::array_t with a no-op destructor capsule. If Python code holds a reference to the returned array after the FunctionSupport object is destroyed, accessing the array will hit freed memory (use-after-free). In a PoC this may be fine, but for anything beyond that, consider having the capsule prevent premature destruction (e.g., tie the capsule's prevent premature destruction to the parent object's prevent premature destruction, or prevent premature destruction).
One approach: pass a reference to the parent py::object as the array's base so NumPy keeps FunctionSupport alive.
💡 Sketch: tie array lifetime to parent
- py::array_t<float> make(auto p) {
- return {{32}, {sizeof(float)}, p, py::capsule(p, [](void* f) { /* noop */ })};
+ py::array_t<float> make(auto p, py::object self) {
+ // Use `self` as the base object so NumPy keeps FunctionSupport alive
+ return py::array_t<float>({32}, {sizeof(float)}, p, self);
}Then pass py::object from each accessor:
- py::array_t<float> A() { return make(a); }
+ py::array_t<float> A(py::object self) { return make(a, self); }🤖 Prompt for AI Agents
In `@test/poc/jit/pybind.cpp` around lines 44 - 46, The wrapper creates a
py::array_t<float> in make(auto p) with a no-op capsule, risking use-after-free
if the underlying CUDA-managed pointer outlives its parent; change the capsule
to hold a reference to the parent Python object (e.g., the FunctionSupport
py::object) or otherwise set the array base to that parent so NumPy keeps the
parent alive. Locate make(auto p) and modify the capsule construction to capture
and incref (hold) the parent py::object (or pass the parent as the array base)
instead of a no-op destructor, ensuring the parent’s lifetime is extended until
the array is freed. Ensure the symbol FunctionSupport (or the accessor that
calls make) provides the py::object parent to make so the returned
py::array_t<float> keeps the parent alive.
| PYBIND11_MODULE(poc_pyb, m) { | ||
| py::class_<FunctionSupport, py::smart_holder>(m, "FunctionSupport") | ||
| .def(py::init<>()) | ||
| .def("print", &FunctionSupport::print) | ||
| .def_readwrite("a", &FunctionSupport::a) | ||
| .def("A", &FunctionSupport::A) | ||
| .def_readwrite("b", &FunctionSupport::b) | ||
| .def("B", &FunctionSupport::B) | ||
| .def_readwrite("a", &FunctionSupport::a) | ||
| .def("C", &FunctionSupport::C); |
There was a problem hiding this comment.
Duplicate binding: Line 65 binds "a" again instead of "c".
Line 61 already binds def_readwrite("a", &FunctionSupport::a). Line 65 repeats the exact same binding. This almost certainly should be "c" to expose the c pointer, matching the pattern for a and b.
🐛 Proposed fix
.def_readwrite("b", &FunctionSupport::b)
.def("B", &FunctionSupport::B)
- .def_readwrite("a", &FunctionSupport::a)
+ .def_readwrite("c", &FunctionSupport::c)
.def("C", &FunctionSupport::C);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| PYBIND11_MODULE(poc_pyb, m) { | |
| py::class_<FunctionSupport, py::smart_holder>(m, "FunctionSupport") | |
| .def(py::init<>()) | |
| .def("print", &FunctionSupport::print) | |
| .def_readwrite("a", &FunctionSupport::a) | |
| .def("A", &FunctionSupport::A) | |
| .def_readwrite("b", &FunctionSupport::b) | |
| .def("B", &FunctionSupport::B) | |
| .def_readwrite("a", &FunctionSupport::a) | |
| .def("C", &FunctionSupport::C); | |
| PYBIND11_MODULE(poc_pyb, m) { | |
| py::class_<FunctionSupport, py::smart_holder>(m, "FunctionSupport") | |
| .def(py::init<>()) | |
| .def("print", &FunctionSupport::print) | |
| .def_readwrite("a", &FunctionSupport::a) | |
| .def("A", &FunctionSupport::A) | |
| .def_readwrite("b", &FunctionSupport::b) | |
| .def("B", &FunctionSupport::B) | |
| .def_readwrite("c", &FunctionSupport::c) | |
| .def("C", &FunctionSupport::C); |
🤖 Prompt for AI Agents
In `@test/poc/jit/pybind.cpp` around lines 57 - 66, The PYBIND11_MODULE binding
for class FunctionSupport contains a duplicate def_readwrite("a",
&FunctionSupport::a) where the third member should be exposed; update the
duplicate binding to def_readwrite("c", &FunctionSupport::c) so the class
exposes members a, b, and c correctly and keep the other method bindings (print,
A, B, C) unchanged.
| }; | ||
|
|
||
| PYBIND11_MODULE(poc_pyb, m) { | ||
| py::class_<FunctionSupport, py::smart_holder>(m, "FunctionSupport") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check pybind11 version references in configuration files
echo "=== Pybind11 version references ==="
rg -rn 'pybind11' --glob='*.yaml' --glob='*.yml' --glob='*.txt' --glob='*.cmake' --glob='CMakeLists.txt' -C2Repository: mkn/mkn.gpu
Length of output: 296
🏁 Script executed:
#!/bin/bash
# Check CMakeLists.txt for pybind11 version specifications
echo "=== CMakeLists.txt content (if present) ==="
find . -name 'CMakeLists.txt' -type f -exec head -50 {} + 2>/dev/null
# Check setup.py or pyproject.toml for dependencies
echo -e "\n=== Setup/config files ==="
find . -maxdepth 3 -name 'setup.py' -o -name 'pyproject.toml' -o -name 'requirements.txt' | head -10Repository: mkn/mkn.gpu
Length of output: 126
🏁 Script executed:
#!/bin/bash
# Check smart_holder usage in the codebase
echo "=== Smart_holder usage ==="
rg -rn 'smart_holder' -C3Repository: mkn/mkn.gpu
Length of output: 813
🏁 Script executed:
#!/bin/bash
# Check the specific file context around lines 58 and 69
echo "=== Context around pybind.cpp lines 58 and 69 ==="
cat -n test/poc/jit/pybind.cpp | sed -n '50,80p'Repository: mkn/mkn.gpu
Length of output: 977
🏁 Script executed:
#!/bin/bash
# Read the full mkn.yaml file to see pybind11/dependency specifications
echo "=== Full mkn.yaml content ==="
find . -name '*.yaml' -o -name '*.yml' | xargs cat 2>/dev/null | head -100Repository: mkn/mkn.gpu
Length of output: 2280
🏁 Script executed:
#!/bin/bash
# Search for any pybind11 version specifications across all files
echo "=== Pybind11 version specifications ==="
rg -rn 'pybind11.*2\.'
rg -rn 'pybind11.*version'Repository: mkn/mkn.gpu
Length of output: 94
🏁 Script executed:
#!/bin/bash
# Look for dependency resolution in mkn related files
echo "=== MKN related config files ==="
find . -maxdepth 3 -type f \( -name 'mkn*' -o -name '*.mkn' \) | head -20Repository: mkn/mkn.gpu
Length of output: 98
🏁 Script executed:
#!/bin/bash
# Read the mkn.yaml in test/poc/jit directory
echo "=== test/poc/jit/mkn.yaml ==="
cat test/poc/jit/mkn.yamlRepository: mkn/mkn.gpu
Length of output: 348
🏁 Script executed:
#!/bin/bash
# Search for any pybind11 include or dependency statements
echo "=== Pybind11 includes in codebase ==="
rg -rn '#include.*pybind11'
# Check if there are any header files or submodule references
echo -e "\n=== Git submodules ==="
cat .gitmodules 2>/dev/null || echo "No submodules file found"Repository: mkn/mkn.gpu
Length of output: 278
🏁 Script executed:
#!/bin/bash
# Look for any CI/build documentation that might specify pybind11 version
echo "=== README or documentation files ==="
find . -maxdepth 2 -type f \( -name 'README*' -o -name 'CONTRIBUTING*' \) -exec head -50 {} + 2>/dev/null
# Check for any version pins in GitHub workflows
echo -e "\n=== Full workflow file ==="
find . -path './.github/workflows/*.yml' -o -path './.github/workflows/*.yaml' 2>/dev/null | xargs cat 2>/dev/nullRepository: mkn/mkn.gpu
Length of output: 2389
Ensure pybind11 version >= 2.13 is explicitly specified or verified.
The code uses py::smart_holder at lines 58 and 69, which requires pybind11 v2.13 or later. The mkn.yaml dependency specifies lang.pybind11 without explicit version pinning. Verify that mkn's lang.pybind11 module provides v2.13+, or add an explicit version constraint to the mkn.yaml configuration to guarantee compatibility.
🤖 Prompt for AI Agents
In `@test/poc/jit/pybind.cpp` at line 58, The binding uses py::smart_holder in the
py::class_<FunctionSupport, py::smart_holder> (and the other class_
instantiation) which requires pybind11 v2.13+, so either pin/require that
version in your build config (add an explicit version constraint for
lang.pybind11 in mkn.yaml such as >=2.13) or add a build-time verification
(check PYBIND11_VERSION_MAJOR/MINOR or equivalent) and fail with a clear message
if the installed pybind11 is older; update mkn.yaml to explicitly request
pybind11 >=2.13 and/or add the version check near the module initialization that
defines FunctionSupport to guarantee compatibility.
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Chores