Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughModularized GPU backends: split monolithic CUDA/ROCm headers into def/api/cls and .ipp pieces, moved device helpers to shared any/ files, added nvcc build config, enabled CPU test flag, updated CI curl invocation and compiler target, and adjusted tests/benchmarks and .gitignore. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Host as Host (code/tests)
participant API as MKN_GPU_NS (def/api/cls)
participant Runtime as "CUDA/HIP Runtime / Device"
Host->>API: request alloc / send / launch / stream ops
API->>Runtime: call alloc / memcpy / launch (cuda/hip APIs)
Runtime-->>API: async completion (stream/event)
API-->>Host: return handles / status / results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai are you setup for this repo? |
|
Tip For best results, initiate chat on the files or code changes.
Would you like me to:
Just let me know how I can assist! 🐰 🧠 Learnings used |
|
Tip For best results, initiate chat on the files or code changes.
How can I help you with this PR? Would you like me to:
Let me know what you need! 🐰 🧠 Learnings used |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@inc/mkn/gpu/cuda.hpp`:
- Around line 84-95: Make getWarpSize inline to avoid ODR violations when
included from headers, and add proper CUDA error handling: in the inline
std::uint32_t getWarpSize(size_t dev = 0) function initialize a cudaDeviceProp
variable before use, call cudaGetDeviceProperties and check its return value
(use the project's MKN_GPU_ASSERT macro or return a safe default like 32 /
throw/abort on failure), and only access devProp.warpSize on success; also
ensure the static std::uint32_t const warp_size = getWarpSize(); usage remains
valid after these safeguards.
In `@res/mkn/clang_cuda.yaml`:
- Around line 18-31: The compiler and linker entries currently invoke the
generic clang++ which can pick older toolchains; update the "compiler" and
"linker" fields so they explicitly call clang++-17 (i.e. replace uses of
"clang++" in the compiler: and linker: settings with "clang++-17") to ensure the
recommended clang 17 toolchain is used for the cxx_cuda build and CUDA SM_80
flags (references: compiler, linker, cxx_cuda).
In `@test/poc/jit/eg.py`:
- Around line 17-24: vadd is declared as a CUDA kernel but is invoked from
inside another kernel (vector_add_gpu), which Numba disallows; change vadd to a
device function by using `@cuda.jit`(device=True) instead of a kernel decorator so
it can be called from vector_add_gpu, keep the same signature (index i and
arrays a, b, c) and ensure vector_add_gpu calls vadd(...) as a regular device
function rather than launching it.
In `@test/poc/jit/pybind.cpp`:
- Around line 61-66: The pybind registration mistakenly repeats
.def_readwrite("a", &FunctionSupport::a) instead of exposing member c; update
the duplicate binding in the chain that registers FunctionSupport so it uses
.def_readwrite("c", &FunctionSupport::c) (keep the existing
.def("A",&FunctionSupport::A), .def_readwrite("b",&FunctionSupport::b) and
.def("C",&FunctionSupport::C) intact) so the struct's c member is correctly
exposed to Python.
- Around line 44-46: The make function currently takes an auto parameter and
returns a py::array_t<float> with a no-op capsule, which causes two problems:
'auto' in the parameter requires C++20 and the no-op capsule does not extend the
owning FunctionSupport lifetime (risking use-after-free). Change the signature
of make to accept an explicit float* (replace auto p with float* p) and when
constructing the py::array_t<float> pass py::cast(this) as the base object so
the Python wrapper for this FunctionSupport instance is held alive for the
array's lifetime, removing the no-op capsule.
🧹 Nitpick comments (1)
res/mkn/nvcc.yaml (1)
5-19: Consider making the CUDA arch explicit/configurable.
Without an arch flag, nvcc defaults can vary across environments and hurt reproducibility/perf. Acuda_archproperty (or env override) keeps builds consistent.Example (adjust to mkn interpolation rules)
property: - cxx_flags: --std=c++20 -Xcompiler -fPIC # -fPIC -fsized-deallocation -Wno-unknown-cuda-version - cxx_cuda: -x cu --extended-lambda `#-x` cuda --cuda-gpu-arch="sm_80" -Xclang -fcuda-allow-variadic-functions + cuda_arch: sm_80 + cxx_flags: --std=c++20 -Xcompiler -fPIC # -fPIC -fsized-deallocation -Wno-unknown-cuda-version + cxx_cuda: -x cu --extended-lambda --gpu-architecture="${cuda_arch}"
|
@copilot can you review? |
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
inc/mkn/gpu/cuda.hpp (2)
65-76:⚠️ Potential issue | 🟠 MajorInconsistent error handling:
cudaGetDevicePropertiesnot checked withMKN_GPU_ASSERT.In
rocm.hpp(line 63),hipGetDevicePropertiesis wrapped withMKN_GPU_ASSERT, but here the return value is silently discarded via[[maybe_unused]]. If the call fails, subsequent reads fromdevPropare undefined behavior on uninitialized memory.Suggested fix
void inline prinfo(size_t dev = 0) { - cudaDeviceProp devProp; - [[maybe_unused]] auto ret = cudaGetDeviceProperties(&devProp, dev); + cudaDeviceProp devProp{}; + MKN_GPU_ASSERT(cudaGetDeviceProperties(&devProp, dev));
78-87:⚠️ Potential issue | 🟡 MinorMissing error check on
cudaMemGetInfo.Same inconsistency:
rocm.hpp(line 87) wrapshipMemGetInfowithMKN_GPU_ASSERT, but here the return value is silently discarded. A failed call would leavefree_t/total_tat zero, producing misleading output rather than a clear error.Suggested fix
- cudaMemGetInfo(&free_t, &total_t); + MKN_GPU_ASSERT(cudaMemGetInfo(&free_t, &total_t));
🤖 Fix all issues with AI agents
In `@inc/mkn/gpu/any/inc/devfunc.ipp`:
- Around line 2-9: The zero function currently partitions by warpSize but
indexes with threadIdx.x which overflows when blockDim.x > warpSize; change the
loop and tail calculation to use the actual active thread width (e.g., use auto
const width = blockDim.x; compute each = size / width; loop writes t[chunk *
width + threadIdx.x] with a bounds check if threadIdx.x < size - (width *
each)), and keep the final conditional guard (threadIdx.x < remaining) and the
optional __syncthreads() when sync==true; alternatively, if zero() is only valid
at warp granularity, add a runtime assertion (e.g., assert(blockDim.x ==
warpSize)) or documentation and enforce it before using threadIdx.x to prevent
out-of-bounds. Ensure references: function zero, threadIdx.x, warpSize,
blockDim.x, and the sync template parameter.
In `@inc/mkn/gpu/cuda/cls.hpp`:
- Around line 20-30: The Stream struct currently allows implicit copies which
can lead to double-destroy of the cudaStream_t; make Stream non-copyable and
either non-movable or properly movable (delete copy ctor and copy assignment;
implement or default a move ctor/move assignment that transfers the stream
handle and clears the source) so ownership is unique, and update the destructor
to only destroy a valid handle. Also change sync() to assign to result and
assert/check it the same way the constructor/destructor do (use MKN_GPU_ASSERT
on the result returned by cudaStreamSynchronize) to surface runtime errors;
reference the Stream ctor, ~Stream, and sync methods and the result member when
making these changes.
- Around line 48-61: The field StreamEvent::fin is a plain bool causing a data
race between Callback (called on a CUDA callback thread) and reads via
StreamEvent::finished(); change fin to std::atomic<bool> and update finished()
to return fin.load(std::memory_order_acquire) and update Callback to set
fin.store(true, std::memory_order_release); also include the <atomic> header and
ensure any write/read of fin uses the atomic operations instead of direct
assignment or plain reads.
In `@inc/mkn/gpu/cuda/def.hpp`:
- Around line 12-16: Replace the bare preprocessor check for MKN_GPU_FN_PER_NS
in the CUDA def.hpp (the block that defines MKN_GPU_NS) with the same safe
pattern used in rocm/def.hpp: use `#if` defined(MKN_GPU_FN_PER_NS) &&
MKN_GPU_FN_PER_NS so the undefined-macro case doesn't trigger -Wundef; keep the
existing `#else` branch that defines MKN_GPU_NS as mkn::gpu and the `#endif` comment
intact.
In `@inc/mkn/gpu/def.hpp`:
- Around line 28-30: The MKN_GPU_CPU macro isn't always defined which triggers
-Wundef warnings; define it to 0 by default and then set it to 1 only when CUDA
and ROCm are both absent. Specifically, add a default definition like "#ifndef
MKN_GPU_CPU / `#define` MKN_GPU_CPU 0 / `#endif`" (mirroring how MKN_GPU_CUDA and
MKN_GPU_ROCM are handled) and keep the existing conditional that sets
MKN_GPU_CPU to 1 when MKN_GPU_CUDA == 0 && MKN_GPU_ROCM == 0 so all translation
units can safely use "#if MKN_GPU_CPU".
In `@inc/mkn/gpu/rocm/cls.hpp`:
- Around line 16-26: Stream::sync currently assigns hipStreamSynchronize(stream)
to result but doesn't check it; make it consistent with the
constructor/destructor by asserting the result using MKN_GPU_ASSERT. Replace the
current body of Stream::sync so it invokes MKN_GPU_ASSERT(result =
hipStreamSynchronize(stream)); ensuring synchronization failures are caught via
the existing assertion mechanism (referencing Stream::sync, result,
MKN_GPU_ASSERT, and hipStreamSynchronize).
- Around line 37-49: The callback currently invokes an empty std::function and
can throw std::bad_function_call; in StreamEvent::operator() (where _fn is set
and hipStreamAddCallback is called) ensure _fn is never left empty by assigning
a no-op lambda when fn is empty (or alternatively change StreamEvent::Callback
to check if self._fn is callable before invoking it), then preserve the existing
fin and reset behavior (i.e., set _fn = []{}; fin = 1) — update either
operator() to do _fn = fn ? fn : []{} or add a guard in Callback before calling
_fn() to avoid calling an empty target.
- Around line 90-98: send_async currently applies the offset to both device and
host pointers (p + start, t + start) whereas the synchronous send in
rocm/def.hpp offsets only the device pointer; if start is intended as a
device-side offset, change the hipMemcpyAsync call in send_async so the
destination is p + start but the source remains t (not t + start), keeping the
Size start parameter semantics consistent with send and avoiding reading from
the wrong host memory; update any related comments or docstring for send_async
to state that start is a device offset.
In `@inc/mkn/gpu/rocm/def.hpp`:
- Around line 85-95: The ROCm overloads destroy and destroy_host currently take
T* by value which leaves the caller's pointer dangling (unlike the CUDA
counterparts destroy(T*& ptr) / destroy_host(T*& ptr)); change the signatures of
destroy and destroy_host to take T*& ptr, call KLOG(TRC), call
MKN_GPU_ASSERT(hipFree(ptr)) and MKN_GPU_ASSERT(hipHostFree(ptr)) respectively,
and then set ptr = nullptr so the caller's pointer is cleared—this matches the
behavior in cuda/def.hpp and prevents dangling pointers.
In `@res/mkn/clang_cuda.yaml`:
- Line 18: The commented example that currently references sm_61 is stale given
the active cxx_cuda setting uses sm_120; update the example comment to show
sm_120 (Blackwell) and explicitly note that sm_120 requires CUDA toolkit headers
12.8+ and that Clang on PATH must provide compatible CUDA support (e.g., Clang
versions earlier than full CUDA‑12.8 support like 20.1.0 may warn or be
incompatible). Edit the comment near the cxx_cuda example to reflect the new
default GPU arch (sm_120) and add a short compatibility note about requiring
CUDA 12.8+ headers and verifying Clang/toolchain compatibility.
In `@test/hip/async.cpp`:
- Line 94: The three hipMemset calls (e.g., hipMemset(a, 0, bytes)) are missing
error checks and are operating on host-pinned memory allocated via
hipHostMalloc; either replace each hipMemset with a plain memset(a, 0, bytes) to
avoid an unnecessary device dispatch, or if the hip call is intentional wrap
each call with checkHip(hipMemset(...)) to handle the returned hipError_t;
update all three occurrences (the hipMemset calls around where variable a and
bytes are used) consistently.
🧹 Nitpick comments (11)
test/any/managed.cpp (1)
88-90: Test coverage gap for the newzerodevice function.
test_zerowas removed here, and thezeroutility was moved toinc/mkn/gpu/any/inc/devfunc.ipp. Consider adding a replacement test that exercises the new modularizedzerofunction to maintain coverage.inc/mkn/gpu/any/cls.hpp (1)
13-17: EntireCli<Device>struct is duplicated ininc/mkn/gpu/env.hpp.Both
inc/mkn/gpu/any/cls.hppandinc/mkn/gpu/env.hppdefine identical template structures. Consider consolidating to a single definition to prevent them from drifting apart during future maintenance.test/hip/async.ppc (2)
36-44:checkHipsilently discards errors in release builds.In a test/benchmark program, silently ignoring HIP errors in release mode can mask failures and produce misleading timing results. Consider always checking, or at minimum logging, the error—even in non-debug builds.
Also, the commented-out
// return result;on line 43 is dead code from the original NVIDIA sample.
86-91:dummyEventis created and destroyed but never used.
dummyEvent(lines 86, 90, 149) is allocated and cleaned up but never recorded or synchronized against. It's vestigial from the original NVIDIA sample. Consider removing it to reduce confusion.Proposed fix
hipEvent_t startEvent, stopEvent, dummyEvent; + // dummyEvent is unused — consider removing it hipStream_t stream[nStreams]; checkHip(hipEventCreate(&startEvent)); checkHip(hipEventCreate(&stopEvent)); - checkHip(hipEventCreate(&dummyEvent));And in cleanup:
checkHip(hipEventDestroy(startEvent)); checkHip(hipEventDestroy(stopEvent)); - checkHip(hipEventDestroy(dummyEvent));inc/mkn/gpu/cuda/cls.hpp (1)
65-78:Pointeruses magic numbers forcudaMemoryType— prefer named constants.Lines 69-74 compare
attributes.typeagainst raw integers0,1,2,3. These correspond tocudaMemoryTypeUnregistered,cudaMemoryTypeHost,cudaMemoryTypeDevice, andcudaMemoryTypeManaged. Using the enum names improves readability and guards against enum value changes across CUDA versions.Proposed fix
- bool is_unregistered_ptr() const { return attributes.type == 0; } - bool is_host_ptr() const { - return attributes.type == 1 || (is_unregistered_ptr() && t != nullptr); - } - bool is_device_ptr() const { return is_managed_ptr() || attributes.type == 2; } - bool is_managed_ptr() const { return attributes.type == 3; } + bool is_unregistered_ptr() const { return attributes.type == cudaMemoryTypeUnregistered; } + bool is_host_ptr() const { + return attributes.type == cudaMemoryTypeHost || (is_unregistered_ptr() && t != nullptr); + } + bool is_device_ptr() const { return is_managed_ptr() || attributes.type == cudaMemoryTypeDevice; } + bool is_managed_ptr() const { return attributes.type == cudaMemoryTypeManaged; }inc/mkn/gpu/cuda/api.hpp (1)
12-20:idx()may overflowuint32_tfor large 3D grids.The intermediate products
width * heightandz * width * heightcan overflowuint32_twhen grid dimensions are large in multiple axes. The template parameter allows callers to specifyuint64_t, but the defaultuint32_tcould silently wrap. This is a known trade-off (matches standard CUDA patterns), just worth documenting for users of multi-dimensional grids.inc/mkn/gpu/def.hpp (1)
34-39:CompileFlags::withCPUdoesn't use theMKN_GPU_CPUmacro.
withCUDAandwithROCMreference their respective macros, butwithCPUuses a hardcoded expression!MKN_GPU_ROCM and !MKN_GPU_CUDA. If a user explicitly definesMKN_GPU_CPU=1alongside a GPU backend (e.g., for testing),CompileFlags::withCPUwould reportfalse. For consistency and to respect user overrides:Proposed fix
struct CompileFlags { bool constexpr static withCUDA = MKN_GPU_CUDA; bool constexpr static withROCM = MKN_GPU_ROCM; - bool constexpr static withCPU = !MKN_GPU_ROCM and !MKN_GPU_CUDA; + bool constexpr static withCPU = MKN_GPU_CPU; bool constexpr static perNamespace = MKN_GPU_FN_PER_NS; };(This depends on defining
MKN_GPU_CPUto0when unused, as suggested above.)inc/mkn/gpu/cuda.hpp (1)
89-91: Commented-out include left in code.
device.ippis commented out. If this is intentional for the POC, a brief comment explaining why would help. If it's no longer needed, remove the line.inc/mkn/gpu/rocm.hpp (1)
45-58: Remove commented-out code blocks.Three full function templates (
_vector_fill, twofilloverloads) are commented out. These are still active incuda.hpp. If they're intended to be restored, track them via an issue; otherwise, remove the dead code to reduce noise.inc/mkn/gpu/cuda/def.hpp (1)
45-45:static inlineon namespace-scope variable:staticgives internal linkage, contradictinginline's ODR intent.
staticmeans each TU gets its own copy;inlineon a variable (C++17) is meant to ensure a single definition across TUs. Combined,staticwins and each TU initializes its ownwarp_sizeindependently, which is likely fine for a constant but semantically confusing. If you intend a single shared instance, dropstatic. If per-TU is intentional, dropinline.inc/mkn/gpu/rocm/cls.hpp (1)
56-56: Nit: preferfalseover0forboolinitialization.
bool fin = 0;works butbool fin = false;is more idiomatic C++.
| struct Stream { | ||
| Stream() { MKN_GPU_ASSERT(result = cudaStreamCreate(&stream)); } | ||
| ~Stream() { MKN_GPU_ASSERT(result = cudaStreamDestroy(stream)); } | ||
|
|
||
| auto& operator()() { return stream; }; | ||
|
|
||
| void sync() { result = cudaStreamSynchronize(stream); } | ||
|
|
||
| cudaError_t result; | ||
| cudaStream_t stream; | ||
| }; |
There was a problem hiding this comment.
Stream lacks copy/move protection — risk of double-destroying cudaStream_t.
Stream manages a CUDA stream resource via RAII (create in ctor, destroy in dtor), but does not delete or define copy/move operations. If a Stream is inadvertently copied, the destructor will call cudaStreamDestroy twice on the same handle, causing undefined behavior.
Also, sync() (line 26) sets result but never checks it, unlike the constructor and destructor which assert.
Proposed fix
struct Stream {
Stream() { MKN_GPU_ASSERT(result = cudaStreamCreate(&stream)); }
~Stream() { MKN_GPU_ASSERT(result = cudaStreamDestroy(stream)); }
+ Stream(Stream const&) = delete;
+ Stream& operator=(Stream const&) = delete;
+ Stream(Stream&&) = delete;
+ Stream& operator=(Stream&&) = delete;
auto& operator()() { return stream; };
- void sync() { result = cudaStreamSynchronize(stream); }
+ void sync() { MKN_GPU_ASSERT(result = cudaStreamSynchronize(stream)); }
cudaError_t result;
cudaStream_t stream;
};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| struct Stream { | |
| Stream() { MKN_GPU_ASSERT(result = cudaStreamCreate(&stream)); } | |
| ~Stream() { MKN_GPU_ASSERT(result = cudaStreamDestroy(stream)); } | |
| auto& operator()() { return stream; }; | |
| void sync() { result = cudaStreamSynchronize(stream); } | |
| cudaError_t result; | |
| cudaStream_t stream; | |
| }; | |
| struct Stream { | |
| Stream() { MKN_GPU_ASSERT(result = cudaStreamCreate(&stream)); } | |
| ~Stream() { MKN_GPU_ASSERT(result = cudaStreamDestroy(stream)); } | |
| Stream(Stream const&) = delete; | |
| Stream& operator=(Stream const&) = delete; | |
| Stream(Stream&&) = delete; | |
| Stream& operator=(Stream&&) = delete; | |
| auto& operator()() { return stream; }; | |
| void sync() { MKN_GPU_ASSERT(result = cudaStreamSynchronize(stream)); } | |
| cudaError_t result; | |
| cudaStream_t stream; | |
| }; |
🤖 Prompt for AI Agents
In `@inc/mkn/gpu/cuda/cls.hpp` around lines 20 - 30, The Stream struct currently
allows implicit copies which can lead to double-destroy of the cudaStream_t;
make Stream non-copyable and either non-movable or properly movable (delete copy
ctor and copy assignment; implement or default a move ctor/move assignment that
transfers the stream handle and clears the source) so ownership is unique, and
update the destructor to only destroy a valid handle. Also change sync() to
assign to result and assert/check it the same way the constructor/destructor do
(use MKN_GPU_ASSERT on the result returned by cudaStreamSynchronize) to surface
runtime errors; reference the Stream ctor, ~Stream, and sync methods and the
result member when making these changes.
| static void Callback(cudaStream_t /*stream*/, cudaError_t /*status*/, void* ptr) { | ||
| auto& self = *reinterpret_cast<StreamEvent*>(ptr); | ||
| self._fn(); | ||
| self._fn = [] {}; | ||
| self.fin = 1; | ||
| } | ||
|
|
||
| bool finished() const { return fin; } | ||
|
|
||
| Stream& stream; | ||
| cudaError_t result; | ||
| std::function<void()> _fn; | ||
| bool fin = 0; | ||
| }; |
There was a problem hiding this comment.
StreamEvent::fin is not atomic — data race with CUDA callback thread.
cudaStreamAddCallback fires Callback on a different thread. The callback writes fin = 1 (line 52) while the main thread may read finished() (line 55) concurrently. This is a data race (undefined behavior per the C++ memory model). Use std::atomic<bool> for fin.
Proposed fix
+#include <atomic>
// ...
- bool finished() const { return fin; }
+ bool finished() const { return fin.load(std::memory_order_acquire); }
// ...
- bool fin = 0;
+ std::atomic<bool> fin{false};And in Callback:
- self.fin = 1;
+ self.fin.store(true, std::memory_order_release);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| static void Callback(cudaStream_t /*stream*/, cudaError_t /*status*/, void* ptr) { | |
| auto& self = *reinterpret_cast<StreamEvent*>(ptr); | |
| self._fn(); | |
| self._fn = [] {}; | |
| self.fin = 1; | |
| } | |
| bool finished() const { return fin; } | |
| Stream& stream; | |
| cudaError_t result; | |
| std::function<void()> _fn; | |
| bool fin = 0; | |
| }; | |
| static void Callback(cudaStream_t /*stream*/, cudaError_t /*status*/, void* ptr) { | |
| auto& self = *reinterpret_cast<StreamEvent*>(ptr); | |
| self._fn(); | |
| self._fn = [] {}; | |
| self.fin.store(true, std::memory_order_release); | |
| } | |
| bool finished() const { return fin.load(std::memory_order_acquire); } | |
| Stream& stream; | |
| cudaError_t result; | |
| std::function<void()> _fn; | |
| std::atomic<bool> fin{false}; | |
| }; |
🤖 Prompt for AI Agents
In `@inc/mkn/gpu/cuda/cls.hpp` around lines 48 - 61, The field StreamEvent::fin is
a plain bool causing a data race between Callback (called on a CUDA callback
thread) and reads via StreamEvent::finished(); change fin to std::atomic<bool>
and update finished() to return fin.load(std::memory_order_acquire) and update
Callback to set fin.store(true, std::memory_order_release); also include the
<atomic> header and ensure any write/read of fin uses the atomic operations
instead of direct assignment or plain reads.
| property: | ||
| cxx_flags: --std=c++20 -fPIC -fsized-deallocation -Wno-unknown-cuda-version | ||
| cxx_cuda: -x cuda --cuda-gpu-arch="sm_61" -Xclang -fcuda-allow-variadic-functions | ||
| cxx_cuda: -x cuda --cuda-gpu-arch="sm_120" -Xclang -fcuda-allow-variadic-functions |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n res/mkn/clang_cuda.yamlRepository: mkn/mkn.gpu
Length of output: 1158
🌐 Web query:
CUDA sm_120 compute capability Blackwell Nvidia
💡 Result:
sm_120is NVIDIA’s SASS (binary) code target for CUDA Compute Capability 12.0, which corresponds to Blackwell “RTX” client/workstation GPUs (e.g., GeForce RTX 50xx). [1][2]- CUDA Toolkit support: CUDA 12.8 added compiler support for SM_120 (along with SM_100/SM_101 for other Blackwell variants). [1]
- Blackwell is split across multiple compute capabilities: for example, Blackwell B200/B300 are Compute Capability 10.0 (SM_100), while Blackwell RTX parts are 12.0 (SM_120). [2]
- PTX targeting note: newer PTX ISA documentation includes family-specific targets like
sm_120f(family-forward within the Blackwell family) and also mentions related targets (e.g.,sm_121). [3]
Sources:
[1] CUDA 12.8 Features Archive (NVIDIA) (docs.nvidia.com)
[2] NVIDIA CUTLASS architecture/compute-capability table (docs.nvidia.com)
[3] PTX ISA 8.8 documentation (NVIDIA) (docs.nvidia.com)
Citations:
- 1: https://docs.nvidia.com/cuda/archive/12.8.0/cuda-features-archive/index.html?utm_source=openai
- 2: https://docs.nvidia.com/cutlass/4.2.1/overview.html?utm_source=openai
- 3: https://docs.nvidia.com/cuda/archive/13.0.0/hopper-tuning-guide/parallel-thread-execution/index.html?utm_source=openai
🌐 Web query:
Clang CUDA 12.8 support version requirements
💡 Result:
-
If you mean “Clang compiling CUDA code using the CUDA Toolkit headers/libraries”: as of Clang/LLVM 20.1.0, Clang’s officially supported CUDA SDK version is up to CUDA 12.6 (anything newer is not guaranteed and may be treated as “newer-than-supported” / partially supported). [1]
- LLVM’s “Compiling CUDA with clang” doc also notes that when Clang sees a newer CUDA Toolkit than it supports, it will warn and try to proceed as the newest supported version (the exact max version stated there may lag behind current Clang releases). [2]
-
If you mean “CUDA 12.8’s nvcc host-compiler requirements on Linux”: NVIDIA lists supported host Clang versions 7.x through 19.x for CUDA 12.8 on x86_64 and aarch64 (major-version checked). [3]
-
Practical datapoint (hipify-clang/ROCm tooling): for CUDA 12.8.1, AMD’s hipify-clang documentation maps support to LLVM/Clang 20.1.x (20.1.0 through 20.1.8). [4]
Sources: [1] Clang 20.1.0 release notes, CUDA support section; [2] LLVM “Compiling CUDA with clang” doc; [3] NVIDIA CUDA 12.8 Linux installation guide (supported compilers table); [4] ROCm hipify-clang dependencies table.
Update stale comment and clarify CUDA toolchain requirements for sm_120 Blackwell support.
Line 6 contains an outdated example referencing sm_61 when the active config (line 18) targets sm_120 (Blackwell RTX). Additionally, sm_120 requires CUDA 12.8+ headers; verify that the Clang version on PATH provides this support (Clang 20.1.0 officially supports only CUDA 12.6, though it will attempt to proceed with a warning when given newer toolkits). Update the commented example on line 6 to reflect the current sm_120 default or document the version constraints more explicitly.
🤖 Prompt for AI Agents
In `@res/mkn/clang_cuda.yaml` at line 18, The commented example that currently
references sm_61 is stale given the active cxx_cuda setting uses sm_120; update
the example comment to show sm_120 (Blackwell) and explicitly note that sm_120
requires CUDA toolkit headers 12.8+ and that Clang on PATH must provide
compatible CUDA support (e.g., Clang versions earlier than full CUDA‑12.8
support like 20.1.0 may warn or be incompatible). Edit the comment near the
cxx_cuda example to reflect the new default GPU arch (sm_120) and add a short
compatibility note about requiring CUDA 12.8+ headers and verifying
Clang/toolchain compatibility.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
inc/mkn/gpu/cuda.hpp (1)
65-76:⚠️ Potential issue | 🟡 Minor
prinfosilently ignorescudaGetDevicePropertieserrors, unlike the ROCm counterpart.In
inc/mkn/gpu/rocm.hpp:63, the same function usesMKN_GPU_ASSERT(hipGetDeviceProperties(...)). Here, the result is[[maybe_unused]], which means a failed query silently reads uninitializeddevProp. Consider aligning with the ROCm version.Suggested fix
void inline prinfo(size_t dev = 0) { cudaDeviceProp devProp; - [[maybe_unused]] auto ret = cudaGetDeviceProperties(&devProp, dev); + MKN_GPU_ASSERT(cudaGetDeviceProperties(&devProp, dev));inc/mkn/gpu.hpp (1)
60-62:⚠️ Potential issue | 🟡 Minor
#errormessage doesn't mentionMKN_GPU_CPUas an option.Suggested fix
-#error "UNKNOWN GPU / define MKN_GPU_ROCM or MKN_GPU_CUDA" +#error "UNKNOWN GPU / define MKN_GPU_ROCM, MKN_GPU_CUDA, or MKN_GPU_CPU"
🤖 Fix all issues with AI agents
In `@inc/mkn/gpu/cpu/def.hpp`:
- Around line 1-4: The three headers share the same include guard macro
`_MKN_GPU_CPU_DEF_HPP_`, causing later headers to be skipped; change the guards
so each header has a unique macro: leave `def.hpp` as `_MKN_GPU_CPU_DEF_HPP_`,
update `api.hpp` to use `_MKN_GPU_CPU_API_HPP_` and `cls.hpp` to use
`_MKN_GPU_CPU_CLS_HPP_`, and ensure the corresponding `#ifndef/#define` and
matching `#endif /*...*/` comment names are updated to match each file's new
macro.
In `@inc/mkn/gpu/cuda/def.hpp`:
- Line 33: The free function getErrorString(auto const code) is missing the
inline specifier which causes ODR/linker errors when this header is included in
multiple TUs; fix it by marking the function inline (match the other free
functions) so change the declaration of getErrorString to be inline
getErrorString(auto const code) and keep the body returning
cudaGetErrorString(code) so it becomes an inline header-only function.
🧹 Nitpick comments (9)
test/hip/async.ppc (2)
86-91:dummyEventis created and destroyed but never used.Minor dead code —
dummyEventis allocated on line 90 and freed on line 149 but never recorded or synchronized against. Consider removing it to keep the benchmark clean.
36-44:checkHipsilently swallows errors in release builds.The function is a no-op unless
DEBUG/_DEBUGis defined. For a benchmark that relies on correct GPU results, a silent failure in release mode could produce misleading timing or error numbers. Consider at minimum logging or returning the error code in release builds, or documenting this is intentional.inc/mkn/gpu/cuda.hpp (1)
89-91: Commented-out include left in.Line 89 has a commented-out
#include "mkn/gpu/any/inc/device.ipp". If it's intentionally excluded, consider removing the line to avoid confusion. If it's temporary, a brief// TODOcomment explaining why would help.inc/mkn/gpu/rocm/def.hpp (1)
34-42:getWarpSizesilently ignoreshipGetDevicePropertieserrors.Line 39 discards the return value with
[[maybe_unused]]. If the call fails,devProp.warpSizeis read from uninitialized memory. At minimum, value-initializedevProp(hipDeviceProp_t devProp{};) so the fallback is deterministic rather than UB. Based on learnings, this function must support cross-platform use with CUDA/ROCm, so a graceful fallback (e.g., return 64 for ROCm) on failure would be appropriate.Suggested fix
std::uint32_t inline getWarpSize(size_t dev = 0) { `#ifdef` _MKN_GPU_WARP_SIZE_ return _MKN_GPU_WARP_SIZE_; `#else` - hipDeviceProp_t devProp; - [[maybe_unused]] auto ret = hipGetDeviceProperties(&devProp, dev); - return devProp.warpSize; + hipDeviceProp_t devProp{}; + auto ret = hipGetDeviceProperties(&devProp, dev); + if (ret != hipSuccess) return 64; // safe ROCm default + return devProp.warpSize; `#endif` /*_MKN_GPU_WARP_SIZE_ */ }inc/mkn/gpu/rocm.hpp (1)
45-58: Remove commented-out code.Three entire function definitions are commented out. If they've been moved or deprecated, remove them. If they're pending work, add a tracking TODO or issue reference.
inc/mkn/gpu/cuda/cls.hpp (1)
69-74:Pointeruses magic numbers forcudaMemoryTypeenum values.Using
0,1,2,3is fragile if CUDA SDK changes enum values. Consider using the named constants (cudaMemoryTypeUnregistered,cudaMemoryTypeHost,cudaMemoryTypeDevice,cudaMemoryTypeManaged).Suggested fix
- bool is_unregistered_ptr() const { return attributes.type == 0; } - bool is_host_ptr() const { - return attributes.type == 1 || (is_unregistered_ptr() && t != nullptr); - } - bool is_device_ptr() const { return is_managed_ptr() || attributes.type == 2; } - bool is_managed_ptr() const { return attributes.type == 3; } + bool is_unregistered_ptr() const { return attributes.type == cudaMemoryTypeUnregistered; } + bool is_host_ptr() const { + return attributes.type == cudaMemoryTypeHost || (is_unregistered_ptr() && t != nullptr); + } + bool is_device_ptr() const { return is_managed_ptr() || attributes.type == cudaMemoryTypeDevice; } + bool is_managed_ptr() const { return attributes.type == cudaMemoryTypeManaged; }inc/mkn/gpu/cuda/api.hpp (1)
31-41:global_d_kernelintentionally omits bounds checking;global_gd_kernelprovides the guarded variant.This matches the documented design intent. One minor observation:
global_gd_kernelcomputesidx()with the defaultuint32_treturn type whilesisstd::size_t(64-bit). For grids exceeding ~4 billion threads the truncated index would silently pass the guard.This is unlikely to be a practical problem today, but if you ever foresee very large 1D dispatches, consider
idx<std::size_t>()here.Optional: match index width to the size parameter
template <typename F, typename... Args> __global__ static void global_gd_kernel(F f, std::size_t s, Args... args) { - if (auto i = mkn::gpu::cuda::idx(); i < s) f(args...); + if (auto i = mkn::gpu::cuda::idx<std::size_t>(); i < s) f(args...); }inc/mkn/gpu/cuda/def.hpp (2)
45-45:staticonwarp_sizegives each TU its own copy (and its owncudaGetDevicePropertiescall).If
_MKN_GPU_WARP_SIZE_is not defined, every translation unit that includes this header will independently query the device at static-init time. Changingstatictoinlinewould deduplicate to a single initialization across TUs. If per-TU copies are intentional (e.g. to avoid cross-TU init-order issues with the macro namespace), this is fine as-is.Optional: single-init variant
-static std::uint32_t inline const warp_size = getWarpSize(); +inline std::uint32_t const warp_size = getWarpSize();
26-31:gpuAssertcallsstd::abort()but<cstdlib>is not included.This works transitively via other headers today, but could break if include chains change.
Add explicit include
`#include` <string> `#include` <cstdint> +#include <cstdlib>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@inc/mkn/gpu/rocm/cls.hpp`:
- Around line 149-161: Add the missing resize API and dev member to align ROCm
GLauncher with CUDA: introduce a std::size_t dev field and implement a
resize(std::size_t s, size_t dev = 0) method (move the current grid/block
calculation and hipGetDeviceProperties call out of the constructor into resize)
so it sets count, dev, initializes devProp via hipGetDeviceProperties(&devProp,
dev), updates b.x and g.x like the current logic, and keep cli{devProp} intact;
make the constructor delegate to Launcher and call resize(count, dev) so
existing code that calls GLauncher::resize(...) or accesses GLauncher::dev works
cross-backend.
In `@test/any/managed.cpp`:
- Around line 86-91: The test suite lost zero-fill coverage because test_zero()
was removed from main; restore coverage by reintroducing a call to test_zero()
in main (alongside test(), test_guess(), test_lambda_copy_capture_views(), and
test_lambda_ref_copy_capture_views()) or, if you prefer to centralize tests, add
an equivalent invocation of test_zero() in the test registration/runner so it
executes; ensure the zero device helper (zero / functions in devfunc.ipp) and
any device-side zero-fill implementation referenced by test_zero() are present
and included so the test compiles and runs (look for symbols test_zero, zero,
and devfunc.ipp to locate the missing pieces).
🧹 Nitpick comments (6)
inc/mkn/gpu/rocm.hpp (1)
45-58: Consider removing commented-out code instead of leaving it.These three template functions (
_vector_fill, twofilloverloads) are fully commented out. If they've been relocated or are no longer needed, remove them to reduce noise. Version control preserves the history.inc/mkn/gpu/cuda/def.hpp (1)
35-45:getWarpSize:cudaGetDevicePropertieserror is silently discarded.If the CUDA runtime call fails (line 40),
devProp.warpSizeis read from an uninitialized struct — undefined behavior. The[[maybe_unused]]suppresses the warning but doesn't handle the failure. This was discussed in a previous review iteration; noting for completeness that it's still unguarded. The same pattern exists inrocm/def.hpp:38-40.Minimal defensive fix (value-initialize devProp)
- cudaDeviceProp devProp; + cudaDeviceProp devProp{}; [[maybe_unused]] auto ret = cudaGetDeviceProperties(&devProp, dev);This at least zero-initializes the struct so the return value is deterministic (0) on failure rather than UB.
inc/mkn/gpu/cuda.hpp (1)
89-91: Commented-outdevice.ippinclude is a leftover.Line 89 is a commented-out include. Since
device.ippis included incuda/cls.hpp(line 103), this line serves no purpose and can be removed.inc/mkn/gpu/rocm/def.hpp (3)
34-42: SamedevPropinitialization concern ascuda/def.hpp.
hipDeviceProp_t devProp(line 38) is uninitialized; ifhipGetDevicePropertiesfails, readingdevProp.warpSizeis UB. Consider value-initializing:hipDeviceProp_t devProp{};.
55-59: Minor inconsistency: ROCmalloc(void*&)logsKLOG(TRC)but CUDA counterpart does not.In
cuda/def.hpplines 57–59, thealloc(void*&, Size)overload has noKLOG(TRC)call, while this ROCm version does (line 57). If the intent is symmetric tracing across backends, consider adding it to the CUDA side too.
73-78:alloc_managed:bytescomputed but log usessize * sizeof(T)again.Line 75 computes
auto const bytes = size * sizeof(T), but line 76 recomputessize * sizeof(T)in the log. Usebytesfor consistency:Suggested fix
auto const bytes = size * sizeof(T); - KLOG(TRC) << "GPU alloced: " << size * sizeof(T); + KLOG(TRC) << "GPU alloced: " << bytes; MKN_GPU_ASSERT(hipMallocManaged((void**)&p, bytes));
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@inc/mkn/gpu/any/cls.hpp`:
- Around line 1-24: There's a duplicate template struct Cli<Device> defined
twice causing an ODR violation; remove the redundant definition from the second
header (the one that currently duplicates lines 41-51 in env.hpp) and either (a)
include the canonical header that defines template struct Cli<Device> (the one
matching inc/mkn/gpu/any/cls.hpp) and use that definition, or (b) replace the
duplicate with a thin delegation/alias to the canonical Cli<Device> (e.g., a
using/forwarding declaration) so all translation units share the single
definition of Cli<Device>.
🧹 Nitpick comments (7)
res/test_all.sh (1)
13-13:exit 0is redundant inside the subshell.With
set -e, the subshell already exits non-zero on failure, and exits 0 naturally if all commands succeed. The explicitexit 0is a no-op.inc/mkn/gpu/any/cls.hpp (1)
13-16:autoreturn type requires both return paths to deduce the same type.
kul::String::INT32(...)likely returnsstd::int32_twhiledev.maxThreadsPerBlockmay beint(e.g., incudaDeviceProp). On common platforms these are the same type, but if they ever differ, this will fail to compile. This appears to match existing code ininc/mkn/gpu/cli.hpp, so it's likely fine in practice.inc/mkn/gpu/rocm/def.hpp (1)
73-78: Minor:alloc_managedcomputesbytesbut the log still usessize * sizeof(T)instead ofbytes.The local
bytesvariable is computed on line 75 but the KLOG on line 76 recalculates the same expression. This is consistent with the CUDA counterpart, so likely copy-paste.Optional cleanup
template <typename T, typename Size> void alloc_managed(T*& p, Size size) { auto const bytes = size * sizeof(T); - KLOG(TRC) << "GPU alloced: " << size * sizeof(T); + KLOG(TRC) << "GPU alloced: " << bytes; MKN_GPU_ASSERT(hipMallocManaged((void**)&p, bytes)); }inc/mkn/gpu/cuda.hpp (1)
89-91: Commented-out include is a stale breadcrumb.
device.ippis already included viacuda/cls.hpp(line 103 of that file). The commented-out line here could be removed to avoid confusion.inc/mkn/gpu/cuda/def.hpp (1)
73-77: Minor:alloc_managedcomputesbytesbut the log recalculatessize * sizeof(T).Same as the ROCm counterpart — the local
byteson line 75 could be reused in the log on line 75.Optional cleanup
template <typename T, typename Size> void alloc_managed(T*& p, Size size) { - KLOG(TRC) << "GPU alloced: " << size * sizeof(T); - MKN_GPU_ASSERT(cudaMallocManaged((void**)&p, size * sizeof(T))); + auto const bytes = size * sizeof(T); + KLOG(TRC) << "GPU alloced: " << bytes; + MKN_GPU_ASSERT(cudaMallocManaged((void**)&p, bytes)); }inc/mkn/gpu/rocm.hpp (1)
50-58: Nice DRY refactor onfill— consider aligning the CUDA counterpart.The second
filloverload now delegates to the first (fill(c, c.size(), val)), eliminating duplication. The CUDA equivalent incuda.hpp(lines 59–62) still duplicates theGLaunchercall inline.inc/mkn/gpu/rocm/cls.hpp (1)
62-86: ROCmPointerhasassert(t)and extratypemember — minor divergence from CUDA.The CUDA
Pointer(incuda/cls.hpp) doesn'tasserton the input pointer or store a separatetypefield. This isn't a bug — ROCm'shipPointerAttribute_thas different semantics — but it's worth noting for anyone porting code between backends.
Summary by CodeRabbit
New Features
Bug Fixes
Chores
Tests