Skip to content

Comments

updates#30

Merged
PhilipDeegan merged 1 commit intomasterfrom
next
Feb 15, 2026
Merged

updates#30
PhilipDeegan merged 1 commit intomasterfrom
next

Conversation

@PhilipDeegan
Copy link
Member

@PhilipDeegan PhilipDeegan commented Jan 27, 2026

Summary by CodeRabbit

  • New Features

    • Modularized GPU backend (separate CUDA/ROCm/CPU layers) and added NVCC build configuration.
  • Bug Fixes

    • CI download now fails fast, follows redirects and retries for more reliable builds.
  • Chores

    • Updated CUDA target and compiler/linker invocations; enabled CPU test flag; broadened .gitignore.
  • Tests

    • Expanded GPU async/streaming tests and added a top-level test runner script.

@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Modularized GPU backends: split monolithic CUDA/ROCm headers into def/api/cls and .ipp pieces, moved device helpers to shared any/ files, added nvcc build config, enabled CPU test flag, updated CI curl invocation and compiler target, and adjusted tests/benchmarks and .gitignore.

Changes

Cohort / File(s) Summary
CUDA modularization
inc/mkn/gpu/cuda.hpp, inc/mkn/gpu/cuda/def.hpp, inc/mkn/gpu/cuda/api.hpp, inc/mkn/gpu/cuda/cls.hpp
Removed large inline CUDA implementations from cuda.hpp and replaced with modular def/api/cls headers (runtime helpers, alloc/sync/copy, device idx/api, Stream/StreamEvent/Pointer, async copy, launchers).
ROCm modularization
inc/mkn/gpu/rocm.hpp, inc/mkn/gpu/rocm/def.hpp, inc/mkn/gpu/rocm/api.hpp, inc/mkn/gpu/rocm/cls.hpp
Replaced monolithic ROCm header with def/api/cls split: HIP runtime helpers, device kernels, Stream/StreamEvent/Pointer, async copy, launch utilities; former inline implementations removed.
Shared any-device utilities
inc/mkn/gpu/def.hpp, inc/mkn/gpu/any/def.hpp, inc/mkn/gpu/any/cls.hpp, inc/mkn/gpu/any/inc/devfunc.ipp
Added CompileFlags and common device helpers; introduced Cli helper and fill_warp_size device function in any/inc; centralized shared functionality into any/ includes.
CPU backend & integration
inc/mkn/gpu.hpp, inc/mkn/gpu/cpu.hpp, inc/mkn/gpu/cpu/def.hpp, inc/mkn/gpu/cpu/api.hpp, inc/mkn/gpu/cpu/cls.hpp
Switched top-level include to def.hpp and added conditional backend includes; updated CPU include guards and switched CPU implementation to use any/inc/* utilities and fill_warp_size.
Removed legacy files
inc/mkn/gpu/defines.hpp, inc/mkn/gpu/devfunc.hpp
Deleted legacy centralized defines and old device zero implementation; functionality moved into backend-specific def.hpp/api.hpp/cls.hpp and shared any/ files.
Build & CI changes
.github/workflows/build.yml, res/mkn/nvcc.yaml, res/mkn/clang_cuda.yaml, mkn.yaml, .gitignore
Hardened CI curl call (-f + --retry 3 --retry-delay 2), added NVCC build config, changed compiler target (clang++-17→clang++, sm_61→sm_120), removed -fuse-ld=gold, enabled MKN_GPU_CPU in mkn.yaml, broadened .clangd ignore.
Tests & benchmarks
test/any/async_streaming.cpp, test/any/managed.cpp, test/hip/async.cpp, test/hip/async.ppc, res/test_all.sh
Updated includes/macros, replaced zerofill_warp_size, switched memsethipMemset, added HIP async benchmark program and a test orchestration script.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Host as Host (code/tests)
    participant API as MKN_GPU_NS (def/api/cls)
    participant Runtime as "CUDA/HIP Runtime / Device"

    Host->>API: request alloc / send / launch / stream ops
    API->>Runtime: call alloc / memcpy / launch (cuda/hip APIs)
    Runtime-->>API: async completion (stream/event)
    API-->>Host: return handles / status / results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Poem

🐰
I nibbled at headers, split stacks with care,
Moved kernels and streams to tidy new lairs,
CUDA, ROCm, CPU — neat files in a row,
Streams hop, kernels bound, off to the show,
A rabbit’s small tidy patch — hop, hop, hooray! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title 'updates' is a generic, non-descriptive term that fails to convey meaningful information about the changeset. Use a more specific title that highlights the primary change, such as 'Refactor GPU backend headers into modular structure' or 'Modularize CUDA, ROCm, and CPU GPU implementations'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into master

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch next

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@PhilipDeegan
Copy link
Member Author

PhilipDeegan commented Jan 27, 2026

@coderabbitai are you setup for this repo?

@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

Tip

For best results, initiate chat on the files or code changes.

@PhilipDeegan Yes, I'm set up and ready to help with this repository! I have your preferences stored (like your include ordering style) and I'm operational for code reviews.

Would you like me to:

  • Review this PR (python interop poc)?
  • Show you the current CodeRabbit configuration for this repository?
  • Help with something specific?

Just let me know how I can assist! 🐰


🧠 Learnings used
Learnt from: PhilipDeegan
Repo: mkn/mkn.gpu PR: 19
File: test/any/async_streaming.cpp:2-6
Timestamp: 2024-06-10T19:38:12.663Z
Learning: PhilipDeegan, thank you for sharing your preferred style for organizing includes. I'll remember this for future reviews in your codebase.

<!--

PhilipDeegan prefers to sort `#include` statements by line length first, and alphabetically second.

@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

Tip

For best results, initiate chat on the files or code changes.

@PhilipDeegan Yes, I'm all set up for this repository! I have your preferences stored, including your preferred include ordering style (by line length first, then alphabetically).

How can I help you with this PR? Would you like me to:

  • Review the python interop poc changes?
  • Help with specific files or implementations?
  • Answer questions about the code?

Let me know what you need! 🐰


🧠 Learnings used
Learnt from: PhilipDeegan
Repo: mkn/mkn.gpu PR: 19
File: test/any/async_streaming.cpp:2-6
Timestamp: 2024-06-10T19:38:12.663Z
Learning: PhilipDeegan, thank you for sharing your preferred style for organizing includes. I'll remember this for future reviews in your codebase.

<!--

PhilipDeegan prefers to sort `#include` statements by line length first, and alphabetically second.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@inc/mkn/gpu/cuda.hpp`:
- Around line 84-95: Make getWarpSize inline to avoid ODR violations when
included from headers, and add proper CUDA error handling: in the inline
std::uint32_t getWarpSize(size_t dev = 0) function initialize a cudaDeviceProp
variable before use, call cudaGetDeviceProperties and check its return value
(use the project's MKN_GPU_ASSERT macro or return a safe default like 32 /
throw/abort on failure), and only access devProp.warpSize on success; also
ensure the static std::uint32_t const warp_size = getWarpSize(); usage remains
valid after these safeguards.

In `@res/mkn/clang_cuda.yaml`:
- Around line 18-31: The compiler and linker entries currently invoke the
generic clang++ which can pick older toolchains; update the "compiler" and
"linker" fields so they explicitly call clang++-17 (i.e. replace uses of
"clang++" in the compiler: and linker: settings with "clang++-17") to ensure the
recommended clang 17 toolchain is used for the cxx_cuda build and CUDA SM_80
flags (references: compiler, linker, cxx_cuda).

In `@test/poc/jit/eg.py`:
- Around line 17-24: vadd is declared as a CUDA kernel but is invoked from
inside another kernel (vector_add_gpu), which Numba disallows; change vadd to a
device function by using `@cuda.jit`(device=True) instead of a kernel decorator so
it can be called from vector_add_gpu, keep the same signature (index i and
arrays a, b, c) and ensure vector_add_gpu calls vadd(...) as a regular device
function rather than launching it.

In `@test/poc/jit/pybind.cpp`:
- Around line 61-66: The pybind registration mistakenly repeats
.def_readwrite("a", &FunctionSupport::a) instead of exposing member c; update
the duplicate binding in the chain that registers FunctionSupport so it uses
.def_readwrite("c", &FunctionSupport::c) (keep the existing
.def("A",&FunctionSupport::A), .def_readwrite("b",&FunctionSupport::b) and
.def("C",&FunctionSupport::C) intact) so the struct's c member is correctly
exposed to Python.
- Around line 44-46: The make function currently takes an auto parameter and
returns a py::array_t<float> with a no-op capsule, which causes two problems:
'auto' in the parameter requires C++20 and the no-op capsule does not extend the
owning FunctionSupport lifetime (risking use-after-free). Change the signature
of make to accept an explicit float* (replace auto p with float* p) and when
constructing the py::array_t<float> pass py::cast(this) as the base object so
the Python wrapper for this FunctionSupport instance is held alive for the
array's lifetime, removing the no-op capsule.
🧹 Nitpick comments (1)
res/mkn/nvcc.yaml (1)

5-19: Consider making the CUDA arch explicit/configurable.
Without an arch flag, nvcc defaults can vary across environments and hurt reproducibility/perf. A cuda_arch property (or env override) keeps builds consistent.

Example (adjust to mkn interpolation rules)
 property:
-  cxx_flags: --std=c++20 -Xcompiler -fPIC # -fPIC -fsized-deallocation -Wno-unknown-cuda-version
-  cxx_cuda: -x cu --extended-lambda `#-x` cuda --cuda-gpu-arch="sm_80" -Xclang -fcuda-allow-variadic-functions
+  cuda_arch: sm_80
+  cxx_flags: --std=c++20 -Xcompiler -fPIC # -fPIC -fsized-deallocation -Wno-unknown-cuda-version
+  cxx_cuda: -x cu --extended-lambda --gpu-architecture="${cuda_arch}"

@mkn mkn deleted a comment from coderabbitai bot Jan 27, 2026
@PhilipDeegan PhilipDeegan changed the title python interop poc updates Feb 14, 2026
@PhilipDeegan
Copy link
Member Author

PhilipDeegan commented Feb 14, 2026

@copilot can you review?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
inc/mkn/gpu/cuda.hpp (2)

65-76: ⚠️ Potential issue | 🟠 Major

Inconsistent error handling: cudaGetDeviceProperties not checked with MKN_GPU_ASSERT.

In rocm.hpp (line 63), hipGetDeviceProperties is wrapped with MKN_GPU_ASSERT, but here the return value is silently discarded via [[maybe_unused]]. If the call fails, subsequent reads from devProp are undefined behavior on uninitialized memory.

Suggested fix
 void inline prinfo(size_t dev = 0) {
-  cudaDeviceProp devProp;
-  [[maybe_unused]] auto ret = cudaGetDeviceProperties(&devProp, dev);
+  cudaDeviceProp devProp{};
+  MKN_GPU_ASSERT(cudaGetDeviceProperties(&devProp, dev));

78-87: ⚠️ Potential issue | 🟡 Minor

Missing error check on cudaMemGetInfo.

Same inconsistency: rocm.hpp (line 87) wraps hipMemGetInfo with MKN_GPU_ASSERT, but here the return value is silently discarded. A failed call would leave free_t/total_t at zero, producing misleading output rather than a clear error.

Suggested fix
-  cudaMemGetInfo(&free_t, &total_t);
+  MKN_GPU_ASSERT(cudaMemGetInfo(&free_t, &total_t));
🤖 Fix all issues with AI agents
In `@inc/mkn/gpu/any/inc/devfunc.ipp`:
- Around line 2-9: The zero function currently partitions by warpSize but
indexes with threadIdx.x which overflows when blockDim.x > warpSize; change the
loop and tail calculation to use the actual active thread width (e.g., use auto
const width = blockDim.x; compute each = size / width; loop writes t[chunk *
width + threadIdx.x] with a bounds check if threadIdx.x < size - (width *
each)), and keep the final conditional guard (threadIdx.x < remaining) and the
optional __syncthreads() when sync==true; alternatively, if zero() is only valid
at warp granularity, add a runtime assertion (e.g., assert(blockDim.x ==
warpSize)) or documentation and enforce it before using threadIdx.x to prevent
out-of-bounds. Ensure references: function zero, threadIdx.x, warpSize,
blockDim.x, and the sync template parameter.

In `@inc/mkn/gpu/cuda/cls.hpp`:
- Around line 20-30: The Stream struct currently allows implicit copies which
can lead to double-destroy of the cudaStream_t; make Stream non-copyable and
either non-movable or properly movable (delete copy ctor and copy assignment;
implement or default a move ctor/move assignment that transfers the stream
handle and clears the source) so ownership is unique, and update the destructor
to only destroy a valid handle. Also change sync() to assign to result and
assert/check it the same way the constructor/destructor do (use MKN_GPU_ASSERT
on the result returned by cudaStreamSynchronize) to surface runtime errors;
reference the Stream ctor, ~Stream, and sync methods and the result member when
making these changes.
- Around line 48-61: The field StreamEvent::fin is a plain bool causing a data
race between Callback (called on a CUDA callback thread) and reads via
StreamEvent::finished(); change fin to std::atomic<bool> and update finished()
to return fin.load(std::memory_order_acquire) and update Callback to set
fin.store(true, std::memory_order_release); also include the <atomic> header and
ensure any write/read of fin uses the atomic operations instead of direct
assignment or plain reads.

In `@inc/mkn/gpu/cuda/def.hpp`:
- Around line 12-16: Replace the bare preprocessor check for MKN_GPU_FN_PER_NS
in the CUDA def.hpp (the block that defines MKN_GPU_NS) with the same safe
pattern used in rocm/def.hpp: use `#if` defined(MKN_GPU_FN_PER_NS) &&
MKN_GPU_FN_PER_NS so the undefined-macro case doesn't trigger -Wundef; keep the
existing `#else` branch that defines MKN_GPU_NS as mkn::gpu and the `#endif` comment
intact.

In `@inc/mkn/gpu/def.hpp`:
- Around line 28-30: The MKN_GPU_CPU macro isn't always defined which triggers
-Wundef warnings; define it to 0 by default and then set it to 1 only when CUDA
and ROCm are both absent. Specifically, add a default definition like "#ifndef
MKN_GPU_CPU / `#define` MKN_GPU_CPU 0 / `#endif`" (mirroring how MKN_GPU_CUDA and
MKN_GPU_ROCM are handled) and keep the existing conditional that sets
MKN_GPU_CPU to 1 when MKN_GPU_CUDA == 0 && MKN_GPU_ROCM == 0 so all translation
units can safely use "#if MKN_GPU_CPU".

In `@inc/mkn/gpu/rocm/cls.hpp`:
- Around line 16-26: Stream::sync currently assigns hipStreamSynchronize(stream)
to result but doesn't check it; make it consistent with the
constructor/destructor by asserting the result using MKN_GPU_ASSERT. Replace the
current body of Stream::sync so it invokes MKN_GPU_ASSERT(result =
hipStreamSynchronize(stream)); ensuring synchronization failures are caught via
the existing assertion mechanism (referencing Stream::sync, result,
MKN_GPU_ASSERT, and hipStreamSynchronize).
- Around line 37-49: The callback currently invokes an empty std::function and
can throw std::bad_function_call; in StreamEvent::operator() (where _fn is set
and hipStreamAddCallback is called) ensure _fn is never left empty by assigning
a no-op lambda when fn is empty (or alternatively change StreamEvent::Callback
to check if self._fn is callable before invoking it), then preserve the existing
fin and reset behavior (i.e., set _fn = []{}; fin = 1) — update either
operator() to do _fn = fn ? fn : []{} or add a guard in Callback before calling
_fn() to avoid calling an empty target.
- Around line 90-98: send_async currently applies the offset to both device and
host pointers (p + start, t + start) whereas the synchronous send in
rocm/def.hpp offsets only the device pointer; if start is intended as a
device-side offset, change the hipMemcpyAsync call in send_async so the
destination is p + start but the source remains t (not t + start), keeping the
Size start parameter semantics consistent with send and avoiding reading from
the wrong host memory; update any related comments or docstring for send_async
to state that start is a device offset.

In `@inc/mkn/gpu/rocm/def.hpp`:
- Around line 85-95: The ROCm overloads destroy and destroy_host currently take
T* by value which leaves the caller's pointer dangling (unlike the CUDA
counterparts destroy(T*& ptr) / destroy_host(T*& ptr)); change the signatures of
destroy and destroy_host to take T*& ptr, call KLOG(TRC), call
MKN_GPU_ASSERT(hipFree(ptr)) and MKN_GPU_ASSERT(hipHostFree(ptr)) respectively,
and then set ptr = nullptr so the caller's pointer is cleared—this matches the
behavior in cuda/def.hpp and prevents dangling pointers.

In `@res/mkn/clang_cuda.yaml`:
- Line 18: The commented example that currently references sm_61 is stale given
the active cxx_cuda setting uses sm_120; update the example comment to show
sm_120 (Blackwell) and explicitly note that sm_120 requires CUDA toolkit headers
12.8+ and that Clang on PATH must provide compatible CUDA support (e.g., Clang
versions earlier than full CUDA‑12.8 support like 20.1.0 may warn or be
incompatible). Edit the comment near the cxx_cuda example to reflect the new
default GPU arch (sm_120) and add a short compatibility note about requiring
CUDA 12.8+ headers and verifying Clang/toolchain compatibility.

In `@test/hip/async.cpp`:
- Line 94: The three hipMemset calls (e.g., hipMemset(a, 0, bytes)) are missing
error checks and are operating on host-pinned memory allocated via
hipHostMalloc; either replace each hipMemset with a plain memset(a, 0, bytes) to
avoid an unnecessary device dispatch, or if the hip call is intentional wrap
each call with checkHip(hipMemset(...)) to handle the returned hipError_t;
update all three occurrences (the hipMemset calls around where variable a and
bytes are used) consistently.
🧹 Nitpick comments (11)
test/any/managed.cpp (1)

88-90: Test coverage gap for the new zero device function.

test_zero was removed here, and the zero utility was moved to inc/mkn/gpu/any/inc/devfunc.ipp. Consider adding a replacement test that exercises the new modularized zero function to maintain coverage.

inc/mkn/gpu/any/cls.hpp (1)

13-17: Entire Cli<Device> struct is duplicated in inc/mkn/gpu/env.hpp.

Both inc/mkn/gpu/any/cls.hpp and inc/mkn/gpu/env.hpp define identical template structures. Consider consolidating to a single definition to prevent them from drifting apart during future maintenance.

test/hip/async.ppc (2)

36-44: checkHip silently discards errors in release builds.

In a test/benchmark program, silently ignoring HIP errors in release mode can mask failures and produce misleading timing results. Consider always checking, or at minimum logging, the error—even in non-debug builds.

Also, the commented-out // return result; on line 43 is dead code from the original NVIDIA sample.


86-91: dummyEvent is created and destroyed but never used.

dummyEvent (lines 86, 90, 149) is allocated and cleaned up but never recorded or synchronized against. It's vestigial from the original NVIDIA sample. Consider removing it to reduce confusion.

Proposed fix
   hipEvent_t startEvent, stopEvent, dummyEvent;
+  // dummyEvent is unused — consider removing it
   hipStream_t stream[nStreams];
   checkHip(hipEventCreate(&startEvent));
   checkHip(hipEventCreate(&stopEvent));
-  checkHip(hipEventCreate(&dummyEvent));

And in cleanup:

   checkHip(hipEventDestroy(startEvent));
   checkHip(hipEventDestroy(stopEvent));
-  checkHip(hipEventDestroy(dummyEvent));
inc/mkn/gpu/cuda/cls.hpp (1)

65-78: Pointer uses magic numbers for cudaMemoryType — prefer named constants.

Lines 69-74 compare attributes.type against raw integers 0, 1, 2, 3. These correspond to cudaMemoryTypeUnregistered, cudaMemoryTypeHost, cudaMemoryTypeDevice, and cudaMemoryTypeManaged. Using the enum names improves readability and guards against enum value changes across CUDA versions.

Proposed fix
-  bool is_unregistered_ptr() const { return attributes.type == 0; }
-  bool is_host_ptr() const {
-    return attributes.type == 1 || (is_unregistered_ptr() && t != nullptr);
-  }
-  bool is_device_ptr() const { return is_managed_ptr() || attributes.type == 2; }
-  bool is_managed_ptr() const { return attributes.type == 3; }
+  bool is_unregistered_ptr() const { return attributes.type == cudaMemoryTypeUnregistered; }
+  bool is_host_ptr() const {
+    return attributes.type == cudaMemoryTypeHost || (is_unregistered_ptr() && t != nullptr);
+  }
+  bool is_device_ptr() const { return is_managed_ptr() || attributes.type == cudaMemoryTypeDevice; }
+  bool is_managed_ptr() const { return attributes.type == cudaMemoryTypeManaged; }
inc/mkn/gpu/cuda/api.hpp (1)

12-20: idx() may overflow uint32_t for large 3D grids.

The intermediate products width * height and z * width * height can overflow uint32_t when grid dimensions are large in multiple axes. The template parameter allows callers to specify uint64_t, but the default uint32_t could silently wrap. This is a known trade-off (matches standard CUDA patterns), just worth documenting for users of multi-dimensional grids.

inc/mkn/gpu/def.hpp (1)

34-39: CompileFlags::withCPU doesn't use the MKN_GPU_CPU macro.

withCUDA and withROCM reference their respective macros, but withCPU uses a hardcoded expression !MKN_GPU_ROCM and !MKN_GPU_CUDA. If a user explicitly defines MKN_GPU_CPU=1 alongside a GPU backend (e.g., for testing), CompileFlags::withCPU would report false. For consistency and to respect user overrides:

Proposed fix
 struct CompileFlags {
   bool constexpr static withCUDA = MKN_GPU_CUDA;
   bool constexpr static withROCM = MKN_GPU_ROCM;
-  bool constexpr static withCPU = !MKN_GPU_ROCM and !MKN_GPU_CUDA;
+  bool constexpr static withCPU = MKN_GPU_CPU;
   bool constexpr static perNamespace = MKN_GPU_FN_PER_NS;
 };

(This depends on defining MKN_GPU_CPU to 0 when unused, as suggested above.)

inc/mkn/gpu/cuda.hpp (1)

89-91: Commented-out include left in code.

device.ipp is commented out. If this is intentional for the POC, a brief comment explaining why would help. If it's no longer needed, remove the line.

inc/mkn/gpu/rocm.hpp (1)

45-58: Remove commented-out code blocks.

Three full function templates (_vector_fill, two fill overloads) are commented out. These are still active in cuda.hpp. If they're intended to be restored, track them via an issue; otherwise, remove the dead code to reduce noise.

inc/mkn/gpu/cuda/def.hpp (1)

45-45: static inline on namespace-scope variable: static gives internal linkage, contradicting inline's ODR intent.

static means each TU gets its own copy; inline on a variable (C++17) is meant to ensure a single definition across TUs. Combined, static wins and each TU initializes its own warp_size independently, which is likely fine for a constant but semantically confusing. If you intend a single shared instance, drop static. If per-TU is intentional, drop inline.

inc/mkn/gpu/rocm/cls.hpp (1)

56-56: Nit: prefer false over 0 for bool initialization.

bool fin = 0; works but bool fin = false; is more idiomatic C++.

Comment on lines 20 to 30
struct Stream {
Stream() { MKN_GPU_ASSERT(result = cudaStreamCreate(&stream)); }
~Stream() { MKN_GPU_ASSERT(result = cudaStreamDestroy(stream)); }

auto& operator()() { return stream; };

void sync() { result = cudaStreamSynchronize(stream); }

cudaError_t result;
cudaStream_t stream;
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Stream lacks copy/move protection — risk of double-destroying cudaStream_t.

Stream manages a CUDA stream resource via RAII (create in ctor, destroy in dtor), but does not delete or define copy/move operations. If a Stream is inadvertently copied, the destructor will call cudaStreamDestroy twice on the same handle, causing undefined behavior.

Also, sync() (line 26) sets result but never checks it, unlike the constructor and destructor which assert.

Proposed fix
 struct Stream {
   Stream() { MKN_GPU_ASSERT(result = cudaStreamCreate(&stream)); }
   ~Stream() { MKN_GPU_ASSERT(result = cudaStreamDestroy(stream)); }
+  Stream(Stream const&) = delete;
+  Stream& operator=(Stream const&) = delete;
+  Stream(Stream&&) = delete;
+  Stream& operator=(Stream&&) = delete;

   auto& operator()() { return stream; };

-  void sync() { result = cudaStreamSynchronize(stream); }
+  void sync() { MKN_GPU_ASSERT(result = cudaStreamSynchronize(stream)); }

   cudaError_t result;
   cudaStream_t stream;
 };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
struct Stream {
Stream() { MKN_GPU_ASSERT(result = cudaStreamCreate(&stream)); }
~Stream() { MKN_GPU_ASSERT(result = cudaStreamDestroy(stream)); }
auto& operator()() { return stream; };
void sync() { result = cudaStreamSynchronize(stream); }
cudaError_t result;
cudaStream_t stream;
};
struct Stream {
Stream() { MKN_GPU_ASSERT(result = cudaStreamCreate(&stream)); }
~Stream() { MKN_GPU_ASSERT(result = cudaStreamDestroy(stream)); }
Stream(Stream const&) = delete;
Stream& operator=(Stream const&) = delete;
Stream(Stream&&) = delete;
Stream& operator=(Stream&&) = delete;
auto& operator()() { return stream; };
void sync() { MKN_GPU_ASSERT(result = cudaStreamSynchronize(stream)); }
cudaError_t result;
cudaStream_t stream;
};
🤖 Prompt for AI Agents
In `@inc/mkn/gpu/cuda/cls.hpp` around lines 20 - 30, The Stream struct currently
allows implicit copies which can lead to double-destroy of the cudaStream_t;
make Stream non-copyable and either non-movable or properly movable (delete copy
ctor and copy assignment; implement or default a move ctor/move assignment that
transfers the stream handle and clears the source) so ownership is unique, and
update the destructor to only destroy a valid handle. Also change sync() to
assign to result and assert/check it the same way the constructor/destructor do
(use MKN_GPU_ASSERT on the result returned by cudaStreamSynchronize) to surface
runtime errors; reference the Stream ctor, ~Stream, and sync methods and the
result member when making these changes.

Comment on lines +48 to +61
static void Callback(cudaStream_t /*stream*/, cudaError_t /*status*/, void* ptr) {
auto& self = *reinterpret_cast<StreamEvent*>(ptr);
self._fn();
self._fn = [] {};
self.fin = 1;
}

bool finished() const { return fin; }

Stream& stream;
cudaError_t result;
std::function<void()> _fn;
bool fin = 0;
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

StreamEvent::fin is not atomic — data race with CUDA callback thread.

cudaStreamAddCallback fires Callback on a different thread. The callback writes fin = 1 (line 52) while the main thread may read finished() (line 55) concurrently. This is a data race (undefined behavior per the C++ memory model). Use std::atomic<bool> for fin.

Proposed fix
+#include <atomic>
 // ...
-  bool finished() const { return fin; }
+  bool finished() const { return fin.load(std::memory_order_acquire); }
   // ...
-  bool fin = 0;
+  std::atomic<bool> fin{false};

And in Callback:

-    self.fin = 1;
+    self.fin.store(true, std::memory_order_release);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static void Callback(cudaStream_t /*stream*/, cudaError_t /*status*/, void* ptr) {
auto& self = *reinterpret_cast<StreamEvent*>(ptr);
self._fn();
self._fn = [] {};
self.fin = 1;
}
bool finished() const { return fin; }
Stream& stream;
cudaError_t result;
std::function<void()> _fn;
bool fin = 0;
};
static void Callback(cudaStream_t /*stream*/, cudaError_t /*status*/, void* ptr) {
auto& self = *reinterpret_cast<StreamEvent*>(ptr);
self._fn();
self._fn = [] {};
self.fin.store(true, std::memory_order_release);
}
bool finished() const { return fin.load(std::memory_order_acquire); }
Stream& stream;
cudaError_t result;
std::function<void()> _fn;
std::atomic<bool> fin{false};
};
🤖 Prompt for AI Agents
In `@inc/mkn/gpu/cuda/cls.hpp` around lines 48 - 61, The field StreamEvent::fin is
a plain bool causing a data race between Callback (called on a CUDA callback
thread) and reads via StreamEvent::finished(); change fin to std::atomic<bool>
and update finished() to return fin.load(std::memory_order_acquire) and update
Callback to set fin.store(true, std::memory_order_release); also include the
<atomic> header and ensure any write/read of fin uses the atomic operations
instead of direct assignment or plain reads.

property:
cxx_flags: --std=c++20 -fPIC -fsized-deallocation -Wno-unknown-cuda-version
cxx_cuda: -x cuda --cuda-gpu-arch="sm_61" -Xclang -fcuda-allow-variadic-functions
cxx_cuda: -x cuda --cuda-gpu-arch="sm_120" -Xclang -fcuda-allow-variadic-functions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n res/mkn/clang_cuda.yaml

Repository: mkn/mkn.gpu

Length of output: 1158


🌐 Web query:

CUDA sm_120 compute capability Blackwell Nvidia

💡 Result:

  • sm_120 is NVIDIA’s SASS (binary) code target for CUDA Compute Capability 12.0, which corresponds to Blackwell “RTX” client/workstation GPUs (e.g., GeForce RTX 50xx). [1][2]
  • CUDA Toolkit support: CUDA 12.8 added compiler support for SM_120 (along with SM_100/SM_101 for other Blackwell variants). [1]
  • Blackwell is split across multiple compute capabilities: for example, Blackwell B200/B300 are Compute Capability 10.0 (SM_100), while Blackwell RTX parts are 12.0 (SM_120). [2]
  • PTX targeting note: newer PTX ISA documentation includes family-specific targets like sm_120f (family-forward within the Blackwell family) and also mentions related targets (e.g., sm_121). [3]

Sources:
[1] CUDA 12.8 Features Archive (NVIDIA) (docs.nvidia.com)
[2] NVIDIA CUTLASS architecture/compute-capability table (docs.nvidia.com)
[3] PTX ISA 8.8 documentation (NVIDIA) (docs.nvidia.com)

Citations:


🌐 Web query:

Clang CUDA 12.8 support version requirements

💡 Result:

  • If you mean “Clang compiling CUDA code using the CUDA Toolkit headers/libraries”: as of Clang/LLVM 20.1.0, Clang’s officially supported CUDA SDK version is up to CUDA 12.6 (anything newer is not guaranteed and may be treated as “newer-than-supported” / partially supported). [1]

    • LLVM’s “Compiling CUDA with clang” doc also notes that when Clang sees a newer CUDA Toolkit than it supports, it will warn and try to proceed as the newest supported version (the exact max version stated there may lag behind current Clang releases). [2]
  • If you mean “CUDA 12.8’s nvcc host-compiler requirements on Linux”: NVIDIA lists supported host Clang versions 7.x through 19.x for CUDA 12.8 on x86_64 and aarch64 (major-version checked). [3]

  • Practical datapoint (hipify-clang/ROCm tooling): for CUDA 12.8.1, AMD’s hipify-clang documentation maps support to LLVM/Clang 20.1.x (20.1.0 through 20.1.8). [4]

Sources: [1] Clang 20.1.0 release notes, CUDA support section; [2] LLVM “Compiling CUDA with clang” doc; [3] NVIDIA CUDA 12.8 Linux installation guide (supported compilers table); [4] ROCm hipify-clang dependencies table.


Update stale comment and clarify CUDA toolchain requirements for sm_120 Blackwell support.

Line 6 contains an outdated example referencing sm_61 when the active config (line 18) targets sm_120 (Blackwell RTX). Additionally, sm_120 requires CUDA 12.8+ headers; verify that the Clang version on PATH provides this support (Clang 20.1.0 officially supports only CUDA 12.6, though it will attempt to proceed with a warning when given newer toolkits). Update the commented example on line 6 to reflect the current sm_120 default or document the version constraints more explicitly.

🤖 Prompt for AI Agents
In `@res/mkn/clang_cuda.yaml` at line 18, The commented example that currently
references sm_61 is stale given the active cxx_cuda setting uses sm_120; update
the example comment to show sm_120 (Blackwell) and explicitly note that sm_120
requires CUDA toolkit headers 12.8+ and that Clang on PATH must provide
compatible CUDA support (e.g., Clang versions earlier than full CUDA‑12.8
support like 20.1.0 may warn or be incompatible). Edit the comment near the
cxx_cuda example to reflect the new default GPU arch (sm_120) and add a short
compatibility note about requiring CUDA 12.8+ headers and verifying
Clang/toolchain compatibility.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
inc/mkn/gpu/cuda.hpp (1)

65-76: ⚠️ Potential issue | 🟡 Minor

prinfo silently ignores cudaGetDeviceProperties errors, unlike the ROCm counterpart.

In inc/mkn/gpu/rocm.hpp:63, the same function uses MKN_GPU_ASSERT(hipGetDeviceProperties(...)). Here, the result is [[maybe_unused]], which means a failed query silently reads uninitialized devProp. Consider aligning with the ROCm version.

Suggested fix
 void inline prinfo(size_t dev = 0) {
   cudaDeviceProp devProp;
-  [[maybe_unused]] auto ret = cudaGetDeviceProperties(&devProp, dev);
+  MKN_GPU_ASSERT(cudaGetDeviceProperties(&devProp, dev));
inc/mkn/gpu.hpp (1)

60-62: ⚠️ Potential issue | 🟡 Minor

#error message doesn't mention MKN_GPU_CPU as an option.

Suggested fix
-#error "UNKNOWN GPU / define MKN_GPU_ROCM or MKN_GPU_CUDA"
+#error "UNKNOWN GPU / define MKN_GPU_ROCM, MKN_GPU_CUDA, or MKN_GPU_CPU"
🤖 Fix all issues with AI agents
In `@inc/mkn/gpu/cpu/def.hpp`:
- Around line 1-4: The three headers share the same include guard macro
`_MKN_GPU_CPU_DEF_HPP_`, causing later headers to be skipped; change the guards
so each header has a unique macro: leave `def.hpp` as `_MKN_GPU_CPU_DEF_HPP_`,
update `api.hpp` to use `_MKN_GPU_CPU_API_HPP_` and `cls.hpp` to use
`_MKN_GPU_CPU_CLS_HPP_`, and ensure the corresponding `#ifndef/#define` and
matching `#endif /*...*/` comment names are updated to match each file's new
macro.

In `@inc/mkn/gpu/cuda/def.hpp`:
- Line 33: The free function getErrorString(auto const code) is missing the
inline specifier which causes ODR/linker errors when this header is included in
multiple TUs; fix it by marking the function inline (match the other free
functions) so change the declaration of getErrorString to be inline
getErrorString(auto const code) and keep the body returning
cudaGetErrorString(code) so it becomes an inline header-only function.
🧹 Nitpick comments (9)
test/hip/async.ppc (2)

86-91: dummyEvent is created and destroyed but never used.

Minor dead code — dummyEvent is allocated on line 90 and freed on line 149 but never recorded or synchronized against. Consider removing it to keep the benchmark clean.


36-44: checkHip silently swallows errors in release builds.

The function is a no-op unless DEBUG/_DEBUG is defined. For a benchmark that relies on correct GPU results, a silent failure in release mode could produce misleading timing or error numbers. Consider at minimum logging or returning the error code in release builds, or documenting this is intentional.

inc/mkn/gpu/cuda.hpp (1)

89-91: Commented-out include left in.

Line 89 has a commented-out #include "mkn/gpu/any/inc/device.ipp". If it's intentionally excluded, consider removing the line to avoid confusion. If it's temporary, a brief // TODO comment explaining why would help.

inc/mkn/gpu/rocm/def.hpp (1)

34-42: getWarpSize silently ignores hipGetDeviceProperties errors.

Line 39 discards the return value with [[maybe_unused]]. If the call fails, devProp.warpSize is read from uninitialized memory. At minimum, value-initialize devProp (hipDeviceProp_t devProp{};) so the fallback is deterministic rather than UB. Based on learnings, this function must support cross-platform use with CUDA/ROCm, so a graceful fallback (e.g., return 64 for ROCm) on failure would be appropriate.

Suggested fix
 std::uint32_t inline getWarpSize(size_t dev = 0) {
 `#ifdef` _MKN_GPU_WARP_SIZE_
   return _MKN_GPU_WARP_SIZE_;
 `#else`
-  hipDeviceProp_t devProp;
-  [[maybe_unused]] auto ret = hipGetDeviceProperties(&devProp, dev);
-  return devProp.warpSize;
+  hipDeviceProp_t devProp{};
+  auto ret = hipGetDeviceProperties(&devProp, dev);
+  if (ret != hipSuccess) return 64; // safe ROCm default
+  return devProp.warpSize;
 `#endif` /*_MKN_GPU_WARP_SIZE_    */
 }
inc/mkn/gpu/rocm.hpp (1)

45-58: Remove commented-out code.

Three entire function definitions are commented out. If they've been moved or deprecated, remove them. If they're pending work, add a tracking TODO or issue reference.

inc/mkn/gpu/cuda/cls.hpp (1)

69-74: Pointer uses magic numbers for cudaMemoryType enum values.

Using 0, 1, 2, 3 is fragile if CUDA SDK changes enum values. Consider using the named constants (cudaMemoryTypeUnregistered, cudaMemoryTypeHost, cudaMemoryTypeDevice, cudaMemoryTypeManaged).

Suggested fix
-  bool is_unregistered_ptr() const { return attributes.type == 0; }
-  bool is_host_ptr() const {
-    return attributes.type == 1 || (is_unregistered_ptr() && t != nullptr);
-  }
-  bool is_device_ptr() const { return is_managed_ptr() || attributes.type == 2; }
-  bool is_managed_ptr() const { return attributes.type == 3; }
+  bool is_unregistered_ptr() const { return attributes.type == cudaMemoryTypeUnregistered; }
+  bool is_host_ptr() const {
+    return attributes.type == cudaMemoryTypeHost || (is_unregistered_ptr() && t != nullptr);
+  }
+  bool is_device_ptr() const { return is_managed_ptr() || attributes.type == cudaMemoryTypeDevice; }
+  bool is_managed_ptr() const { return attributes.type == cudaMemoryTypeManaged; }
inc/mkn/gpu/cuda/api.hpp (1)

31-41: global_d_kernel intentionally omits bounds checking; global_gd_kernel provides the guarded variant.

This matches the documented design intent. One minor observation: global_gd_kernel computes idx() with the default uint32_t return type while s is std::size_t (64-bit). For grids exceeding ~4 billion threads the truncated index would silently pass the guard.

This is unlikely to be a practical problem today, but if you ever foresee very large 1D dispatches, consider idx<std::size_t>() here.

Optional: match index width to the size parameter
 template <typename F, typename... Args>
 __global__ static void global_gd_kernel(F f, std::size_t s, Args... args) {
-  if (auto i = mkn::gpu::cuda::idx(); i < s) f(args...);
+  if (auto i = mkn::gpu::cuda::idx<std::size_t>(); i < s) f(args...);
 }
inc/mkn/gpu/cuda/def.hpp (2)

45-45: static on warp_size gives each TU its own copy (and its own cudaGetDeviceProperties call).

If _MKN_GPU_WARP_SIZE_ is not defined, every translation unit that includes this header will independently query the device at static-init time. Changing static to inline would deduplicate to a single initialization across TUs. If per-TU copies are intentional (e.g. to avoid cross-TU init-order issues with the macro namespace), this is fine as-is.

Optional: single-init variant
-static std::uint32_t inline const warp_size = getWarpSize();
+inline std::uint32_t const warp_size = getWarpSize();

26-31: gpuAssert calls std::abort() but <cstdlib> is not included.

This works transitively via other headers today, but could break if include chains change.

Add explicit include
 `#include` <string>
 `#include` <cstdint>
+#include <cstdlib>

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@inc/mkn/gpu/rocm/cls.hpp`:
- Around line 149-161: Add the missing resize API and dev member to align ROCm
GLauncher with CUDA: introduce a std::size_t dev field and implement a
resize(std::size_t s, size_t dev = 0) method (move the current grid/block
calculation and hipGetDeviceProperties call out of the constructor into resize)
so it sets count, dev, initializes devProp via hipGetDeviceProperties(&devProp,
dev), updates b.x and g.x like the current logic, and keep cli{devProp} intact;
make the constructor delegate to Launcher and call resize(count, dev) so
existing code that calls GLauncher::resize(...) or accesses GLauncher::dev works
cross-backend.

In `@test/any/managed.cpp`:
- Around line 86-91: The test suite lost zero-fill coverage because test_zero()
was removed from main; restore coverage by reintroducing a call to test_zero()
in main (alongside test(), test_guess(), test_lambda_copy_capture_views(), and
test_lambda_ref_copy_capture_views()) or, if you prefer to centralize tests, add
an equivalent invocation of test_zero() in the test registration/runner so it
executes; ensure the zero device helper (zero / functions in devfunc.ipp) and
any device-side zero-fill implementation referenced by test_zero() are present
and included so the test compiles and runs (look for symbols test_zero, zero,
and devfunc.ipp to locate the missing pieces).
🧹 Nitpick comments (6)
inc/mkn/gpu/rocm.hpp (1)

45-58: Consider removing commented-out code instead of leaving it.

These three template functions (_vector_fill, two fill overloads) are fully commented out. If they've been relocated or are no longer needed, remove them to reduce noise. Version control preserves the history.

inc/mkn/gpu/cuda/def.hpp (1)

35-45: getWarpSize: cudaGetDeviceProperties error is silently discarded.

If the CUDA runtime call fails (line 40), devProp.warpSize is read from an uninitialized struct — undefined behavior. The [[maybe_unused]] suppresses the warning but doesn't handle the failure. This was discussed in a previous review iteration; noting for completeness that it's still unguarded. The same pattern exists in rocm/def.hpp:38-40.

Minimal defensive fix (value-initialize devProp)
-  cudaDeviceProp devProp;
+  cudaDeviceProp devProp{};
   [[maybe_unused]] auto ret = cudaGetDeviceProperties(&devProp, dev);

This at least zero-initializes the struct so the return value is deterministic (0) on failure rather than UB.

inc/mkn/gpu/cuda.hpp (1)

89-91: Commented-out device.ipp include is a leftover.

Line 89 is a commented-out include. Since device.ipp is included in cuda/cls.hpp (line 103), this line serves no purpose and can be removed.

inc/mkn/gpu/rocm/def.hpp (3)

34-42: Same devProp initialization concern as cuda/def.hpp.

hipDeviceProp_t devProp (line 38) is uninitialized; if hipGetDeviceProperties fails, reading devProp.warpSize is UB. Consider value-initializing: hipDeviceProp_t devProp{};.


55-59: Minor inconsistency: ROCm alloc(void*&) logs KLOG(TRC) but CUDA counterpart does not.

In cuda/def.hpp lines 57–59, the alloc(void*&, Size) overload has no KLOG(TRC) call, while this ROCm version does (line 57). If the intent is symmetric tracing across backends, consider adding it to the CUDA side too.


73-78: alloc_managed: bytes computed but log uses size * sizeof(T) again.

Line 75 computes auto const bytes = size * sizeof(T), but line 76 recomputes size * sizeof(T) in the log. Use bytes for consistency:

Suggested fix
   auto const bytes = size * sizeof(T);
-  KLOG(TRC) << "GPU alloced: " << size * sizeof(T);
+  KLOG(TRC) << "GPU alloced: " << bytes;
   MKN_GPU_ASSERT(hipMallocManaged((void**)&p, bytes));

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@inc/mkn/gpu/any/cls.hpp`:
- Around line 1-24: There's a duplicate template struct Cli<Device> defined
twice causing an ODR violation; remove the redundant definition from the second
header (the one that currently duplicates lines 41-51 in env.hpp) and either (a)
include the canonical header that defines template struct Cli<Device> (the one
matching inc/mkn/gpu/any/cls.hpp) and use that definition, or (b) replace the
duplicate with a thin delegation/alias to the canonical Cli<Device> (e.g., a
using/forwarding declaration) so all translation units share the single
definition of Cli<Device>.
🧹 Nitpick comments (7)
res/test_all.sh (1)

13-13: exit 0 is redundant inside the subshell.

With set -e, the subshell already exits non-zero on failure, and exits 0 naturally if all commands succeed. The explicit exit 0 is a no-op.

inc/mkn/gpu/any/cls.hpp (1)

13-16: auto return type requires both return paths to deduce the same type.

kul::String::INT32(...) likely returns std::int32_t while dev.maxThreadsPerBlock may be int (e.g., in cudaDeviceProp). On common platforms these are the same type, but if they ever differ, this will fail to compile. This appears to match existing code in inc/mkn/gpu/cli.hpp, so it's likely fine in practice.

inc/mkn/gpu/rocm/def.hpp (1)

73-78: Minor: alloc_managed computes bytes but the log still uses size * sizeof(T) instead of bytes.

The local bytes variable is computed on line 75 but the KLOG on line 76 recalculates the same expression. This is consistent with the CUDA counterpart, so likely copy-paste.

Optional cleanup
 template <typename T, typename Size>
 void alloc_managed(T*& p, Size size) {
   auto const bytes = size * sizeof(T);
-  KLOG(TRC) << "GPU alloced: " << size * sizeof(T);
+  KLOG(TRC) << "GPU alloced: " << bytes;
   MKN_GPU_ASSERT(hipMallocManaged((void**)&p, bytes));
 }
inc/mkn/gpu/cuda.hpp (1)

89-91: Commented-out include is a stale breadcrumb.

device.ipp is already included via cuda/cls.hpp (line 103 of that file). The commented-out line here could be removed to avoid confusion.

inc/mkn/gpu/cuda/def.hpp (1)

73-77: Minor: alloc_managed computes bytes but the log recalculates size * sizeof(T).

Same as the ROCm counterpart — the local bytes on line 75 could be reused in the log on line 75.

Optional cleanup
 template <typename T, typename Size>
 void alloc_managed(T*& p, Size size) {
-  KLOG(TRC) << "GPU alloced: " << size * sizeof(T);
-  MKN_GPU_ASSERT(cudaMallocManaged((void**)&p, size * sizeof(T)));
+  auto const bytes = size * sizeof(T);
+  KLOG(TRC) << "GPU alloced: " << bytes;
+  MKN_GPU_ASSERT(cudaMallocManaged((void**)&p, bytes));
 }
inc/mkn/gpu/rocm.hpp (1)

50-58: Nice DRY refactor on fill — consider aligning the CUDA counterpart.

The second fill overload now delegates to the first (fill(c, c.size(), val)), eliminating duplication. The CUDA equivalent in cuda.hpp (lines 59–62) still duplicates the GLauncher call inline.

inc/mkn/gpu/rocm/cls.hpp (1)

62-86: ROCm Pointer has assert(t) and extra type member — minor divergence from CUDA.

The CUDA Pointer (in cuda/cls.hpp) doesn't assert on the input pointer or store a separate type field. This isn't a bug — ROCm's hipPointerAttribute_t has different semantics — but it's worth noting for anyone porting code between backends.

@PhilipDeegan PhilipDeegan merged commit e5ee749 into master Feb 15, 2026
1 check passed
@PhilipDeegan PhilipDeegan deleted the next branch February 15, 2026 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant