Skip to content

Conversation

@PanZezhong1725
Copy link
Collaborator

No description provided.

gongchensu and others added 30 commits December 29, 2025 17:04
Signed-off-by: Ceng23333 <441651826@qq.com>
Signed-off-by: Ceng23333 <441651826@qq.com>
…graph recording

- Ensure embedding tensors are on the same device. Change format.
- Optimize embedding kernel with vectorized memory access and __ldg
- Add vectorized memory access using float4/float2, half2, and bfloat162
- Use __ldg instruction for read-only weight and indices access
- Add memory alignment checks to enable vectorized paths
- Add __restrict__ keywords for better compiler optimization
- Implement dynamic block size selection based on embedding_dim
对 `NineToothedTensor` 进行 C++ 层封装

加入使用数组作为 `shape` 和 `strides` 创建 `ninetoothed::Tensor` 的方式

使用 `ninetoothed::Tensor` 接入九齿的 ReLU 算子

Add an include guard to `ninetoothed/utils.h`
spike-zhu and others added 28 commits February 11, 2026 14:41
issue/949 - feat: add silu_and_mul for moore gpu with test pass
issue/899 - fix: fix causal_softmax and rearrange bug
issue/838 - Cambricon Batched RoPE
issue/1012 - feat: add paged caching for moore gpu referencing nvidia
issue/1001 - feat: add paged attention prefill  and decode for moore gpu referencing nvidia
issue/837 - support int32 and int64 in cambricon add
issue/523 - switched to cambricon mlu 1.22 interface
Issue/862 - Fix compilation errors (missing headers, cub namespace) t…
Issue/972:摩尔平台基于 muDNN 的 w8a8 量化实现,并完善 scaled_mm_int8 python 测试脚本
issue/961: fix metax init with preload
@wooway777 wooway777 merged commit 784139b into main Feb 13, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants