DeepGEMM
Deepseek 一天一开源 根本学不过来。
fp8 gemm ,使用方法是用JIT,很方便
exclusively supports NVIDIA Hopper tensor cores. 我的40系显卡跑不了了。
参考cutlass但是和cutlass繁重的utils剥离开。只有~300行代码
竟然还对比cutlass SASS发现黑科技。
We observe a performance improvement in the CUTLASS FP8 kernel between NVCC 12.2 and 12.3. By comparing the compiled SASS, we discover that one bit in a series of FADD instructions is flipped in an interleaving pattern. After referencing some open-source CUDA assembler implementations, we identified that this bit controls yield, which may enhance warp-level parallelism (just a guess, yielding the current warp and let other warps work).
数据使用TMA加速并且和计算overlap。
1 | // 准备好barrier, full表示有数据可以计算,empty表示算完没用了换下一个 |
矩阵计算用WGMMA
,tensor core的FFMA指令(矩阵AB+C),这里tensor描述涉及cute layout, 一种加速矩阵运算的Hierarchy Tensor Layout,关于layout:这个论文讲了layout历史
这种2D layout方便描述大矩阵中切tile。
一个循环体内最多算4个tile,
1 | for (int k = 0; k < BLOCK_K / WGMMA::K; ++ k) { |
wgmma实现是直接从cutlass里扒的指令,不同规模都有。mma_utils.cuh