LLM 通信量计算量总结

Batch Size $B$
Sequence length $S$
Head num $H$
Head dim $d$
Hidden size $h = H\times d$
Parallel Size $p$

¶Tensor Parallel in one machine

¶Attention

计算量

Attention
Usually d = hidden_size
矩阵乘法每个元素为乘加计算，为2个操作
$XW_{qkv}$ :

$2 \times B \times 3 \times s \times h \times h = 6Bsh^2$

$QK^T$ :

$2 \times B \times s \times h \times s = 2Bs^2h$

$P = SV$ :

$2 \times B \times s \times s \times h = 2Bs^2h$

$output = PW_o$ :

$2 \times B \times s \times h \times h = 2Bsh^2$

total:

$\text{FLOP} = 8Bsh^2 + 4Bs^2h$

通信量：
All-Reduce after attention layer:

$2 \times (p-1) \times \frac{1}{p} BSHd = \frac{2(p-1)BSHd}{p}$

Prefill阶段Sequence Parallel 是 Reduce-Scatter + All Gather，通信量和All-Reduce一样

Sequence Parallel

¶MLP Block

前后通信方式和Attention一样，故通信量也一样。

MLP Tensor Parallel
计算量：
升维再降维的操作,设升维到 $d$

$2 \times B \times shd = 2Bshd$

升维再降维总量：

$\text{FLOP}=4Bshd$

¶FFN in MOE

前后各一个All-To-All
收集Attention结果，得到完整的sequence作为FFN输入，FFN输出分到各rank继续其他层TP执行

$2 \times (p-1) \times \frac{1}{p} BSHd = \frac{2(p-1)BSHd}{p}$

Sequence Parallel

计算量：
MOE内部FFN为降维再升维，设降维到 $d$

$\text{FLOP}=4Bshd$

¶Pipeline Parallel

按层分到各个rank，每个rank负责一部分网络
单次rank间通信为 $BSHd$ ，整体总通信

$(p-1) \times BSHd$

通信量减少，服务端利用率最大化。
每个Request一次在各rank上运行并把输出传到下个rank，相比Tensor Parallel latency增加。

TP + PP 如 $TP size = 4$ , $PP size =2$ , 在attention后的All-Reduce可以从 $P_0$ 传到 $P_1$ ，一次的通信量是 $p\times \frac{BSHd}{p} = BSHd$

¶KV Cache

Layer num $L$

decoder需要获取之前每个step token 的KV， $s\rq = s_{prefill}+s_{decode}$

KV cache size:

$2 \times L \times Hd \times s\rq$

再乘上sizeof(dtype)即可

¶常见模型KV cache size

Deepseek V3:
head size: Compresed 512, Rope k size 64
Layer: 61

$(512 + 64) * 61$

¶计算量：

每个step需要计算当前token 的QKV，即 $s = 1$
$XW_{qkv}$

$\text{FLOP} = 2 \times B \times 3 \times 1 \times h \times h = 6Bh^2$

$Q_iK^T$ 为当前token $Q_i$ 和所有 $K$

$\text{FLOP} = 2 \times B \times 1 \times h \times s = 2Bsh$

$P = SV$ :

$\text{FLOP} = 2 \times B \times 1 \times s \times h = 2Bsh$

$output = PW_o$ :

$\text{FLOP} =2 \times B \times 1 \times h \times h = 2Bh^2$

总计算量：

$\text{FLOP} = 8Bh^2 + 4Bsh$