专家并行负载均衡 EPLB

MOE 模型用两个batch 互相overlap的方式使得通信和计算都不闲着，而且保持通信稍稍大于计算10us左右。最大限度利用完整了计算和通信，但是这样就够了吗？

加了Two-Batch-Overlap后还是很慢，通过profile发现通信等待的时间越往后越长，计算耗时几乎没有，全是通信的等待。为什么呢？

把所有卡的trace 合到一起看才能看出来:

负载不均衡
上图中，浅粉色是Dispatch wait，深紫色是Combine wait，空白的地方是计算kernel

有的GPU一直在等待通信，有的GPU没有通信等待，全程计算。

说明专家负载不均衡，高负载的专家都集中在一张GPU上，导致这个GPU算不过来，而其他GPU空闲等待。
真实的专家并行

那咋办。token选的专家又不能换，那就换专家的位置，只要负载高的专家别集中在一个卡上就行了。

在调用dispatch的时候hook一下，统计每个专家被选的次数。然后根据专家的负载重新排序，把高负载和低负载的专家中和一下放一个卡上。另外，还可以增加冗余专家：既然高负载的专家只有一份放哪个GPU都算不过来，那就复制一下。

一个不是很严谨的例子：
unbalance
例中，红色表示高负载，绿色表示低负载，上图中，GPU1负载最高，GPU3负载最低。
balance
均衡后，高负载的Expert1，3，5，6复制了两份，4个GPU负载都差不多了。

EPLB中还有两种均衡策略：

Hierarchical Load Balancing
When the number of server nodes divides the number of expert groups, we use the hierarchical load balancing policy to harness the group-limited expert routing. We first pack the expert groups to nodes evenly, ensuring the loads of different nodes are balanced. Then, we replicate the experts within each node. Finally, we pack the replicated experts to individual GPUs to ensure different GPUs are load-balanced. The hierarchical load balancing policy can be used in prefilling stage with a smaller expert-parallel size.

Global Load Balancing
In other cases, we use the global load balancing policy that replicates the experts globally regardless of expert groups, and pack the replicated experts to individual GPUs. This policy can be adopted in decoding stage with a larger expert-parallel size.

EPLB Github

Decode节点进行全局负载均衡，直接重排所有GPU所有权重。（EP size 大）
Prefill节点中权重按组负载均衡（EP size 小），先分高负载配低负载好组，然后把各个组放到各个结点。这里分组均衡和全局均衡的区别，说是同一个组放一个机器更好，可能是因为训练时带有分组相关loss？

balanced trace

均衡后再看trace，嗯舒服了。