Sglang PD分离中的Mooncake
¶架构概述
Sglang采用的Prefill-Decode(PD)分离架构是现代大语言模型推理的重要优化方案。该架构将传统的单体推理过程拆分为两个独立的阶段:
- Prefill阶段:负责处理输入提示词,生成初始的Key-Value(KV)Cache
- Decode阶段:基于KV Cache进行自回归的token生成
Mooncake作为核心协调组件,实现了这两个物理分离实例的高效连接和协同工作。
¶工作流程详解
¶初始化与资源准备
当请求到达时,Decode节点首先通过prealloc组件预先为KV Cache分配内存空间。这种预先分配的策略确保了后续数据传输的高效性,避免了运行时内存分配的开销。
¶Bootstrap协调机制
Decode节点的receiver通过Mooncake向Prefill节点发送bootstrap信号,触发Prefill处理流程。这种设计使得Decode节点能够主动协调Prefill工作,实现精准的流水线控制。
¶架构设计解析
¶Prefill节点组件
- Bootstrap Server (Mooncake):核心协调组件,管理节点注册和连接
- KV Manager:负责KV Cache的生命周期管理
- KV Sender:优化数据传输,支持分块和流式传输
- Scheduler:任务调度和资源分配
¶Decode节点组件
- KV Receiver:高效接收和处理传入的KV Cache
- KV Manager:管理接收到的缓存数据
- Scheduler:解码调度和token生成控制
¶技术优势与价值
¶性能提升
- 降低TTFT:通过并行处理和流水线优化显著减少首token延迟
- 提高吞吐量:Prefill和Decode分离允许独立扩展和优化
- 资源利用率:专业化组件设计最大化硬件利用效率
¶架构灵活性
- 独立扩展:Prefill和Decode可根据负载独立扩容
- 故障隔离:单点故障不影响整个系统运行
- 混合部署:支持不同硬件配置的节点混合部署
¶运维优势
- 服务发现:自动化节点管理和连接建立
- 负载均衡:智能请求分配和资源调度
- 监控诊断:完善的监控指标和诊断能力
这种基于Mooncake的PD分离架构为大语言模型服务提供了可扩展、高性能、高可用的基础设施解决方案,代表了现代AI推理架构的重要发展方向。
sequenceDiagram
autonumber
box Prefill
participant forward_A
participant forward_B
participant CPULoop
participant waiting_queue_p as waiting_queue
participant sender
end
box Decode
participant reciever
participant transfer_queue
participant prealloc
participant waiting_queue_d as waiting_queue
participant scheduler
end
Note over forward_A,scheduler: requests arrived,TTFT start
activate scheduler
activate scheduler
activate scheduler
loop
prealloc->>prealloc: alloc for KV cache
end
prealloc->>transfer_queue: pop_prealloc
activate transfer_queue
activate transfer_queue
activate transfer_queue
Note over transfer_queue,reciever: init reciever
activate reciever
activate reciever
activate reciever
reciever->>waiting_queue_p: bootstrap prealloc
waiting_queue_p->>CPULoop: new-seqs
CPULoop->>+forward_A: batch 0 start
waiting_queue_p->>CPULoop: new-seqs
CPULoop->>+forward_B: batch 1 start
forward_A->>-CPULoop: batch 0 sync
CPULoop->>+sender: trans batch 0
loop Every chunk
sender->>reciever: send chunk
end
sender->>reciever: last chunk send aux
sender->>-CPULoop: trans batch 0 finish
reciever->>-transfer_queue: start decode
transfer_queue->>-waiting_queue_d: pop_transfer
Note right of scheduler: TTFT for req 0
waiting_queue_d->>scheduler: first token
deactivate scheduler
waiting_queue_p->>CPULoop: new-seqs
CPULoop->>+forward_A: batch 2 start
forward_B->>-CPULoop: batch 1 sync
CPULoop->>+sender: trans batch 1
loop Every chunk
sender->>reciever: send chunk
end
sender->>reciever: last chunk send aux
sender->>-CPULoop: trans batch 1 finish
reciever->>-transfer_queue: start decode
transfer_queue->>-waiting_queue_d: pop_transfer
Note right of scheduler: TTFT for req 1
waiting_queue_d->>scheduler: first token
deactivate scheduler
forward_A->>-CPULoop: batch 2 sync
CPULoop->>+sender: trans batch 2
loop Every chunk
sender->>reciever: send chunk
end
sender->>reciever: last chunk send aux
sender->>-CPULoop: trans batch 0 finish
reciever->>-transfer_queue: start decode
transfer_queue->>-waiting_queue_d: pop_transfer
Note right of scheduler: TTFT for req 2
waiting_queue_d->>scheduler: first token
deactivate scheduler
图1:PD分离架构交互序列图 - 展示了Mooncake协调下Prefill和Decode节点的完整工作流程,包括资源预分配、bootstrap协调、并行处理、分块传输和解码启动等关键阶段
graph
subgraph Prefill_Node [Prefill Node]
BS[Bootstrap Server]
subgraph GPU_P1 [GPU 1]
subgraph SP1 [Scheduler]
KVM_P1[KV Manager]
KVS_P1[KV Sender]
end
end
subgraph GPU_P2 [GPU 2]
subgraph SP2 [Scheduler]
KVM_P2[KV Manager]
KVS_P2[KV Sender]
end
end
end
subgraph Decode_Node [Decode Node]
subgraph GPU_D1 [GPU 1]
subgraph SD1 [Scheduler]
KVM_D1[KV Manager]
KVR_D1[KV Receiver]
end
end
subgraph GPU_D2 [GPU 2]
subgraph SD2 [Scheduler]
KVM_D2[KV Manager]
KVR_D2[KV Receiver]
end
end
end
KVM_P1 -->|register| BS
KVM_P2 -->|register| BS
KVR_D1 --->|bootstrap, 发送receiver信息| BS
KVR_D2 --->|bootstrap, 发送receiver信息| BS
KVS_P1 -->|发送KV| KVR_D1
KVS_P2 -->|发送KV| KVR_D2
图2:Mooncake连接架构图 - 展示了星型拓扑结构中Mooncake作为中心枢纽,协调Prefill和Decode节点之间的服务注册、bootstrap连接和直接数据传输
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 JMY Space!





