NVIDIA 源码开源(没有二进制开源)的VLLM PD分离实现

主要解决问题

  • KV cache offloading 显存不够,内存来凑(很早就有用内存/ssd来扩充kvcache的方法)
  • Accelerated data transfer 传输优化((NIXL)[https://github.com/ai-dynamo/nixl])
  • PD分离
  • 动态请求调度
  • 动态GPU调度

运行架构

首先需要etcd prometheus grafana 三件套,deploy目录下有个写好的compose

这三个启动好了以后可以跑server了

可以参考/example/llm

这个目录下还有几个目录

1
2
3
4
components
configs
graphs
utils

graphs里面是服务架构图,类似triton server里的,定义一个请求按什么步骤处理,比如disagg_router.py 分离架构部署+KV Routing

1
2
3
4
5
6
7
from components.frontend import Frontend
from components.kv_router import Router
from components.prefill_worker import PrefillWorker
from components.processor import Processor
from components.worker import VllmWorker

Frontend.link(Processor).link(Router).link(VllmWorker).link(PrefillWorker)

Processor,Router,VllmWorker,PrefillWorker都在components里定义好初始化参数,实现generate(Request)接口

这几个类用config里面的yaml配置来初始化,用graph建图,一个server就ready了

1
2
cd /workspace/examples/llm
dynamo serve graphs.disagg_router:Frontend -f ./configs/disagg_router.yaml

各个组件功能:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
                                                 +----------------+
+------| prefill worker |-------+
notify | | | |
finished | +----------------+ | pull
v v
+------+ +-----------+ +------------------+ push +---------------+
| HTTP |----->| processor |----->| decode/monolith |------------>| prefill queue |
| |<-----| |<-----| worker | | |
+------+ +-----------+ +------------------+ +---------------+
| ^ |
query best | | return | publish kv events
worker | | worker_id v
| | +------------------+
| +---------| kv-router |
+------------->| |
+------------------+