大模型推理引擎性能对比，VLLM、SGLang、LMDeploy吞吐量测试

Nov 5, 2024#LLM #vllm #sglang #lmdeploy3218

AI-generated summary

The document compares the throughput performance of three large model inference engines: VLLM, SGLang, and LMDeploy, measured in output tokens per second for short input and long output scenarios. The results are presented in a table showing performance across different concurrency levels (1, 2, 4, 8, 16, 32, 64, and 128). LMDeploy consistently outperforms the other two engines at all concurrency levels, with the highest throughput recorded at 1123.07 tokens/s for 64 concurrent requests. The tests were conducted using the Qwen2.5-14B-Instruct-AWQ model on a hardware setup of E5 2680v4 with a 2080ti GPU.

简单对比 3 个大模型推理引擎吞吐速度，单位为输出 token/s，短输入长输出场景，其余参数见表后

VLLM | SGLang | LMDeploy#

Concurrency	VLLM 0.6.1.post2	VLLM 0.6.3.post1	LMDeploy 0.6.0a0	LMDeploy 0.6.2	SGLang 0.3.4.post2	SGLang 0.3.4.post2 (--disable-cuda-graph)
1	28.73	28.76	56.19	57.24	37.23	29.96
2	71.53	73.26	113.12	113.48	73.59	58.28
4	133.38	136.05	205.51	199.01	136.73	111.24
8	246.14	251.59	398.73	393.48	258.21	215.53
16	394.25	401.67	704.69	709.27	461.89	444.48
32	480.26	481.75	967.34	973.24	562.36	557.93
64	520.11	526.01	1119.22	1123.07	594.03	602.36
128	479.02	481.63	989.14	890.44	534.69	582.97

测试模型：Qwen2.5-14B-Instruct-AWQ
硬件环境：E5 2680v4 + 2080ti 22G * 1

Pasted image 20241123103935

Pasted image 20241123103943

未经授权，请勿转载