Latest in AI

Showing:llm-servingResearchersClear ×

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Heaps do lie: debugging a memory leak in vLLM
Mistral AI News50 days agoTutorial
Mistral AI published an engineering deep dive on a memory leak found during vLLM disaggregated serving tests. The leak appeared only with a specific stack involving Mistral Medium 3.1, NIXL, UCX, graph compilation, and P/D disaggregation, with RSS growing steadily despite heap profilers looking normal. The team used pmap, BPFtrace, and targeted GDB automation to trace the issue to UCX mmap hooks and applied configuration fixes plus a vLLM patch.
解鎖連續批次處理（Continuous Batching）中的非同步機制★ 75
Hugging Face Blog75 days agoRelease
As the demand for deploying large language models (LLMs) in production environments surges, how to improve inference efficiency and reduce costs has become a…
vLLM V0 到 V1 的演進：在強化學習（RL）中「正確性重於修正」的實踐★ 75
Hugging Face Blog82 days agoOpinion
This blog post published by the ServiceNow AI team delves into the major transition of the open-source large language model inference engine vLLM from V0 to…
從第一性原理理解連續批處理（Continuous Batching）★ 80
Hugging Face Blog245 days agoTutorial
This technical blog post from Hugging Face takes a "First Principles" approach to provide a deep analysis of one of the most critical optimization techniques…
SGLang 整合 Hugging Face Transformers 後端：大幅提升模型相容性與開發彈性★ 75
Hugging Face Blog400 days agoRelease
SGLang (Structured Generation Language) is a high-performance LLM inference and serving framework developed by the LMSYS team, renowned for its efficient…
長 Prompt 如何阻塞其他請求？優化 LLM 推理效能與解決隊頭阻塞的關鍵策略★ 80
Hugging Face Blog411 days agoTutorial
As the context windows of large language models (LLMs) continue to expand — from the early 4k and 8k, to the now-common 32k and even 128k or more — users have…
併發請求下的 Prefill 與 Decode：優化 LLM 推論效能的關鍵技術★ 82
Hugging Face Blog468 days agoTutorial
When deploying large language models (LLMs), maintaining low latency and high throughput under high concurrency (concurrent requests) is one of the greatest…
在 Intel Gaudi 上使用 TGI 加速大型語言模型（LLM）推理★ 75
Hugging Face Blog487 days agoRelease
Hugging Face's official blog has announced that its widely adopted open-source large model inference framework, Text Generation Inference (TGI), now officially…