Apply This Environment Variable Fix to Stop vLLM Memory Leaks
Mistral · Bug Fix · · notable
Briefing for: Engineering
What happened
Mistral engineers identified a memory leak in vLLM when using disaggregated serving (Prefill/Decode splitting) with NIXL and UCX. The leak occurs outside the traditional heap in anonymous memory mappings because UCX's default mmap hooking mechanism intercepts all memory allocations to manage its Registration Cache, but fails to release them properly in certain edge cases.
Why it matters
This bug causes a linear RSS memory growth of ~400MB per minute, leading to inevitable OOM crashes in production. Standard Python memory profilers like Memray or Guppy will not see this leak because it occurs at the system-call level through raw syscall wrappers that bypass glibc.
What this enables
- If you run disaggregated serving on InfiniBand-enabled clusters, setting UCX_MEM_MMAP_HOOK_MODE=none will stop the leak without degrading inference performance.
- If you need to trace similar 'invisible' leaks, use the provided BPFtrace and GDB scripts to monitor raw mmap/munmap system calls.
- If you are using RDMA, setting UCX_RCACHE_MAX_UNRELEASED=1024 serves as a safety threshold for memory invalidation queues.
Get personalized AI briefings for your role at Changecast →