iohann.s.titov@gmail.com Ivan S. Titov vLLM is a fast and easy-to-use library for LLM inference and serving. Provides a Python API and an OpenAI-compatible HTTP server. USE=cpu / cuda / rocm pick a single VLLM_TARGET_DEVICE for the build (mutually exclusive). Default (none of the three) builds with VLLM_TARGET_DEVICE=empty — Python entrypoints import cleanly, backend kernels fail at first model load. Useful if you only want the API surface for development. Build for CPU inference (VLLM_TARGET_DEVICE=cpu); pull torchaudio + numba Build for AMD ROCm inference (VLLM_TARGET_DEVICE=rocm); pull HIP libs + torch{audio,vision}

vllm

vllm-project/vllm

https://github.com/vllm-project/vllm/issues