LocalAI/backend/python/vllm
Ludovic Leroux 0135e1e3b9
fix: vllm - use AsyncLLMEngine to allow true streaming mode (#1749)
* fix: use vllm AsyncLLMEngine to bring true stream

Current vLLM implementation uses the LLMEngine, which was designed for offline batch inference, which results in the streaming mode outputing all blobs at once at the end of the inference.

This PR reworks the gRPC server to use asyncio and gRPC.aio, in combination with vLLM's AsyncLLMEngine to bring true stream mode.

This PR also passes more parameters to vLLM during inference (presence_penalty, frequency_penalty, stop, ignore_eos, seed, ...).

* Remove unused import
2024-02-24 11:48:45 +01:00
..
backend_pb2_grpc.py refactor: move backends into the backends directory (#1279) 2023-11-13 22:40:16 +01:00
backend_pb2.py transformers: correctly load automodels (#1643) 2024-01-26 00:13:21 +01:00
backend_vllm.py fix: vllm - use AsyncLLMEngine to allow true streaming mode (#1749) 2024-02-24 11:48:45 +01:00
Makefile deps(conda): use transformers-env with vllm,exllama(2) (#1554) 2024-01-06 13:32:28 +01:00
README.md refactor: move backends into the backends directory (#1279) 2023-11-13 22:40:16 +01:00
run.sh deps(conda): use transformers-env with vllm,exllama(2) (#1554) 2024-01-06 13:32:28 +01:00
test_backend_vllm.py feat(conda): share envs with transformer-based backends (#1465) 2023-12-21 08:35:15 +01:00
test.sh deps(conda): use transformers-env with vllm,exllama(2) (#1554) 2024-01-06 13:32:28 +01:00

Creating a separate environment for the vllm project

make vllm