LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2024-06-07 19:40:48 +00:00

Author	SHA1	Message	Date
Ludovic Leroux	12c0d9443e	feat: use tokenizer.apply_chat_template() in vLLM (#1990 ) Use tokenizer.apply_chat_template() in vLLM Signed-off-by: Ludovic LEROUX <ludovic@inpher.io>	2024-04-11 19:20:22 +02:00
Ludovic Leroux	939411300a	Bump vLLM version + more options when loading models in vLLM (#1782 ) * Bump vLLM version to 0.3.2 * Add vLLM model loading options * Remove transformers-exllama * Fix install exllama	2024-03-01 22:48:53 +01:00
Ludovic Leroux	0135e1e3b9	fix: vllm - use AsyncLLMEngine to allow true streaming mode (#1749 ) * fix: use vllm AsyncLLMEngine to bring true stream Current vLLM implementation uses the LLMEngine, which was designed for offline batch inference, which results in the streaming mode outputing all blobs at once at the end of the inference. This PR reworks the gRPC server to use asyncio and gRPC.aio, in combination with vLLM's AsyncLLMEngine to bring true stream mode. This PR also passes more parameters to vLLM during inference (presence_penalty, frequency_penalty, stop, ignore_eos, seed, ...). * Remove unused import	2024-02-24 11:48:45 +01:00
Ettore Di Giacinto	06cd9ef98d	feat(extra-backends): Improvements, adding mamba example (#1618 ) * feat(extra-backends): Improvements vllm: add max_tokens, wire up stream event mamba: fixups, adding examples for mamba-chat * examples(mamba-chat): add * docs: update	2024-01-20 17:56:08 +01:00
Ettore Di Giacinto	ad0e30bca5	refactor: move backends into the backends directory (#1279 ) * refactor: move backends into the backends directory Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor: move main close to implementation for every backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2023-11-13 22:40:16 +01:00

5 Commits