Winograd convolutions were always disabled giving error when inference device was CPU.
This commit implement logic to disable Winograd convolutions only if CPU or NPU are declared.
* Bump oneapi-basekit, optimum and openvino
* Changed PERFORMANCE HINT to CUMULATIVE_THROUGHPUT
Minor latency change for first token but about 10-15% speedup on token generation.
* fix regression #1971
fixes regression #1971 introduced by intel_extension_for_transformers==1.4
* UseTokenizerTemplate and StopPrompt
Implementation of use_tokenizer_template and stopwords options
* Streaming working
* Small fix for regression on CUDA and XPU
* use pip version of optimum[openvino]
* Update backend/python/transformers/transformers_server.py
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
* Token streaming support
fix optimum[openvino] package in install.sh
* Token Streaming support
---------
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
* fixes#1775 and #1774
Add BitsAndBytes Quantization and fixes embedding on CUDA devices
* Manage 4bit and 8 bit quantization
Manage different BitsAndBytes options with the quantization: parameter in yaml
* fix compilation errors on non CUDA environment
* OpenVINO draft
First draft of OpenVINO integration in transformer backend
* first working implementation
* Streaming working
* Small fix for regression on CUDA and XPU
* use pip version of optimum[openvino]
* Update backend/python/transformers/transformers_server.py
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
---------
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
* fixes#1775 and #1774
Add BitsAndBytes Quantization and fixes embedding on CUDA devices
* Manage 4bit and 8 bit quantization
Manage different BitsAndBytes options with the quantization: parameter in yaml
* fix compilation errors on non CUDA environment
* feat(intel): add diffusers support
* try to consume upstream container image
* Debug
* Manually install deps
* Map transformers/hf cache dir to modelpath if not specified
* fix(compel): update initialization, pass by all gRPC options
* fix: add dependencies, implement transformers for xpu
* base it from the oneapi image
* Add pillow
* set threads if specified when launching the API
* Skip conda install if intel
* defaults to non-intel
* ci: add to pipelines
* prepare compel only if enabled
* Skip conda install if intel
* fix cleanup
* Disable compel by default
* Install torch 2.1.0 with Intel
* Skip conda on some setups
* Detect python
* Quiet output
* Do not override system python with conda
* Prefer python3
* Fixups
* exllama2: do not install without conda (overrides pytorch version)
* exllama/exllama2: do not install if not using cuda
* Add missing dataset dependency
* Small fixups, symlink to python, add requirements
* Add neural_speed to the deps
* correctly handle model offloading
* fix: device_map == xpu
* go back at calling python, fixed at dockerfile level
* Exllama2 restricted to only nvidia gpus
* Tokenizer to xpu
* feat(transformers): support also text generation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* embedded: set seed -1
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Use cuda in transformers if available
tensorflow probably needs a different check.
Signed-off-by: Erich Schubert <kno10@users.noreply.github.com>
* feat: expose CUDA at top level
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* tests: add to tests and create workflow for py extra backends
* doc: update note on how to use core images
---------
Signed-off-by: Erich Schubert <kno10@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Erich Schubert <kno10@users.noreply.github.com>