* feat: group make output by target when running parallelized builds in CI
* fix: quote GO_TAGS in makefile to fix handling of whitespace in value
* fix: set CPATH to find opencv2 in it's commonly installed location
* fix: add missing go mod dropreplace for go-llama.cpp
* chore: remove opencv symlink from github workflows
* Streaming working
* Small fix for regression on CUDA and XPU
* use pip version of optimum[openvino]
* Update backend/python/transformers/transformers_server.py
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
* Token streaming support
fix optimum[openvino] package in install.sh
* Token Streaming support
---------
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
* fixes#1775 and #1774
Add BitsAndBytes Quantization and fixes embedding on CUDA devices
* Manage 4bit and 8 bit quantization
Manage different BitsAndBytes options with the quantization: parameter in yaml
* fix compilation errors on non CUDA environment
* OpenVINO draft
First draft of OpenVINO integration in transformer backend
* first working implementation
* Streaming working
* Small fix for regression on CUDA and XPU
* use pip version of optimum[openvino]
* Update backend/python/transformers/transformers_server.py
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
---------
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
* Initial implementation of assistants api
* Move load/save configs to utils
* Save assistant and assistantfiles config to disk.
* Add tsets for assistant api
* Fix models path spelling mistake.
* Remove personal go.mod information
---------
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
* Enhance autogptq backend to support VL models
* update dependencies for autogptq
* remove redundant auto-gptq dependency
* Convert base64 to image_url for Qwen-VL model
* implemented model inference for qwen-vl
* remove user prompt from generated answer
* fixed write image error
---------
Co-authored-by: Binghua Wu <bingwu@estee.com>
* readme: update quickstart
* aio(gpu): fix dreamshaper
* tests(aio): allow to run tests also against an endpoint
* docs: split content
* tests: less verbosity
---------
Co-authored-by: Dave <dave@gray101.com>
This change makes the assumption that "Microsoft Corporation Device 008e"
is an NVIDIA CUDA device. If this is not the case, please update the
hardware detection script here.
Signed-off-by: Enrico Ros <enrico.ros@gmail.com>
Co-authored-by: Dave <dave@gray101.com>
* docs(aio): Add AIO images docs
* add image generation link to quickstart
* while reviewing I noticed this one link was missing, so quickly adding it.
Signed-off-by: Dave <dave@gray101.com>
Co-authored-by: Dave <dave@gray101.com>
* docs(mac): Improve documentation for mac build
- added documentation to build from current master
- added troubleshooting information
Signed-off-by: Sebastian <tauven@gmail.com>
* docs(max): fix typo
Signed-off-by: Sebastian <tauven@gmail.com>
---------
Signed-off-by: Sebastian <tauven@gmail.com>
* test with gguf instead of ggml. Updates testPrompt to match? Adds debugging line to Dockerfile that I've found helpful recently.
* fix testPrompt slightly
* Sad Experiment: Test GH runner without metal?
* break apart CGO_LDFLAGS
* switch runner
* upstream llama.cpp disables Metal on Github CI!
* missed a dir from clean-tests
* CGO_LDFLAGS
* tmate failure + NO_ACCELERATE
* whisper.cpp has a metal fix
* do the exact opposite of the name of this branch, but keep it around for unrelated fixes?
* add back newlines
* add tmate to linux for testing
* update fixtures
* timeout for tmate