* feat(llama.cpp): Enable decentralized, distributed inference
As https://github.com/mudler/LocalAI/pull/2324 introduced distributed inferencing thanks to
@rgerganov implementation in https://github.com/ggerganov/llama.cpp/pull/6829 in upstream llama.cpp, now
it is possible to distribute the workload to remote llama.cpp gRPC server.
This changeset now uses mudler/edgevpn to establish a secure, distributed network between the nodes using a shared token.
The token is generated automatically when starting the server with the `--p2p` flag, and can be used by starting the workers
with `local-ai worker p2p-llama-cpp-rpc` by passing the token via environment variable (TOKEN) or with args (--token).
As per how mudler/edgevpn works, a network is established between the server and the workers with dht and mdns discovery protocols,
the llama.cpp rpc server is automatically started and exposed to the underlying p2p network so the API server can connect on.
When the HTTP server is started, it will discover the workers in the network and automatically create the port-forwards to the service locally.
Then llama.cpp is configured to use the services.
This feature is behind the "p2p" GO_FLAGS
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* go mod tidy
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: add p2p tag
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* better message
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* auto select cpu variant
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
* remove cuda target for now
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
* fix metal
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
* fix path
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
* cuda
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
* auto select cuda
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
* update test
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
* select CUDA backend only if present
Signed-off-by: mudler <mudler@localai.io>
* ci: keep cuda bin in path
Signed-off-by: mudler <mudler@localai.io>
* Makefile: make dist now builds also cuda
Signed-off-by: mudler <mudler@localai.io>
* Keep pushing fallback in case auto-flagset/nvidia fails
There could be other reasons for which the default binary may fail. For example we might have detected an Nvidia GPU,
however the user might not have the drivers/cuda libraries installed in the system, and so it would fail to start.
We keep the fallback of llama.cpp at the end of the llama.cpp backends to try to fallback loading in case things go wrong
Signed-off-by: mudler <mudler@localai.io>
* Do not build cuda on MacOS
Signed-off-by: mudler <mudler@localai.io>
* cleanup
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
* Apply suggestions from code review
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
---------
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Signed-off-by: mudler <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Co-authored-by: mudler <mudler@localai.io>
* feat(single-build): generate single binaries for releases
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* drop old targets
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
fix: more places where we are installing grpc that need a version specified
fix: attempt to fix metal tests
fix: metal/brew is forcing an update, they don't have 1.58 available anymore
Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>
* ci: try to build on macos14
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: fixup artifact name
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(build): adjust number of parallel make jobs
* fix: update make on MacOS from brew to support --output-sync argument
* fix: cache grpc with version as part of key to improve validity of cache hits
* fix: use gmake for tests-apple to use the updated GNU make version
* fix: actually use the new make version for tests-apple
* feat: parallelize tests-extra
* feat: attempt to cache grpc build for docker images
* fix: don't quote GRPC version
* fix: don't cache go modules, we have limited cache space, better used elsewhere
* fix: release with the same version of go that we test with
* fix: don't fail on exporting cache layers
* fix: remove deprecated BUILD_GRPC docker arg from Makefile
* feat: group make output by target when running parallelized builds in CI
* fix: quote GO_TAGS in makefile to fix handling of whitespace in value
* fix: set CPATH to find opencv2 in it's commonly installed location
* fix: add missing go mod dropreplace for go-llama.cpp
* chore: remove opencv symlink from github workflows
* move downloader out
* separate startup functions for preloading configuration files
* docs: add popular model examples
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* shorteners
* Add llava
* Add mistral-openorca
* Better link to build section
* docs: update
* fixup
* Drop code dups
* Minor fixups
* Apply suggestions from code review
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
* ci: try to cache gRPC build during tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: do not build all images for tests, just necessary
* ci: cache gRPC also in release pipeline
* fixes
* Update model_preload_test.go
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
* ci(apple): install grpc from brew
* ci(apple): use brew deps also on release
* ci(linux): install grpc from package manager
* ci: set concurrency
* Revert "ci(linux): install grpc from package manager"
This reverts commit 004e3e308e.
* wip: llama.cpp c++ gRPC server
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* make it work, attach it to the build process
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* update deps
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix: add protobuf dep
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* try fix protobuf on cmake
* cmake: workarounds
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* add packages
* cmake: use fixed version of grpc
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* cmake(grpc): install locally
* install grpc
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* install required deps for grpc on debian bullseye
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* debug
* debug
* Fixups
* no need to install cmake manually
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: fixup macOS
* use brew whenever possible
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* macOS fixups
* debug
* fix container build
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* workaround
* try mac
https://stackoverflow.com/questions/23905661/on-mac-g-clang-fails-to-search-usr-local-include-and-usr-local-lib-by-def
* Disable temp. arm64 docker image builds
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
**Description**
This PR syncs up the `llama` backend to use `gguf`
(https://github.com/go-skynet/go-llama.cpp/pull/180). It also adds
`llama-stable` to the targets so we can still load ggml. It adapts the
current tests to use the `llama-backend` for ggml and uses a `gguf`
model to run tests on the new backend.
In order to consume the new version of go-llama.cpp, it also bump go to
1.21 (images, pipelines, etc)
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>