* Use cuda in transformers if available tensorflow probably needs a different check. Signed-off-by: Erich Schubert <kno10@users.noreply.github.com> * feat: expose CUDA at top level Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * tests: add to tests and create workflow for py extra backends * doc: update note on how to use core images --------- Signed-off-by: Erich Schubert <kno10@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Erich Schubert <kno10@users.noreply.github.com>
16 KiB
+++ disableToc = false title = "Getting started" weight = 1 url = '/basics/getting_started/' +++
LocalAI
is available as a container image and binary. It can be used with docker, podman, kubernetes and any container engine. You can check out all the available images with corresponding tags here.
See also our [How to]({{%relref "howtos" %}}) section for end-to-end guided examples curated by the community.
How to get started
The easiest way to run LocalAI is by using docker compose
or with Docker (to build locally, see the [build section]({{%relref "build" %}})).
{{% notice note %}} To run with GPU Accelleration, see [GPU acceleration]({{%relref "features/gpu-acceleration" %}}). {{% /notice %}}
{{< tabs >}} {{% tab name="Docker" %}}
# Prepare the models into the `model` directory
mkdir models
# copy your models to it
cp your-model.gguf models/
# run the LocalAI container
docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4
# You should see:
#
# ┌───────────────────────────────────────────────────┐
# │ Fiber v2.42.0 │
# │ http://127.0.0.1:8080 │
# │ (bound on host 0.0.0.0 and port 8080) │
# │ │
# │ Handlers ............. 1 Processes ........... 1 │
# │ Prefork ....... Disabled PID ................. 1 │
# └───────────────────────────────────────────────────┘
# Try the endpoint with curl
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "your-model.gguf",
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
{{% notice note %}}
- If running on Apple Silicon (ARM) it is not suggested to run on Docker due to emulation. Follow the [build instructions]({{%relref "build" %}}) to use Metal acceleration for full GPU support.
- If you are running Apple x86_64 you can use
docker
, there is no additional gain into building it from source. {{% /notice %}}
{{% /tab %}} {{% tab name="Docker compose" %}}
# Clone LocalAI
git clone https://github.com/go-skynet/LocalAI
cd LocalAI
# (optional) Checkout a specific LocalAI tag
# git checkout -b build <TAG>
# copy your models to models/
cp your-model.gguf models/
# (optional) Edit the .env file to set things like context size and threads
# vim .env
# start with docker compose
docker compose up -d --pull always
# or you can build the images with:
# docker compose up -d --build
# Now API is accessible at localhost:8080
curl http://localhost:8080/v1/models
# {"object":"list","data":[{"id":"your-model.gguf","object":"model"}]}
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "your-model.gguf",
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
Note: If you are on Windows, please run docker-compose
not docker compose
and make sure the project is in the Linux Filesystem, otherwise loading models might be slow. For more Info: Microsoft Docs
{{% /tab %}}
{{% tab name="Kubernetes" %}}
For installing LocalAI in Kubernetes, you can use the following helm chart:
# Install the helm repository
helm repo add go-skynet https://go-skynet.github.io/helm-charts/
# Update the repositories
helm repo update
# Get the values
helm show values go-skynet/local-ai > values.yaml
# Edit the values value if needed
# vim values.yaml ...
# Install the helm chart
helm install local-ai go-skynet/local-ai -f values.yaml
{{% /tab %}}
{{< /tabs >}}
Example: Use luna-ai-llama2 model with docker
mkdir models
# Download luna-ai-llama2 to models/
wget https://huggingface.co/TheBloke/Luna-AI-Llama2-Uncensored-GGUF/resolve/main/luna-ai-llama2-uncensored.Q4_0.gguf -O models/luna-ai-llama2
# Use a template from the examples
cp -rf prompt-templates/getting_started.tmpl models/luna-ai-llama2.tmpl
docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4
# Now API is accessible at localhost:8080
curl http://localhost:8080/v1/models
# {"object":"list","data":[{"id":"luna-ai-llama2","object":"model"}]}
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "luna-ai-llama2",
"messages": [{"role": "user", "content": "How are you?"}],
"temperature": 0.9
}'
# {"model":"luna-ai-llama2","choices":[{"message":{"role":"assistant","content":"I'm doing well, thanks. How about you?"}}]}
To see other model configurations, see also the example section here.
From binaries
LocalAI binary releases are available in Github.
You can control LocalAI with command line arguments, to specify a binding address, or the number of threads.
CLI parameters
Parameter | Environmental Variable | Default Variable | Description |
---|---|---|---|
--f16 | $F16 | false | Enable f16 mode |
--debug | $DEBUG | false | Enable debug mode |
--cors | $CORS | false | Enable CORS support |
--cors-allow-origins value | $CORS_ALLOW_ORIGINS | Specify origins allowed for CORS | |
--threads value | $THREADS | 4 | Number of threads to use for parallel computation |
--models-path value | $MODELS_PATH | ./models | Path to the directory containing models used for inferencing |
--preload-models value | $PRELOAD_MODELS | List of models to preload in JSON format at startup | |
--preload-models-config value | $PRELOAD_MODELS_CONFIG | A config with a list of models to apply at startup. Specify the path to a YAML config file | |
--config-file value | $CONFIG_FILE | Path to the config file | |
--address value | $ADDRESS | :8080 | Specify the bind address for the API server |
--image-path value | $IMAGE_PATH | Path to the directory used to store generated images | |
--context-size value | $CONTEXT_SIZE | 512 | Default context size of the model |
--upload-limit value | $UPLOAD_LIMIT | 15 | Default upload limit in megabytes (audio file upload) |
--galleries | $GALLERIES | Allows to set galleries from command line | |
--parallel-requests | $PARALLEL_REQUESTS | false | Enable backends to handle multiple requests in parallel. This is for backends that supports multiple requests in parallel, like llama.cpp or vllm |
--single-active-backend | $SINGLE_ACTIVE_BACKEND | false | Allow only one backend to be running |
--api-keys value | $API_KEY | empty | List of API Keys to enable API authentication. When this is set, all the requests must be authenticated with one of these API keys. |
--enable-watchdog-idle | $WATCHDOG_IDLE | false | Enable watchdog for stopping idle backends. This will stop the backends if are in idle state for too long. (default: false) [$WATCHDOG_IDLE] |
--enable-watchdog-busy | $WATCHDOG_BUSY | false | Enable watchdog for stopping busy backends that exceed a defined threshold. |
--watchdog-busy-timeout value | $WATCHDOG_BUSY_TIMEOUT | 5m | Watchdog timeout. This will restart the backend if it crashes. |
--watchdog-idle-timeout value | $WATCHDOG_IDLE_TIMEOUT | 15m | Watchdog idle timeout. This will restart the backend if it crashes. |
--preload-backend-only | $PRELOAD_BACKEND_ONLY | false | If set, the api is NOT launched, and only the preloaded models / backends are started. This is intended for multi-node setups. |
--external-grpc-backends | EXTERNAL_GRPC_BACKENDS | none | Comma separated list of external gRPC backends to use. Format: name:host:port or name:/path/to/file |
Container images
LocalAI has a set of images to support CUDA, ffmpeg and 'vanilla' (CPU-only). The image list is on quay:
{{< tabs >}} {{% tab name="Vanilla / CPU Images" %}}
master
latest
v2.0.0
v2.0.0-ffmpeg
v2.0.0-ffmpeg-core
Core Images - Smaller images without predownload python dependencies {{% /tab %}}
{{% tab name="GPU Images CUDA 11" %}}
master-cublas-cuda11
master-cublas-cuda11-core
v2.0.0-cublas-cuda11
v2.0.0-cublas-cuda11-core
v2.0.0-cublas-cuda11-ffmpeg
v2.0.0-cublas-cuda11-ffmpeg-core
Core Images - Smaller images without predownload python dependencies {{% /tab %}}
{{% tab name="GPU Images CUDA 12" %}}
master-cublas-cuda12
master-cublas-cuda12-core
v2.0.0-cublas-cuda12
v2.0.0-cublas-cuda12-core
v2.0.0-cublas-cuda12-ffmpeg
v2.0.0-cublas-cuda12-ffmpeg-core
Core Images - Smaller images without predownload python dependencies
{{% /tab %}}
{{< /tabs >}}
Example:
- Standard (GPT +
stablediffusion
):quay.io/go-skynet/local-ai:latest
- FFmpeg:
quay.io/go-skynet/local-ai:v2.0.0-ffmpeg
- CUDA 11+FFmpeg:
quay.io/go-skynet/local-ai:v2.0.0-cublas-cuda11-ffmpeg
- CUDA 12+FFmpeg:
quay.io/go-skynet/local-ai:v2.0.0-cublas-cuda12-ffmpeg
{{% notice note %}}
Note: the binary inside the image is pre-compiled, and might not suite all CPUs.
To enable CPU optimizations for the execution environment,
the default behavior is to rebuild when starting the container.
To disable this auto-rebuild behavior,
set the environment variable REBUILD
to false
.
See [docs on all environment variables]({{%relref "advanced#environment-variables" %}}) for more info. {{% /notice %}}
Run LocalAI in Kubernetes
LocalAI can be installed inside Kubernetes with helm.
Requirements:
- SSD storage class, or disable
mmap
to load the whole model in memory
- Add the helm repo
helm repo add go-skynet https://go-skynet.github.io/helm-charts/
- Install the helm chart:
helm repo update helm install local-ai go-skynet/local-ai -f values.yaml
Note: For further configuration options, see the helm chart repository on GitHub.
Example values
Deploy a single LocalAI pod with 6GB of persistent storage serving up a ggml-gpt4all-j
model with custom prompt.
### values.yaml
replicaCount: 1
deployment:
image: quay.io/go-skynet/local-ai:latest ##(This is for CPU only, to use GPU change it to a image that supports GPU IE "v2.0.0-cublas-cuda12-core")
env:
threads: 4
context_size: 512
modelsPath: "/models"
resources:
{}
# We usually recommend not to specify default resources and to leave this as a conscious
# choice for the user. This also increases chances charts run on environments with little
# resources, such as Minikube. If you do want to specify resources, uncomment the following
# lines, adjust them as necessary, and remove the curly braces after 'resources:'.
# limits:
# cpu: 100m
# memory: 128Mi
# requests:
# cpu: 100m
# memory: 128Mi
# Prompt templates to include
# Note: the keys of this map will be the names of the prompt template files
promptTemplates:
{}
# ggml-gpt4all-j.tmpl: |
# The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
# ### Prompt:
# {{.Input}}
# ### Response:
# Models to download at runtime
models:
# Whether to force download models even if they already exist
forceDownload: false
# The list of URLs to download models from
# Note: the name of the file will be the name of the loaded model
list:
- url: "https://gpt4all.io/models/ggml-gpt4all-j.bin"
# basicAuth: base64EncodedCredentials
# Persistent storage for models and prompt templates.
# PVC and HostPath are mutually exclusive. If both are enabled,
# PVC configuration takes precedence. If neither are enabled, ephemeral
# storage is used.
persistence:
pvc:
enabled: false
size: 6Gi
accessModes:
- ReadWriteOnce
annotations: {}
# Optional
storageClass: ~
hostPath:
enabled: false
path: "/models"
service:
type: ClusterIP
port: 80
annotations: {}
# If using an AWS load balancer, you'll need to override the default 60s load balancer idle timeout
# service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1200"
ingress:
enabled: false
className: ""
annotations:
{}
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: "true"
hosts:
- host: chart-example.local
paths:
- path: /
pathType: ImplementationSpecific
tls: []
# - secretName: chart-example-tls
# hosts:
# - chart-example.local
nodeSelector: {}
tolerations: []
affinity: {}
Build from source
See the [build section]({{%relref "build" %}}).
Other examples
To see other examples on how to integrate with other projects for instance for question answering or for using it with chatbot-ui, see: examples.
Clients
OpenAI clients are already compatible with LocalAI by overriding the basePath, or the target URL.
Javascript
https://github.com/openai/openai-node/
import { Configuration, OpenAIApi } from 'openai';
const configuration = new Configuration({
basePath: `http://localhost:8080/v1`
});
const openai = new OpenAIApi(configuration);
Python
https://github.com/openai/openai-python
Set the OPENAI_API_BASE
environment variable, or by code:
import openai
openai.api_base = "http://localhost:8080/v1"
# create a chat completion
chat_completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hello world"}])
# print the completion
print(completion.choices[0].message.content)