LocalAI/docs/content/model-compatibility/llama-cpp.md


+++
disableToc = false
title = "🦙 llama.cpp"
weight = 1
+++

[llama.cpp](https://github.com/ggerganov/llama.cpp) is a popular port of Facebook's LLaMA model in C/C++.

{{% notice note %}}

The `ggml` file format has been deprecated. If you are using `ggml` models and you are configuring your model with a YAML file, specify, use the `llama-ggml` backend instead. If you are relying in automatic detection of the model, you should be fine. For `gguf` models, use the `llama` backend. The go backend is deprecated as well but still available as `go-llama`. The go backend supports still features not available in the mainline: speculative sampling and embeddings.

{{% /notice %}}

## Features

The `llama.cpp` model supports the following features:
- [📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
- [🧠 Embeddings]({{%relref "features/embeddings" %}})
- [🔥 OpenAI functions]({{%relref "features/openai-functions" %}})
- [✍️ Constrained grammars]({{%relref "features/constrained_grammars" %}})

## Setup

LocalAI supports `llama.cpp` models out of the box. You can use the `llama.cpp` model in the same way as any other model. 

### Manual setup

It is sufficient to copy the `ggml` or `guf` model files in the `models` folder. You can refer to the model in the `model` parameter in the API calls.

[You can optionally create an associated YAML]({{%relref "advanced" %}}) model config file to tune the model's parameters or apply a template to the prompt.

Prompt templates are useful for models that are fine-tuned towards a specific prompt. 

### Automatic setup

LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for `ggml` or `gguf` models.

For instance, if you have the galleries enabled, you can just start chatting with models in huggingface by running:

```bash
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.1
   }'
```

LocalAI will automatically download and configure the model in the `model` directory.

Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the [model gallery documentation]({{%relref "models" %}}).

### YAML configuration

To use the `llama.cpp` backend, specify `llama` as the backend in the YAML file:

```yaml
name: llama
backend: llama
parameters:
  # Relative to the models path
  model: file.gguf.bin
```

In the example above we specify `llama` as the backend to restrict loading `gguf` models only. 

For instance, to use the `llama-ggml` backend for `ggml` models:

```yaml
name: llama
backend: llama-ggml
parameters:
  # Relative to the models path
  model: file.ggml.bin
```

### Reference

- [llama](https://github.com/ggerganov/llama.cpp)
- [binding](https://github.com/go-skynet/go-llama.cpp)
docs: Initial import from localai-website (#1312) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> 2023-11-22 17:13:50 +00:00
			`+++`
			`disableToc = false`
			`title = "🦙 llama.cpp"`
			`weight = 1`
			`+++`

			`[llama.cpp](https://github.com/ggerganov/llama.cpp) is a popular port of Facebook's LLaMA model in C/C++.`

			`{{% notice note %}}`

feat(alias): alias llama to llama-cpp, update docs (#1448) Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> 2023-12-16 17:22:45 +00:00			The `ggml` file format has been deprecated. If you are using `ggml` models and you are configuring your model with a YAML file, specify, use the `llama-ggml` backend instead. If you are relying in automatic detection of the model, you should be fine. For `gguf` models, use the `llama` backend. The go backend is deprecated as well but still available as `go-llama`. The go backend supports still features not available in the mainline: speculative sampling and embeddings.
docs: Initial import from localai-website (#1312) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> 2023-11-22 17:13:50 +00:00
			`{{% /notice %}}`

			`## Features`

			The `llama.cpp` model supports the following features:
			`- [📖 Text generation (GPT)]({{%relref "features/text-generation" %}})`
			`- [🧠 Embeddings]({{%relref "features/embeddings" %}})`
			`- [🔥 OpenAI functions]({{%relref "features/openai-functions" %}})`
			`- [✍️ Constrained grammars]({{%relref "features/constrained_grammars" %}})`

			`## Setup`

			LocalAI supports `llama.cpp` models out of the box. You can use the `llama.cpp` model in the same way as any other model.

			`### Manual setup`

			It is sufficient to copy the `ggml` or `guf` model files in the `models` folder. You can refer to the model in the `model` parameter in the API calls.

			`[You can optionally create an associated YAML]({{%relref "advanced" %}}) model config file to tune the model's parameters or apply a template to the prompt.`

			`Prompt templates are useful for models that are fine-tuned towards a specific prompt.`

			`### Automatic setup`

			LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for `ggml` or `gguf` models.

			`For instance, if you have the galleries enabled, you can just start chatting with models in huggingface by running:`

			```bash
			`curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{`
			`"model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",`
			`"messages": [{"role": "user", "content": "Say this is a test!"}],`
			`"temperature": 0.1`
			`}'`
			```

			LocalAI will automatically download and configure the model in the `model` directory.

			`Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the [model gallery documentation]({{%relref "models" %}}).`

			`### YAML configuration`

			To use the `llama.cpp` backend, specify `llama` as the backend in the YAML file:

			```yaml
			`name: llama`
			`backend: llama`
			`parameters:`
			`# Relative to the models path`
			`model: file.gguf.bin`
			```

			In the example above we specify `llama` as the backend to restrict loading `gguf` models only.

feat(alias): alias llama to llama-cpp, update docs (#1448) Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> 2023-12-16 17:22:45 +00:00			For instance, to use the `llama-ggml` backend for `ggml` models:
docs: Initial import from localai-website (#1312) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> 2023-11-22 17:13:50 +00:00
			```yaml
			`name: llama`
feat(alias): alias llama to llama-cpp, update docs (#1448) Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> 2023-12-16 17:22:45 +00:00			`backend: llama-ggml`
docs: Initial import from localai-website (#1312) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> 2023-11-22 17:13:50 +00:00			`parameters:`
			`# Relative to the models path`
			`model: file.ggml.bin`
			```

			`### Reference`

			`- [llama](https://github.com/ggerganov/llama.cpp)`
			`- [binding](https://github.com/go-skynet/go-llama.cpp)`