LocalAI/docs/content/model-compatibility/exllama.md


+++
disableToc = false
title = "🦙 Exllama"
weight = 2
+++


[Exllama](https://github.com/turboderp/exllama) is a "A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights"

## Prerequisites

This is an extra backend - in the container images is already available and there is nothing to do for the setup.

If you are building LocalAI locally, you need to install [exllama manually](https://github.com/jllllll/exllama#this-is-a-python-module-version-of-exllama) first.

## Model setup

Download the model as a folder inside the `model ` directory and create a YAML file specifying the `exllama` backend. For instance with the `TheBloke/WizardLM-7B-uncensored-GPTQ` model:

```
$ git lfs install
$ cd models && git clone https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ
$ ls models/                                                                 
.keep                        WizardLM-7B-uncensored-GPTQ/ exllama.yaml
$ cat models/exllama.yaml                                                     
name: exllama
parameters:
  model: WizardLM-7B-uncensored-GPTQ
backend: exllama
# ...
```

Test with:

```bash
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{                                                                                                         
   "model": "exllama",
   "messages": [{"role": "user", "content": "How are you?"}],
   "temperature": 0.1
 }'
```
docs: Initial import from localai-website (#1312) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> 2023-11-22 17:13:50 +00:00
			`+++`
			`disableToc = false`
			`title = "🦙 Exllama"`
			`weight = 2`
			`+++`


			`[Exllama](https://github.com/turboderp/exllama) is a "A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights"`

			`## Prerequisites`

			`This is an extra backend - in the container images is already available and there is nothing to do for the setup.`

			`If you are building LocalAI locally, you need to install [exllama manually](https://github.com/jllllll/exllama#this-is-a-python-module-version-of-exllama) first.`

			`## Model setup`

			Download the model as a folder inside the `model ` directory and create a YAML file specifying the `exllama` backend. For instance with the `TheBloke/WizardLM-7B-uncensored-GPTQ` model:

			```
			`$ git lfs install`
			`$ cd models && git clone https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ`
			`$ ls models/`
			`.keep WizardLM-7B-uncensored-GPTQ/ exllama.yaml`
			`$ cat models/exllama.yaml`
			`name: exllama`
			`parameters:`
			`model: WizardLM-7B-uncensored-GPTQ`
			`backend: exllama`
			`# ...`
			```

			`Test with:`

			```bash
			`curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{`
			`"model": "exllama",`
			`"messages": [{"role": "user", "content": "How are you?"}],`
			`"temperature": 0.1`
			`}'`
			```