tabby/MODEL_SPEC.md

# Tabby Model Specification (Unstable)

> 💁 **INFO**
> Tabby currently operates with two inference backends: [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [llama.cpp](https://github.com/ggerganov/llama.cpp).
> The CUDA/CPU device utilizes ctranslate2 when the `--device cuda` or `--device cpu` options are specified, while the Metal (M1/M2) device employs llama.cpp with the `--device metal` option.
> 
> It is possible to create a model that is only usable for a specific inference backend. However, in general, the Tabby team will maintain models that are usable on all devices.

Tabby organizes the model within a directory. This document provides an explanation of the necessary contents for supporting model serving. An example model directory can be found at https://huggingface.co/TabbyML/StarCoder-1B

The minimal Tabby model directory should include the following contents:

```
ctranslate2/
ggml/
tabby.json
tokenizer.json
```

### tabby.json

This file provides meta information about the model. An example file appears as follows:

```json
{
    "auto_model": "AutoModelForCausalLM",
    "prompt_template": "<PRE>{prefix}<SUF>{suffix}<MID>",
    "chat_template":  "<s>{% for message in messages %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + '</s> ' }}{% endif %}{% endfor %}",
}
```

The **auto_model** field can have one of the following values:
- `AutoModelForCausalLM`: This represents a decoder-only style language model, such as GPT or Llama.
- `AutoModelForSeq2SeqLM`: This represents an encoder-decoder style language model, like T5.

The **prompt_template** field is optional. When present, it is assumed that the model supports [FIM inference](https://arxiv.org/abs/2207.14255).

One example for the **prompt_template** is `<PRE>{prefix}<SUF>{suffix}<MID>`. In this format, `{prefix}` and `{suffix}` will be replaced with their corresponding values, and the entire prompt will be fed into the LLM.

The **chat_template** field is optional. When it is present, it is assumed that the model supports an instruct/chat-style interaction, and can be passed to `--chat-model`.

### tokenizer.json
This is the standard fast tokenizer file created using [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers). Most Hugging Face models already come with it in repository.

### ctranslate2/
This directory contains binary files used by the [ctranslate2](https://github.com/OpenNMT/CTranslate2) inference engine. Tabby utilizes ctranslate2 for inference on both `cpu` and `cuda` devices.

With the [python package](https://pypi.org/project/ctranslate2) installed, you can acquire this directory by executing the following command in the HF model directory:

```bash
ct2-transformers-converter --model ./ --output_dir ctranslate2 --quantization=float16
```

*Note that the model itself must be compatible with ctranslate2.*

### ggml/
This directory contains binary files used by the [llama.cpp](https://github.com/ggerganov/llama.cpp) inference engine. Tabby utilizes ggml for inference on the `metal` device.

Currently, only `q8_0.gguf` in this directory is in use. You can refer to the instructions in llama.cpp to learn how to acquire it.
docs: add model spec (unstable) version (#457) * docs: add model spec (unstable) version * add link to hf tokenizers 2023-09-18 07:48:03 +00:00			`# Tabby Model Specification (Unstable)`

docs: add faq on how to convert own model for tabby (#592) 2023-10-18 23:18:24 +00:00			`> 💁 INFO`
			`> Tabby currently operates with two inference backends: [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [llama.cpp](https://github.com/ggerganov/llama.cpp).`
			> The CUDA/CPU device utilizes ctranslate2 when the `--device cuda` or `--device cpu` options are specified, while the Metal (M1/M2) device employs llama.cpp with the `--device metal` option.
			`>`
			`> It is possible to create a model that is only usable for a specific inference backend. However, in general, the Tabby team will maintain models that are usable on all devices.`

docs: add model spec (unstable) version (#457) * docs: add model spec (unstable) version * add link to hf tokenizers 2023-09-18 07:48:03 +00:00			`Tabby organizes the model within a directory. This document provides an explanation of the necessary contents for supporting model serving. An example model directory can be found at https://huggingface.co/TabbyML/StarCoder-1B`

			`The minimal Tabby model directory should include the following contents:`

			```
			`ctranslate2/`
			`ggml/`
			`tabby.json`
			`tokenizer.json`
			```

			`### tabby.json`

			`This file provides meta information about the model. An example file appears as follows:`

docs: add chat_template to model spec 2023-10-04 22:04:09 +00:00			```json
docs: add model spec (unstable) version (#457) * docs: add model spec (unstable) version * add link to hf tokenizers 2023-09-18 07:48:03 +00:00			`{`
			`"auto_model": "AutoModelForCausalLM",`
docs: add chat_template to model spec 2023-10-04 22:04:09 +00:00			`"prompt_template": "<PRE>{prefix}<SUF>{suffix}<MID>",`
			`"chat_template": "<s>{% for message in messages %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + '</s> ' }}{% endif %}{% endfor %}",`
docs: add model spec (unstable) version (#457) * docs: add model spec (unstable) version * add link to hf tokenizers 2023-09-18 07:48:03 +00:00			`}`
			```

			`The auto_model field can have one of the following values:`
			- `AutoModelForCausalLM`: This represents a decoder-only style language model, such as GPT or Llama.
			- `AutoModelForSeq2SeqLM`: This represents an encoder-decoder style language model, like T5.

			`The prompt_template field is optional. When present, it is assumed that the model supports [FIM inference](https://arxiv.org/abs/2207.14255).`

			One example for the prompt_template is `<PRE>{prefix}<SUF>{suffix}<MID>`. In this format, `{prefix}` and `{suffix}` will be replaced with their corresponding values, and the entire prompt will be fed into the LLM.

docs: add chat_template to model spec 2023-10-04 22:04:09 +00:00			The chat_template field is optional. When it is present, it is assumed that the model supports an instruct/chat-style interaction, and can be passed to `--chat-model`.

docs: add model spec (unstable) version (#457) * docs: add model spec (unstable) version * add link to hf tokenizers 2023-09-18 07:48:03 +00:00			`### tokenizer.json`
			`This is the standard fast tokenizer file created using [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers). Most Hugging Face models already come with it in repository.`

			`### ctranslate2/`
			This directory contains binary files used by the [ctranslate2](https://github.com/OpenNMT/CTranslate2) inference engine. Tabby utilizes ctranslate2 for inference on both `cpu` and `cuda` devices.

			`With the [python package](https://pypi.org/project/ctranslate2) installed, you can acquire this directory by executing the following command in the HF model directory:`

			```bash
			`ct2-transformers-converter --model ./ --output_dir ctranslate2 --quantization=float16`
			```

			`Note that the model itself must be compatible with ctranslate2.`

			`### ggml/`
docs: Update MODEL_SPEC.md (#504) Fix #503 2023-10-04 15:33:51 +00:00			This directory contains binary files used by the [llama.cpp](https://github.com/ggerganov/llama.cpp) inference engine. Tabby utilizes ggml for inference on the `metal` device.
docs: add model spec (unstable) version (#457) * docs: add model spec (unstable) version * add link to hf tokenizers 2023-09-18 07:48:03 +00:00
			Currently, only `q8_0.gguf` in this directory is in use. You can refer to the instructions in llama.cpp to learn how to acquire it.