> Tabby currently operates with two inference backends: [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [llama.cpp](https://github.com/ggerganov/llama.cpp).
> The CUDA/CPU device utilizes ctranslate2 when the `--device cuda` or `--device cpu` options are specified, while the Metal (M1/M2) device employs llama.cpp with the `--device metal` option.
>
> It is possible to create a model that is only usable for a specific inference backend. However, in general, the Tabby team will maintain models that are usable on all devices.
Tabby organizes the model within a directory. This document provides an explanation of the necessary contents for supporting model serving. An example model directory can be found at https://huggingface.co/TabbyML/StarCoder-1B
The minimal Tabby model directory should include the following contents:
```
ctranslate2/
ggml/
tabby.json
tokenizer.json
```
### tabby.json
This file provides meta information about the model. An example file appears as follows:
The **auto_model** field can have one of the following values:
-`AutoModelForCausalLM`: This represents a decoder-only style language model, such as GPT or Llama.
-`AutoModelForSeq2SeqLM`: This represents an encoder-decoder style language model, like T5.
The **prompt_template** field is optional. When present, it is assumed that the model supports [FIM inference](https://arxiv.org/abs/2207.14255).
One example for the **prompt_template** is `<PRE>{prefix}<SUF>{suffix}<MID>`. In this format, `{prefix}` and `{suffix}` will be replaced with their corresponding values, and the entire prompt will be fed into the LLM.
The **chat_template** field is optional. When it is present, it is assumed that the model supports an instruct/chat-style interaction, and can be passed to `--chat-model`.
This is the standard fast tokenizer file created using [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers). Most Hugging Face models already come with it in repository.
### ctranslate2/
This directory contains binary files used by the [ctranslate2](https://github.com/OpenNMT/CTranslate2) inference engine. Tabby utilizes ctranslate2 for inference on both `cpu` and `cuda` devices.
With the [python package](https://pypi.org/project/ctranslate2) installed, you can acquire this directory by executing the following command in the HF model directory:
This directory contains binary files used by the [llama.cpp](https://github.com/ggerganov/llama.cpp) inference engine. Tabby utilizes ggml for inference on the `metal` device.