From ff65c59996a85ac79e50efc9f50544036ca96311 Mon Sep 17 00:00:00 2001
From: Meng Zhang <meng@tabbyml.com>
Date: Sun, 29 Oct 2023 01:24:22 -0700
Subject: [PATCH] docs: Update MODEL_SPEC.md. remove info about ctranslate2
 backend (#660)

---
 MODEL_SPEC.md | 27 ++-------------------------
 1 file changed, 2 insertions(+), 25 deletions(-)
diff --git a/MODEL_SPEC.md b/MODEL_SPEC.md
index beec00b..7540e30 100644
--- a/MODEL_SPEC.md
+++ b/MODEL_SPEC.md
@@ -1,17 +1,10 @@
 # Tabby Model Specification (Unstable)
 
-> 💁 **INFO**
-> Tabby currently operates with two inference backends: [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [llama.cpp](https://github.com/ggerganov/llama.cpp).
-> The CUDA/CPU device utilizes ctranslate2 when the `--device cuda` or `--device cpu` options are specified, while the Metal (M1/M2) device employs llama.cpp with the `--device metal` option.
-> 
-> It is possible to create a model that is only usable for a specific inference backend. However, in general, the Tabby team will maintain models that are usable on all devices.
-
 Tabby organizes the model within a directory. This document provides an explanation of the necessary contents for supporting model serving. An example model directory can be found at https://huggingface.co/TabbyML/StarCoder-1B
 
 The minimal Tabby model directory should include the following contents:
 
 ```
-ctranslate2/
 ggml/
 tabby.json
 tokenizer.json
@@ -23,16 +16,11 @@ This file provides meta information about the model. An example file appears as
 
 ```json
 {
-    "auto_model": "AutoModelForCausalLM",
     "prompt_template": "<PRE>{prefix}<SUF>{suffix}<MID>",
     "chat_template":  "<s>{% for message in messages %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + '</s> ' }}{% endif %}{% endfor %}",
 }
 ```
 
-The **auto_model** field can have one of the following values:
-- `AutoModelForCausalLM`: This represents a decoder-only style language model, such as GPT or Llama.
-- `AutoModelForSeq2SeqLM`: This represents an encoder-decoder style language model, like T5.
-
 The **prompt_template** field is optional. When present, it is assumed that the model supports [FIM inference](https://arxiv.org/abs/2207.14255).
 
 One example for the **prompt_template** is `<PRE>{prefix}<SUF>{suffix}<MID>`. In this format, `{prefix}` and `{suffix}` will be replaced with their corresponding values, and the entire prompt will be fed into the LLM.
@@ -42,18 +30,7 @@ The **chat_template** field is optional. When it is present, it is assumed that
 ### tokenizer.json
 This is the standard fast tokenizer file created using [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers). Most Hugging Face models already come with it in repository.
 
-### ctranslate2/
-This directory contains binary files used by the [ctranslate2](https://github.com/OpenNMT/CTranslate2) inference engine. Tabby utilizes ctranslate2 for inference on both `cpu` and `cuda` devices.
-
-With the [python package](https://pypi.org/project/ctranslate2) installed, you can acquire this directory by executing the following command in the HF model directory:
-
-```bash
-ct2-transformers-converter --model ./ --output_dir ctranslate2 --quantization=float16
-```
-
-*Note that the model itself must be compatible with ctranslate2.*
-
 ### ggml/
-This directory contains binary files used by the [llama.cpp](https://github.com/ggerganov/llama.cpp) inference engine. Tabby utilizes ggml for inference on the `metal` device.
+This directory contains binary files used by the [llama.cpp](https://github.com/ggerganov/llama.cpp) inference engine. Tabby utilizes ggml for inference on `cpu`, `cuda` and `metal` devices.
 
-Currently, only `q8_0.gguf` in this directory is in use. You can refer to the instructions in llama.cpp to learn how to acquire it.
+Currently, only `q8_0.v2.gguf` in this directory is in use. You can refer to the instructions in llama.cpp to learn how to acquire it.