import CodeBlock from '@theme/CodeBlock'; # ⁉️ Frequently Asked Questions

How much VRAM a LLM model consumes?

By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.

What GPUs are required for reduced-precision inference (e.g int8)?

int8: Compute Capability >= 7.0 or Compute Capability 6.1
float16: Compute Capability >= 7.0
bfloat16: Compute Capability >= 8.0

To determine the mapping between the GPU card type and its compute capability, please visit this page

How to utilize multiple NVIDIA GPUs?

Tabby only supports the use of a single GPU. To utilize multiple GPUs, you can initiate multiple Tabby instances and set CUDA_VISIBLE_DEVICES accordingly.

How can I convert my own model for use with Tabby?

Since version 0.5.0, Tabby's inference now operates entirely on llama.cpp, allowing the use of any GGUF-compatible model format with Tabby. To enhance accessibility, we have curated models that we benchmarked, available at registry-tabby.

Users are free to fork the repository to create their own registry. If a user's registry is located at https://github.com/USERNAME/registry-tabby, the model ID will be USERNAME/model.

For details on the registry format, please refer to models.json