tabby/website/docs/faq.mdx

import CodeBlock from '@theme/CodeBlock';

# ⁉️ Frequently Asked Questions

<details>
  <summary>How much VRAM a LLM model consumes?</summary>
  <div>By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.</div>
</details>

<details>
  <summary>What GPUs are required for reduced-precision inference (e.g int8)?</summary>
  <div>
    <ul>
      <li>int8: Compute Capability >= 7.0 or Compute Capability 6.1</li>
      <li>float16: Compute Capability >= 7.0</li>
      <li>bfloat16: Compute Capability >= 8.0</li>
    </ul>
    <p>
      To determine the mapping between the GPU card type and its compute capability, please visit <a href="https://developer.nvidia.com/cuda-gpus">this page</a>
    </p>
  </div>
</details>

<details>
  <summary>How to utilize multiple NVIDIA GPUs?</summary>
  <div>
    <p>Tabby only supports the use of a single GPU. To utilize multiple GPUs, you can initiate multiple Tabby instances and set CUDA_VISIBLE_DEVICES accordingly.</p>
  </div>
</details>

<details>
  <summary>How can I convert my own model for use with Tabby?</summary>
  <div>
    <p>Since version 0.5.0, Tabby's inference now operates entirely on llama.cpp, allowing the use of any GGUF-compatible model format with Tabby. To enhance accessibility, we have curated models that we benchmarked, available at https://github.com/TabbyML/registry-tabby. Users are free to fork the repository to create their own registry. If a user's registry is located at https://github.com/USERNAME/registry-tabby, the model ID will be `--model USERNAME/model`.</p>
    <p>For details on the registry format, please refer to https://github.com/TabbyML/registry-tabby/blob/main/models.json</p>
  </div>
</details>