import CodeBlock from '@theme/CodeBlock'; # ⁉️ Frequently Asked Questions

How much VRAM a LLM model consumes?

By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.

What GPUs are required for reduced-precision inference (e.g int8)?

int8: Compute Capability >= 7.0 or Compute Capability 6.1
float16: Compute Capability >= 7.0
bfloat16: Compute Capability >= 8.0

To determine the mapping between the GPU card type and its compute capability, please visit this page

How to utilize multiple NVIDIA GPUs?

Tabby supports replicating models on multiple GPUs to increase throughput. You can specify the devices for model replication by using the --device-indices option.

# Replicate model to GPU 0 and GPU 1.{'\n'} tabby serve ... --device-indices 0 --device-indices 1