41 lines
1.5 KiB
Markdown
41 lines
1.5 KiB
Markdown
import CodeBlock from '@theme/CodeBlock';
|
|
|
|
# ⁉️ Frequently Asked Questions
|
|
|
|
<details>
|
|
<summary>How much VRAM a LLM model consumes?</summary>
|
|
<div>By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.</div>
|
|
</details>
|
|
|
|
<details>
|
|
<summary>What GPUs are required for reduced-precision inference (e.g int8)?</summary>
|
|
<div>
|
|
<ul>
|
|
<li>int8: Compute Capability >= 7.0 or Compute Capability 6.1</li>
|
|
<li>float16: Compute Capability >= 7.0</li>
|
|
<li>bfloat16: Compute Capability >= 8.0</li>
|
|
</ul>
|
|
<p>
|
|
To determine the mapping between the GPU card type and its compute capability, please visit <a href="https://developer.nvidia.com/cuda-gpus">this page</a>
|
|
</p>
|
|
</div>
|
|
</details>
|
|
|
|
<details>
|
|
<summary>How to utilize multiple NVIDIA GPUs?</summary>
|
|
<div>
|
|
<p>Tabby supports replicating models on multiple GPUs to increase throughput. You can specify the devices for model replication by using the <b>--device-indices</b> option.</p>
|
|
<CodeBlock language="bash">
|
|
# Replicate model to GPU 0 and GPU 1.{'\n'}
|
|
tabby serve ... --device-indices 0 --device-indices 1
|
|
</CodeBlock>
|
|
</div>
|
|
</details>
|
|
|
|
<details>
|
|
<summary>How can I convert my own model for use with Tabby?</summary>
|
|
<div>
|
|
<p>Follow the instructions provided in the <a href="https://github.com/TabbyML/tabby/blob/main/MODEL_SPEC.md">Model Spec</a>.</p>
|
|
<p>Please note that the spec is unstable and does not adhere to semver.</p>
|
|
</div>
|
|
</details> |