tabby/website/docs/03-faq.mdx

import CodeBlock from '@theme/CodeBlock';

# ⁉️ Frequently Asked Questions

<details>
  <summary>How much VRAM a LLM model consumes?</summary>
  <div>By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.</div>
</details>

<details>
  <summary>What GPUs are required for reduced-precision inference (e.g int8)?</summary>
  <div>
    <ul>
      <li>int8: Compute Capability >= 7.0 or Compute Capability 6.1</li>
      <li>float16: Compute Capability >= 7.0</li>
      <li>bfloat16: Compute Capability >= 8.0</li>
    </ul>
    <p>
      To determine the mapping between the GPU card type and its compute capability, please visit <a href="https://developer.nvidia.com/cuda-gpus">this page</a>
    </p>
  </div>
</details>

<details>
  <summary>How to utilize multiple NVIDIA GPUs?</summary>
  <div>
    <p>Tabby supports replicating models on multiple GPUs to increase throughput. You can specify the devices for model replication by using the <b>--device-indices</b> option.</p>
    <CodeBlock language="bash">
    # Replicate model to GPU 0 and GPU 1.{'\n'}
    tabby serve ... --device-indices 0 --device-indices 1
    </CodeBlock>
  </div>
</details>
docs: add faq on utilize multipl NV GPUs (#423) * docs: add faq on utilize multipl NV GPUs * update 2023-09-10 14:16:46 +00:00			`import CodeBlock from '@theme/CodeBlock';`

docs: add emojis to docs 2023-09-04 06:35:43 +00:00			`# ⁉️ Frequently Asked Questions`
docs: add faq.md (#381) * docs: add 03-faq.md * docs: add section on reduced precision inference compute capability * update 2023-08-30 15:09:23 +00:00
			`<details>`
			`<summary>How much VRAM a LLM model consumes?</summary>`
			`<div>By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B.</div>`
			`</details>`

			`<details>`
			`<summary>What GPUs are required for reduced-precision inference (e.g int8)?</summary>`
			`<div>`
			`<ul>`
			`<li>int8: Compute Capability >= 7.0 or Compute Capability 6.1</li>`
			`<li>float16: Compute Capability >= 7.0</li>`
			`<li>bfloat16: Compute Capability >= 8.0</li>`
			`</ul>`
			`<p>`
			`To determine the mapping between the GPU card type and its compute capability, please visit <a href="https://developer.nvidia.com/cuda-gpus">this page</a>`
			`</p>`
			`</div>`
			`</details>`
docs: add faq on utilize multipl NV GPUs (#423) * docs: add faq on utilize multipl NV GPUs * update 2023-09-10 14:16:46 +00:00
			`<details>`
			`<summary>How to utilize multiple NVIDIA GPUs?</summary>`
			`<div>`
			`<p>Tabby supports replicating models on multiple GPUs to increase throughput. You can specify the devices for model replication by using the <b>--device-indices</b> option.</p>`
			`<CodeBlock language="bash">`
			`# Replicate model to GPU 0 and GPU 1.{'\n'}`
			`tabby serve ... --device-indices 0 --device-indices 1`
			`</CodeBlock>`
			`</div>`
			`</details>`