125 lines
3.9 KiB
Markdown
125 lines
3.9 KiB
Markdown
# Modal
|
||
|
||
Modal is a serverless GPU provider. By leveraging Modal, your Tabby instance will run on demand. When there are no requests to the Tabby server for a certain amount of time, Modal will schedule the container to sleep, thereby saving GPU costs.
|
||
|
||
## Setup
|
||
|
||
First we import the components we need from `modal`.
|
||
|
||
```python
|
||
from modal import Image, Mount, Secret, Stub, asgi_app, gpu, method
|
||
```
|
||
|
||
Next, we set the base docker image version, which model to serve, taking care to specify the GPU configuration required to fit the model into VRAM.
|
||
|
||
```python
|
||
MODEL_ID = "TabbyML/StarCoder-1B"
|
||
GPU_CONFIG = gpu.T4()
|
||
```
|
||
|
||
## Define the container image
|
||
|
||
We want to create a Modal image which has the Tabby model cache pre-populated. The benefit of this is that the container no longer has to re-download the model - instead, it will take advantage of Modal’s internal filesystem for faster cold starts.
|
||
|
||
### Download the weights
|
||
|
||
```python
|
||
def download_model():
|
||
import subprocess
|
||
|
||
subprocess.run(
|
||
[
|
||
"/opt/tabby/bin/tabby",
|
||
"download",
|
||
"--model",
|
||
MODEL_ID,
|
||
]
|
||
)
|
||
```
|
||
|
||
|
||
### Image definition
|
||
|
||
We’ll start from a image by tabby, and override the default ENTRYPOINT for Modal to run its own which enables seamless serverless deployments.
|
||
|
||
Next we run the download step to pre-populate the image with our model weights.
|
||
|
||
Finally, we install the `asgi-proxy-lib` to interface with modal's asgi webserver over localhost.
|
||
|
||
```python
|
||
image = (
|
||
Image.from_registry(
|
||
"tabbyml/tabby:0.3.1",
|
||
add_python="3.11",
|
||
)
|
||
.dockerfile_commands("ENTRYPOINT []")
|
||
.run_function(download_model)
|
||
.pip_install("asgi-proxy-lib")
|
||
)
|
||
```
|
||
|
||
### The app function
|
||
|
||
The endpoint function is represented with Modal's `@stub.function`. Here, we:
|
||
|
||
1. Launch the Tabby process and wait for it to be ready to accept requests.
|
||
2. Create an ASGI proxy to tunnel requests from the Modal web endpoint to the local Tabby server.
|
||
3. Specify that each container is allowed to handle up to 10 requests simultaneously.
|
||
4. Keep idle containers for 2 minutes before spinning them down.
|
||
|
||
```python
|
||
@stub.function(
|
||
gpu=GPU_CONFIG,
|
||
allow_concurrent_inputs=10,
|
||
container_idle_timeout=120,
|
||
timeout=360,
|
||
)
|
||
@asgi_app()
|
||
def app():
|
||
import socket
|
||
import subprocess
|
||
import time
|
||
from asgi_proxy import asgi_proxy
|
||
|
||
launcher = subprocess.Popen(
|
||
[
|
||
"/opt/tabby/bin/tabby",
|
||
"serve",
|
||
"--model",
|
||
MODEL_ID,
|
||
"--port",
|
||
"8000",
|
||
"--device",
|
||
"cuda",
|
||
]
|
||
)
|
||
|
||
# Poll until webserver at 127.0.0.1:8000 accepts connections before running inputs.
|
||
def tabby_ready():
|
||
try:
|
||
socket.create_connection(("127.0.0.1", 8000), timeout=1).close()
|
||
return True
|
||
except (socket.timeout, ConnectionRefusedError):
|
||
# Check if launcher webserving process has exited.
|
||
# If so, a connection can never be made.
|
||
retcode = launcher.poll()
|
||
if retcode is not None:
|
||
raise RuntimeError(f"launcher exited unexpectedly with code {retcode}")
|
||
return False
|
||
|
||
while not tabby_ready():
|
||
time.sleep(1.0)
|
||
|
||
print("Tabby server ready!")
|
||
return asgi_proxy("http://localhost:8000")
|
||
```
|
||
|
||
### Serve the app
|
||
|
||
Once we deploy this model with `modal serve app.py`, it will output the url of the web endpoint, in a form of `https://<USERNAME>--tabby-server-starcoder-1b-app-dev.modal.run`, it can be used as tabby server url in tabby editor extensions!
|
||
|
||
See [app.py](https://github.com/TabbyML/tabby/blob/main/website/docs/installation/modal/app.py) for a complete example.
|
||
|
||
## Feedback and support
|
||
If you have improvement suggestions or need specific support, please join [Tabby Slack community](https://join.slack.com/t/tabbycommunity/shared_invite/zt-1xeiddizp-bciR2RtFTaJ37RBxr8VxpA) or reach out on [Tabby’s GitHub repository](https://github.com/TabbyML/tabby).
|