125 lines
3.9 KiB
Markdown
125 lines
3.9 KiB
Markdown
|
|
# Modal
|
|||
|
|
|
|||
|
|
Modal is a serverless GPU provider. By leveraging Modal, your Tabby instance will run on demand. When there are no requests to the Tabby server for a certain amount of time, Modal will schedule the container to sleep, thereby saving GPU costs.
|
|||
|
|
|
|||
|
|
## Setup
|
|||
|
|
|
|||
|
|
First we import the components we need from `modal`.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from modal import Image, Mount, Secret, Stub, asgi_app, gpu, method
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Next, we set the base docker image version, which model to serve, taking care to specify the GPU configuration required to fit the model into VRAM.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
MODEL_ID = "TabbyML/StarCoder-1B"
|
|||
|
|
GPU_CONFIG = gpu.T4()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Define the container image
|
|||
|
|
|
|||
|
|
We want to create a Modal image which has the Tabby model cache pre-populated. The benefit of this is that the container no longer has to re-download the model - instead, it will take advantage of Modal’s internal filesystem for faster cold starts.
|
|||
|
|
|
|||
|
|
### Download the weights
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def download_model():
|
|||
|
|
import subprocess
|
|||
|
|
|
|||
|
|
subprocess.run(
|
|||
|
|
[
|
|||
|
|
"/opt/tabby/bin/tabby",
|
|||
|
|
"download",
|
|||
|
|
"--model",
|
|||
|
|
MODEL_ID,
|
|||
|
|
]
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
|
|||
|
|
### Image definition
|
|||
|
|
|
|||
|
|
We’ll start from a image by tabby, and override the default ENTRYPOINT for Modal to run its own which enables seamless serverless deployments.
|
|||
|
|
|
|||
|
|
Next we run the download step to pre-populate the image with our model weights.
|
|||
|
|
|
|||
|
|
Finally, we install the `asgi-proxy-lib` to interface with modal's asgi webserver over localhost.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
image = (
|
|||
|
|
Image.from_registry(
|
|||
|
|
"tabbyml/tabby:0.3.1",
|
|||
|
|
add_python="3.11",
|
|||
|
|
)
|
|||
|
|
.dockerfile_commands("ENTRYPOINT []")
|
|||
|
|
.run_function(download_model)
|
|||
|
|
.pip_install("asgi-proxy-lib")
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### The app function
|
|||
|
|
|
|||
|
|
The endpoint function is represented with Modal's `@stub.function`. Here, we:
|
|||
|
|
|
|||
|
|
1. Launch the Tabby process and wait for it to be ready to accept requests.
|
|||
|
|
2. Create an ASGI proxy to tunnel requests from the Modal web endpoint to the local Tabby server.
|
|||
|
|
3. Specify that each container is allowed to handle up to 10 requests simultaneously.
|
|||
|
|
4. Keep idle containers for 2 minutes before spinning them down.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
@stub.function(
|
|||
|
|
gpu=GPU_CONFIG,
|
|||
|
|
allow_concurrent_inputs=10,
|
|||
|
|
container_idle_timeout=120,
|
|||
|
|
timeout=360,
|
|||
|
|
)
|
|||
|
|
@asgi_app()
|
|||
|
|
def app():
|
|||
|
|
import socket
|
|||
|
|
import subprocess
|
|||
|
|
import time
|
|||
|
|
from asgi_proxy import asgi_proxy
|
|||
|
|
|
|||
|
|
launcher = subprocess.Popen(
|
|||
|
|
[
|
|||
|
|
"/opt/tabby/bin/tabby",
|
|||
|
|
"serve",
|
|||
|
|
"--model",
|
|||
|
|
MODEL_ID,
|
|||
|
|
"--port",
|
|||
|
|
"8000",
|
|||
|
|
"--device",
|
|||
|
|
"cuda",
|
|||
|
|
]
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Poll until webserver at 127.0.0.1:8000 accepts connections before running inputs.
|
|||
|
|
def tabby_ready():
|
|||
|
|
try:
|
|||
|
|
socket.create_connection(("127.0.0.1", 8000), timeout=1).close()
|
|||
|
|
return True
|
|||
|
|
except (socket.timeout, ConnectionRefusedError):
|
|||
|
|
# Check if launcher webserving process has exited.
|
|||
|
|
# If so, a connection can never be made.
|
|||
|
|
retcode = launcher.poll()
|
|||
|
|
if retcode is not None:
|
|||
|
|
raise RuntimeError(f"launcher exited unexpectedly with code {retcode}")
|
|||
|
|
return False
|
|||
|
|
|
|||
|
|
while not tabby_ready():
|
|||
|
|
time.sleep(1.0)
|
|||
|
|
|
|||
|
|
print("Tabby server ready!")
|
|||
|
|
return asgi_proxy("http://localhost:8000")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Serve the app
|
|||
|
|
|
|||
|
|
Once we deploy this model with `modal serve app.py`, it will output the url of the web endpoint, in a form of `https://<USERNAME>--tabby-server-starcoder-1b-app-dev.modal.run`, it can be used as tabby server url in tabby editor extensions!
|
|||
|
|
|
|||
|
|
See [app.py](https://github.com/TabbyML/tabby/tree/main/website/docs/modal/app.py) for a complete example.
|
|||
|
|
|
|||
|
|
## Feedback and support
|
|||
|
|
If you have improvement suggestions or need specific support, please join [Tabby Slack community](https://join.slack.com/t/tabbycommunity/shared_invite/zt-1xeiddizp-bciR2RtFTaJ37RBxr8VxpA) or reach out on [Tabby’s GitHub repository](https://github.com/TabbyML/tabby).
|