docs: add back-pressure and cancellation blog (#479)
* docs: add back-pressure and cancellation blog * fix(doc): format and content improvement for the back-pressure blog (#496) * Minor editorial changes on cancellation blog. * adjust blog structure * rename blog title --------- Co-authored-by: Wang Zixiao <wayne.wang0821@gmail.com> Co-authored-by: Meng Zhang <meng@tabbyml.com>release-0.2
parent
2171ba72ff
commit
6348018d38
|
|
@ -0,0 +1,3 @@
|
||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:93f559473aa7d0e2da81d2fb2fe2a733549a2b83cc84da8c95d92d0bc4d2e8b5
|
||||||
|
size 130476
|
||||||
|
|
@ -0,0 +1,139 @@
|
||||||
|
---
|
||||||
|
authors: [wwayne, gyxlucy, meng]
|
||||||
|
|
||||||
|
tags: [tech design]
|
||||||
|
---
|
||||||
|
|
||||||
|
# Stream laziness in Tabby
|
||||||
|
|
||||||
|
This blog focuses on understanding stream laziness in Tabby. You do not need to know this information to use the Tabby, but for those interested, it offers a deeper dive on why and how the Tabby handle its LLM workload.
|
||||||
|
|
||||||
|
## What is streaming?
|
||||||
|
Let's begin by setting up a simple example program:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
```javascript
|
||||||
|
const express = require('express');
|
||||||
|
|
||||||
|
function sleep(ms) {
|
||||||
|
return new Promise(resolve => setTimeout(resolve, ms));
|
||||||
|
}
|
||||||
|
|
||||||
|
async function* llm() {
|
||||||
|
let i = 1;
|
||||||
|
while (true) {
|
||||||
|
console.log(`producing ${i}`);
|
||||||
|
yield i++;
|
||||||
|
|
||||||
|
// Mimic LLM inference latency.
|
||||||
|
await sleep(1000);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
function server(llm) {
|
||||||
|
const app = express();
|
||||||
|
app.get('/', async (req, res) => {
|
||||||
|
res.writeHead(200, {
|
||||||
|
'Content-Type': 'application/jsonstream',
|
||||||
|
'Transfer-Encoding': 'chunked',
|
||||||
|
});
|
||||||
|
|
||||||
|
let value, done;
|
||||||
|
do {
|
||||||
|
({ value, done } = await llm.next());
|
||||||
|
res.write(JSON.stringify(value));
|
||||||
|
res.write('\n');
|
||||||
|
} while (!done);
|
||||||
|
});
|
||||||
|
|
||||||
|
app.listen(8080);
|
||||||
|
}
|
||||||
|
|
||||||
|
async function consumer() {
|
||||||
|
const resp = await fetch('http://localhost:8080');
|
||||||
|
|
||||||
|
// Read values from our stream
|
||||||
|
const reader = resp.body.pipeThrough(new TextDecoderStream()).getReader();
|
||||||
|
// We're only reading 3 items this time:
|
||||||
|
for (let i = 0; i < 3; i++) {
|
||||||
|
// we know our stream is infinite, so there's no need to check `done`.
|
||||||
|
const { value } = await reader.read();
|
||||||
|
console.log(`read ${value}`);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
server(llm());
|
||||||
|
consumer();
|
||||||
|
```
|
||||||
|
|
||||||
|
## Stream Laziness
|
||||||
|
|
||||||
|
If you were to run this program, you'd notice something interesting. We'll observe the LLM continuing to output `producing ${i}` even after the consumer has finished reading three times. This might seem obvious, given that the LLM is generating an infinite stream of integers. However, it represents a problem: our server must maintain an ever-expanding queue of items that have been pushed in but not pulled out.
|
||||||
|
|
||||||
|
Moreover, the workload involved in creating the stream is typically both expensive and time-consuming, such as computation workload on the GPU. But what if the client aborts the in-flight request due to a network issue or other intended behaviors?
|
||||||
|
|
||||||
|
This is where the concept of stream laziness comes into play. We should perform computations only when the client requests them. If the client no longer needs a response, we should halt production and pause the stream, thereby saving valuable GPU resources.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## How to handle canellation?
|
||||||
|
|
||||||
|
The core idea is straightforward: on the server side, we need to listen to the `close` event and check if the connection is still valid before pulling data from the LLM stream.
|
||||||
|
|
||||||
|
```js
|
||||||
|
app.get('/', async (req, res) => {
|
||||||
|
...
|
||||||
|
let canceled;
|
||||||
|
req.on('close', () => canceled = true);
|
||||||
|
do {
|
||||||
|
({ value, done } = await llm.next());
|
||||||
|
...
|
||||||
|
} while (!done && !canceled);
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
## Implement cancellation for Tabby
|
||||||
|
|
||||||
|
In Tabby, effective management of code completion cancellations is crucial for promptly responding to users' inputs while optimizing model usage to enhance performance.
|
||||||
|
|
||||||
|
On the client side, whenever we receive a new input from a user, it's essential to abort the previous query and promptly retrieve a new response from the server.
|
||||||
|
|
||||||
|
```js
|
||||||
|
// Demo code in the client side
|
||||||
|
|
||||||
|
let controller;
|
||||||
|
|
||||||
|
const callServer = (prompt) => {
|
||||||
|
controller = new AbortController();
|
||||||
|
const signal = controller.signal;
|
||||||
|
// 2. calling server API to get the result with the prompt
|
||||||
|
const response = await fetch("/v1/completions", {
|
||||||
|
method: "POST",
|
||||||
|
headers: {
|
||||||
|
"Content-Type": "application/json"
|
||||||
|
},
|
||||||
|
body: JSON.stringify({ prompt })
|
||||||
|
signal
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
const onChange = (e) => {
|
||||||
|
if (controller) controller.abort(); // Abort the previous request
|
||||||
|
callServer(e.target.value);
|
||||||
|
};
|
||||||
|
|
||||||
|
// 1. Debounce the input 100ms for example
|
||||||
|
<input onChange={debounce(onChange)} />
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
By employing streaming and implementing laziness semantics appropriately, all components operate smoothly and efficiently!
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## That's it
|
||||||
|
|
||||||
|
We would love to invite to join our Slack community! Please feel free to reach out to us on [Slack](https://join.slack.com/t/tabbycommunity/shared_invite/zt-1xeiddizp-bciR2RtFTaJ37RBxr8VxpA) - we have channels for discussing all aspects of the product and tech, and everyone is welcome to join the conversation.
|
||||||
|
|
||||||
|
Happy hacking 😁💪🏻
|
||||||
|
|
@ -0,0 +1,3 @@
|
||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:4113704a91e8355fb14ff43c0b26fb3fd926168dbc0eacca008f5b8657e1cb8e
|
||||||
|
size 111767
|
||||||
|
|
@ -0,0 +1,3 @@
|
||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:905ca358ae3198ccd587d7a837cc82fcbfad30c7c5b6115248a41ed9ae5468b8
|
||||||
|
size 107185
|
||||||
|
|
@ -0,0 +1,3 @@
|
||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:74734640c70fb1ebbfe6baa9a3ff8f8ef257577766d98df6bbdb066e44a9a17d
|
||||||
|
size 104103
|
||||||
|
|
@ -7,3 +7,13 @@ gyxlucy:
|
||||||
name: Lucy Gao
|
name: Lucy Gao
|
||||||
url: https://github.com/gyxlucy
|
url: https://github.com/gyxlucy
|
||||||
image_url: https://github.com/gyxlucy.png
|
image_url: https://github.com/gyxlucy.png
|
||||||
|
|
||||||
|
icycodes:
|
||||||
|
name: Zhiming Ma
|
||||||
|
url: https://github.com/icycodes
|
||||||
|
image_url: https://github.com/icycodes.png
|
||||||
|
|
||||||
|
wwayne:
|
||||||
|
name: Wayne Wang
|
||||||
|
url: https://github.com/wwayne
|
||||||
|
image_url: https://github.com/wwayne.png
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue