diff --git a/website/blog/2023-09-30-stream-laziness-in-tabby/cancellation.png b/website/blog/2023-09-30-stream-laziness-in-tabby/cancellation.png new file mode 100644 index 0000000..6252d81 --- /dev/null +++ b/website/blog/2023-09-30-stream-laziness-in-tabby/cancellation.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:93f559473aa7d0e2da81d2fb2fe2a733549a2b83cc84da8c95d92d0bc4d2e8b5 +size 130476 diff --git a/website/blog/2023-09-30-stream-laziness-in-tabby/index.md b/website/blog/2023-09-30-stream-laziness-in-tabby/index.md new file mode 100644 index 0000000..363fcff --- /dev/null +++ b/website/blog/2023-09-30-stream-laziness-in-tabby/index.md @@ -0,0 +1,139 @@ +--- +authors: [wwayne, gyxlucy, meng] + +tags: [tech design] +--- + +# Stream laziness in Tabby + +This blog focuses on understanding stream laziness in Tabby. You do not need to know this information to use the Tabby, but for those interested, it offers a deeper dive on why and how the Tabby handle its LLM workload. + +## What is streaming? +Let's begin by setting up a simple example program: + +![intro](./intro.png) + +```javascript +const express = require('express'); + +function sleep(ms) { + return new Promise(resolve => setTimeout(resolve, ms)); +} + +async function* llm() { + let i = 1; + while (true) { + console.log(`producing ${i}`); + yield i++; + + // Mimic LLM inference latency. + await sleep(1000); + } +} + +function server(llm) { + const app = express(); + app.get('/', async (req, res) => { + res.writeHead(200, { + 'Content-Type': 'application/jsonstream', + 'Transfer-Encoding': 'chunked', + }); + + let value, done; + do { + ({ value, done } = await llm.next()); + res.write(JSON.stringify(value)); + res.write('\n'); + } while (!done); + }); + + app.listen(8080); +} + +async function consumer() { + const resp = await fetch('http://localhost:8080'); + + // Read values from our stream + const reader = resp.body.pipeThrough(new TextDecoderStream()).getReader(); + // We're only reading 3 items this time: + for (let i = 0; i < 3; i++) { + // we know our stream is infinite, so there's no need to check `done`. + const { value } = await reader.read(); + console.log(`read ${value}`); + } +} + +server(llm()); +consumer(); +``` + +## Stream Laziness + +If you were to run this program, you'd notice something interesting. We'll observe the LLM continuing to output `producing ${i}` even after the consumer has finished reading three times. This might seem obvious, given that the LLM is generating an infinite stream of integers. However, it represents a problem: our server must maintain an ever-expanding queue of items that have been pushed in but not pulled out. + +Moreover, the workload involved in creating the stream is typically both expensive and time-consuming, such as computation workload on the GPU. But what if the client aborts the in-flight request due to a network issue or other intended behaviors? + +This is where the concept of stream laziness comes into play. We should perform computations only when the client requests them. If the client no longer needs a response, we should halt production and pause the stream, thereby saving valuable GPU resources. + +![Cancellation](./cancellation.png) + +## How to handle canellation? + +The core idea is straightforward: on the server side, we need to listen to the `close` event and check if the connection is still valid before pulling data from the LLM stream. + +```js +app.get('/', async (req, res) => { + ... + let canceled; + req.on('close', () => canceled = true); + do { + ({ value, done } = await llm.next()); + ... + } while (!done && !canceled); +}); +``` + +## Implement cancellation for Tabby + +In Tabby, effective management of code completion cancellations is crucial for promptly responding to users' inputs while optimizing model usage to enhance performance. + +On the client side, whenever we receive a new input from a user, it's essential to abort the previous query and promptly retrieve a new response from the server. + +```js +// Demo code in the client side + +let controller; + +const callServer = (prompt) => { + controller = new AbortController(); + const signal = controller.signal; + // 2. calling server API to get the result with the prompt + const response = await fetch("/v1/completions", { + method: "POST", + headers: { + "Content-Type": "application/json" + }, + body: JSON.stringify({ prompt }) + signal + }); +} + +const onChange = (e) => { + if (controller) controller.abort(); // Abort the previous request + callServer(e.target.value); +}; + +// 1. Debounce the input 100ms for example + + +``` + +By employing streaming and implementing laziness semantics appropriately, all components operate smoothly and efficiently! + +![Streaming](./stream.png) + +## That's it + +We would love to invite to join our Slack community! Please feel free to reach out to us on [Slack](https://join.slack.com/t/tabbycommunity/shared_invite/zt-1xeiddizp-bciR2RtFTaJ37RBxr8VxpA) - we have channels for discussing all aspects of the product and tech, and everyone is welcome to join the conversation. + +Happy hacking 😁💪🏻 diff --git a/website/blog/2023-09-30-stream-laziness-in-tabby/intro.png b/website/blog/2023-09-30-stream-laziness-in-tabby/intro.png new file mode 100644 index 0000000..5806a1c --- /dev/null +++ b/website/blog/2023-09-30-stream-laziness-in-tabby/intro.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:4113704a91e8355fb14ff43c0b26fb3fd926168dbc0eacca008f5b8657e1cb8e +size 111767 diff --git a/website/blog/2023-09-30-stream-laziness-in-tabby/stream-flush.png b/website/blog/2023-09-30-stream-laziness-in-tabby/stream-flush.png new file mode 100644 index 0000000..6e58078 --- /dev/null +++ b/website/blog/2023-09-30-stream-laziness-in-tabby/stream-flush.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:905ca358ae3198ccd587d7a837cc82fcbfad30c7c5b6115248a41ed9ae5468b8 +size 107185 diff --git a/website/blog/2023-09-30-stream-laziness-in-tabby/stream.png b/website/blog/2023-09-30-stream-laziness-in-tabby/stream.png new file mode 100644 index 0000000..d3a5415 --- /dev/null +++ b/website/blog/2023-09-30-stream-laziness-in-tabby/stream.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:74734640c70fb1ebbfe6baa9a3ff8f8ef257577766d98df6bbdb066e44a9a17d +size 104103 diff --git a/website/blog/authors.yml b/website/blog/authors.yml index 4df7b28..a03bbdf 100644 --- a/website/blog/authors.yml +++ b/website/blog/authors.yml @@ -7,3 +7,13 @@ gyxlucy: name: Lucy Gao url: https://github.com/gyxlucy image_url: https://github.com/gyxlucy.png + +icycodes: + name: Zhiming Ma + url: https://github.com/icycodes + image_url: https://github.com/icycodes.png + +wwayne: + name: Wayne Wang + url: https://github.com/wwayne + image_url: https://github.com/wwayne.png