Skip to content

Commit 340a6b9

Browse files
authored
Reorder streaming guide (#5784)
1 parent 7182ed6 commit 340a6b9

File tree

2 files changed

+63
-45
lines changed

2 files changed

+63
-45
lines changed

docs/core_docs/docs/concepts.mdx

+57-45
Original file line numberDiff line numberDiff line change
@@ -639,6 +639,8 @@ For specifics on how to use callbacks, see the [relevant how-to guides here](/do
639639

640640
### Streaming
641641

642+
<span data-heading-keywords="stream,streaming"></span>
643+
642644
Individual LLM calls often run for much longer than traditional resource requests.
643645
This compounds when you build more complex chains or agents that require multiple reasoning steps.
644646

@@ -648,65 +650,33 @@ around building apps with LLMs to help alleviate latency issues, and LangChain a
648650

649651
Below, we'll discuss some concepts and considerations around streaming in LangChain.
650652

651-
#### Tokens
652-
653-
The unit that most model providers use to measure input and output is via a unit called a **token**.
654-
Tokens are the basic units that language models read and generate when processing or producing text.
655-
The exact definition of a token can vary depending on the specific way the model was trained -
656-
for instance, in English, a token could be a single word like "apple", or a part of a word like "app".
657-
658-
When you send a model a prompt, the words and characters in the prompt are encoded into tokens using a **tokenizer**.
659-
The model then streams back generated output tokens, which the tokenizer decodes into human-readable text.
660-
The below example shows how OpenAI models tokenize `LangChain is cool!`:
661-
662-
![](/img/tokenization.png)
663-
664-
You can see that it gets split into 5 different tokens, and that the boundaries between tokens are not exactly the same as word boundaries.
665-
666-
The reason language models use tokens rather than something more immediately intuitive like "characters"
667-
has to do with how they process and understand text. At a high-level, language models iteratively predict their next generated output based on
668-
the initial input and their previous generations. Training the model using tokens language models to handle linguistic
669-
units (like words or subwords) that carry meaning, rather than individual characters, which makes it easier for the model
670-
to learn and understand the structure of the language, including grammar and context.
671-
Furthermore, using tokens can also improve efficiency, since the model processes fewer units of text compared to character-level processing.
672-
673-
#### Callbacks
674-
675-
The lowest level way to stream outputs from LLMs in LangChain is via the [callbacks](/docs/concepts/#callbacks) system. You can pass a
676-
callback handler that handles the [`handleLLMNewToken`](https://api.js.langchain.com/interfaces/langchain_core_callbacks_base.CallbackHandlerMethods.html#handleLLMNewToken) event into LangChain components. When that component is invoked, any
677-
[LLM](/docs/concepts/#llms) or [chat model](/docs/concepts/#chat-models) contained in the component calls
678-
the callback with the generated token. Within the callback, you could pipe the tokens into some other destination, e.g. a HTTP response.
679-
You can also handle the [`handleLLMEnd`](https://api.js.langchain.com/interfaces/langchain_core_callbacks_base.CallbackHandlerMethods.html#handleLLMEnd) event to perform any necessary cleanup.
680-
681-
You can see [this how-to section](/docs/how_to/#callbacks) for more specifics on using callbacks.
682-
683-
Callbacks were the first technique for streaming introduced in LangChain. While powerful and generalizable,
684-
they can be unwieldy for developers. For example:
685-
686-
- You need to explicitly initialize and manage some aggregator or other stream to collect results.
687-
- The execution order isn't explicitly guaranteed, and you could theoretically have a callback run after the `.invoke()` method finishes.
688-
- Providers would often make you pass an additional parameter to stream outputs instead of returning them all at once.
689-
- You would often ignore the result of the actual model call in favor of callback results.
690-
691653
#### `.stream()`
692654

693-
LangChain also includes the `.stream()` method as a more ergonomic streaming interface.
655+
Most modules in LangChain include the `.stream()` method as an ergonomic streaming interface.
694656
`.stream()` returns an iterator, which you can consume with a [`for await...of`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for-await...of) loop. Here's an example with a chat model:
695657

696658
```ts
697659
import { ChatAnthropic } from "@langchain/anthropic";
660+
import { concat } from "@langchain/core/utils/stream";
698661

699662
const model = new ChatAnthropic({ model: "claude-3-sonnet-20240229" });
700663

701664
const stream = await model.stream("what color is the sky?");
702665

666+
let gathered: AIMessageChunk | undefined = undefined;
667+
703668
for await (const chunk of stream) {
704669
console.log(chunk);
670+
if (gathered === undefined) {
671+
gathered = chunk;
672+
} else {
673+
gathered = concat(gathered, chunk);
674+
}
705675
}
706676
```
707677

708678
For models (or other components) that don't support streaming natively, this iterator would just yield a single chunk, but
709-
you could still use the same general pattern. Using `.stream()` will also automatically call the model in streaming mode
679+
you could still use the same general pattern when calling them. Using `.stream()` will also automatically call the model in streaming mode
710680
without the need to provide additional config.
711681

712682
The type of each outputted chunk depends on the type of component - for example, chat models yield [`AIMessageChunks`](https://api.js.langchain.com/classes/langchain_core_messages.AIMessageChunk.html).
@@ -718,13 +688,15 @@ You can check out [this guide](/docs/how_to/streaming/#using-stream) for more de
718688

719689
#### `.streamEvents()`
720690

721-
While the `.stream()` method is easier to use than callbacks, it only returns one type of value. This is fine for single LLM calls,
691+
<span data-heading-keywords="astream_events,stream_events,stream events"></span>
692+
693+
While the `.stream()` method is intuitive, it can only return the final generated value of your chain. This is fine for single LLM calls,
722694
but as you build more complex chains of several LLM calls together, you may want to use the intermediate values of
723695
the chain alongside the final output - for example, returning sources alongside the final generation when building a chat
724696
over documents app.
725697

726-
There are ways to do this using the aforementioned callbacks, or by constructing your chain in such a way that it passes intermediate
727-
values to the end with something like [`.assign()`](/docs/how_to/passthrough/), but LangChain also includes an
698+
There are ways to do this [using callbacks](/docs/concepts/#callbacks-1), or by constructing your chain in such a way that it passes intermediate
699+
values to the end with something like chained [`.assign()`](/docs/how_to/passthrough/) calls, but LangChain also includes an
728700
`.streamEvents()` method that combines the flexibility of callbacks with the ergonomics of `.stream()`. When called, it returns an iterator
729701
which yields [various types of events](/docs/how_to/streaming/#event-reference) that you can filter and process according
730702
to the needs of your project.
@@ -759,6 +731,46 @@ You can roughly think of it as an iterator over callback events (though the form
759731

760732
See [this guide](/docs/how_to/streaming/#using-stream-events) for more detailed information on how to use `.streamEvents()`.
761733

734+
#### Tokens
735+
736+
The unit that most model providers use to measure input and output is via a unit called a **token**.
737+
Tokens are the basic units that language models read and generate when processing or producing text.
738+
The exact definition of a token can vary depending on the specific way the model was trained -
739+
for instance, in English, a token could be a single word like "apple", or a part of a word like "app".
740+
741+
When you send a model a prompt, the words and characters in the prompt are encoded into tokens using a **tokenizer**.
742+
The model then streams back generated output tokens, which the tokenizer decodes into human-readable text.
743+
The below example shows how OpenAI models tokenize `LangChain is cool!`:
744+
745+
![](/img/tokenization.png)
746+
747+
You can see that it gets split into 5 different tokens, and that the boundaries between tokens are not exactly the same as word boundaries.
748+
749+
The reason language models use tokens rather than something more immediately intuitive like "characters"
750+
has to do with how they process and understand text. At a high-level, language models iteratively predict their next generated output based on
751+
the initial input and their previous generations. Training the model using tokens language models to handle linguistic
752+
units (like words or subwords) that carry meaning, rather than individual characters, which makes it easier for the model
753+
to learn and understand the structure of the language, including grammar and context.
754+
Furthermore, using tokens can also improve efficiency, since the model processes fewer units of text compared to character-level processing.
755+
756+
#### Callbacks
757+
758+
The lowest level way to stream outputs from LLMs in LangChain is via the [callbacks](/docs/concepts/#callbacks) system. You can pass a
759+
callback handler that handles the [`handleLLMNewToken`](https://api.js.langchain.com/interfaces/langchain_core_callbacks_base.CallbackHandlerMethods.html#handleLLMNewToken) event into LangChain components. When that component is invoked, any
760+
[LLM](/docs/concepts/#llms) or [chat model](/docs/concepts/#chat-models) contained in the component calls
761+
the callback with the generated token. Within the callback, you could pipe the tokens into some other destination, e.g. a HTTP response.
762+
You can also handle the [`handleLLMEnd`](https://api.js.langchain.com/interfaces/langchain_core_callbacks_base.CallbackHandlerMethods.html#handleLLMEnd) event to perform any necessary cleanup.
763+
764+
You can see [this how-to section](/docs/how_to/#callbacks) for more specifics on using callbacks.
765+
766+
Callbacks were the first technique for streaming introduced in LangChain. While powerful and generalizable,
767+
they can be unwieldy for developers. For example:
768+
769+
- You need to explicitly initialize and manage some aggregator or other stream to collect results.
770+
- The execution order isn't explicitly guaranteed, and you could theoretically have a callback run after the `.invoke()` method finishes.
771+
- Providers would often make you pass an additional parameter to stream outputs instead of returning them all at once.
772+
- You would often ignore the result of the actual model call in favor of callback results.
773+
762774
### Structured output
763775

764776
LLMs are capable of generating arbitrary text. This enables the model to respond appropriately to a wide

docs/core_docs/docs/how_to/streaming.ipynb

+6
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@
1111
"This guide assumes familiarity with the following concepts:\n",
1212
"\n",
1313
"- [Chat models](/docs/concepts/#chat-models)\n",
14+
"- [LangChain Expression Language](/docs/concepts/#langchain-expression-language-lcel)\n",
15+
"- [Output parsers](/docs/concepts/#output-parsers)\n",
1416
"\n",
1517
":::\n",
1618
"\n",
@@ -25,6 +27,10 @@
2527
"\n",
2628
"Let’s take a look at both approaches!\n",
2729
"\n",
30+
":::info\n",
31+
"For a higher-level overview of streaming techniques in LangChain, see [this section of the conceptual guide](/docs/concepts/#streaming).\n",
32+
":::\n",
33+
"\n",
2834
"# Using Stream\n",
2935
"\n",
3036
"All `Runnable` objects implement a method called stream.\n",

0 commit comments

Comments
 (0)