Skip to content

docs[patch]: Adds streaming conceptual docs #5728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/core_docs/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ docs/tutorials/query_analysis.md
docs/tutorials/query_analysis.mdx
docs/tutorials/qa_chat_history.md
docs/tutorials/qa_chat_history.mdx
docs/tutorials/pdf_qa.md
docs/tutorials/pdf_qa.mdx
docs/tutorials/local_rag.md
docs/tutorials/local_rag.mdx
docs/tutorials/llm_chain.md
Expand All @@ -55,6 +57,8 @@ docs/how_to/tools_prompting.md
docs/how_to/tools_prompting.mdx
docs/how_to/tools_builtin.md
docs/how_to/tools_builtin.mdx
docs/how_to/tool_calls_multimodal.md
docs/how_to/tool_calls_multimodal.mdx
docs/how_to/tool_calling.md
docs/how_to/tool_calling.mdx
docs/how_to/structured_output.md
Expand Down
136 changes: 129 additions & 7 deletions docs/core_docs/docs/concepts.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -139,20 +139,20 @@ Some components LangChain implements, some components we rely on third-party int
<span data-heading-keywords="chat model,chat models"></span>

Language models that use a sequence of messages as inputs and return chat messages as outputs (as opposed to using plain text).
These are traditionally newer models (older models are generally `LLMs`, see above).
These are traditionally newer models (older models are generally `LLMs`, see below).
Chat models support the assignment of distinct roles to conversation messages, helping to distinguish messages from the AI, users, and instructions such as system messages.

Although the underlying models are messages in, message out, the LangChain wrappers also allow these models to take a string as input.
This makes them interchangeable with LLMs (and simpler to use).
This gives them the same interface as LLMs (and simpler to use).
When a string is passed in as input, it will be converted to a HumanMessage under the hood before being passed to the underlying model.

LangChain does not provide any ChatModels, rather we rely on third party integrations.
LangChain does not host any Chat Models, rather we rely on third party integrations.

We have some standardized parameters when constructing ChatModels:

- `model`: the name of the model

ChatModels also accept other parameters that are specific to that integration.
Chat Models also accept other parameters that are specific to that integration.

For specifics on how to use chat models, see the [relevant how-to guides here](/docs/how_to/#chat-models).

Expand All @@ -161,13 +161,13 @@ For specifics on how to use chat models, see the [relevant how-to guides here](/
<span data-heading-keywords="llm,llms"></span>

Language models that takes a string as input and returns a string.
These are traditionally older models (newer models generally are `ChatModels`, see below).
These are traditionally older models (newer models generally are [Chat Models](/docs/concepts/#chat-models), see below).

Although the underlying models are string in, string out, the LangChain wrappers also allow these models to take messages as input.
This makes them interchangeable with ChatModels.
This gives them the same interface as [Chat Models](/docs/concepts/#chat-models).
When messages are passed in as input, they will be formatted into a string under the hood before being passed to the underlying model.

LangChain does not provide any LLMs, rather we rely on third party integrations.
LangChain does not host any LLMs, rather we rely on third party integrations.

For specifics on how to use LLMs, see the [relevant how-to guides here](/docs/how_to/#llms).

Expand Down Expand Up @@ -629,6 +629,128 @@ For specifics on how to use callbacks, see the [relevant how-to guides here](/do

## Techniques

### Streaming

Individual LLM calls often run for much longer than traditional resource requests.
This compounds when you build more complex chains or agents that require multiple reasoning steps.

Fortunately, LLMs generate output iteratively, which means it's possible to show sensible intermediate results
before the final response is ready. Consuming output as soon as it becomes available has therefore become a vital part of the UX
around building apps with LLMs to help alleviate latency issues, and LangChain aims to have first-class support for streaming.

Below, we'll discuss some concepts and considerations around streaming in LangChain.

#### Tokens

The unit that most model providers use to measure input and output is via a unit called a **token**.
Tokens are the basic units that language models read and generate when processing or producing text.
The exact definition of a token can vary depending on the specific way the model was trained -
for instance, in English, a token could be a single word like "apple", or a part of a word like "app".

When you send a model a prompt, the words and characters in the prompt are encoded into tokens using a **tokenizer**.
The model then streams back generated output tokens, which the tokenizer decodes into human-readable text.
The below example shows how OpenAI models tokenize `LangChain is cool!`:

![](/img/tokenization.png)

You can see that it gets split into 5 different tokens, and that the boundaries between tokens are not exactly the same as word boundaries.

The reason language models use tokens rather than something more immediately intuitive like "characters"
has to do with how they process and understand text. At a high-level, language models iteratively predict their next generated output based on
the initial input and their previous generations. Training the model using tokens language models to handle linguistic
units (like words or subwords) that carry meaning, rather than individual characters, which makes it easier for the model
to learn and understand the structure of the language, including grammar and context.
Furthermore, using tokens can also improve efficiency, since the model processes fewer units of text compared to character-level processing.

#### Callbacks

The lowest level way to stream outputs from LLMs in LangChain is via the [callbacks](/docs/concepts/#callbacks) system. You can pass a
callback handler that handles the [`handleLLMNewToken`](https://api.js.langchain.com/interfaces/langchain_core_callbacks_base.CallbackHandlerMethods.html#handleLLMNewToken) event into LangChain components. When that component is invoked, any
[LLM](/docs/concepts/#llms) or [chat model](/docs/concepts/#chat-models) contained in the component calls
the callback with the generated token. Within the callback, you could pipe the tokens into some other destination, e.g. a HTTP response.
You can also handle the [`handleLLMEnd`](https://api.js.langchain.com/interfaces/langchain_core_callbacks_base.CallbackHandlerMethods.html#handleLLMEnd) event to perform any necessary cleanup.

You can see [this how-to section](/docs/how_to/#callbacks) for more specifics on using callbacks.

Callbacks were the first technique for streaming introduced in LangChain. While powerful and generalizable,
they can be unwieldy for developers. For example:

- You need to explicitly initialize and manage some aggregator or other stream to collect results.
- The execution order isn't explicitly guaranteed, and you could theoretically have a callback run after the `.invoke()` method finishes.
- Providers would often make you pass an additional parameter to stream outputs instead of returning them all at once.
- You would often ignore the result of the actual model call in favor of callback results.

#### `.stream()`

LangChain also includes the `.stream()` method as a more ergonomic streaming interface.
`.stream()` returns an iterator, which you can consume with a [`for await...of`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for-await...of) loop. Here's an example with a chat model:

```ts
import { ChatAnthropic } from "@langchain/anthropic";

const model = new ChatAnthropic({ model: "claude-3-sonnet-20240229" });

const stream = await model.stream("what color is the sky?");

for await (const chunk of stream) {
console.log(chunk);
}
```

For models (or other components) that don't support streaming natively, this iterator would just yield a single chunk, but
you could still use the same general pattern. Using `.stream()` will also automatically call the model in streaming mode
without the need to provide additional config.

The type of each outputted chunk depends on the type of component - for example, chat models yield [`AIMessageChunks`](https://api.js.langchain.com/classes/langchain_core_messages.AIMessageChunk.html).
Because this method is part of [LangChain Expression Language](/docs/concepts/#langchain-expression-language),
you can handle formatting differences from different outputs using an [output parser](/docs/concepts/#output-parsers) to transform
each yielded chunk.

You can check out [this guide](/docs/how_to/streaming/#using-stream) for more detail on how to use `.stream()`.

#### `.streamEvents()`

While the `.stream()` method is easier to use than callbacks, it only returns one type of value. This is fine for single LLM calls,
but as you build more complex chains of several LLM calls together, you may want to use the intermediate values of
the chain alongside the final output - for example, returning sources alongside the final generation when building a chat
over documents app.

There are ways to do this using the aforementioned callbacks, or by constructing your chain in such a way that it passes intermediate
values to the end with something like [`.assign()`](/docs/how_to/passthrough/), but LangChain also includes an
`.streamEvents()` method that combines the flexibility of callbacks with the ergonomics of `.stream()`. When called, it returns an iterator
which yields [various types of events](/docs/how_to/streaming/#event-reference) that you can filter and process according
to the needs of your project.

Here's one small example that prints just events containing streamed chat model output:

```ts
import { StringOutputParser } from "@langchain/core/output_parsers";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatAnthropic } from "@langchain/anthropic";

const model = new ChatAnthropic({ model: "claude-3-sonnet-20240229" });

const prompt = ChatPromptTemplate.fromTemplate("tell me a joke about {topic}");
const parser = new StringOutputParser();
const chain = prompt.pipe(model).pipe(parser);

const eventStream = await chain.streamEvents(
{ topic: "parrot" },
{ version: "v2" }
);

for await (const event of eventStream) {
const kind = event.event;
if (kind === "on_chat_model_stream") {
console.log(event);
}
}
```

You can roughly think of it as an iterator over callback events (though the format differs) - and you can use it on almost all LangChain components!

See [this guide](/docs/how_to/streaming/#using-stream-events) for more detailed information on how to use `.streamEvents()`.

### Function/tool calling

:::info
Expand Down
Binary file added docs/core_docs/static/img/tokenization.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading