Manage application history

Building effective conversational AI applications requires careful management of the interaction history.

Store conversation history

Your application needs to store the sequence of interactions between the user and the model. A common way is to maintain a list or array of message objects, typically including the role, such as user or assistant or model and the content, including the text of the message.

Example:

[
 { "role": "user", "content": "What are the main accessibility features?" },
 { "role": "model", "content": "The main features are X, Y, and Z." },
 { "role": "user", "content": "Tell me more about Y." } ]

Select the relevant history for each API call

You can't send the entire conversation history with every single request, especially for long conversations, due to the following reasons:

Context window limits: most models have a maximum number of tokens or pieces of words they can process in a single prompt, combining both input and output. Exceeding this limit causes errors.
Cost: API calls are often priced based on the number of tokens processed, combining both input and output. Sending long histories increases cost.
Performance or focus: Extremely long histories might dilute the focus of the model on the most recent user query.

Therefore, you need a strategy to select the most relevant parts of the history to include in the prompt for the next API call:

Sliding window: a straightforward method that involves keeping only the last N turns, such as the last five user messages and their corresponding model responses. This is straightforward to implement but risks losing important context from earlier in the conversation.
Summarization: a more complex approach that involves periodically using the model itself or another method to summarize older parts of the conversation. This summary is then included along with the most recent N turns, preserving more context but potentially adding to the cost or latency for the summarization step.
Combination: always include the very first user message if it sets the overall goal, along with the last N turns.
Vector search: an advanced method that involves embedding conversation turns and retrieving the most semantically similar past turns to the current query. This is powerful but significantly more complex to implement.

Format the prompt

To prepare your prompt for the API call, follow these steps:

Combine the selected historical messages with the new user query.
Format this combined list according to the specific requirements of the model's API you are using. Ensure roles like user, assistant, and system are correctly assigned.

Manage token count

Before making the API call, estimate the total token count of your constructed prompt, including selected history, new query, and any system instructions. Ensure this count is safely below the model's maximum context window limit. If it's too high, you need to prune the history more aggressively, for example, reduce N in the sliding window, shorten the summary.

Manage conversation context

Manually managing conversation context requires careful implementation within your application. You need to store the history, strategically select relevant parts for each API call while respecting the model's context window limit, and format the prompt correctly. This adds complexity compared to using models with built-in session handling but is essential for creating stateful conversational experiences with stateless APIs.

What's next

Key terms