Prompt Caching
OpenAI now offers prompt caching, a feature that can significantly reduce both latency and costs for your API requests. This feature is particularly beneficial for prompts exceeding 1024 tokens, offering up to an 80% reduction in latency for longer prompts over 10,000 tokens.
Prompt Caching is enabled for following models
gpt-4o (excludes gpt-4o-2024-05-13)
gpt-4o-mini
o1-preview
o1-mini
Portkey supports OpenAI's prompt caching feature out of the box. Here is an examples on of how to use it:
from portkey_ai import Portkey
portkey = Portkey(
api_key="PORTKEY_API_KEY",
virtual_key="OPENAI_VIRTUAL_KEY",
)
# Define tools (for function calling example)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
}
}
]
# Example: Function calling with caching
response = portkey.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant that can check the weather."},
{"role": "user", "content": "What's the weather like in San Francisco?"}
],
tools=tools,
tool_choice="auto"
)
print(json.dumps(response.model_dump(), indent=2))
What can be cached
Messages: The complete messages array, encompassing system, user, and assistant interactions.
Images: Images included in user messages, either as links or as base64-encoded data, as well as multiple images can be sent. Ensure the detail parameter is set identically, as it impacts image tokenization.
Tool use: Both the messages array and the list of available
tools
can be cached, contributing to the minimum 1024 token requirement.Structured outputs: The structured output schema serves as a prefix to the system message and can be cached.
What's Not Supported
Completions API (only Chat Completions API is supported)
Streaming responses (caching works, but streaming itself is not affected)
Monitoring Cache Performance
Prompt caching requests & responses based on OpenAI's calculations here:

All requests, including those with fewer than 1024 tokens, will display a cached_tokens
field of the usage.prompt_tokens_details
chat completions object indicating how many of the prompt tokens were a cache hit.
For requests under 1024 tokens, cached_tokens
will be zero.

cached_tokens
field of the usage.prompt_tokens_details
Key Features:
Reduced Latency: Especially significant for longer prompts.
Lower Costs: Cached portions of prompts are billed at a discounted rate.
Improved Efficiency: Allows for more context in prompts without increasing costs proportionally.
Zero Data Retention: No data is stored during the caching process, making it eligible for zero data retention policies.
Last updated
Was this helpful?