Prompt Caching
OpenAI now offers prompt caching, a feature that can significantly reduce both latency and costs for your API requests. This feature is particularly beneficial for prompts exceeding 1024 tokens, offering up to an 80% reduction in latency for longer prompts over 10,000 tokens.
Prompt Caching is enabled for following models
gpt-4o (excludes gpt-4o-2024-05-13)
gpt-4o-mini
o1-preview
o1-mini
Portkey supports OpenAI's prompt caching feature out of the box. Here is an examples on of how to use it:
What can be cached
Messages: The complete messages array, encompassing system, user, and assistant interactions.
Images: Images included in user messages, either as links or as base64-encoded data, as well as multiple images can be sent. Ensure the detail parameter is set identically, as it impacts image tokenization.
Tool use: Both the messages array and the list of available
tools
can be cached, contributing to the minimum 1024 token requirement.Structured outputs: The structured output schema serves as a prefix to the system message and can be cached.
What's Not Supported
Completions API (only Chat Completions API is supported)
Streaming responses (caching works, but streaming itself is not affected)
Monitoring Cache Performance
Prompt caching requests & responses based on OpenAI's calculations here:
Key Features:
Reduced Latency: Especially significant for longer prompts.
Lower Costs: Cached portions of prompts are billed at a discounted rate.
Improved Efficiency: Allows for more context in prompts without increasing costs proportionally.
Zero Data Retention: No data is stored during the caching process, making it eligible for zero data retention policies.
Last updated
Was this helpful?