Cache (Simple & Semantic)
Last updated
Was this helpful?
Last updated
Was this helpful?
Semantic caching is available for and users.
Simple caching is available for all plans.
Speed up and save money on your LLM requests by storing past responses in the Portkey cache. There are 2 cache modes:
Simple: Matches requests verbatim. Perfect for repeated, identical prompts. Works on all models including image generation models.
Semantic: Matches responses for requests that are semantically similar. Ideal for denoising requests with extra prepositions, pronouns, etc. Works on any model available on /chat/completions
or /completions
routes.
Portkey cache serves requests upto 20x times faster and cheaper.
To enable Portkey cache, just add the cache
params to your .
Simple cache performs an exact match on the input prompts. If the exact same request is received again, Portkey retrieves the response directly from the cache, bypassing the model execution.
To optimise for accurate cache hit rates, Semantic cache only works with requests with less than 8,191 input tokens, and with number of messages (human, assistant, system combined) less than or equal to 4.
When using the /chat/completions
endpoint, Portkey requires at least two message objects in the messages
array. The first message object, typically used for the system
message, is not considered when determining semantic similarity for caching purposes.
For example:
In this case, only the content of the user
message ("Who is the president of the US?") is used for finding semantic matches in the cache. The system
message ("You are a helpful assistant") is ignored.
This means that even if you change the system
message while keeping the user
message semantically similar, Portkey will still return a semantic cache hit.
This allows you to modify the behavior or context of the assistant without affecting the cache hits for similar user queries.
You can set the age (or "ttl") of your cached response with this setting. Cache age is also set in your Config object:
In this example, your cache will automatically expire after 60 seconds. Cache age is set in seconds.
Ensure that a new response is fetched and stored in the cache even when there is an existing cached response for your request. Cache force refresh can only be done at the time of making a request, and it is not a part of your Config.
You can enable cache force refresh with this header:
Portkey generally partitions the cache along all the values passed in your request header. With a custom cache namespace, you can now ignore metadata and other headers, and only partition the cache based on the custom strings that you send.
This allows you to have finer control over your cached data and optimize your cache hit ratio.
To use Cache Namespaces, simply include the x-portkey-cache-namespace
header in your API requests, followed by any custom string value. Portkey will then use this namespace string as the sole basis for partitioning the cache, disregarding all other headers, including metadata.
For example, if you send the following header:
Portkey will cache the response under the namespace user-123
, ignoring any other headers or metadata associated with the request.
In this example, the response will be cached under the namespace user-123
, ignoring any other headers or metadata.
Portkey shows you powerful stats on cache usage on the Analytics page. Just head over to the Cache tab, and you will see:
Your raw number of cache hits as well as daily cache hit rate
Your average latency for delivering results from cache and how much time it saves you
How much money the cache saves you
For each request we also calculate and show the cache response time and how much money you saved with each hit.
You can set cache at two levels:
Top-level that works across all the targets.
Target-level that works when that specific target is triggered.
If any of your targets have override_params
then cache on that target will not work until that particular combination of params is also stored with the cache.
If there are no override_params
for that target, then cache will be active on that target even if it hasn't been triggered even once.
Semantic cache considers the contextual similarity between input requests. It uses cosine similarity to ascertain if the similarity between the input and a cached request exceeds a specific threshold. If the similarity threshold is met, Portkey retrieves the response from the cache, saving model execution time. Check out this for more details on how we do this.
On the Logs page, the cache status is updated on the Status column. You will see Cache Disabled
when you are not using the cache, and any of Cache Miss
, Cache Refreshed
, Cache Hit
, Cache Semantic Hit
based on the cache hit status. Read more .