Portkey Docs
HomeAPIIntegrationsChangelog
  • Introduction
    • What is Portkey?
    • Make Your First Request
    • Feature Overview
  • Integrations
    • LLMs
      • OpenAI
        • Structured Outputs
        • Prompt Caching
      • Anthropic
        • Prompt Caching
      • Google Gemini
      • Groq
      • Azure OpenAI
      • AWS Bedrock
      • Google Vertex AI
      • Bring Your Own LLM
      • AI21
      • Anyscale
      • Cerebras
      • Cohere
      • Fireworks
      • Deepbricks
      • Deepgram
      • Deepinfra
      • Deepseek
      • Google Palm
      • Huggingface
      • Inference.net
      • Jina AI
      • Lingyi (01.ai)
      • LocalAI
      • Mistral AI
      • Monster API
      • Moonshot
      • Nomic
      • Novita AI
      • Ollama
      • OpenRouter
      • Perplexity AI
      • Predibase
      • Reka AI
      • SambaNova
      • Segmind
      • SiliconFlow
      • Stability AI
      • Together AI
      • Voyage AI
      • Workers AI
      • ZhipuAI / ChatGLM / BigModel
      • Suggest a new integration!
    • Agents
      • Autogen
      • Control Flow
      • CrewAI
      • Langchain Agents
      • LlamaIndex
      • Phidata
      • Bring Your own Agents
    • Libraries
      • Autogen
      • DSPy
      • Instructor
      • Langchain (Python)
      • Langchain (JS/TS)
      • LlamaIndex (Python)
      • LibreChat
      • Promptfoo
      • Vercel
        • Vercel [Depricated]
  • Product
    • Observability (OpenTelemetry)
      • Logs
      • Tracing
      • Analytics
      • Feedback
      • Metadata
      • Filters
      • Logs Export
      • Budget Limits
    • AI Gateway
      • Universal API
      • Configs
      • Multimodal Capabilities
        • Image Generation
        • Function Calling
        • Vision
        • Speech-to-Text
        • Text-to-Speech
      • Cache (Simple & Semantic)
      • Fallbacks
      • Automatic Retries
      • Load Balancing
      • Conditional Routing
      • Request Timeouts
      • Canary Testing
      • Virtual Keys
        • Budget Limits
    • Prompt Library
      • Prompt Templates
      • Prompt Partials
      • Retrieve Prompts
      • Advanced Prompting with JSON Mode
    • Guardrails
      • List of Guardrail Checks
        • Patronus AI
        • Aporia
        • Pillar
        • Bring Your Own Guardrails
      • Creating Raw Guardrails (in JSON)
    • Autonomous Fine-tuning
    • Enterprise Offering
      • Org Management
        • Organizations
        • Workspaces
        • User Roles & Permissions
        • API Keys (AuthN and AuthZ)
      • Access Control Management
      • Budget Limits
      • Security @ Portkey
      • Logs Export
      • Private Cloud Deployments
        • Architecture
        • AWS
        • GCP
        • Azure
        • Cloudflare Workers
        • F5 App Stack
      • Components
        • Log Store
          • MongoDB
    • Open Source
    • Portkey Pro & Enterprise Plans
  • API Reference
    • Introduction
    • Authentication
    • OpenAPI Specification
    • Headers
    • Response Schema
    • Gateway Config Object
    • SDK
  • Provider Endpoints
    • Supported Providers
    • Chat
    • Embeddings
    • Images
      • Create Image
      • Create Image Edit
      • Create Image Variation
    • Audio
      • Create Speech
      • Create Transcription
      • Create Translation
    • Fine-tuning
      • Create Fine-tuning Job
      • List Fine-tuning Jobs
      • Retrieve Fine-tuning Job
      • List Fine-tuning Events
      • List Fine-tuning Checkpoints
      • Cancel Fine-tuning
    • Batch
      • Create Batch
      • List Batch
      • Retrieve Batch
      • Cancel Batch
    • Files
      • Upload File
      • List Files
      • Retrieve File
      • Retrieve File Content
      • Delete File
    • Moderations
    • Assistants API
      • Assistants
        • Create Assistant
        • List Assistants
        • Retrieve Assistant
        • Modify Assistant
        • Delete Assistant
      • Threads
        • Create Thread
        • Retrieve Thread
        • Modify Thread
        • Delete Thread
      • Messages
        • Create Message
        • List Messages
        • Retrieve Message
        • Modify Message
        • Delete Message
      • Runs
        • Create Run
        • Create Thread and Run
        • List Runs
        • Retrieve Run
        • Modify Run
        • Submit Tool Outputs to Run
        • Cancel Run
      • Run Steps
        • List Run Steps
        • Retrieve Run Steps
    • Completions
    • Gateway for Other API Endpoints
  • Portkey Endpoints
    • Configs
      • Create Config
      • List Configs
      • Retrieve Config
      • Update Config
    • Feedback
      • Create Feedback
      • Update Feedback
    • Guardrails
    • Logs
      • Insert a Log
      • Log Exports [BETA]
        • Retrieve a Log Export
        • Update a Log Export
        • List Log Exports
        • Create a Log Export
        • Start a Log Export
        • Cancel a Log Export
        • Download a Log Export
    • Prompts
      • Prompt Completion
      • Render
    • Virtual Keys
      • Create Virtual Key
      • List Virtual Keys
      • Retrieve Virtual Key
      • Update Virtual Key
      • Delete Virtual Key
    • Analytics
      • Graphs - Time Series Data
        • Get Requests Data
        • Get Cost Data
        • Get Latency Data
        • Get Tokens Data
        • Get Users Data
        • Get Requests per User
        • Get Errors Data
        • Get Error Rate Data
        • Get Status Code Data
        • Get Unique Status Code Data
        • Get Rescued Requests Data
        • Get Cache Hit Rate Data
        • Get Cache Hit Latency Data
        • Get Feedback Data
        • Get Feedback Score Distribution Data
        • Get Weighted Feeback Data
        • Get Feedback Per AI Models
      • Summary
        • Get All Cache Data
      • Groups - Paginated Data
        • Get User Grouped Data
        • Get Model Grouped Data
        • Get Metadata Grouped Data
    • API Keys [BETA]
      • Update API Key
      • Create API Key
      • Delete an API Key
      • Retrieve an API Key
      • List API Keys
    • Admin
      • Users
        • Retrieve a User
        • Retrieve All Users
        • Update a User
        • Remove a User
      • User Invites
        • Invite a User
        • Retrieve an Invite
        • Retrieve All User Invites
        • Delete a User Invite
      • Workspaces
        • Create Workspace
        • Retrieve All Workspaces
        • Retrieve a Workspace
        • Update Workspace
        • Delete a Workspace
      • Workspace Members
        • Add a Workspace Member
        • Retrieve All Workspace Members
        • Retrieve a Workspace Member
        • Update Workspace Member
        • Remove Workspace Member
  • Guides
    • Getting Started
      • A/B Test Prompts and Models
      • Tackling Rate Limiting
      • Function Calling
      • Image Generation
      • Getting started with AI Gateway
      • Llama 3 on Groq
      • Return Repeat Requests from Cache
      • Trigger Automatic Retries on LLM Failures
      • 101 on Portkey's Gateway Configs
    • Integrations
      • Llama 3 on Portkey + Together AI
      • Introduction to GPT-4o
      • Anyscale
      • Mistral
      • Vercel AI
      • Deepinfra
      • Groq
      • Langchain
      • Mixtral 8x22b
      • Segmind
    • Use Cases
      • Few-Shot Prompting
      • Enforcing JSON Schema with Anyscale & Together
      • Detecting Emotions with GPT-4o
      • Build an article suggestion app with Supabase pgvector, and Portkey
      • Setting up resilient Load balancers with failure-mitigating Fallbacks
      • Run Portkey on Prompts from Langchain Hub
      • Smart Fallback with Model-Optimized Prompts
      • How to use OpenAI SDK with Portkey Prompt Templates
      • Setup OpenAI -> Azure OpenAI Fallback
      • Fallback from SDXL to Dall-e-3
      • Comparing Top10 LMSYS Models with Portkey
      • Build a chatbot using Portkey's Prompt Templates
  • Support
    • Contact Us
    • Developer Forum
    • Common Errors & Resolutions
    • December '23 Migration
    • Changelog
Powered by GitBook
On this page
  • Enable Cache in the Config
  • Simple Cache
  • How it Works
  • Semantic Cache
  • How it Works
  • Ignoring the First Message in Semantic Cache
  • Read more how to set cache in Configs.
  • Setting Cache Age
  • Force Refresh Cache
  • Cache Namespace: Simplified Cache Partitioning
  • How It Works
  • Cache in Analytics
  • Cache in Logs
  • How Cache works with Configs

Was this helpful?

Edit on GitHub
  1. Product
  2. AI Gateway

Cache (Simple & Semantic)

PreviousText-to-SpeechNextFallbacks

Last updated 10 months ago

Was this helpful?

Semantic caching is available for and users.

Simple caching is available for all plans.

Speed up and save money on your LLM requests by storing past responses in the Portkey cache. There are 2 cache modes:

  • Simple: Matches requests verbatim. Perfect for repeated, identical prompts. Works on all models including image generation models.

  • Semantic: Matches responses for requests that are semantically similar. Ideal for denoising requests with extra prepositions, pronouns, etc. Works on any model available on /chat/completions or /completions routes.

Portkey cache serves requests upto 20x times faster and cheaper.

Enable Cache in the Config

To enable Portkey cache, just add the cache params to your .

Simple Cache

"cache": { "mode": "simple" }

How it Works

Simple cache performs an exact match on the input prompts. If the exact same request is received again, Portkey retrieves the response directly from the cache, bypassing the model execution.


Semantic Cache

"cache": { "mode": "semantic" }

How it Works

Semantic cache is a "superset" of both caches. Setting cache mode to "semantic" will work for when there are simple cache hits as well.

To optimise for accurate cache hit rates, Semantic cache only works with requests with less than 8,191 input tokens, and with number of messages (human, assistant, system combined) less than or equal to 4.

Ignoring the First Message in Semantic Cache

When using the /chat/completions endpoint, Portkey requires at least two message objects in the messages array. The first message object, typically used for the system message, is not considered when determining semantic similarity for caching purposes.

For example:

messages = [
        { "role": "system", "content": "You are a helpful assistant" },
        { "role": "user", "content": "Who is the president of the US?" }
]

In this case, only the content of the user message ("Who is the president of the US?") is used for finding semantic matches in the cache. The system message ("You are a helpful assistant") is ignored.

This means that even if you change the system message while keeping the user message semantically similar, Portkey will still return a semantic cache hit.

This allows you to modify the behavior or context of the assistant without affecting the cache hits for similar user queries.


Setting Cache Age

You can set the age (or "ttl") of your cached response with this setting. Cache age is also set in your Config object:

"cache": { 
    "mode": "semantic",
    "max_age": 60
}

In this example, your cache will automatically expire after 60 seconds. Cache age is set in seconds.

  • Minimum cache age is 60 seconds

  • Maximum cache age is 90 days (i.e. 7776000 seconds)

  • Default cache age is 7 days (i.e. 604800 seconds)


Force Refresh Cache

Ensure that a new response is fetched and stored in the cache even when there is an existing cached response for your request. Cache force refresh can only be done at the time of making a request, and it is not a part of your Config.

You can enable cache force refresh with this header:

"x-portkey-cache-force-refresh": "True"
curl https://api.portkey.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -H "x-portkey-virtual-key: open-ai-xxx" \
  -H "x-portkey-config: cache-config-xxx" \
  -H "x-portkey-cache-force-refresh: true" \
  -d '{
    "messages": [{"role": "user","content": "Hello!"}]
  }'
from portkey_ai import Portkey

portkey = Portkey(
    api_key="PORTKEY_API_KEY",
    virtual_key="open-ai-xxx",
    config="pp-cache-xxx" 
)

response = portkey.with_options(
    cache_force_refresh = True
).chat.completions.create(
    messages = [{ "role": 'user', "content": 'Hello!' }],
    model = 'gpt-4'
)
import Portkey from 'portkey-ai';

const portkey = new Portkey({
    apiKey: "PORTKEY_API_KEY",
    config: "pc-cache-xxx",
    virtualKey: "open-ai-xxx"
})

async function main(){
    const response = await portkey.chat.completions.create({
        messages: [{ role: 'user', content: 'Hello' }],
        model: 'gpt-4',
    }, {
        cacheForceRefresh: true
    });
}

main()
  • Cache force refresh is only activated if a cache config is also passed along with your request. (setting cacheForceRefresh as true without passing the relevant cache config will not have any effect)

  • For requests that have previous semantic hits, force refresh is performed on ALL the semantic matches of your request.


Cache Namespace: Simplified Cache Partitioning

Portkey generally partitions the cache along all the values passed in your request header. With a custom cache namespace, you can now ignore metadata and other headers, and only partition the cache based on the custom strings that you send.

This allows you to have finer control over your cached data and optimize your cache hit ratio.

How It Works

To use Cache Namespaces, simply include the x-portkey-cache-namespace header in your API requests, followed by any custom string value. Portkey will then use this namespace string as the sole basis for partitioning the cache, disregarding all other headers, including metadata.

For example, if you send the following header:

"x-portkey-cache-namespace: user-123"

Portkey will cache the response under the namespace user-123, ignoring any other headers or metadata associated with the request.

import Portkey from 'portkey-ai';

const portkey = new Portkey({
    apiKey: "PORTKEY_API_KEY",
    config: "pc-cache-xxx",
    virtualKey: "open-ai-xxx"
})

async function main(){
    const response = await portkey.chat.completions.create({
        messages: [{ role: 'user', content: 'Hello' }],
        model: 'gpt-4',
    }, {
        cacheNamespace: 'user-123'
    });
}

main()

In this example, the response will be cached under the namespace user-123, ignoring any other headers or metadata.

from portkey_ai import Portkey

portkey = Portkey(
    api_key="PORTKEY_API_KEY",
    virtual_key="open-ai-xxx",
    config="pp-cache-xxx" 
)

response = portkey.with_options(
    cache_namespace = "user-123"
).chat.completions.create(
    messages = [{ "role": 'user', "content": 'Hello!' }],
    model = 'gpt-4'
)

In this example, the response will be cached under the namespace user-123, ignoring any other headers or metadata.

curl https://api.portkey.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -H "x-portkey-virtual-key: open-ai-xxx" \
  -H "x-portkey-config: cache-config-xxx" \
  -H "x-portkey-cache-namespace: user-123" \
  -d '{
    "messages": [{"role": "user","content": "Hello!"}]
  }'

In this example, the response will be cached under the namespace user-123, ignoring any other headers or metadata.


Cache in Analytics

Portkey shows you powerful stats on cache usage on the Analytics page. Just head over to the Cache tab, and you will see:

  • Your raw number of cache hits as well as daily cache hit rate

  • Your average latency for delivering results from cache and how much time it saves you

  • How much money the cache saves you

Cache in Logs

For each request we also calculate and show the cache response time and how much money you saved with each hit.


How Cache works with Configs

You can set cache at two levels:

  • Top-level that works across all the targets.

  • Target-level that works when that specific target is triggered.

{
  "cache": {"mode": "semantic", "max_age": 60},
  "strategy": {"mode": "fallback"},
  "targets": [
    {"virtual_key": "openai-key-1"},
    {"virtual_key": "openai-key-2"}
  ]
}
{
  "strategy": {"mode": "fallback"},
  "targets": [
    {
      "virtual_key": "openai-key-1",
      "cache": {"mode": "simple", "max_age": 200}
    },
    {
      "virtual_key": "openai-key-2",
      "cache": {"mode": "semantic", "max_age": 100}
    }
  ]
}

You can also set cache at both levels (top & target).

In this case, the target-level cache setting will be given preference over the top-level cache setting. You should start getting cache hits from the second request onwards for that specific target.

If any of your targets have override_params then cache on that target will not work until that particular combination of params is also stored with the cache. If there are no override_params for that target, then cache will be active on that target even if it hasn't been triggered even once.

Semantic cache considers the contextual similarity between input requests. It uses cosine similarity to ascertain if the similarity between the input and a cached request exceeds a specific threshold. If the similarity threshold is met, Portkey retrieves the response from the cache, saving model execution time. Check out this for more details on how we do this.

.

On the Logs page, the cache status is updated on the Status column. You will see Cache Disabled when you are not using the cache, and any of Cache Miss, Cache Refreshed, Cache Hit, Cache Semantic Hit based on the cache hit status. Read more .

blog
here
Read more how to set cache in Configs
Production
Enterprise
config object