Semantic Caching Strategy
This document covers design principles and operational considerations for gateway-level semantic caching in LLM inference pipelines.
Implementation Guide: For tool comparison tables, gateway integration patterns, configuration examples, and deployment snippets, refer to Inference Gateway Setup Guide — Semantic Caching Implementation Options.
1. Overview
Why Semantic Cache is Needed
In large-scale LLM services, user queries are often semantically identical but differently expressed. Traditional caches (HTTP cache, Redis key-value) that match exact strings cannot eliminate such duplicates. Semantic Cache detects semantically similar requests using embedding-based similarity and reuses previous responses, simultaneously improving three issues:
- Token Cost Reduction: Skip LLM calls on cache HITs, saving API costs and GPU time
- Latency Reduction: Respond with vector lookup (few ms) instead of generation latency (hundreds of ms to seconds)
- GPU Capacity Expansion: Effectively increase throughput in self-hosted vLLM/llm-d environments
Expected Savings by Threshold
Savings rates vary significantly based on user query repetitiveness, domain (FAQ/customer support/code generation), and prompt structure, so the figures below are general ranges observed in public implementation documentation and vendor blogs. Each organization must validate actual effects through progressive rollout and A/B evaluation.
| Similarity Threshold | Operation Policy | Observed Savings Range | Characteristics |
|---|---|---|---|
| 0.95 (Strict) | Cache only nearly identical queries | approximately 10-15% | Very low false positive risk, strict quality requirements |
| 0.85 (Balanced) | Allow same meaning with different expressions | approximately 30-40% | Recommended default for general LLM chat/assistants |
| 0.70 (Aggressive) | Group related topics together | approximately 50-60% | Only for FAQ/static KB with very high repetition |
Reference sources: Redis — Building an LLM semantic cache, Portkey Semantic Cache docs, Helicone Caching docs, GPTCache README.
The above figures are rough ranges based on public materials. Not all domains achieve the same HIT rate. Measure actual HIT rate and false-positive rate for your workload with dashboards (§6) before finalizing thresholds.
2. Cache Layer Distinctions
LLM inference pipelines have 3 different cache layers. Each operates at different positions, stores different units, and has different cost impacts. Semantic Cache complements rather than replaces the other two layers.
3-Layer Cache Flow
Layer-by-Layer Comparison
| Aspect | KV Cache (vLLM PagedAttention) | Prompt Cache (Anthropic/OpenAI managed) | Semantic Cache (Gateway level) |
|---|---|---|---|
| Operation Location | Inside inference engine (GPU HBM) | Model provider side | Gateway (Bifrost/LiteLLM/Portkey) front |
| Storage Unit | Token-level KV blocks | Explicit cache_control marked sections | Entire response object (text/JSON) |
| Matching Method | Prefix exact match | Provider-internal hash-based exact match | Embedding cosine similarity |
| Primary Purpose | TTFT & throughput improvement | Repeated system prompt cost reduction | Eliminate duplicate LLM calls entirely |
| Cost Impact | GPU time savings (self-hosted) | Input token price discount (managed) | Skip API calls entirely |
| Failure Impact | Performance degradation only | Cache not applied → regular pricing | Direct response quality impact (false answer risk) |
| Related Docs | vLLM Model Serving | Provider official docs | This document |
Semantic Cache HIT → immediate response (skip LLM call). On MISS, provider call → Prompt Cache reduces system prompt input cost → inference engine KV Cache improves generation speed. The three layers are orthogonal to each other, so enabling all simultaneously is common.
Application Timing Comparison
- Prototype/single model: KV Cache (automatic) + Prompt Cache (if provider supports) is sufficient
- Multi-tenant/multi-provider: Add Gateway-level Semantic Cache — absorbs patterns where identical queries repeat across multiple users
- FAQ/chatbot/fixed KB: Lower Semantic Cache threshold (0.80-0.85) for aggressive reuse
- Code generation/IDE agents: Apply Semantic Cache conservatively (0.95) or disable — similar queries often have different file contexts making reuse risky
3. Similarity Threshold Design
Threshold Trade-offs
Threshold Selection Criteria
| Threshold | Suitable Workloads | Unsuitable Workloads | Notes |
|---|---|---|---|
| 0.95 and above | Code generation, legal/medical assistants, financial advisory | (Broadly applicable) | Only HITs on identical queries with minimal expression differences |
| 0.85-0.94 (Recommended) | General chatbots, customer support, document summarization, product Q&A | Code generation (context-sensitive) | Same meaning with different expressions allowed. Default for most services |
| 0.75-0.84 | FAQ, static KB, internal document search explanations | Conversational reasoning, multi-turn | Increased false positives — response validation layer needed |
| 0.70 and below | Rarely used — limited to high-volume FAQ | All general services | Risk of grouping unrelated queries |
Considerations When Setting Thresholds
- User Error Tolerance: Lower if "closest answer" suffices like customer support; higher for code/calculations
- Domain Vocabulary Diversity: Domains with many term synonyms (medical/legal) tend to group meanings well even at lower thresholds
- Embedding Model Quality: Stronger embeddings (e.g.,
text-embedding-3-large,bge-m3) maintain safety even at lower thresholds - Conversation Context: Multi-turn conversations must include previous turns in hash keys (§5)
- Language/Locale: Multilingual services should separate namespaces by language to prevent cross-contamination
Start conservatively at 0.90, then adjust by 0.05 increments while monitoring HIT rate and user dissatisfaction metrics (👎, regenerate clicks) on Langfuse/Grafana dashboards.
4. Implementation Considerations
When implementing Semantic Cache, select solutions considering these factors.
Key Considerations
- Existing Infrastructure Reusability: Can implement without additional backends if Redis/Milvus vector DB already exists
- Gateway Integration Needs: Whether to integrate routing, guardrails, and cache in unified management or separate layers
- Managed vs Self-hosted: Operational burden, compliance, cost trade-offs
- Observability Requirements: Cache HIT/MISS tracking, false-positive monitoring level
- Vector Search Engine Preference: Organization's standard stack among Redis/Milvus/FAISS/Qdrant
Implementation Patterns
Pattern A: Gateway All-in-one — Routing, cache, observability in single product (e.g., Portkey, Helicone)
- Pros: Integrated configuration, rapid deployment
- Cons: Vendor lock-in, advanced features depend on managed plans
Pattern B: Modular — Gateway (Bifrost/LiteLLM) + independent cache layer (RedisVL, GPTCache)
- Pros: Independent layer replacement possible, open source first
- Cons: Increased integration complexity
Pattern C: Managed — Redis Enterprise LangCache, Portkey SaaS
- Pros: Minimal operational burden, compliance certifications included
- Cons: Cost, region constraints
For specific tool comparison tables, configuration examples, and deployment snippets, refer to Inference Gateway Setup Guide — Semantic Caching Implementation Options.
5. Cache Key Design and Multi-tenancy
Since Semantic Cache sits at the gateway front and skips LLM calls entirely, cache key design and namespace separation directly impact response quality, security, and multi-tenancy.
Cache Key Components
The simplest key is just embedding(user_query), but in real services, the following elements must be included in the key:
Required Components:
model_id: Prevent cross-contamination between model types/versions (e.g.,glm-5≠qwen3-4b)system_prompt_hash: Different system prompts produce completely different answerstenant_id | user_id: Multi-tenant/per-user isolationlanguage | locale: Prevent language cross-contaminationtool_set_hash: Agent's available tool setembedding(user_query): Semantic similarity matching target
Multi-tenant Namespace Strategy
| Layer | Namespace Pattern Example | Isolation Purpose |
|---|---|---|
| Organization / Tenant | cache:{tenantId}:* | Data isolation, audit boundaries |
| User | cache:{tenantId}:{userId}:* | Prevent cross-user leakage of PII-containing queries |
| Language | cache:{tenantId}:ko:* / :en:* | Prevent cross-contamination in multilingual services |
| Domain | cache:{tenantId}:support:* / :billing:* | Block reuse between domains with different contexts |
| Model Version | cache:{...}:glm-5:v2026-03:* | Enable bulk invalidation on model upgrades |
Non-determinism Handling
Requests with temperature > 0, top_p < 1, or tool calls produce different responses each time, so simple reuse can degrade user experience.
Recommended Policy:
- Default cache disabled for streaming/agent-type requests
- Selectively allow only on endpoints with guaranteed reproducibility (e.g.,
/summarize,/classify) - Recommend routing rules that cache only
temperature=0requests
For specific gateway integration patterns (kgateway, LiteLLM, Bifrost), configuration examples, and code snippets, refer to Inference Gateway Setup Guide — Semantic Caching Implementation Options.
6. Observability (Langfuse Integration)
Semantic Cache is a layer that directly impacts users, making it unoperational without observability. Collect the following with Langfuse or equivalent observability stack:
Langfuse Trace Tags
Attach these attributes to each request trace (Langfuse Python/TypeScript SDK supports via metadata or tags):
cache_hit:true/falsesimilarity_score:0.92(on HIT, highest matched similarity)cache_source:redis-semantic/portkey/helicone, etc.cache_namespace:{tenant}:{lang}:{domain}(no PII)cache_ttl_remaining_s: Remaining TTL (for debugging)cache_eviction_reason: MISS cause (below_threshold,namespace_miss,ttl_expired)
Recommended Dashboard Panels
Visualize the following with Langfuse custom dashboards or Prometheus + Grafana:
| Panel | Query/Metric | Target Value |
|---|---|---|
| Overall HIT Rate | count(cache_hit=true) / count(*) | 15-40% (varies by service characteristics) |
| HIT Rate by Namespace | group by cache_namespace | Monitor tenant variance |
| similarity_score Distribution | histogram of similarity_score on HIT | Watch for excessive bins near threshold |
| False-positive Proxy | 👎 feedback / regenerate click rate (where cache_hit=true) | No increase vs baseline |
| Total Saved Tokens | sum(tokens_saved) on HIT | Cost reporting |
| Cache Store Size | Redis DBSIZE, memory usage | Check TTL & eviction policy |
Alert Rules
| Alert | Condition | Severity |
|---|---|---|
| HIT Rate Plunge | HIT rate < 50% of previous 24h average | Warning — possible embedding/Redis failure |
| Abnormal HIT Rate Increase | HIT rate > 70% + false-positive proxy increase | Critical — suspected threshold misconfiguration |
| similarity_score Concentration | HIT ratio within threshold ±0.02 > 40% | Warning — excessive borderline matching |
| Redis Latency | P99 > 20ms | Warning — cache becoming bottleneck |
Langfuse OTel Integration Reference
For Bifrost/LiteLLM OTel transmission configuration, follow existing LLMOps Observability and Inference Gateway Setup Guide documents. Cache-related tags are added as span attributes at the application/gateway plugin layer.
7. Practical Checklist
Security & Privacy
- Prohibit caching prompts containing PII (place Guardrails before Semantic Cache)
- Prohibit cache storage on prompt injection detection
- Prevent cross-tenant leakage (enforce namespace design with unit tests)
- Retain audit logs for minimum 90 days (HIT/MISS, namespace, similarity_score)
Operations & Lifecycle
- TTL: Static KB 7-30 days / product info 1-24h / news & time-series disable
- Model Version Replacement: Include version in key (
glm-5:v2026-03) → natural expiration - Embedding Model Replacement: Full rebuild required
- Failure Fallback: Fail-open on Redis failure (secure original rate limit in advance)
- Progressive Rollout: Validate new policies with A/B testing
Quality Guardrails
- Prohibit or use short TTL for large responses & tool call results
- Auto-evict entries on user 👎 feedback
- Weekly evaluation of cache HIT samples (Ragas/LLM-judge)
Pre-deployment Checklist
- Cache key includes
model_id,system_prompt_hash,tenant_id,language - Guardrails positioned before cache
- Record
cache_hit,similarity_scorein Langfuse traces - Configure HIT rate / false-positive dashboard
- Validate Redis failure fail-open scenario
8. Domain-specific Application Patterns
Even with the same Semantic Cache engine, key composition, thresholds, and TTL vary significantly by domain.
| Domain | Threshold | TTL | Characteristics |
|---|---|---|---|
| FAQ / Product Q&A | 0.80-0.85 | 24-72h | Repetitive queries, fixed answers. Key: tenant+language+product_version |
| Internal KB | 0.85-0.90 | 1-7d | Prioritize isolation by permission. Key: tenant+role_hash+language |
| Customer Support | 0.85 | 6-24h | Redact PII with Guardrails before embedding. Key: tenant+intent+language |
| Code Generation/IDE | 0.97+ or disabled | 30m-2h | High context dependency. Disable for refactoring/debugging |
Considerations:
- FAQ/Product Q&A: Natural invalidation via
product_versionkey on product changes - Internal KB: Flush user namespace on ACL changes
- Customer Support: PII (names, order numbers) must pass through Guardrails
- Code Generation: Different file/repo contexts require different answers for same queries
9. FAQ
Q1. How do Semantic Cache and RAG differ?
RAG retrieves context from vector DB to generate new responses; Semantic Cache reuses existing complete responses. RAG augments input before LLM calls; Semantic Cache avoids LLM calls entirely.
Q2. Can streaming responses be cached?
Yes, but reassembly/replay complexity is high. Recommend starting with non-streaming endpoints.
Q3. What are embedding model selection criteria?
Multilingual: bge-m3, text-embedding-3-large. English-only: text-embedding-3-small. Full cache invalidation required on model changes.
Q4. Why is caching temperature > 0 requests risky?
Users deliberately set high temperature for diverse answers; returning the same answer violates expectations. Disable cache by default for creative endpoints.
Q5. What if cache HIT rate is low?
Check namespace over-segmentation → lower threshold by 0.05 → evaluate embedding model quality in that order. 10-15% HIT rate is normal for non-FAQ workloads.
Q6. What about compliance for cached responses?
Medical/financial/legal domains may require audit log recording even for cache HITs. Always log cache_hit=true and comply with regulatory retention periods.
10. References
Official Documentation & Repositories
- Redis — Semantic Caching (RedisVL)
- Redis LangCache (managed)
- Portkey — Semantic Cache
- Helicone — Caching
- LiteLLM — Caching
- Bifrost Official Docs
- GPTCache (Zilliz)
Related Documents
- Implementation Guide: Inference Gateway Setup Guide — Semantic Caching Implementation Options — Tool comparison tables, configuration examples, deployment snippets
- Inference Gateway Routing Strategy
- OpenClaw AI Gateway Deployment
- LLMOps Observability
- Milvus Vector Database
- Ragas Evaluation