Skip to main content

Inference Gateway & LLM Gateway Routing Strategy

Created: 2025-02-05 | Updated: 2026-04-17 | Reading Time: ~15 minutes

This document covers design principles for 2-Tier gateway architecture and routing strategies (Cascade / Semantic Router / Hybrid). For actual deployment procedures including Helm installation, HTTPRoute manifests, and OTel integration, refer to Inference Gateway Deployment Guide.

Overview

In large-scale AI model serving environments, infrastructure traffic management and LLM provider abstraction must be separated. A single gateway leads to exponential complexity and makes optimizing each layer difficult.

2-Tier Gateway Architecture:

  • L1 (Ingress Gateway): kgateway — Kubernetes Gateway API standard, traffic routing, mTLS, rate limiting
  • L2-A (Inference Gateway): Bifrost/LiteLLM — Provider integration, cascade routing, semantic caching
  • L2-B (Data Plane): agentgateway — MCP/A2A protocols, stateful session management

Each tier is managed independently, separating infrastructure and AI workloads.


2-Tier Gateway Architecture

Gateway Layer Separation

LLM inference platforms must clearly distinguish 3 different Gateway roles:

Gateway TypeRoleImplementationLocation
Ingress GatewayExternal traffic ingress, TLS termination, path-based routingkgateway (NLB integration)Tier 1
Inference GatewayModel selection, intelligent routing, request cascadingBifrost / LiteLLMTier 2-A
Data PlaneMCP/A2A protocols, stateful sessions, tool routingagentgatewayTier 2-B

Core Principles:

  • Ingress Gateway (kgateway): Handles network-level traffic control only. Does not include model selection logic
  • Inference Gateway (Bifrost/LiteLLM): Analyzes request complexity → Automatically selects appropriate model → Cost optimization
  • Data Plane (agentgateway): Handles AI-specific protocols (MCP/A2A), maintains stateful sessions

Overall Architecture

Responsibility Separation by Tier

TierComponentResponsibilityProtocol
Tier 1kgateway (Envoy-based)Traffic routing, mTLS, rate limiting, network policiesHTTP/HTTPS, gRPC
Tier 2-ABifrost / LiteLLMIntelligent model selection, cost tracking, request cascading, semantic cachingOpenAI-compatible API
Tier 2-BagentgatewayMCP/A2A session management, self-hosted inference routing, Tool Poisoning preventionHTTP, JSON-RPC, MCP, A2A

Traffic Flow

External LLM: Client → kgateway → Bifrost/LiteLLM (Cascade + Cache) → OpenAI → Response + Cost tracking Self-hosted vLLM: Client → kgateway → agentgateway → vLLM → Response


kgateway (L1 Inference Gateway)

Gateway API-based Routing

kgateway implements the Kubernetes Gateway API standard, enabling vendor-neutral configuration.

ComponentRoleDescription
GatewayClassGateway implementation definitionDesignate Kgateway controller
GatewayEntry point definitionConfigure listeners, TLS, addresses
HTTPRouteRouting rulesPath, header-based routing
BackendModel servicevLLM, TGI and other inference servers

Gateway API v1.2.0+ provides HTTPRoute improvements, GRPCRoute stabilization, and BackendTLSPolicy, fully supported by kgateway v2.0+.

Dynamic Routing Concepts

Routing TypeCriteriaUse Case
Header-basedx-model-id, x-providerBackend selection by model/provider
Path-based/v1/chat/completions, /v1/embeddingsService separation by API type
Weight-basedbackendRef weightCanary deployment, A/B testing
Composite conditionsHeaders + Path + TierPremium/standard customer backends

Canary deployments start with 5-10% traffic and gradually increase, with immediate rollback via weight=0 on issues.

Load Balancing Strategies

StrategyDescriptionSuitable Scenario
Round RobinSequential distribution (default)Uniform model instances
RandomRandom distributionLarge backend pools
Consistent HashSame key → Same backendKV Cache reuse, session affinity

Consistent Hash is particularly useful for LLM inference. Routing requests from the same user to the same vLLM instance increases prefix cache hit rates, significantly improving TTFT (Time to First Token).

Topology-Aware Routing (Kubernetes 1.33+)

Kubernetes 1.33+ topology-aware routing prioritizes same-AZ Pod communication to reduce cross-AZ data transfer costs.

🚀 Topology-Aware Routing Effects
MetricBeforeTopology-AwareImprovement
Cross-AZ TrafficHighMinimized
50% data transfer cost savings
LatencyHigh (cross-AZ)Low (same AZ)
30-40% P99 latency improvement
Network BandwidthLimitedOptimized
20-30% throughput increase

Failure Handling Concepts

MechanismDescriptionLLM Inference Considerations
TimeoutMaximum processing time per requestLLM long response generation takes tens of seconds. Adequate timeout needed (120s+)
RetryAuto-retry on 5xx, timeout, connection failureMax 3 retries. Infinite retries cause system overload
Circuit BreakerTemporarily block backend on consecutive failuresSet maxEjectionPercent to 50% or below to ensure at least half backends available

For streaming responses, backendRequest timeout is for first byte, request is for total time. POST retries require idempotency guarantees (caution with tool calls).


LLM Gateway Solution Comparison

Major Solution Comparison Table

SolutionLanguageKey FeaturesCascade RoutingLicenseBest For
BifrostGo/Rust50x faster, CEL Rules conditional routing, failoverCEL Rules + external classifierApache 2.0High performance, low cost, self-hosted
LiteLLMPython100+ providers, native complexity-based routingrouting_strategy: complexity-basedMITPython ecosystem, rapid prototyping
vLLM Semantic RouterPythonvLLM-only, lightweight embedding-based routingEmbedding similarity-basedApache 2.0vLLM standalone environment
PortkeyTypeScriptSOC2 certified, semantic caching, Virtual KeysSupportedProprietary + OSSEnterprise, compliance
Kong AI GatewayLua/CMCP support, leverages existing Kong infraPluginApache 2.0 / EnterpriseExisting Kong users
HeliconeRustGateway + Observability integrated, high performanceSupportedApache 2.0High performance + observability needed

Bifrost vs LiteLLM

Bifrost: Go/Rust implementation with 50x faster throughput than Python, 1/10 memory usage. CEL Rules enable conditional routing (header-based cascade, failover). Helm Chart deployment, OpenAI-compatible API. Proxy latency < 100us. Intelligent cascade via app-side complexity score calculation → x-complexity-score header → CEL rule branching pattern or Go Plugin.

LiteLLM: 100+ provider support, native complexity-based routing (activate with 1-line routing_strategy: complexity-based config), one-line Langfuse integration (success_callback: ["langfuse"]), direct LangChain/LlamaIndex integration. However, Python-based with lower throughput, higher memory usage.

Selection Criteria

Use CaseRecommended SolutionReason
Intelligent cascade (convenience priority)LiteLLMNative complexity-based routing, 1-line config
Intelligent cascade (performance priority)BifrostCEL Rules + external classifier, 50x faster
vLLM standalone environmentvLLM Semantic RoutervLLM native, lightweight routing
High performance, low cost self-hostedBifrost50x faster processing, low memory
Python ecosystem (LangChain)LiteLLMNative integration, 100+ providers
Enterprise compliancePortkeySOC2/HIPAA/GDPR, Semantic Cache
High performance + observability integratedHeliconeRust-based All-in-one
ScenarioRecommended StackReason
Startup/PoCkgateway + LiteLLMLow cost, 10-min deployment, complexity routing 1-line
Self-hosted focused (performance)kgateway + Bifrost (CEL cascade) + agentgatewayHigh performance, external+self-hosted pool 2-Tier
Enterprise multi-providerkgateway + Portkey + LangfuseCompliance, 250+ providers
Hybrid (external+self-hosted)kgateway + Bifrost/LiteLLM + agentgatewayExternal via Bifrost/LiteLLM, self-hosted via agentgateway
Global deploymentCloudflare AI Gateway + kgatewayEdge caching, DDoS protection

Request Cascading: Intelligent Model Routing

Concept

Request Cascading is an intelligent optimization technique that automatically analyzes request complexity and routes to appropriate models. Simple queries go to cheap and fast models, complex reasoning to powerful models, simultaneously improving cost and latency. IDEs use a single endpoint only; model selection is centrally controlled at platform level.

Three Cascading Patterns

PatternDescriptionImplementationUse Case
1. Weight-basedDistribute traffic by fixed ratiokgateway backendRef weightA/B testing, gradual model migration
2. Fallback-basedAuto-switch to another model on errorkgateway retry + multiple backendRefAvailability improvement, rate limit avoidance
3. Intelligent routingAuto-select model after request analysisLLM Classifier / LiteLLM / vLLM Semantic RouterCost optimization, quality maintenance

Practical Request Cascading Implementation

Intelligent cascade routing analyzes request complexity and auto-routes to appropriate models. This section focuses on verified approaches in self-hosted environments.

LLM Classifier is a Python FastAPI-based lightweight router that directly analyzes prompt content to auto-select SLM/LLM. It operates as ExtProc (External Processing) or independent service behind kgateway, with clients using only a single endpoint (/v1).

Classification Criteria:

Criteriaweak (SLM)strong (LLM)
KeywordsNoneRefactor, architecture, design, analysis, debug, optimization, migration, etc.
Input length< 500 chars≥ 500 chars
Conversation turns≤ 5 turns> 5 turns

Core Classification Logic:

STRONG_KEYWORDS = ["refactor", "architect", "design", "analyze",
"optimize", "debug", "migration", "complex"]
TOKEN_THRESHOLD = 500

def classify(messages: list[dict]) -> str:
content = " ".join(m.get("content", "") for m in messages if m.get("content"))
# Keyword matching
if any(kw in content.lower() for kw in STRONG_KEYWORDS):
return "strong"
# Input length
if len(content) > TOKEN_THRESHOLD:
return "strong"
# Conversation turns
if len(messages) > 5:
return "strong"
return "weak"

Pros: No client modification needed, direct prompt analysis, direct Langfuse OTel transmission, simple deployment (single Pod) Cons: Classification accuracy depends on heuristics (can be gradually improved with ML classifier)

Why LLM Classifier is Optimal

Standard OpenAI-compatible clients (Aider, Cline, etc.) only configure a single base_url. LLM Classifier analyzes prompts behind this single endpoint and proxies directly to backend vLLM instances. Clients are completely unaware of model selection.

Bifrost Self-hosted Cascade Limitations

We attempted to use Bifrost for self-hosted vLLM cascade but switched to LLM Classifier due to the following limitations:

LimitationDescription
provider/model format enforcementRequires openai/glm-5 format. Standard OpenAI clients (Aider, etc.) expect single model names like model: "auto"
Single base_url per providerOnly one network_config.base_url per provider (e.g., openai). Cannot route to same provider if SLM and LLM are on different Services
No prompt access in CELCEL Rules only access request.headers. Cannot analyze request body (prompt content) for routing
Model name normalization issuesUnpredictable normalization like hyphen removal causes mismatch with vLLM served-model-name
Bifrost is Best for External LLM Provider Integration

Bifrost is optimized for external provider integration (OpenAI/Anthropic/Bedrock) and failover. LLM Classifier is better suited for intelligent cascade routing between self-hosted vLLM instances.

RouteLLM Evaluation Results

RouteLLM is an open-source routing framework developed by LMSYS, with academically validated Matrix Factorization-based classification models (90%+ accuracy on LMSYS Chatbot Arena data).

However, the following issues were confirmed during K8s deployment:

  • Dependency conflicts: Large dependency trees like torch, transformers, sentence-transformers conflict with vLLM environments
  • Container size: Image size 10GB+ with classification model (unsuitable for lightweight router)
  • Deployment instability: High frequency of pip dependency resolution failures
  • Maintenance: Research project nature with no production support

Conclusion: RouteLLM's MF classifier concept is valid, but for production deployment we recommend LLM Classifier (lightweight heuristics) or LiteLLM complexity routing (external provider environments).

Approach B: LiteLLM Native (External Provider Environment)

LiteLLM natively supports complexity-based routing. Adding just 1 line to the config file enables automatic request complexity analysis and model selection.

model_list:
- model_name: gpt-4-turbo
litellm_params:
model: gpt-4-turbo-preview
api_key: os.environ/OPENAI_API_KEY
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key: os.environ/OPENAI_API_KEY

router_settings:
routing_strategy: complexity-based # Enable with this 1 line
complexity_threshold: 0.7 # ≥ 0.7 → stronger model

Pros: Activate with 1-line config, auto-analyzes prompt length·code inclusion·reasoning keywords, 100+ provider support Cons: Python-based low throughput, complexity algorithm not customizable, overhead for self-hosted vLLM

Approach C: vLLM Semantic Router (vLLM-only)

In vLLM environments, vLLM Semantic Router can be used for lightweight embedding-based routing. It matches embeddings to pre-defined "categories" to select models.

# vLLM Semantic Router configuration
from vllm import SemanticRouter

router = SemanticRouter(
categories={
"simple": ["basic question", "quick answer", "definition"],
"complex": ["explain in detail", "analyze", "step by step"]
},
models={
"simple": "qwen3-4b",
"complex": "glm-5-744b"
},
threshold=0.85
)

# Auto-routing
response = router.route(prompt="Explain the architecture...") # → glm-5-744b

Pros: vLLM native, lightweight embedding usage (inference latency < 5ms), simple configuration Cons: vLLM-only, requires pre-defined categories

Cascade Routing Implementation Selection Guide

EnvironmentRecommended ApproachReason
Self-hosted vLLM (Aider/Cline)LLM ClassifierDirect prompt analysis, single endpoint, no client modification
External providers (OpenAI/Anthropic)LiteLLM100+ providers native, complexity routing 1-line
vLLM standalone + embeddings availablevLLM Semantic RoutervLLM native, lightweight
Hybrid (external + self-hosted)LLM Classifier + LiteLLMSelf-hosted via Classifier, external via LiteLLM

Cascade Routing Strategy (Fallback-based)

Try cheap -> balanced -> frontier models progressively based on complexity.

Complexity Classification Criteria (as of 2026-04):

ComplexityConditionsRecommended ModelCost per Token
SimpleTokens < 200, no keywordsHaiku 4.5 / GPT-4.1 nano$0.80-$0.15/M
MediumTokens 200-1000, code includedSonnet 4.6 / Gemini 2.5 Flash$3-$0.10/M
ComplexTokens 1000+, reasoning keywordsOpus 4.7 / GPT-4.1$15-$10/M

Fallback Conditions: HTTP 5xx, Rate Limit exceeded, Timeout, Quality Score < 0.7 (optional)

Cost Savings Effect (as of 2026-04)

10,000 requests/day scenario:

  • Simple (50%): Haiku 4.5 — 50 tok in, 100 tok out → $0.50/day
  • Medium (30%): Sonnet 4.6 — 500 tok in, 500 tok out → $2.70/day
  • Complex (15%): Opus 4.7 — 1500 tok in, 1000 tok out → $3.38/day
  • Very Complex (5%): Opus 4.7 — 3000 tok in, 2000 tok out → $3.00/day

Total cost: $9.58/day ($287/month)

Processing all requests with Opus 4.7: $45/day ($1,350/month) → 79% savings

Self-hosted LLM Classifier Scenario (as of 2026-04):

  • Qwen3-4B (70% weak, L4 $0.3/hr × 24hr × 30d) = $216/month
  • GLM-5 744B (30% strong, H200 $12/hr × 24hr × 30d × 0.3) = $2,592/month
  • Langfuse + AMP/AMG = $200/month

Total cost: $3,008/month (vs GLM-5 alone $8,900/month → 66% savings)

Enterprise Model Routing Patterns

Implementation Location Priority: Gateway > IDE > Client

LocationAdvantagesSuitable Environment
Gateway (LLM Classifier)Prompt analysis, central control, no client modificationSelf-hosted (Recommended)
Gateway (LiteLLM/Bifrost)Multi-provider, policy consistencyExternal providers
IDE (Claude Code)Context awarenessDev tool vendors
Client (SDK)High flexibilityPrototype

Field Recommendation: In self-hosted environments, deploy with kgateway → LLM Classifier → vLLM structure for centralized routing. Developers use only a single endpoint (/v1), and platform teams manage classification policies. For detailed deployment guide, refer to Inference Gateway Deployment: LLM Classifier.


Research Reference: RouteLLM

RouteLLM is an open-source LLM routing framework developed by LMSYS. A lightweight classification model (Matrix Factorization) analyzes requests to automatically select strong/weak models.

ItemRouteLLM (Research)LLM Classifier (Production)
Classification methodMatrix Factorization embeddingKeywords + token length + conversation turns
InputUser prompt + conversation historySame
OutputStrong/Weak + confidence scoreStrong/Weak
Added latency< 10ms (MF inference)< 1ms (rule-based)
Dependenciestorch, transformers, sentence-transformersFastAPI, httpx (lightweight)
K8s deploymentUnstable (dependency conflicts)Stable (50MB image)
RouteLLM Production Deployment Caution

RouteLLM is a research project; K8s production deployment is not recommended. Dependency conflicts and large image size (10GB+) are problematic. The MF classifier concept is useful, but for production we recommend LLM Classifier (self-hosted) or LiteLLM complexity routing (external provider environments).

For detailed deployment code, refer to Inference Gateway Deployment Guide: LLM Classifier.


Gateway API Inference Extension

Kubernetes Gateway API enables managing LLM inference as Kubernetes-native resources through Inference Extension.

Core CRDs (Custom Resource Definitions)

CRDRoleExample
InferenceModelDefine per-model serving policies (criticality, routing rules)criticality: high → dedicated GPU allocation
InferencePoolModel serving Pod group (vLLM replicas)replicas: 3 → 3 vLLM instances
LLMRouteRules for routing requests to InferenceModelx-model-id: glm-5 → GLM-5 Pool

For detailed YAML manifests, refer to Inference Gateway Deployment Guide.

Gateway API Inference Extension Integration

Gateway API Inference Extension integrates with kgateway + llm-d EPP to provide Kubernetes-native inference routing:

Current Status: Actively developed as CNCF project. Expected to provide alpha in Kubernetes 1.34+; production use not currently recommended. For production deployment, refer to Reference Architecture guides.


Semantic Caching

Semantic Caching detects semantically similar prompts and reuses previous responses, simultaneously reducing LLM API costs and latency. At the Gateway level (Bifrost/LiteLLM/Portkey), HIT/MISS is determined by embedding similarity, so it can be combined independently with KV Cache (vLLM) · Prompt Cache (provider-managed).

Recommended default threshold: 0.85 — allows same meaning, different expression

Design principles (3-tier cache comparison, similarity threshold tradeoffs, tool comparison table, cache key design, observability·production checklist) are covered in detail in separate documentation.


agentgateway Data Plane

Overview

agentgateway is an AI workload-dedicated data plane for kgateway. Traditional Envoy is optimized for stateless HTTP/gRPC, but AI agents have special requirements like stateful JSON-RPC sessions, MCP protocols, and Tool Poisoning prevention.

Envoy vs agentgateway Comparison

ItemEnvoy Data Planeagentgateway
Session ManagementStateless, HTTP cookie-basedStateful JSON-RPC sessions, in-memory session store
ProtocolsHTTP/1.1, HTTP/2, gRPCMCP (Model Context Protocol), A2A (Agent-to-Agent)
SecuritymTLS, RBACTool Poisoning prevention, per-session Authorization
RoutingPath/header-basedSession ID-based, tool call validation
ObservabilityHTTP metrics, Access LogLLM token tracking, tool call chains, cost

Core Features

1. Stateful JSON-RPC Session Management: X-MCP-Session-ID header-based session tracking, Sticky Session routing, automatic inactive session cleanup (default 30 minutes)

2. Native MCP/A2A Protocol Support: /mcp/v1 (MCP protocol), /a2a/v1 (A2A agent communication) path support

3. Tool Poisoning Prevention: Allowed tool list, dangerous tool blocking (exec_shell, read_credentials), response size limits, integrity verification (SHA-256)

4. Per-session Authorization: JWT token verification, role-based tool access, session hijacking prevention

agentgateway Project Status

agentgateway is an AI-dedicated data plane separated from the kgateway project in late 2025, currently under active development. Features are continuously added to keep pace with rapid evolution of MCP and A2A protocols.


Monitoring & Observability

Core Metrics

Core metrics to monitor in AI inference gateways:

📊 Kgateway Prometheus Metrics
MetricDescriptionUsage
kgateway_requests_total
Total request countTraffic monitoring
kgateway_request_duration_seconds
Request processing timeLatency analysis
kgateway_upstream_rq_xx
Backend response codesError tracking
kgateway_upstream_cx_active
Active connectionsCapacity planning
kgateway_retry_count
Retry countStability analysis
Metric CategoryKey ItemMeaning
LatencyTTFT (Time to First Token)Time until first token generation. User-perceived responsiveness
ThroughputTPS (Tokens Per Second)Tokens generated per second. Model serving efficiency
Error Rate5xx / Total RequestsBackend failure ratio. Immediate action if > 5%
Cache Hit RateCache Hit / Total RequestsSemantic Cache efficiency. 30%+ recommended
CostToken usage by model × unit priceReal-time cost tracking

Langfuse OTel Integration

Send OTel traces from Bifrost/LiteLLM to Langfuse to track prompts/completions, token usage, cost analysis, and tool call chains. Bifrost activates via otel plugin, LiteLLM via success_callback: ["langfuse"] config. For detailed configuration, refer to Monitoring Stack Setup.

AlertConditionSeverity
High error rate5xx > 5% (5 min)Critical
High latencyP99 > 30s (5 min)Warning
Circuit breaker activatedcircuit_breaker_open == 1Critical
Cache hit rate dropCache hit < 30%Warning
Budget approachingBudget > 80%Warning

Production Deployment Guides

For actual code examples and YAML manifests, refer to Reference Architecture section:

Cost and Observability


References

Official Documentation

LLM Providers

Research Papers & Patterns