7 docs tagged with "inference-gateway"

Cascade Routing Production Tuning

Guide to tuning Inference Gateway Cascade Routing classification thresholds, Canary rollout, Fallback, and cost drift alerts based on production traces

Inference Gateway

Routing strategies, deployment, cascade tuning, and implementation examples for kgateway and Bifrost-based 2-Tier inference gateways

Inference Gateway Deployment Guide

Step-by-step deployment guide for kgateway-based Inference Gateway (basic/advanced/troubleshooting)

llm-d Based EKS Distributed Inference Guide

llm-d architecture concepts, KV Cache-aware routing, Disaggregated Serving, EKS Auto Mode integration strategy

Model Serving & Inference Infrastructure

A guide to the GPU infrastructure, inference framework, and inference optimization layers, with a single map of the end-to-end LLM inference request path and per-layer tuning levers — inference gateway, prefill/decode disaggregation, KV cache-aware routing, LMCache, and cache-hit strategy.

Semantic Caching Strategy

LLM Gateway-level semantic caching strategy and implementation options comparison (GPTCache, Redis Semantic Cache, Portkey, Helicone, Bifrost+Redis)

Tiered Gateway Architecture

Single definition of the Agentic AI Platform gateway layers: Tier 1 Ingress, Tier 2 Inference Routing (Inference Extension) and LLM API Gateway, and the Agent Data Plane — their role separation and how to fill each layer