Inference Gateway Deployment Guide
This document covers production deployment procedures for kgateway + Bifrost-based inference gateway. For architecture concepts and routing strategies (Cascade, Semantic Router, 2-Tier structure), refer to Inference Gateway Routing.
This guide consists of 3 documents. Learn sequentially or reference specific sections as needed.
Production Inference Pipeline Reference Architecture
Complete request flow of production inference pipeline based on EKS Auto Mode. CloudFront (WAF/Shield) → NLB → kgateway ExtProc analyzes prompts to determine LLM routing, passes through Bifrost governance layer and llm-d KV Cache-aware routing to deliver requests to optimal models.
Deployment Stages Overview
Deployment Stages Overview
Basic Deployment (Required)
Configure kgateway + HTTPRoute + Bifrost behind a single NLB endpoint to complete basic inference pipeline.
Includes:
- kgateway installation and Gateway API CRD configuration
- GatewayClass, Gateway, HTTPRoute resource definitions
- Cross-namespace access via ReferenceGrant
- Bifrost Gateway Mode configuration (config.json + PVC)
- provider/model format and IDE compatibility (Aider, Cline, Continue.dev)
- SQLite initialization procedure (when config.json changes)
Learning Time: 30 min | Deployment Time: 45 min
Advanced Features (Optional)
Add prompt-based automatic routing, production security layer, and Semantic Caching to enhance cost optimization and security.
Includes:
- LLM Classifier deployment (prompt-based SLM/LLM automatic branching)
- CloudFront + WAF/Shield security layer
- Semantic Caching implementation options (GPTCache, RedisVL, Portkey, Helicone)
Learning Time: 45 min | Deployment Time: 60-90 min
Troubleshooting (Reference)
Common issues and solutions during deployment and operations.
Includes:
- 404 Not Found (HTTPRoute/Gateway configuration errors)
- Bifrost provider/model errors
- Bifrost model name normalization issues
- Langfuse Sub-path 404
- OTel Trace not arriving
Reference Frequency: During deployment or when issues occur
Learning Paths
Quick Start (Development/Test Environment)
- Configure kgateway + Bifrost with Basic Deployment
- Refer to Troubleshooting when issues occur
Time Required: 1-2 hours
Production Configuration (Complete Pipeline)
- Configure basic infrastructure with Basic Deployment
- Add LLM Classifier + CloudFront/WAF + Semantic Caching in Advanced Features
- Refer to Troubleshooting during operations
Time Required: 3-4 hours
Prerequisites
Verify the following before proceeding with all deployment stages.
Required
- EKS cluster (K8s 1.32+, DRA 1.35 GA)
- kubectl installed with cluster access
- Helm 3.x installed
- vLLM or llm-d based model serving Pods deployed
Recommended
- AWS Load Balancer Controller installed (for automatic NLB creation)
- Langfuse deployed (refer to Langfuse Deployment Guide)
- Production environment: ACM certificate issued (for CloudFront + TLS)
Next Steps
- Get Started: Navigate to Basic Deployment to begin kgateway installation.
- Understand Architecture: Read Inference Gateway Routing before deployment to grasp the overall structure.
- Prepare Monitoring: Configure observability stack by referring to Langfuse Deployment Guide.
References
- Inference Gateway Routing - kgateway architecture and routing strategy details
- Langfuse Deployment Guide - Helm installation, OTel integration, Redis/ClickHouse configuration
- Agent Monitoring - Langfuse architecture and components
- Kubernetes Gateway API Official Documentation
- kgateway Official Documentation
- Bifrost Official Documentation