Inference Gateway Deployment Guide

This document covers production deployment procedures for kgateway + Bifrost-based inference gateway. For architecture concepts and routing strategies (Cascade, Semantic Router, 2-Tier structure), refer to Inference Gateway Routing.

Guide Structure

This guide consists of 3 documents. Learn sequentially or reference specific sections as needed.

Production Inference Pipeline Reference Architecture

Complete request flow of production inference pipeline based on EKS Auto Mode. CloudFront (WAF/Shield) → NLB → kgateway ExtProc analyzes prompts to determine LLM routing, passes through Bifrost governance layer and llm-d KV Cache-aware routing to deliver requests to optimal models.

Deployment Stages Overview

Basic Deployment (Required)

Configure kgateway + HTTPRoute + Bifrost behind a single NLB endpoint to complete basic inference pipeline.

Includes:

kgateway installation and Gateway API CRD configuration
GatewayClass, Gateway, HTTPRoute resource definitions
Cross-namespace access via ReferenceGrant
Bifrost Gateway Mode configuration (config.json + PVC)
provider/model format and IDE compatibility (Aider, Cline, Continue.dev)
SQLite initialization procedure (when config.json changes)

Learning Time: 30 min | Deployment Time: 45 min

Advanced Features (Optional)

Add prompt-based automatic routing, production security layer, and Semantic Caching to enhance cost optimization and security.

Includes:

LLM Classifier deployment (prompt-based SLM/LLM automatic branching)
CloudFront + WAF/Shield security layer
Semantic Caching implementation options (GPTCache, RedisVL, Portkey, Helicone)

Learning Time: 45 min | Deployment Time: 60-90 min

Troubleshooting (Reference)

Common issues and solutions during deployment and operations.

Includes:

404 Not Found (HTTPRoute/Gateway configuration errors)
Bifrost provider/model errors
Bifrost model name normalization issues
Langfuse Sub-path 404
OTel Trace not arriving

Reference Frequency: During deployment or when issues occur

Learning Paths

Quick Start (Development/Test Environment)

Configure kgateway + Bifrost with Basic Deployment
Refer to Troubleshooting when issues occur

Time Required: 1-2 hours

Production Configuration (Complete Pipeline)

Configure basic infrastructure with Basic Deployment
Add LLM Classifier + CloudFront/WAF + Semantic Caching in Advanced Features
Refer to Troubleshooting during operations

Time Required: 3-4 hours

Prerequisites

Verify the following before proceeding with all deployment stages.

Required

EKS cluster (K8s 1.32+, DRA 1.35 GA)
kubectl installed with cluster access
Helm 3.x installed
vLLM or llm-d based model serving Pods deployed

Next Steps

Get Started: Navigate to Basic Deployment to begin kgateway installation.
Understand Architecture: Read Inference Gateway Routing before deployment to grasp the overall structure.
Prepare Monitoring: Configure observability stack by referring to Langfuse Deployment Guide.

References

Inference Gateway Routing - kgateway architecture and routing strategy details
Langfuse Deployment Guide - Helm installation, OTel integration, Redis/ClickHouse configuration
Agent Monitoring - Langfuse architecture and components
Kubernetes Gateway API Official Documentation
kgateway Official Documentation
Bifrost Official Documentation

Production Inference Pipeline Reference Architecture​

Deployment Stages Overview​

Deployment Stages Overview​

Basic Deployment (Required)​

Advanced Features (Optional)​

Troubleshooting (Reference)​

Learning Paths​

Quick Start (Development/Test Environment)​

Production Configuration (Complete Pipeline)​

Prerequisites​

Required​

Recommended​

Next Steps​

References​

Production Inference Pipeline Reference Architecture

Deployment Stages Overview

Deployment Stages Overview

Basic Deployment (Required)

Advanced Features (Optional)

Troubleshooting (Reference)

Learning Paths

Quick Start (Development/Test Environment)

Production Configuration (Complete Pipeline)

Prerequisites

Required

Recommended

Next Steps

References