Skip to main content

Inference Gateway Deployment Guide

This document covers production deployment procedures for kgateway + Bifrost-based inference gateway. For architecture concepts and routing strategies (Cascade, Semantic Router, 2-Tier structure), refer to Inference Gateway Routing.

Guide Structure

This guide consists of 3 documents. Learn sequentially or reference specific sections as needed.

Production Inference Pipeline Reference Architecture

Complete request flow of production inference pipeline based on EKS Auto Mode. CloudFront (WAF/Shield) → NLB → kgateway ExtProc analyzes prompts to determine LLM routing, passes through Bifrost governance layer and llm-d KV Cache-aware routing to deliver requests to optimal models.


Deployment Stages Overview

Deployment Stages Overview

Basic Deployment (Required)

Configure kgateway + HTTPRoute + Bifrost behind a single NLB endpoint to complete basic inference pipeline.

Includes:

  • kgateway installation and Gateway API CRD configuration
  • GatewayClass, Gateway, HTTPRoute resource definitions
  • Cross-namespace access via ReferenceGrant
  • Bifrost Gateway Mode configuration (config.json + PVC)
  • provider/model format and IDE compatibility (Aider, Cline, Continue.dev)
  • SQLite initialization procedure (when config.json changes)

Learning Time: 30 min | Deployment Time: 45 min


Advanced Features (Optional)

Add prompt-based automatic routing, production security layer, and Semantic Caching to enhance cost optimization and security.

Includes:

  • LLM Classifier deployment (prompt-based SLM/LLM automatic branching)
  • CloudFront + WAF/Shield security layer
  • Semantic Caching implementation options (GPTCache, RedisVL, Portkey, Helicone)

Learning Time: 45 min | Deployment Time: 60-90 min


Troubleshooting (Reference)

Common issues and solutions during deployment and operations.

Includes:

  • 404 Not Found (HTTPRoute/Gateway configuration errors)
  • Bifrost provider/model errors
  • Bifrost model name normalization issues
  • Langfuse Sub-path 404
  • OTel Trace not arriving

Reference Frequency: During deployment or when issues occur


Learning Paths

Quick Start (Development/Test Environment)

  1. Configure kgateway + Bifrost with Basic Deployment
  2. Refer to Troubleshooting when issues occur

Time Required: 1-2 hours


Production Configuration (Complete Pipeline)

  1. Configure basic infrastructure with Basic Deployment
  2. Add LLM Classifier + CloudFront/WAF + Semantic Caching in Advanced Features
  3. Refer to Troubleshooting during operations

Time Required: 3-4 hours


Prerequisites

Verify the following before proceeding with all deployment stages.

Required

  • EKS cluster (K8s 1.32+, DRA 1.35 GA)
  • kubectl installed with cluster access
  • Helm 3.x installed
  • vLLM or llm-d based model serving Pods deployed
  • AWS Load Balancer Controller installed (for automatic NLB creation)
  • Langfuse deployed (refer to Langfuse Deployment Guide)
  • Production environment: ACM certificate issued (for CloudFront + TLS)

Next Steps


References