Skip to main content

Advanced Features

This document covers advanced configurations for production environments. Add prompt-based automatic routing (LLM Classifier), security layer (CloudFront + WAF/Shield), and cost optimization (Semantic Caching) to complete a full inference pipeline.

Time Required

Learning: 45 min | Deployment: 60-90 min

Prerequisites

The components in this document assume Basic Deployment is complete. Verify that kgateway, HTTPRoute, and Bifrost are operational first.


1. LLM Classifier Deployment

1.1 Architecture Overview

LLM Classifier is a Python FastAPI-based lightweight router that operates behind kgateway. It receives OpenAI-compatible requests from clients (Aider, Cline, etc.), analyzes prompt content, and automatically proxies to weak (SLM) or strong (LLM) backends.

Key Features:

  • Clients request with model: "auto" (or any model name) — unaware of model selection
  • Classification based on keyword matching + token length + conversation turn count
  • Direct trace transmission via Langfuse OTel SDK
  • Container image under 50MB (FastAPI + httpx)

1.2 Classification Logic (extproc_http.py)

"""LLM Classifier — Prompt-based automatic model routing"""
import os, httpx
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse

app = FastAPI()

# --- Classification Settings ---
STRONG_KEYWORDS = [
"refactor", "architect", "design", "analyze", "optimize", "debug",
"migration", "complex", "performance", "security", "review",
]
TOKEN_THRESHOLD = 500
TURN_THRESHOLD = 5

# --- Backend Settings ---
WEAK_URL = os.getenv("WEAK_BACKEND", "http://qwen3-serving:8000")
STRONG_URL = os.getenv("STRONG_BACKEND", "http://glm5-serving:8000")

def classify(messages: list[dict]) -> str:
"""Analyze prompt content → decide weak / strong"""
content = " ".join(
m.get("content", "") for m in messages if m.get("content")
)
lower = content.lower()
# 1. Keyword matching
if any(kw in lower for kw in STRONG_KEYWORDS):
return "strong"
# 2. Input length
if len(content) > TOKEN_THRESHOLD:
return "strong"
# 3. Conversation turn count
if len(messages) > TURN_THRESHOLD:
return "strong"
return "weak"

@app.api_route("/v1/{path:path}", methods=["POST"])
async def proxy(path: str, request: Request):
body = await request.json()
messages = body.get("messages", [])
tier = classify(messages)
backend = STRONG_URL if tier == "strong" else WEAK_URL
target = f"{backend}/v1/{path}"

async with httpx.AsyncClient(timeout=300) as client:
if body.get("stream"):
req = client.build_request("POST", target, json=body)
resp = await client.send(req, stream=True)
return StreamingResponse(
resp.aiter_bytes(),
status_code=resp.status_code,
headers=dict(resp.headers),
)
resp = await client.post(target, json=body)
return resp.json()
Langfuse OTel Integration

Adding OpenTelemetry SDK to the above code allows direct recording of classification decisions + backend response times to Langfuse. Install opentelemetry-sdk, opentelemetry-exporter-otlp packages and set OTEL_EXPORTER_OTLP_ENDPOINT to Langfuse OTLP endpoint.

1.3 Dockerfile

FROM python:3.11-slim
RUN pip install --no-cache-dir fastapi uvicorn httpx
COPY extproc_http.py /app/
WORKDIR /app
CMD ["uvicorn", "extproc_http:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "2"]
# Build and push to ECR
docker buildx build --platform linux/amd64 \
-t <ACCOUNT_ID>.dkr.ecr.us-east-2.amazonaws.com/llm-classifier:latest \
--push .

1.4 K8s Deployment + Service

apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-classifier
namespace: ai-inference
spec:
replicas: 2
selector:
matchLabels:
app: llm-classifier
template:
metadata:
labels:
app: llm-classifier
spec:
containers:
- name: classifier
image: <ACCOUNT_ID>.dkr.ecr.us-east-2.amazonaws.com/llm-classifier:latest
ports:
- containerPort: 8080
name: http
env:
- name: WEAK_BACKEND
value: "http://qwen3-serving.ai-inference.svc.cluster.local:8000"
- name: STRONG_BACKEND
value: "http://glm5-serving.ai-inference.svc.cluster.local:8000"
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
readinessProbe:
httpGet:
path: /docs
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /docs
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
name: llm-classifier
namespace: ai-inference
spec:
selector:
app: llm-classifier
ports:
- name: http
port: 8080
targetPort: 8080
type: ClusterIP

1.5 kgateway HTTPRoute Configuration

Route /v1/* path to LLM Classifier in kgateway. Use instead of direct vLLM routing or Bifrost-routed routing from basic deployment.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-classifier-route
namespace: ai-inference
spec:
parentRefs:
- name: unified-gateway
namespace: ai-gateway
rules:
- matches:
- path:
type: PathPrefix
value: /v1/
backendRefs:
- name: llm-classifier
port: 8080
timeouts:
request: 300s
backendRequest: 300s
Timeout Settings

LLM inference can take tens of seconds. Set timeouts.request and backendRequest sufficiently (minimum 120s for GLM-5 744B, recommended 300s).

1.6 Aider/Cline Connection

With LLM Classifier, all clients connect to a single endpoint. Model name can be any value (Classifier ignores it and classifies based on prompt).

Aider

# LLM Classifier automatic branching — no double-prefix needed
OPENAI_API_BASE="http://<NLB_ENDPOINT>/v1" \
OPENAI_API_KEY="dummy" \
aider --model openai/auto

Cline

Settings -> API Provider -> OpenAI Compatible

  • Base URL: http://<NLB_ENDPOINT>/v1
  • Model: auto
  • API Key: dummy

Python Client

from openai import OpenAI

client = OpenAI(
base_url="http://<NLB_ENDPOINT>/v1",
api_key="dummy"
)

# Simple request → Qwen3-4B (automatic)
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Hello"}]
)

# Complex request → GLM-5 744B (automatic)
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Refactor this code and analyze the architecture"}]
)
Advantages over Bifrost

The provider/model format (openai/glm-5) and Aider double-prefix trick (openai/openai/glm-5) required when routing via Bifrost are completely unnecessary. All clients connect with the same model: "auto".

1.7 Routing Endpoint Structure (With LLM Classifier)

http://<NLB_ENDPOINT>/v1/*           → LLM Classifier → Qwen3-4B or GLM-5 (automatic branching)
http://<NLB_ENDPOINT>/langfuse/* → Langfuse (Observability UI)
http://<NLB_ENDPOINT>/_next/* → Langfuse (Static Assets)
http://<NLB_ENDPOINT>/api/public/* → Langfuse (API + OTel)
https://<AMG_ENDPOINT> → Grafana (Separate managed service)

2. CloudFront + WAF/Shield Security Layer

In production, do not expose NLB directly; configure CloudFront + WAF/Shield at the front to perform DDoS defense, request filtering, and TLS termination.

Architecture

2.1 Configure NLB TLS Listener

Convert existing HTTP Gateway to HTTPS. Requires ACM certificate.

# 1. Request ACM certificate (NLB region — us-east-2)
aws acm request-certificate \
--domain-name "api.your-company.com" \
--validation-method DNS \
--region us-east-2

# 2. Confirm ARN after DNS validation complete
export NLB_CERT_ARN=$(aws acm list-certificates --region us-east-2 \
--query "CertificateSummaryList[?DomainName=='api.your-company.com'].CertificateArn" \
--output text)

Update Gateway resource to HTTPS:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: unified-gateway
namespace: ai-gateway
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "external"
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
# TLS termination
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "${NLB_CERT_ARN}"
service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
# SG restriction: Allow only CloudFront IP ranges
service.beta.kubernetes.io/aws-load-balancer-security-groups: "${CF_RESTRICTED_SG_ID}"
spec:
gatewayClassName: kgateway
listeners:
- name: https
protocol: HTTPS
port: 443
tls:
mode: Terminate
certificateRefs:
- name: nlb-tls-cert
allowedRoutes:
namespaces:
from: All
Restrict NLB Security Group

NLB's Security Group should allow only CloudFront Managed Prefix List. Opening 0.0.0.0/0 will be automatically blocked by company policy.

# Check CloudFront Managed Prefix List
aws ec2 describe-managed-prefix-lists \
--filters "Name=prefix-list-name,Values=com.amazonaws.global.cloudfront.origin-facing" \
--query "PrefixLists[0].PrefixListId" --output text

# Allow only CloudFront prefix list in SG
aws ec2 authorize-security-group-ingress \
--group-id ${CF_RESTRICTED_SG_ID} \
--ip-permissions "IpProtocol=tcp,FromPort=443,ToPort=443,PrefixListIds=[{PrefixListId=${CF_PREFIX_LIST_ID}}]"

2.2 Create WAF WebACL

# Create WAF WebACL (CloudFront must be in us-east-1)
aws wafv2 create-web-acl \
--name "inference-gateway-waf" \
--scope CLOUDFRONT \
--region us-east-1 \
--default-action '{"Allow":{}}' \
--rules '[
{
"Name": "AWSManagedRulesCommonRuleSet",
"Priority": 1,
"Statement": {
"ManagedRuleGroupStatement": {
"VendorName": "AWS",
"Name": "AWSManagedRulesCommonRuleSet"
}
},
"OverrideAction": {"None":{}},
"VisibilityConfig": {
"SampledRequestsEnabled": true,
"CloudWatchMetricsEnabled": true,
"MetricName": "CommonRuleSet"
}
},
{
"Name": "RateLimit",
"Priority": 2,
"Statement": {
"RateBasedStatement": {
"Limit": 2000,
"AggregateKeyType": "IP"
}
},
"Action": {"Block":{}},
"VisibilityConfig": {
"SampledRequestsEnabled": true,
"CloudWatchMetricsEnabled": true,
"MetricName": "RateLimit"
}
},
{
"Name": "AWSManagedRulesKnownBadInputsRuleSet",
"Priority": 3,
"Statement": {
"ManagedRuleGroupStatement": {
"VendorName": "AWS",
"Name": "AWSManagedRulesKnownBadInputsRuleSet"
}
},
"OverrideAction": {"None":{}},
"VisibilityConfig": {
"SampledRequestsEnabled": true,
"CloudWatchMetricsEnabled": true,
"MetricName": "KnownBadInputs"
}
}
]' \
--visibility-config '{
"SampledRequestsEnabled": true,
"CloudWatchMetricsEnabled": true,
"MetricName": "InferenceGatewayWAF"
}'

WAF Rule Configuration:

RulePurposeConfiguration
AWSManagedRulesCommonRuleSetSQL Injection, XSS, general attack defenseAWS Managed
RateLimitPer-IP request limit2,000 req/5min (adjustable)
KnownBadInputsRuleSetBlock Log4j, known malicious patternsAWS Managed

2.3 Create CloudFront Distribution

# Confirm NLB DNS name
export NLB_DNS=$(kubectl get gateway unified-gateway -n ai-gateway \
-o jsonpath='{.status.addresses[0].value}')

# Create CloudFront distribution
aws cloudfront create-distribution \
--distribution-config "{
\"CallerReference\": \"inference-gateway-$(date +%s)\",
\"Origins\": {
\"Quantity\": 1,
\"Items\": [{
\"Id\": \"nlb-origin\",
\"DomainName\": \"${NLB_DNS}\",
\"CustomOriginConfig\": {
\"HTTPPort\": 80,
\"HTTPSPort\": 443,
\"OriginProtocolPolicy\": \"https-only\",
\"OriginSslProtocols\": {\"Quantity\": 1, \"Items\": [\"TLSv1.2\"]}
}
}]
},
\"DefaultCacheBehavior\": {
\"TargetOriginId\": \"nlb-origin\",
\"ViewerProtocolPolicy\": \"https-only\",
\"AllowedMethods\": {
\"Quantity\": 7,
\"Items\": [\"GET\",\"HEAD\",\"OPTIONS\",\"PUT\",\"POST\",\"PATCH\",\"DELETE\"],
\"CachedMethods\": {\"Quantity\": 2, \"Items\": [\"GET\",\"HEAD\"]}
},
\"CachePolicyId\": \"4135ea2d-6df8-44a3-9df3-4b5a84be39ad\",
\"OriginRequestPolicyId\": \"216adef6-5c7f-47e4-b989-5492eafa07d3\",
\"Compress\": true,
\"ForwardedValues\": {
\"QueryString\": true,
\"Cookies\": {\"Forward\": \"none\"},
\"Headers\": {
\"Quantity\": 3,
\"Items\": [\"Authorization\", \"Content-Type\", \"X-Api-Key\"]
}
}
},
\"Enabled\": true,
\"WebACLId\": \"${WAF_ACL_ARN}\",
\"Comment\": \"Inference Gateway - kgateway + Bifrost\",
\"PriceClass\": \"PriceClass_200\",
\"ViewerCertificate\": {
\"CloudFrontDefaultCertificate\": true
}
}"
Cache Policy

LLM inference API (/v1/chat/completions) is POST request so not cached in CloudFront. Use CachingDisabled policy (4135ea2d-...) and AllOriginRequestPolicy (216adef6-...) to forward all headers to Origin. Only Langfuse static assets (/_next/*) benefit from caching.

2.4 Shield Standard

CloudFront distributions automatically include AWS Shield Standard (no additional cost). Includes L3/L4 DDoS defense.

For large-scale services, consider Shield Advanced upgrade ($3,000/month):

  • L7 DDoS defense
  • AWS DDoS Response Team (DRT) support
  • WAF cost exemption
  • Cost protection (refund for scaling costs due to DDoS)

2.5 Change Client Endpoints

After deployment, access via CloudFront domain:

# Confirm CloudFront domain
export CF_DOMAIN=$(aws cloudfront list-distributions \
--query "DistributionList.Items[?Comment=='Inference Gateway - kgateway + Bifrost'].DomainName" \
--output text)

echo "Endpoint: https://${CF_DOMAIN}/v1"

IDE/Client Configuration Changes:

# Aider
OPENAI_API_BASE="https://${CF_DOMAIN}/v1" \
OPENAI_API_KEY="dummy" \
aider --model openai/auto

# Python SDK
from openai import OpenAI
client = OpenAI(
base_url=f"https://{CF_DOMAIN}/v1",
api_key="dummy"
)

2.6 Verification

# 1. Verify CloudFront → NLB → kgateway path
curl -s https://${CF_DOMAIN}/v1/models | jq .

# 2. Verify WAF operation (block SQL Injection patterns)
curl -s -o /dev/null -w "%{http_code}" \
"https://${CF_DOMAIN}/v1/models?id=1%20OR%201=1"
# Expected: 403 (WAF blocked)

# 3. Verify Rate Limit (exceed 2000 req/5min)
for i in $(seq 1 100); do
curl -s -o /dev/null -w "%{http_code}\n" \
https://${CF_DOMAIN}/v1/models &
done

# 4. Verify NLB direct access blocked (SG allows only CF prefix)
curl -s -o /dev/null -w "%{http_code}" \
"https://${NLB_DNS}/v1/models"
# Expected: timeout (direct access blocked)

2.7 Connection Path Summary

Before: Client → NLB (HTTP, public) → kgateway → Bifrost → vLLM
After: Client → CloudFront (HTTPS, WAF/Shield) → NLB (HTTPS, CF only) → kgateway → Bifrost → vLLM
SegmentProtocolSecurity
Client → CloudFrontHTTPS (TLS 1.2+)WAF rules + Shield Standard + Rate Limit
CloudFront → NLBHTTPS (TLS 1.2)SG: Allow only CloudFront Prefix List
NLB → kgatewayHTTP (inside cluster)VPC internal communication, NetworkPolicy
kgateway → Bifrost/vLLMHTTP (inside cluster)Service-to-service communication

3. Semantic Caching Implementation Options (Advanced)

Concepts and Design Principles

For Semantic Caching concepts, similarity threshold design, cache key structure, and observability strategies, refer to Semantic Caching Strategy. This section covers tool comparisons and deployment configurations for actual implementation.

3.1 Implementation Tool Comparison (2026-04 baseline)

Major options organized based on official documentation and repositories. Features change rapidly, so always verify official documentation at deployment time.

ToolLicenseBackendKey AdvantagesLimitationsOfficial Resources
GPTCacheOSS (MIT)Redis / Milvus / FAISS / SQLiteVarious backends, rich adapters, specialized for Semantic Cache from the startRelease frequency decreased after 2024, community-driven vs. LangChain/LiteLLMGitHub
Redis Semantic Cache (RedisVL)OSS (MIT)Redis Stack / Redis 8+Reuse existing Redis infrastructure, native SemanticCache class, built-in vector searchApplication must configure embedding pipeline and TTL policies directlyRedisVL — Semantic Cache
PortkeySaaS + Self-host (OSS Gateway, Apache 2.0)Built-in store / RedisGateway all-in-one (routing/guardrails/cache integrated), multi-tenant with Virtual KeysAdvanced features depend on managed plans, self-host configuration complexPortkey Semantic Cache
HeliconeOSS (Apache 2.0) / SaaSClickHouse (observability) + Redis/S3 (cache)Observability·logging and cache integrated, low latency with Rust gatewaySelf-host full stack has many dependencies, cache defaults to exact-match (Semantic is advanced feature)Helicone Caching
Bifrost + RedisOSS (Apache 2.0) + OSS RedisRedisLow latency with Go, customize cache keys with CEL Rules, reuse existing Bifrost deploymentSemantic Cache itself requires direct plugin/sidecar configurationBifrost Documentation
LangCache (Redis Labs)Managed SaaS (Redis Enterprise)Redis EnterpriseFully managed, includes embedding model·governance (GA H2 2025)Enterprise only, region constraints, costRedis LangCache

3.2 Tool Selection Decision Tree

3.3 Scenario-Based Recommendations

ScenarioRecommended CombinationReason
Existing EKS + Redis operationsBifrost + Redis + RedisVLReuse existing infrastructure without introducing new vendors
Managed + compliancePortkey managed or LangCacheSOC2/HIPAA certifications, minimal operational burden
Observability priorityHeliconeCache·routing·logging in single product
Initial PoC / prototypeLiteLLM + Redis (cache: true)Activate with 1-2 lines of configuration, fast validation
Strong open source constraintGPTCache + MilvusNo vendor lock-in, free backend choice

3.4 Gateway Integration Patterns

LiteLLM

Basic activation (exact-match):

# litellm_config.yaml
litellm_settings:
cache: true
cache_params:
type: "redis"
host: "redis-service.default.svc.cluster.local"
port: 6379

Semantic Cache activation:

litellm_settings:
cache: true
cache_params:
type: "redis-semantic-cache"
host: "redis-service.default.svc.cluster.local"
port: 6379
similarity_threshold: 0.85
embedding_model: "text-embedding-3-small"

For detailed options, refer to LiteLLM Caching documentation.

Bifrost + RedisVL Sidecar

Bifrost itself only supports exact-match cache, so implement Semantic Cache in two ways.

Method A: Python proxy frontend — Deploy lightweight FastAPI proxy using RedisVL SemanticCache class in front of Bifrost

from redisvl.extensions.session_manager import SemanticCache
from fastapi import FastAPI, Request
import httpx

app = FastAPI()

cache = SemanticCache(
name="llm_cache",
redis_url="redis://redis-service:6379",
distance_threshold=0.15, # 1 - similarity (0.85 similarity = 0.15 distance)
)

@app.post("/v1/{path:path}")
async def proxy(path: str, request: Request):
body = await request.json()
query = body["messages"][-1]["content"]

# Semantic Cache lookup
cached = cache.check(prompt=query)
if cached:
return {"choices": [{"message": {"content": cached[0]["response"]}}]}

# MISS → Bifrost call
async with httpx.AsyncClient() as client:
resp = await client.post(f"http://bifrost:8080/v1/{path}", json=body)
result = resp.json()

# Store response
cache.store(prompt=query, response=result["choices"][0]["message"]["content"])
return result

Method B: CEL Rules header-based branching — Use Bifrost CEL Rules to route only requests with x-cache-enabled: true header via Redis

{
"plugins": [
{
"enabled": true,
"name": "cel_rules",
"config": {
"rules": [
{
"condition": "request.header['x-cache-enabled'] == 'true'",
"action": "route",
"target": "redis-semantic-proxy"
}
]
}
}
]
}

Portkey

Portkey is gateway all-in-one with built-in cache support.

import Portkey from "portkey-ai";

const portkey = new Portkey({
apiKey: "YOUR_PORTKEY_API_KEY",
config: {
cache: {
mode: "semantic",
max_age: 3600, // TTL 1 hour
},
strategy: {
mode: "fallback",
targets: [
{ provider: "openai", model: "gpt-4o" },
{ provider: "anthropic", model: "claude-sonnet-4" },
],
},
},
});

const response = await portkey.chat.completions.create({
messages: [{ role: "user", content: "Hello" }],
model: "gpt-4o",
});

Can also separate cache policies per tenant with Virtual Keys. For details, refer to Portkey Semantic Cache documentation.

Helicone

Helicone controls cache with request headers.

curl https://oai.helicone.ai/v1/chat/completions \
-H "Authorization: Bearer YOUR_OPENAI_KEY" \
-H "Helicone-Auth: Bearer YOUR_HELICONE_KEY" \
-H "Helicone-Cache-Enabled: true" \
-H "Helicone-Cache-Seed: prod-v1" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello"}]
}'

Semantic mode is an advanced feature, verify in Helicone Caching documentation.

3.5 Cache Key Design Example (YAML)

Pseudo-code example for generating cache keys in actual implementation.

# Cache key generation logic (pseudo-code)
cache_key_components:
model_id: "glm-5" # Model type
system_prompt_hash: "a3f2e1b" # System prompt SHA256 (8 chars)
tenant_id: "org-12345" # Organization/tenant
language: "ko" # Language
tool_set_hash: "c9d8e7f" # Agent tool set hash
embedding: [0.12, -0.34, ...] # User query embedding (stored in vector DB)

# Redis key format
redis_key: "cache:org-12345:ko:glm-5:a3f2e1b:c9d8e7f"
# Vector DB searches embedding similarity → on HIT above threshold, retrieves response with redis_key

3.6 Pre-Deployment Checklist

  • Set initial threshold to 0.90 (conservative start)
  • Document TTL policies (apply differentially by domain)
  • Verify Guardrails (PII redaction) placed before cache
  • Add cache_hit, similarity_score tags to Langfuse traces
  • Verify fail-open scenario on Redis failure
  • Gradual rollout with A/B testing (traffic 10% → 50% → 100%)

Next Steps

Advanced feature configuration is complete. Proceed to the next steps:

  1. Troubleshooting: If errors occurred during deployment, refer to Troubleshooting Guide.
  2. Enhanced Monitoring: Complete OTel integration and dashboards by referring to Langfuse Deployment Guide.
  3. Operational Processes: Establish production operations by referring to Agent Monitoring.

References