Skip to main content

Deployment Strategies

Model Replacement Strategies

Shadow Testing

Concept: The new model receives actual production traffic, but responses are not delivered to users. Only the old model's responses are returned, and new model outputs are collected only for logging/evaluation.

When to Use:

  • When you want to validate new model's latency, error rate, and output quality without risk
  • When cost burden is acceptable (2x cost per request)

Implementation Example (Python, LiteLLM): LiteLLM has no native shadow feature, so implement directly:

import asyncio
from litellm import acompletion

async def shadow_call(user_request):
# Old model (production)
old_task = acompletion(model="gpt-4", messages=user_request)
# New model (shadow)
new_task = acompletion(model="claude-3-5-sonnet-20241022", messages=user_request)

old_resp, new_resp = await asyncio.gather(old_task, new_task, return_exceptions=True)

# Logging: compare both responses
log_to_langfuse(user_request, old_resp, new_resp, shadow=True)

# Return only old model response to user
return old_resp

Advantages:

  • No impact on user experience
  • Testing with actual traffic patterns

Disadvantages:

  • 2x cost
  • Cannot collect user feedback (users don't see shadow responses)

Canary Rollout

Concept: Start with a small percentage of traffic (5%) and gradually increase the ratio.

5% → Observe (24h) → If no issues 25% → 50% → 100%

When to Use:

  • When the new model is sufficiently validated, but full production replacement is high risk
  • When fast rollback is needed upon regression detection

Implementation Example (LaunchDarkly): Control model selection with Feature Flag

from ldclient import LDClient, Context

ld_client = LDClient(sdk_key="your-key")

def get_model_for_user(user_id: str):
context = Context.builder(user_id).kind("user").build()
model = ld_client.variation("llm-model-selection", context, default="gpt-4")
return model

# Set "llm-model-selection" flag in LaunchDarkly console to 5% claude-3-5-sonnet, 95% gpt-4

Monitoring Criteria:

  • Canary group vs Control group success rate (200 response ratio)
  • Latency P50/P99 difference
  • User feedback (thumbs up/down) ratio
  • Cost (token usage)

Automatic Rollback Trigger:

# Example: Prometheus AlertManager rule
- alert: CanaryRegressionDetected
expr: |
(rate(llm_success_total{model="claude-3-5-sonnet"}[5m])
/ rate(llm_requests_total{model="claude-3-5-sonnet"}[5m]))
< 0.95
for: 10m
annotations:
summary: "Canary success rate below 95%, rollback needed"

Advantages:

  • Progressive risk distribution
  • Can collect real user feedback

Disadvantages:

  • Longer deployment period (days to weeks)
  • Monitoring infrastructure required

A/B Testing

Concept: Randomly split traffic into two groups (A: old model, B: new model) and statistically compare business metrics (conversion rate, user satisfaction, etc.).

When to Use:

  • When you need to prove "Is the new model really better?" with statistical significance
  • Marketing, UX optimization (prompt tone changes, etc.)

Experiment Design:

  1. Null Hypothesis: "No performance difference between new and old models"
  2. Alternative Hypothesis: "New model improves conversion rate by 5% or more"
  3. Sample Size Calculation: AB Test Calculator
    Example: Baseline 10%, detect 5%p improvement, 80% power → 2,348 per group needed
  4. Experiment Duration: Until sufficient samples are collected (typically 1-4 weeks)

Implementation Example (Unleash):

import { UnleashClient } from 'unleash-client';

const unleash = new UnleashClient({
url: 'https://unleash.example.com/api',
appName: 'agent-service',
customHeaders: { Authorization: 'your-token' }
});

function selectModel(userId: string): string {
const context = { userId };
// 'ab-test-claude-vs-gpt' variant: 50% 'A', 50% 'B'
const variant = unleash.getVariant('ab-test-claude-vs-gpt', context);
return variant.name === 'B' ? 'claude-3-5-sonnet-20241022' : 'gpt-4';
}

Analysis: Validate significance with chi-square test after experiment completion

from scipy.stats import chi2_contingency

# A: gpt-4, B: claude-3-5-sonnet
# Success/failure contingency table
obs = [[2100, 300], # A: 2100 success, 300 failure
[2200, 200]] # B: 2200 success, 200 failure

chi2, p, dof, ex = chi2_contingency(obs)
print(f"p-value: {p}") # p < 0.05 → B is statistically significantly superior

Advantages:

  • Prove business impact with numbers
  • Favorable for marketing, executive persuasion

Disadvantages:

  • Long experiment period
  • Statistical expertise required
  • Sufficient traffic needed to ensure significance

Blue-Green Deployment

Concept: Operate old environment (Blue) and new environment (Green) simultaneously, then switch traffic to Green all at once. Immediately revert to Blue if issues occur.

When to Use:

  • When replacing the model serving infrastructure itself (vLLM 0.5 → 0.6)
  • Runtime changes rather than prompt changes

Implementation Example (Kubernetes Service + Ingress):

# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-blue
spec:
replicas: 3
selector:
matchLabels:
app: llm
version: blue
template:
metadata:
labels:
app: llm
version: blue
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.5.4
args: ["--model", "meta-llama/Llama-3.1-8B-Instruct"]
---
# green-deployment.yaml (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-green
spec:
replicas: 3
selector:
matchLabels:
app: llm
version: green
template:
metadata:
labels:
app: llm
version: green
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.3
args: ["--model", "meta-llama/Llama-3.1-8B-Instruct"]
---
# service.yaml (initially points to blue)
apiVersion: v1
kind: Service
metadata:
name: llm-service
spec:
selector:
app: llm
version: blue # ← Change this to 'green' to switch
ports:
- port: 8000

Switch Procedure:

  1. Complete Green deployment → Check health check
  2. kubectl patch svc llm-service -p '{"spec":{"selector":{"version":"green"}}}'
  3. Monitor for 5 minutes → Delete Blue if no issues
  4. If issues occur, immediately rollback with version: blue

Advantages:

  • Fastest rollback speed (seconds)
  • Simple switch process

Disadvantages:

  • 2x infrastructure cost (during switch period)
  • No progressive validation (all-or-nothing)

Feature Flag-Based Prompt Rollout

LaunchDarkly

LaunchDarkly is an enterprise-grade Feature Flag platform.

Prompt Rollout Example:

from ldclient import LDClient, Context

ld_client = LDClient(sdk_key="sdk-key")

def get_prompt_version(user_id: str, org_id: str) -> int:
context = Context.builder(user_id) \
.kind("user") \
.set("org_id", org_id) \
.build()

# flag 'prompt-version-financial': organization-level targeting
# Example: org_id='acme-corp' → version=5, others → version=4
version = ld_client.variation("prompt-version-financial", context, default=4)
return version

Kill Switch: Revert all users to safe version in emergency situations

# Force 'prompt-version-financial' flag to 4 in LaunchDarkly console
# Applied immediately to all users without code changes

Targeting Rule Examples:

  • Beta users: user.beta == true → new version
  • Specific region: user.region == "us-east-1" → canary version
  • Organization tier: user.tier == "enterprise" → latest version priority

Unleash

Unleash is an open-source Feature Flag platform.

Advantages:

  • Self-hosted capable
  • Postgres backend, RBAC, audit log provided by default

Prompt Rollout:

import { Unleash } from 'unleash-client';

const unleash = new Unleash({
url: 'https://unleash.internal.corp/api',
appName: 'agent-gateway',
customHeaders: { Authorization: 'token' }
});

function getPromptVariant(userId: string): string {
const context = { userId, properties: { region: 'us-west-2' } };
const variant = unleash.getVariant('prompt-experiment-2026-04', context);
// variant.name: 'control', 'treatment-A', 'treatment-B'
return variant.payload.value; // Actual prompt text or version number
}

AWS AppConfig

AWS AppConfig supports Feature Flags and dynamic configuration.

Advantages:

  • AWS native, integrates with Lambda/ECS/EKS
  • Deployment strategy: Linear, Canary, All-at-once
  • CloudWatch alarm-based automatic rollback

Example:

import boto3
import json

appconfig = boto3.client('appconfigdata')

session = appconfig.start_configuration_session(
ApplicationIdentifier='agent-app',
EnvironmentIdentifier='production',
ConfigurationProfileIdentifier='prompt-config'
)
session_token = session['InitialConfigurationToken']

config = appconfig.get_latest_configuration(ConfigurationToken=session_token)
prompt_config = json.loads(config['Configuration'].read())

print(prompt_config['version']) # Example: 5
print(prompt_config['text'])

Deployment Strategy:

{
"DeploymentStrategyId": "AppConfig.Canary10Percent20Minutes",
"Description": "Deploy to 10% users for 20 minutes then expand"
}

Automatic rollback when CloudWatch alarm (LLMErrorRate > threshold) fires.


Deployment Strategy Comparison

StrategyRiskValidation SpeedCostUser FeedbackRollback Speed
ShadowNoneFast2xNot possibleN/A
CanaryLowMedium1xPossibleFast (minutes)
A/BMediumSlow1xPossibleMedium (hours)
Blue-GreenHighFast2x (during switch)PossibleVery fast (seconds)

Selection Guide:

  • Initial validation: Shadow → Canary 5%
  • Business impact measurement: A/B Testing
  • Infrastructure replacement: Blue-Green
  • Emergency rollback needed: Blue-Green + Canary combination

References

Feature Flag Platforms

Deployment Strategies

Statistical Testing


Next Steps

Once you've selected a deployment strategy:

  1. Governance & Automation — Build automatic regression detection and rollback system
  2. Prompt & Model Registry — Build version control system
  3. Agent Monitoring — Build real-time observability