Deployment Strategies
Model Replacement Strategies
Shadow Testing
Concept: The new model receives actual production traffic, but responses are not delivered to users. Only the old model's responses are returned, and new model outputs are collected only for logging/evaluation.
When to Use:
- When you want to validate new model's latency, error rate, and output quality without risk
- When cost burden is acceptable (2x cost per request)
Implementation Example (Python, LiteLLM): LiteLLM has no native shadow feature, so implement directly:
import asyncio
from litellm import acompletion
async def shadow_call(user_request):
# Old model (production)
old_task = acompletion(model="gpt-4", messages=user_request)
# New model (shadow)
new_task = acompletion(model="claude-3-5-sonnet-20241022", messages=user_request)
old_resp, new_resp = await asyncio.gather(old_task, new_task, return_exceptions=True)
# Logging: compare both responses
log_to_langfuse(user_request, old_resp, new_resp, shadow=True)
# Return only old model response to user
return old_resp
Advantages:
- No impact on user experience
- Testing with actual traffic patterns
Disadvantages:
- 2x cost
- Cannot collect user feedback (users don't see shadow responses)
Canary Rollout
Concept: Start with a small percentage of traffic (5%) and gradually increase the ratio.
5% → Observe (24h) → If no issues 25% → 50% → 100%
When to Use:
- When the new model is sufficiently validated, but full production replacement is high risk
- When fast rollback is needed upon regression detection
Implementation Example (LaunchDarkly): Control model selection with Feature Flag
from ldclient import LDClient, Context
ld_client = LDClient(sdk_key="your-key")
def get_model_for_user(user_id: str):
context = Context.builder(user_id).kind("user").build()
model = ld_client.variation("llm-model-selection", context, default="gpt-4")
return model
# Set "llm-model-selection" flag in LaunchDarkly console to 5% claude-3-5-sonnet, 95% gpt-4
Monitoring Criteria:
- Canary group vs Control group success rate (200 response ratio)
- Latency P50/P99 difference
- User feedback (thumbs up/down) ratio
- Cost (token usage)
Automatic Rollback Trigger:
# Example: Prometheus AlertManager rule
- alert: CanaryRegressionDetected
expr: |
(rate(llm_success_total{model="claude-3-5-sonnet"}[5m])
/ rate(llm_requests_total{model="claude-3-5-sonnet"}[5m]))
< 0.95
for: 10m
annotations:
summary: "Canary success rate below 95%, rollback needed"
Advantages:
- Progressive risk distribution
- Can collect real user feedback
Disadvantages:
- Longer deployment period (days to weeks)
- Monitoring infrastructure required
A/B Testing
Concept: Randomly split traffic into two groups (A: old model, B: new model) and statistically compare business metrics (conversion rate, user satisfaction, etc.).
When to Use:
- When you need to prove "Is the new model really better?" with statistical significance
- Marketing, UX optimization (prompt tone changes, etc.)
Experiment Design:
- Null Hypothesis: "No performance difference between new and old models"
- Alternative Hypothesis: "New model improves conversion rate by 5% or more"
- Sample Size Calculation: AB Test Calculator
Example: Baseline 10%, detect 5%p improvement, 80% power → 2,348 per group needed - Experiment Duration: Until sufficient samples are collected (typically 1-4 weeks)
Implementation Example (Unleash):
import { UnleashClient } from 'unleash-client';
const unleash = new UnleashClient({
url: 'https://unleash.example.com/api',
appName: 'agent-service',
customHeaders: { Authorization: 'your-token' }
});
function selectModel(userId: string): string {
const context = { userId };
// 'ab-test-claude-vs-gpt' variant: 50% 'A', 50% 'B'
const variant = unleash.getVariant('ab-test-claude-vs-gpt', context);
return variant.name === 'B' ? 'claude-3-5-sonnet-20241022' : 'gpt-4';
}
Analysis: Validate significance with chi-square test after experiment completion
from scipy.stats import chi2_contingency
# A: gpt-4, B: claude-3-5-sonnet
# Success/failure contingency table
obs = [[2100, 300], # A: 2100 success, 300 failure
[2200, 200]] # B: 2200 success, 200 failure
chi2, p, dof, ex = chi2_contingency(obs)
print(f"p-value: {p}") # p < 0.05 → B is statistically significantly superior
Advantages:
- Prove business impact with numbers
- Favorable for marketing, executive persuasion
Disadvantages:
- Long experiment period
- Statistical expertise required
- Sufficient traffic needed to ensure significance
Blue-Green Deployment
Concept: Operate old environment (Blue) and new environment (Green) simultaneously, then switch traffic to Green all at once. Immediately revert to Blue if issues occur.
When to Use:
- When replacing the model serving infrastructure itself (vLLM 0.5 → 0.6)
- Runtime changes rather than prompt changes
Implementation Example (Kubernetes Service + Ingress):
# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-blue
spec:
replicas: 3
selector:
matchLabels:
app: llm
version: blue
template:
metadata:
labels:
app: llm
version: blue
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.5.4
args: ["--model", "meta-llama/Llama-3.1-8B-Instruct"]
---
# green-deployment.yaml (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-green
spec:
replicas: 3
selector:
matchLabels:
app: llm
version: green
template:
metadata:
labels:
app: llm
version: green
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.3
args: ["--model", "meta-llama/Llama-3.1-8B-Instruct"]
---
# service.yaml (initially points to blue)
apiVersion: v1
kind: Service
metadata:
name: llm-service
spec:
selector:
app: llm
version: blue # ← Change this to 'green' to switch
ports:
- port: 8000
Switch Procedure:
- Complete Green deployment → Check health check
kubectl patch svc llm-service -p '{"spec":{"selector":{"version":"green"}}}'- Monitor for 5 minutes → Delete Blue if no issues
- If issues occur, immediately rollback with
version: blue
Advantages:
- Fastest rollback speed (seconds)
- Simple switch process
Disadvantages:
- 2x infrastructure cost (during switch period)
- No progressive validation (all-or-nothing)
Feature Flag-Based Prompt Rollout
LaunchDarkly
LaunchDarkly is an enterprise-grade Feature Flag platform.
Prompt Rollout Example:
from ldclient import LDClient, Context
ld_client = LDClient(sdk_key="sdk-key")
def get_prompt_version(user_id: str, org_id: str) -> int:
context = Context.builder(user_id) \
.kind("user") \
.set("org_id", org_id) \
.build()
# flag 'prompt-version-financial': organization-level targeting
# Example: org_id='acme-corp' → version=5, others → version=4
version = ld_client.variation("prompt-version-financial", context, default=4)
return version
Kill Switch: Revert all users to safe version in emergency situations
# Force 'prompt-version-financial' flag to 4 in LaunchDarkly console
# Applied immediately to all users without code changes
Targeting Rule Examples:
- Beta users:
user.beta == true→ new version - Specific region:
user.region == "us-east-1"→ canary version - Organization tier:
user.tier == "enterprise"→ latest version priority
Unleash
Unleash is an open-source Feature Flag platform.
Advantages:
- Self-hosted capable
- Postgres backend, RBAC, audit log provided by default
Prompt Rollout:
import { Unleash } from 'unleash-client';
const unleash = new Unleash({
url: 'https://unleash.internal.corp/api',
appName: 'agent-gateway',
customHeaders: { Authorization: 'token' }
});
function getPromptVariant(userId: string): string {
const context = { userId, properties: { region: 'us-west-2' } };
const variant = unleash.getVariant('prompt-experiment-2026-04', context);
// variant.name: 'control', 'treatment-A', 'treatment-B'
return variant.payload.value; // Actual prompt text or version number
}
AWS AppConfig
AWS AppConfig supports Feature Flags and dynamic configuration.
Advantages:
- AWS native, integrates with Lambda/ECS/EKS
- Deployment strategy: Linear, Canary, All-at-once
- CloudWatch alarm-based automatic rollback
Example:
import boto3
import json
appconfig = boto3.client('appconfigdata')
session = appconfig.start_configuration_session(
ApplicationIdentifier='agent-app',
EnvironmentIdentifier='production',
ConfigurationProfileIdentifier='prompt-config'
)
session_token = session['InitialConfigurationToken']
config = appconfig.get_latest_configuration(ConfigurationToken=session_token)
prompt_config = json.loads(config['Configuration'].read())
print(prompt_config['version']) # Example: 5
print(prompt_config['text'])
Deployment Strategy:
{
"DeploymentStrategyId": "AppConfig.Canary10Percent20Minutes",
"Description": "Deploy to 10% users for 20 minutes then expand"
}
Automatic rollback when CloudWatch alarm (LLMErrorRate > threshold) fires.
Deployment Strategy Comparison
| Strategy | Risk | Validation Speed | Cost | User Feedback | Rollback Speed |
|---|---|---|---|---|---|
| Shadow | None | Fast | 2x | Not possible | N/A |
| Canary | Low | Medium | 1x | Possible | Fast (minutes) |
| A/B | Medium | Slow | 1x | Possible | Medium (hours) |
| Blue-Green | High | Fast | 2x (during switch) | Possible | Very fast (seconds) |
Selection Guide:
- Initial validation: Shadow → Canary 5%
- Business impact measurement: A/B Testing
- Infrastructure replacement: Blue-Green
- Emergency rollback needed: Blue-Green + Canary combination
References
Feature Flag Platforms
- LaunchDarkly: launchdarkly.com
- Unleash: getunleash.io
- AWS AppConfig: AWS Documentation
Deployment Strategies
- Canary Deployment Pattern: martinfowler.com/bliki/CanaryRelease.html
- Blue-Green Deployment: martinfowler.com/bliki/BlueGreenDeployment.html
- Shadow Testing: Google SRE Workbook - Canarying Releases
Statistical Testing
- A/B Test Calculator: evanmiller.org/ab-testing
- scipy.stats: docs.scipy.org/doc/scipy/reference/stats.html
Next Steps
Once you've selected a deployment strategy:
- Governance & Automation — Build automatic regression detection and rollback system
- Prompt & Model Registry — Build version control system
- Agent Monitoring — Build real-time observability