Inference Platform Benchmark: Bedrock AgentCore vs EKS Self-Managed
Created: 2026-03-18 | Status: Plan
Objective
Set Bedrock AgentCore as the default inference platform and quantitatively validate when and under what conditions self-managed EKS becomes necessary. Also compare performance/cost differences across LLM gateway (LiteLLM vs Bifrost) and cache-aware routing (llm-d) combinations for self-managed EKS.
Bedrock AgentCore is the default choice. As a managed service, AWS handles build time, operational burden, and scaling. Open-source/custom models are also supported via Custom Model Import, so model support alone does not justify self-management. Self-management is only justified when inference engine-level control, large-scale cost optimization, or cache routing is required.
Comparison Targets
| Configuration | Description | Validation Purpose |
|---|---|---|
| Baseline. AgentCore (Default Models) | Immediately use Bedrock-provided models | Reference point |
| Baseline+. AgentCore (Custom Models) | Serve custom models via Custom Model Import | Custom model performance/cost in managed environment |
| Alt A-1. EKS + LiteLLM + vLLM | LiteLLM gateway, standard load balancing | Self-managed with existing ecosystem |
| Alt A-2. EKS + Bifrost + vLLM | Bifrost gateway, standard load balancing | High-performance gateway effect validation |
| Alt B-1. EKS + LiteLLM + llm-d + vLLM | LiteLLM + cache-aware routing | Validate llm-d added value |
| Alt B-2. EKS + Bifrost + llm-d + vLLM | Bifrost + cache-aware routing | Validate optimal combination |
Architecture Configuration
Baseline: Client → AgentCore Gateway → Bedrock Inference (Default Models)
Baseline+: Client → AgentCore Gateway → Bedrock Inference (Custom Import Models)
Alt A-1: Client → LiteLLM → kgateway (RoundRobin) → vLLM Pods
Alt A-2: Client → Bifrost → vLLM Pods (Bifrost load balancing)
Alt B-1: Client → LiteLLM → llm-d (Prefix-Cache Aware) → vLLM Pods
Alt B-2: Client → Bifrost → llm-d (Prefix-Cache Aware) → vLLM Pods
llm-d provides OpenAI-compatible endpoints, so both LiteLLM and Bifrost can integrate simply by pointing their base_url to the llm-d service. Gateway selection and llm-d integration are independent.
LLM Gateway Comparison: LiteLLM vs Bifrost
The gateway choice directly impacts platform performance and operations for self-managed EKS.
| Item | LiteLLM (Python) | Bifrost (Go) |
|---|---|---|
| Gateway Overhead | Hundreds of us/req | ~11 us/req (40-50x faster) |
| Memory Footprint | Baseline | ~68% smaller |
| Provider Support | 100+ | 20+ (major providers native) |
| Cost Tracking | Built-in | Built-in (hierarchical: key/team/customer) |
| Observability | Langfuse native integration | Built-in (request tracing, Prometheus) |
| Semantic Caching | Built-in | Built-in (~5ms hit) |
| Guardrails | Built-in | Built-in |
| MCP Tool Filtering | Limited | Built-in (per Virtual Key) |
| Governance (Virtual Keys) | API Key management | Hierarchical (key/team/customer budget/permissions) |
| Rate Limiting | Built-in | Hierarchical (key/team/customer) |
| Fallback/Load Balancing | Built-in | Built-in |
| Web UI | Built-in | Built-in (real-time monitoring) |
| Langfuse Integration | Native plugin (configuration only) | Via OTel or Langfuse OpenAI SDK wrapper (app level) |
| Community/References | Mature (16k+ GitHub stars) | Growing (3k+ GitHub stars) |
Why Gateway Overhead Matters for Agentic AI
Agents make multiple sequential LLM calls within a single task. Gateway overhead accumulates with each call:
Agent 1 task = LLM call → Tool → LLM call → Tool → LLM call → Response
(gateway) (gateway) (gateway)
LiteLLM: ~300us x 5 calls = ~1.5ms cumulative
Bifrost: ~11us x 5 calls = ~0.055ms cumulative
As ratio of inference time (hundreds of ms to seconds): 1-3% vs 0.01-0.1%
Negligible for single requests, but high concurrency + agent multi-call environments may show tail latency differences.
AgentCore Provided Scope
| Area | AgentCore Provided | Required for Self-Managed |
|---|---|---|
| Inference (Default Models) | Claude, Llama, Mistral, etc. ready to use | vLLM + GPU + model deployment |
| Inference (Custom Models) | Custom Model Import / Marketplace | vLLM + GPU + model deployment |
| Scaling | Automatic (managed) | Karpenter + HPA/KEDA |
| Agent Runtime | Built-in Agent Runtime | LangGraph / Strands self-managed |
| MCP Connection | Built-in MCP Connector | Deploy/operate MCP servers |
| Guardrails | Bedrock Guardrails | Gateway built-in (Bifrost/LiteLLM) |
| Observability | CloudWatch integration | Langfuse + Bifrost/LiteLLM built-in + Prometheus |
| Security | IAM native, VPC integration | Pod Identity + NetworkPolicy |
| Operations | None (managed) | GPU monitoring, model updates, incident response |
Validation Questions
| # | Question | Scenario |
|---|---|---|
| Q1 | Does AgentCore default model performance meet production SLAs? | 1 |
| Q2 | How does Custom Model Import performance compare to direct vLLM serving? | 2 |
| Q3 | What are Custom Model Import constraints? (quantization, batch strategy, etc.) | 2 |
| Q4 | At what traffic scale does self-management become cost-effective? | 7 |
| Q5 | Can AgentCore handle complex agent workflow requirements? | 5 |
| Q6 | Is llm-d cache optimization effective enough to reverse cost differences? | 3, 6 |
| Q7 | How responsive is AgentCore during burst traffic? | 9 |
| Q8 | Is AgentCore isolation sufficient for multi-tenant environments? | 6 |
| Q9 | Is the LiteLLM vs Bifrost gateway overhead significant in practice? | 4 |
| Q10 | Does the Bifrost + llm-d combination operate stably? | 4 |
Test Environment
Region: us-east-1
Baseline (AgentCore Default Models):
- Bedrock Claude 3.5 Sonnet (on-demand + provisioned)
- Bedrock Llama 3.1 70B (on-demand)
- AgentCore Agent Runtime + MCP Connector
- Bedrock Guardrails, CloudWatch
Baseline+ (AgentCore Custom Models):
- Llama 3.1 70B fine-tuned model → Custom Model Import
- Same AgentCore runtime
Alt A-1 (EKS + LiteLLM + vLLM):
- EKS v1.32, Karpenter v1.2
- g5.2xlarge (A10G) x 4, vLLM v0.7.x
- Llama 3.1 70B (AWQ 4bit)
- LiteLLM v1.60+ → kgateway (RoundRobin)
- Langfuse v3.x + Prometheus
Alt A-2 (EKS + Bifrost + vLLM):
- Same EKS/vLLM configuration
- Bifrost (latest) → vLLM (Bifrost load balancing)
- Bifrost built-in observability + Prometheus
Alt B-1 (EKS + LiteLLM + llm-d + vLLM):
- Alt A-1 + llm-d v0.3+
Alt B-2 (EKS + Bifrost + llm-d + vLLM):
- Alt A-2 + llm-d v0.3+
- Bifrost base_url → llm-d service endpoint
Load Generation: Locust + LLMPerf
Test Scenarios
Scenario 1: Simple Inference — AgentCore Baseline Performance
- Different prompt each time, input 500 / output 1000 tokens
- Concurrency: 1, 10, 50, 100, 200
- Target: Baseline (default models)
- Validation: Do AgentCore TTFT, TPS meet production SLAs?
Scenario 2: Custom Model Import vs vLLM Direct Serving
- Same model (Llama 3.1 70B) served on Baseline+ vs Alt A-1/A-2
- Input 500 / output 1000 tokens, concurrency: 1, 10, 50, 100
- Measured: TTFT, TPS, E2E Latency
- Validation: Performance differences and constraints of Custom Import
- Quantization option comparison (Import supported range vs vLLM AWQ/GPTQ/FP8)
- Batch size / concurrent processing control availability
- Model update turnaround time (Import redeployment vs vLLM rolling update)