Skip to main content

Inference Platform Benchmark: Bedrock AgentCore vs EKS Self-Hosted

Written: 2026-03-18 | Status: Plan

Objective

Set Bedrock AgentCore as the default inference platform and quantitatively validate when and under what conditions EKS self-hosting becomes necessary. Also compare performance/cost differences based on LLM gateway (LiteLLM vs Bifrost) and cache-aware routing (llm-d) combinations in EKS self-hosting.

Base Assumption

Bedrock AgentCore is the default choice. As a managed service, AWS handles build time, operational burden, and scaling. Open-source/custom models are also supported via Custom Model Import, so model support itself is not a reason for self-hosting. Self-hosting is justified only when inference engine-level control, large-scale cost optimization, or cache routing is required.


Comparison Targets

ConfigurationDescriptionValidation Purpose
Baseline. AgentCore (Base Models)Use Bedrock-provided models immediatelyReference point
Baseline+. AgentCore (Custom Models)Serve custom models via Custom Model ImportCustom model performance/cost in managed environment
Alt A-1. EKS + LiteLLM + vLLMLiteLLM gateway, standard load balancingSelf-hosting based on existing ecosystem
Alt A-2. EKS + Bifrost + vLLMBifrost gateway, standard load balancingValidate high-performance gateway effect
Alt B-1. EKS + LiteLLM + llm-d + vLLMLiteLLM + cache-aware routingValidate llm-d additional effect
Alt B-2. EKS + Bifrost + llm-d + vLLMBifrost + cache-aware routingValidate optimal combination

Architecture Configuration

Baseline:   Client → AgentCore Gateway → Bedrock Inference (base models)
Baseline+: Client → AgentCore Gateway → Bedrock Inference (Custom Import models)

Alt A-1: Client → LiteLLM → kgateway (RoundRobin) → vLLM Pods
Alt A-2: Client → Bifrost → vLLM Pods (Bifrost load balancing)

Alt B-1: Client → LiteLLM → llm-d (Prefix-Cache Aware) → vLLM Pods
Alt B-2: Client → Bifrost → llm-d (Prefix-Cache Aware) → vLLM Pods
llm-d Connection Method

Since llm-d provides an OpenAI-compatible endpoint, both LiteLLM and Bifrost can integrate simply by pointing base_url to the llm-d service. Gateway selection and llm-d integration are independent.


LLM Gateway Comparison: LiteLLM vs Bifrost

Gateway selection directly impacts platform performance and operations in EKS self-hosting.

ItemLiteLLM (Python)Bifrost (Go)
Gateway Overheadhundreds us/req11 us/req (4050x faster)
Memory Footprintbaseline~68% smaller
Provider Support100+20+ (native major providers)
Cost Trackingbuilt-inbuilt-in (hierarchical: key/team/customer)
ObservabilityLangfuse native integrationbuilt-in (request tracking, Prometheus)
Semantic Cachingbuilt-inbuilt-in (~5ms hit)
Guardrailsbuilt-inbuilt-in
MCP Tool Filteringlimitedbuilt-in (per Virtual Key)
Governance (Virtual Keys)API Key managementhierarchical (key/team/customer budget/permissions)
Rate Limitingbuilt-inhierarchical (key/team/customer)
Fallback/Load Balancingbuilt-inbuilt-in
Web UIbuilt-inbuilt-in (real-time monitoring)
Langfuse Integrationnative plugin (config-only integration)via OTel or Langfuse OpenAI SDK wrapper (app level)
Community/Referencesmature (16k+ GitHub stars)growing (3k+ GitHub stars)

Why Gateway Overhead Matters in Agentic AI

Agents make multiple sequential LLM calls within a single task. Gateway overhead accumulates per call:

Agent 1 task = LLM call → tool → LLM call → tool → LLM call → response
(gateway) (gateway) (gateway)

LiteLLM: ~300us x 5 calls = ~1.5ms cumulative
Bifrost: ~11us x 5 calls = ~0.055ms cumulative

Ratio to inference time (hundreds of ms ~ seconds): 1~3% vs 0.01~0.1%

While negligible in single requests, tail latency differences can emerge in high-concurrency + agent multi-call environments.


AgentCore Coverage

AreaAgentCore ProvidesSelf-Hosting Requires
Inference (Base Models)Claude, Llama, Mistral, etc. immediate usevLLM + GPU + model deployment
Inference (Custom Models)Custom Model Import / MarketplacevLLM + GPU + model deployment
Scalingautomatic (managed)Karpenter + HPA/KEDA
Agent RuntimeAgent Runtime built-inLangGraph / Strands direct build
MCP ConnectionMCP connector built-inMCP server direct deployment/operation
GuardrailsBedrock Guardrailsgateway built-in (Bifrost/LiteLLM)
ObservabilityCloudWatch integrationLangfuse + Bifrost/LiteLLM built-in + Prometheus
SecurityIAM native, VPC integrationPod Identity + NetworkPolicy
Operationsnone (managed)GPU monitoring, model updates, incident response

Validation Questions

#QuestionScenario
Q1Does AgentCore base model performance meet production SLA?1
Q2How does Custom Model Import performance compare to direct vLLM serving?2
Q3What are Custom Model Import constraints? (quantization, batch strategy, etc.)2
Q4At what traffic scale does self-hosting become cost-effective?7
Q5Can AgentCore handle agent workflow complexity?5
Q6Is llm-d cache optimization effective enough to reverse cost differences?3, 6
Q7What is AgentCore responsiveness under burst traffic?9
Q8Is AgentCore isolation sufficient in multi-tenant environments?6
Q9Is LiteLLM vs Bifrost gateway overhead significant in actual measurements?4
Q10Does Bifrost + llm-d combination operate stably?4

Test Environment

Region: us-east-1

Baseline (AgentCore base models):
- Bedrock Claude 3.5 Sonnet (on-demand + provisioned)
- Bedrock Llama 3.1 70B (on-demand)
- AgentCore Agent Runtime + MCP connector
- Bedrock Guardrails, CloudWatch

Baseline+ (AgentCore custom models):
- Llama 3.1 70B fine-tuned model → Custom Model Import
- Same AgentCore runtime

Alt A-1 (EKS + LiteLLM + vLLM):
- EKS v1.32, Karpenter v1.2
- g5.2xlarge (A10G) x 4, vLLM v0.7.x
- Llama 3.1 70B (AWQ 4bit)
- LiteLLM v1.60+ → kgateway (RoundRobin)
- Langfuse v3.x + Prometheus

Alt A-2 (EKS + Bifrost + vLLM):
- Same EKS/vLLM configuration
- Bifrost (latest) → vLLM (Bifrost load balancing)
- Bifrost built-in observability + Prometheus

Alt B-1 (EKS + LiteLLM + llm-d + vLLM):
- Alt A-1 + llm-d v0.3+

Alt B-2 (EKS + Bifrost + llm-d + vLLM):
- Alt A-2 + llm-d v0.3+
- Bifrost base_url → llm-d service endpoint

Load generation: Locust + LLMPerf

Test Scenarios

Scenario 1: Simple Inference — AgentCore Base Performance

  • Different prompts each time, input 500 / output 1000 tokens
  • Concurrency: 1, 10, 50, 100, 200
  • Target: Baseline (base models)
  • Validation: Does AgentCore TTFT, TPS meet production SLA?

Scenario 2: Custom Model Import vs Direct vLLM Serving

  • Same model (Llama 3.1 70B) serving in Baseline+ vs Alt A-1/A-2
  • Input 500 / output 1000 tokens, concurrency: 1, 10, 50, 100
  • Measure: TTFT, TPS, E2E Latency
  • Validation: Custom Import performance differences and constraints
    • Quantization option comparison (Import support range vs vLLM AWQ/GPTQ/FP8)
    • Batch size / concurrent processing control availability
    • Model update time requirements (Import redeployment vs vLLM rolling update)

Scenario 3: Repeated System Prompts — Caching Effects

  • 3 system prompts (2000 tokens each) fixed + user input only changes
  • Concurrency: 10, 50, 100
  • Target: Baseline (prompt caching) vs Alt A-1/A-2 vs Alt B-1/B-2 (llm-d)
  • Validation: Bedrock prompt caching vs llm-d prefix caching vs Bifrost semantic caching, TTFT/cost comparison

Scenario 4: Gateway Overhead — LiteLLM vs Bifrost

  • Use LiteLLM and Bifrost as gateways for same vLLM backend
  • Concurrency: 1, 10, 50, 100, 500, 1000
  • llm-d presence combinations: A-1 vs A-2, B-1 vs B-2
  • Measure: Gateway additional latency (p50/p95/p99), memory usage, CPU usage, error rate
  • Validation:
    • Q9 — Does gateway overhead create meaningful differences at high concurrency?
    • Q10 — Does Bifrost → llm-d connection operate stably?
    • Cumulative overhead difference in agent multi-calls (5 turns)

Scenario 5: Multi-turn Agent Workflow

  • 5-turn conversation + 3 tool calls (web search, DB query, calculation)
  • AgentCore: Agent Runtime + MCP connector
  • EKS: LangGraph + MCP server (Bifrost MCP tool filtering vs LiteLLM)
  • Validation: AgentCore Agent Runtime complex workflow handling capability, customization limits

Scenario 6: Multi-tenant

  • 5 tenants, each with different system prompts/guardrail policies
  • AgentCore: IAM-based isolation
  • EKS + LiteLLM: API Key-based isolation
  • EKS + Bifrost: Virtual Key hierarchical governance (team/customer budget, permissions)
  • EKS + llm-d: Per-tenant cache routing
  • Validation: AgentCore isolation level vs EKS, Bifrost Virtual Key governance effect

Scenario 7: Break-even Point Exploration

  • Gradual load increase: 1, 5, 10, 30, 50, 100 req/s
  • Calculate monthly cost for 6 configurations at each level
  • Validation: Derive exact cost crossover point

Scenario 8: Long-running Operations (24h)

  • 30 req/s, maintain for 24 hours
  • Total cost, stability (error rate), performance variance
  • Validation: AgentCore cost predictability vs EKS GPU idle cost

Scenario 9: Burst Traffic

  • Normal 10 req/s → 5 minutes at 100 req/s → back to 10 req/s
  • Validation: AgentCore throttling/queuing behavior vs EKS Karpenter scale-out delay

Measurement Metrics

CategoryMetricBaselineBaseline+A-1 (LiteLLM)A-2 (Bifrost)B-1 (LiteLLM+llm-d)B-2 (Bifrost+llm-d)
PerformanceTTFT (p50/p95/p99)OOOOOO
TPS (output tokens/sec)OOOOOO
E2E LatencyOOOOOO
Throughput (req/s)OOOOOO
Cold StartOOOOOO
GatewayGateway additional latency--OOOO
Gateway memory usage--OOOO
Gateway CPU usage--OOOO
CachingBedrock prompt caching savings rateOO----
Semantic cache hit rate---O-O
KV Cache Hit Rate----OO
CostMonthly total cost (by traffic)OOOOOO
Effective cost per tokenOOOOOO
Idle cost--OOOO
GovernanceTenant isolation levelOOOOOO
Budget/Rate Limit precisionOOOOOO
OperationsBuild timeOOOOOO
Failure recovery timeOOOOOO
Required personnel/skillsetOOOOOO

Cost Simulation

Fixed Costs (Monthly)

ItemBaselineBaseline+A-1/A-2B-1/B-2
GPU instances (g5.2xlarge x4)--~$4,800~$4,800
EKS cluster--$73$73
llm-d (CPU Pod)---~$50
Gateway (LiteLLM/Bifrost)--~$50~$50
Langfuse (self-hosted)--~$100~$100
Bedrock provisionedCalculated separatelyCalculated separately--

Variable Costs

ItemBaselineBaseline+A-1/A-2B-1/B-2
Billing methodper tokenper tokenGPU time allocationGPU time allocation
Cache savingsprompt caching discountprompt caching discountsemantic caching (Bifrost)KV cache + semantic caching
Idle costnone (on-demand)none (on-demand)GPU idle chargesGPU idle charges

Expected Cost Curve

Monthly Cost
^
| AgentCore on-demand
| \
| \ / A-1 (LiteLLM+vLLM)
| \ / A-2 (Bifrost+vLLM)
| \ /
| AgentCore \ / B-1 (LiteLLM+llm-d)
| provisioned\ / / B-2 (Bifrost+llm-d)
| \ / / /
| \ / / /
| \ / / /
| X / / <-- Break-even point
| / \ / /
| EKS fixed cost/---\--/-/----------
| / \/
+-------------------------------------------> Traffic (req/s)
5 10 30 50 100
Traffic RangeRecommendationReason
Below break-evenAgentCore on-demandNo GPU fixed cost, immediate start
Near break-evenAgentCore provisionedDiscounted throughput, still managed
Above break-even + varied promptsAlt A-2 (Bifrost)Low overhead, governance
Above break-even + repeated promptsAlt B-2 (Bifrost+llm-d)Cache effect + low overhead

Decision Flowchart


Conditions Justifying EKS Self-Hosting

Only consider self-hosting when AgentCore is insufficient

EKS self-hosting is justified when one or more of the following conditions apply.

ConditionReason
Fine-grained inference engine controlvLLM scheduling, batch strategy, quantization (AWQ/GPTQ/FP8) free choice
Large-scale traffic cost optimizationCost per token reversal above break-even point
KV cache routingTTFT/GPU efficiency maximization with llm-d prefix cache
Multi-tenant governanceFine-grained budget/permission control per team/customer with Bifrost Virtual Key
Immediate latest model adoptionUse community latest models before Bedrock Import
Data sovereignty / airgapEnvironments where Bedrock API calls are impossible

Observability Stack Configuration

Observability stack varies based on gateway selection in EKS self-hosting.

LiteLLM-based (A-1, B-1)

Application (Langfuse SDK) ──→ Langfuse Server (Trace/Span)
LiteLLM ──→ Langfuse Server (native integration, request/cost logs)
vLLM + llm-d ──→ Prometheus → Grafana (GPU, KV cache metrics)

Bifrost-based (A-2, B-2)

Application (Langfuse SDK) ──→ Langfuse Server (Trace/Span)
Bifrost (OTel Plugin) ──→ OTLP Collector ──→ Langfuse Server (gateway-level trace)
Bifrost ──→ Prometheus → Grafana (cost/token/latency metrics)
Bifrost ──→ Bifrost Web UI (real-time monitoring)
vLLM + llm-d ──→ Prometheus → Grafana (GPU, KV cache metrics)
Langfuse is needed regardless of gateway

Bifrost's built-in observability monitors the gateway layer (request/cost/latency). Complete agent workflow tracing (multi-call connections, prompt quality evaluation, session tracking) is handled by Langfuse. The two layers are complementary, not replacements.


Result Report Structure (Planned)

SectionContent
Executive SummaryClear distinction between "when AgentCore is sufficient" and "when self-hosting is needed"
AgentCore Base PerformanceBase model TTFT, TPS, Throughput benchmarks
Custom Import vs vLLMSame model performance/cost/constraint comparison
Gateway ComparisonLiteLLM vs Bifrost overhead, governance, stability
Caching Strategy ComparisonBedrock prompt caching vs Bifrost semantic caching vs llm-d prefix caching
Agent Runtime ComparisonAgentCore Runtime vs LangGraph features/flexibility
Cost Break-even6-configuration cost graph by traffic range + crossover points
Observability StackObservability configuration comparison by gateway
Decision GuideWorkload characteristics → optimal configuration flowchart
Migration PathTasks and risks when transitioning AgentCore → EKS