Skip to main content

Inference Platform Benchmark: Bedrock AgentCore vs EKS Self-Managed

Created: 2026-03-18 | Status: Plan

Objective

Set Bedrock AgentCore as the default inference platform and quantitatively validate when and under what conditions self-managed EKS becomes necessary. Also compare performance/cost differences across LLM gateway (LiteLLM vs Bifrost) and cache-aware routing (llm-d) combinations for self-managed EKS.

Default Premise

Bedrock AgentCore is the default choice. As a managed service, AWS handles build time, operational burden, and scaling. Open-source/custom models are also supported via Custom Model Import, so model support alone does not justify self-management. Self-management is only justified when inference engine-level control, large-scale cost optimization, or cache routing is required.


Comparison Targets

ConfigurationDescriptionValidation Purpose
Baseline. AgentCore (Default Models)Immediately use Bedrock-provided modelsReference point
Baseline+. AgentCore (Custom Models)Serve custom models via Custom Model ImportCustom model performance/cost in managed environment
Alt A-1. EKS + LiteLLM + vLLMLiteLLM gateway, standard load balancingSelf-managed with existing ecosystem
Alt A-2. EKS + Bifrost + vLLMBifrost gateway, standard load balancingHigh-performance gateway effect validation
Alt B-1. EKS + LiteLLM + llm-d + vLLMLiteLLM + cache-aware routingValidate llm-d added value
Alt B-2. EKS + Bifrost + llm-d + vLLMBifrost + cache-aware routingValidate optimal combination

Architecture Configuration

Baseline:   Client → AgentCore Gateway → Bedrock Inference (Default Models)
Baseline+: Client → AgentCore Gateway → Bedrock Inference (Custom Import Models)

Alt A-1: Client → LiteLLM → kgateway (RoundRobin) → vLLM Pods
Alt A-2: Client → Bifrost → vLLM Pods (Bifrost load balancing)

Alt B-1: Client → LiteLLM → llm-d (Prefix-Cache Aware) → vLLM Pods
Alt B-2: Client → Bifrost → llm-d (Prefix-Cache Aware) → vLLM Pods
llm-d Connection Method

llm-d provides OpenAI-compatible endpoints, so both LiteLLM and Bifrost can integrate simply by pointing their base_url to the llm-d service. Gateway selection and llm-d integration are independent.


LLM Gateway Comparison: LiteLLM vs Bifrost

The gateway choice directly impacts platform performance and operations for self-managed EKS.

ItemLiteLLM (Python)Bifrost (Go)
Gateway OverheadHundreds of us/req~11 us/req (40-50x faster)
Memory FootprintBaseline~68% smaller
Provider Support100+20+ (major providers native)
Cost TrackingBuilt-inBuilt-in (hierarchical: key/team/customer)
ObservabilityLangfuse native integrationBuilt-in (request tracing, Prometheus)
Semantic CachingBuilt-inBuilt-in (~5ms hit)
GuardrailsBuilt-inBuilt-in
MCP Tool FilteringLimitedBuilt-in (per Virtual Key)
Governance (Virtual Keys)API Key managementHierarchical (key/team/customer budget/permissions)
Rate LimitingBuilt-inHierarchical (key/team/customer)
Fallback/Load BalancingBuilt-inBuilt-in
Web UIBuilt-inBuilt-in (real-time monitoring)
Langfuse IntegrationNative plugin (configuration only)Via OTel or Langfuse OpenAI SDK wrapper (app level)
Community/ReferencesMature (16k+ GitHub stars)Growing (3k+ GitHub stars)

Why Gateway Overhead Matters for Agentic AI

Agents make multiple sequential LLM calls within a single task. Gateway overhead accumulates with each call:

Agent 1 task = LLM call → Tool → LLM call → Tool → LLM call → Response
(gateway) (gateway) (gateway)

LiteLLM: ~300us x 5 calls = ~1.5ms cumulative
Bifrost: ~11us x 5 calls = ~0.055ms cumulative

As ratio of inference time (hundreds of ms to seconds): 1-3% vs 0.01-0.1%

Negligible for single requests, but high concurrency + agent multi-call environments may show tail latency differences.


AgentCore Provided Scope

AreaAgentCore ProvidedRequired for Self-Managed
Inference (Default Models)Claude, Llama, Mistral, etc. ready to usevLLM + GPU + model deployment
Inference (Custom Models)Custom Model Import / MarketplacevLLM + GPU + model deployment
ScalingAutomatic (managed)Karpenter + HPA/KEDA
Agent RuntimeBuilt-in Agent RuntimeLangGraph / Strands self-managed
MCP ConnectionBuilt-in MCP ConnectorDeploy/operate MCP servers
GuardrailsBedrock GuardrailsGateway built-in (Bifrost/LiteLLM)
ObservabilityCloudWatch integrationLangfuse + Bifrost/LiteLLM built-in + Prometheus
SecurityIAM native, VPC integrationPod Identity + NetworkPolicy
OperationsNone (managed)GPU monitoring, model updates, incident response

Validation Questions

#QuestionScenario
Q1Does AgentCore default model performance meet production SLAs?1
Q2How does Custom Model Import performance compare to direct vLLM serving?2
Q3What are Custom Model Import constraints? (quantization, batch strategy, etc.)2
Q4At what traffic scale does self-management become cost-effective?7
Q5Can AgentCore handle complex agent workflow requirements?5
Q6Is llm-d cache optimization effective enough to reverse cost differences?3, 6
Q7How responsive is AgentCore during burst traffic?9
Q8Is AgentCore isolation sufficient for multi-tenant environments?6
Q9Is the LiteLLM vs Bifrost gateway overhead significant in practice?4
Q10Does the Bifrost + llm-d combination operate stably?4

Test Environment

Region: us-east-1

Baseline (AgentCore Default Models):
- Bedrock Claude 3.5 Sonnet (on-demand + provisioned)
- Bedrock Llama 3.1 70B (on-demand)
- AgentCore Agent Runtime + MCP Connector
- Bedrock Guardrails, CloudWatch

Baseline+ (AgentCore Custom Models):
- Llama 3.1 70B fine-tuned model → Custom Model Import
- Same AgentCore runtime

Alt A-1 (EKS + LiteLLM + vLLM):
- EKS v1.32, Karpenter v1.2
- g5.2xlarge (A10G) x 4, vLLM v0.7.x
- Llama 3.1 70B (AWQ 4bit)
- LiteLLM v1.60+ → kgateway (RoundRobin)
- Langfuse v3.x + Prometheus

Alt A-2 (EKS + Bifrost + vLLM):
- Same EKS/vLLM configuration
- Bifrost (latest) → vLLM (Bifrost load balancing)
- Bifrost built-in observability + Prometheus

Alt B-1 (EKS + LiteLLM + llm-d + vLLM):
- Alt A-1 + llm-d v0.3+

Alt B-2 (EKS + Bifrost + llm-d + vLLM):
- Alt A-2 + llm-d v0.3+
- Bifrost base_url → llm-d service endpoint

Load Generation: Locust + LLMPerf

Test Scenarios

Scenario 1: Simple Inference — AgentCore Baseline Performance

  • Different prompt each time, input 500 / output 1000 tokens
  • Concurrency: 1, 10, 50, 100, 200
  • Target: Baseline (default models)
  • Validation: Do AgentCore TTFT, TPS meet production SLAs?

Scenario 2: Custom Model Import vs vLLM Direct Serving

  • Same model (Llama 3.1 70B) served on Baseline+ vs Alt A-1/A-2
  • Input 500 / output 1000 tokens, concurrency: 1, 10, 50, 100
  • Measured: TTFT, TPS, E2E Latency
  • Validation: Performance differences and constraints of Custom Import
    • Quantization option comparison (Import supported range vs vLLM AWQ/GPTQ/FP8)
    • Batch size / concurrent processing control availability
    • Model update turnaround time (Import redeployment vs vLLM rolling update)

Scenario 3: Repeated System Prompts — Caching Effect

  • 3 fixed system prompts (2000 tokens each) + only user input varies
  • Concurrency: 10, 50, 100
  • Target: Baseline (prompt caching) vs Alt A-1/A-2 vs Alt B-1/B-2 (llm-d)
  • Validation: Bedrock prompt caching vs llm-d prefix caching vs Bifrost semantic caching, TTFT/cost comparison

Scenario 4: Gateway Overhead — LiteLLM vs Bifrost

  • LiteLLM and Bifrost each used as gateway for the same vLLM backend
  • Concurrency: 1, 10, 50, 100, 500, 1000
  • With/without llm-d combinations: A-1 vs A-2, B-1 vs B-2
  • Measured: Gateway added latency (p50/p95/p99), memory usage, CPU usage, error rate
  • Validation:
    • Q9 — Does gateway overhead create significant differences at high concurrency?
    • Q10 — Does Bifrost → llm-d connection operate stably?
    • Cumulative overhead difference for agent multi-call (5 turns)

Scenario 5: Multi-turn Agent Workflow

  • 5-turn conversation + 3 tool calls (web search, DB query, calculation)
  • AgentCore: Agent Runtime + MCP Connector
  • EKS: LangGraph + MCP Server (Bifrost MCP tool filtering vs LiteLLM)
  • Validation: AgentCore Agent Runtime complex workflow handling capability, customization limits

Scenario 6: Multi-tenant

  • 5 tenants, each with different system prompts/guardrail policies
  • AgentCore: IAM-based isolation
  • EKS + LiteLLM: API Key-based isolation
  • EKS + Bifrost: Virtual Key hierarchical governance (per team/customer budget, permissions)
  • EKS + llm-d: Per-tenant cache routing
  • Validation: AgentCore isolation level vs EKS, Bifrost Virtual Key governance effectiveness

Scenario 7: Break-even Point Discovery

  • Gradual load increase: 1, 5, 10, 30, 50, 100 req/s
  • Monthly cost calculation for 6 configurations at each level
  • Validation: Derive precise cost crossover point

Scenario 8: Extended Operation (24h)

  • 30 req/s, maintained for 24 hours
  • Total cost, stability (error rate), performance variance
  • Validation: AgentCore cost predictability vs EKS GPU idle costs

Scenario 9: Burst Traffic

  • Normal 10 req/s → 100 req/s for 5 min → back to 10 req/s
  • Validation: AgentCore throttling/queuing behavior vs EKS Karpenter scale-out delay

Measured Metrics

CategoryMetricBaselineBaseline+A-1 (LiteLLM)A-2 (Bifrost)B-1 (LiteLLM+llm-d)B-2 (Bifrost+llm-d)
PerformanceTTFT (p50/p95/p99)OOOOOO
TPS (output tokens/sec)OOOOOO
E2E LatencyOOOOOO
Throughput (req/s)OOOOOO
Cold StartOOOOOO
GatewayGateway Added Latency--OOOO
Gateway Memory Usage--OOOO
Gateway CPU Usage--OOOO
CachingBedrock Prompt Caching SavingsOO----
Semantic Cache Hit Rate---O-O
KV Cache Hit Rate----OO
CostMonthly Total Cost (per traffic level)OOOOOO
Effective Cost per TokenOOOOOO
Idle Cost--OOOO
GovernanceTenant Isolation LevelOOOOOO
Budget/Rate Limit PrecisionOOOOOO
OperationsBuild TimeOOOOOO
Disaster Recovery TimeOOOOOO
Required Personnel/Skill SetOOOOOO

Cost Simulation

Fixed Costs (Monthly)

ItemBaselineBaseline+A-1/A-2B-1/B-2
GPU Instances (g5.2xlarge x4)--~$4,800~$4,800
EKS Cluster--$73$73
llm-d (CPU Pod)---~$50
Gateway (LiteLLM/Bifrost)--~$50~$50
Langfuse (self-hosted)--~$100~$100
Bedrock ProvisionedSeparate calculationSeparate calculation--

Variable Costs

ItemBaselineBaseline+A-1/A-2B-1/B-2
Billing MethodPer tokenPer tokenGPU time allocationGPU time allocation
Cache SavingsPrompt caching discountPrompt caching discountSemantic caching (Bifrost)KV cache + semantic caching
Idle CostNone (on-demand)None (on-demand)Charged during GPU idleCharged during GPU idle

Expected Cost Curve

Monthly Cost
^
| AgentCore On-Demand
| \
| \ / A-1 (LiteLLM+vLLM)
| \ / A-2 (Bifrost+vLLM)
| \ /
| AgentCore \ / B-1 (LiteLLM+llm-d)
| Provisioned\ / / B-2 (Bifrost+llm-d)
| \ / / /
| \ / / /
| \ / / /
| X / / <-- Break-even point
| / \ / /
| EKS Fixed Cost-/---\--/-/----------
| / \/
+-------------------------------------------> Traffic (req/s)
5 10 30 50 100
Traffic RangeRecommendationReason
Below break-evenAgentCore On-DemandNo GPU fixed costs, instant start
Around break-evenAgentCore ProvisionedDiscounted throughput, still managed
Above break-even + diverse promptsAlt A-2 (Bifrost)Low overhead, governance
Above break-even + repeated promptsAlt B-2 (Bifrost+llm-d)Cache effect + low overhead

Decision Flowchart


Conditions Justifying EKS Self-Management

Only consider self-management when AgentCore is insufficient

Self-managed EKS is justified when one or more of the following conditions apply.

ConditionReason
Fine-grained inference engine controlFree choice of vLLM scheduling, batch strategy, quantization (AWQ/GPTQ/FP8)
Large-scale traffic cost optimizationCost per token reversal above break-even point
KV cache routingMaximize TTFT/GPU efficiency with llm-d prefix cache
Multi-tenant governanceFine-grained per-team/customer budget/permission control with Bifrost Virtual Keys
Immediate latest model adoptionUse community latest models before Bedrock Import
Data sovereignty / Air-gappedEnvironments where Bedrock API calls are impossible

Observability Stack Configuration

The observability stack differs based on gateway choice for self-managed EKS.

LiteLLM-based (A-1, B-1)

Application (Langfuse SDK) ──→ Langfuse Server (Trace/Span)
LiteLLM ──→ Langfuse Server (native integration, request/cost logs)
vLLM + llm-d ──→ Prometheus → Grafana (GPU, KV cache metrics)

Bifrost-based (A-2, B-2)

Application (Langfuse SDK) ──→ Langfuse Server (Trace/Span)
Bifrost (OTel Plugin) ──→ OTLP Collector ──→ Langfuse Server (gateway-level traces)
Bifrost ──→ Prometheus → Grafana (cost/token/latency metrics)
Bifrost ──→ Bifrost Web UI (real-time monitoring)
vLLM + llm-d ──→ Prometheus → Grafana (GPU, KV cache metrics)
Langfuse is needed regardless of gateway

Bifrost's built-in observability monitors the gateway layer (requests/cost/latency). Full agent workflow tracing (connecting multi-calls, prompt quality evaluation, session tracking) is handled by Langfuse. The two layers are complementary, not replacements.


Result Report Structure (Planned)

SectionContent
Executive SummaryClear distinction between "when AgentCore is sufficient" and "when self-management is needed"
AgentCore Baseline PerformanceDefault model TTFT, TPS, Throughput baselines
Custom Import vs vLLMSame model performance/cost/constraint comparison
Gateway ComparisonLiteLLM vs Bifrost overhead, governance, stability
Caching Strategy ComparisonBedrock prompt caching vs Bifrost semantic caching vs llm-d prefix caching
Agent Runtime ComparisonAgentCore Runtime vs LangGraph capabilities/flexibility
Cost Break-even6-configuration cost graph per traffic range + crossover points
Observability StackPer-gateway observability configuration comparison
Decision GuideWorkload characteristics → optimal configuration flowchart
Migration PathWork and risks when transitioning from AgentCore → EKS