Skip to main content

AI Platform Selection Guide: Managed vs Open Source vs Hybrid

When customers start developing AI directly, the first question they face is "Should we use managed services or build with open source?" This document provides a decision framework to select the optimal approach among SageMaker Unified Studio, Bedrock AgentCore, and EKS-based open architecture based on customer circumstances.

AI platform construction paths are broadly divided into three categories:

  • (A) AWS Managed: Start with no infrastructure operations using Bedrock + Strands SDK + AgentCore
  • (B) EKS + Open Source: Secure maximum control with self-hosting using vLLM, llm-d, Langfuse, etc.
  • (C) Hybrid: Achieve balance of cost, control, and speed by combining Bedrock and EKS
Prerequisites

Before reading this document, refer to the following:


AWS AI Platform Service Landscape

AWS AI services are structured into 4 tiers. Customers start at lower tiers and move to higher tiers as needed.

Key Tier Distinctions:

  • Tier 1-3: AWS managed services allow you to start without infrastructure operations.
  • Tier 4: Choose when fine-grained control, cost optimization, or data sovereignty is required.
  • Most customers start at Tier 1 and expand incrementally, while enterprises tend to combine Tier 3 and Tier 4 in hybrid configurations.

SageMaker Unified Studio

Integrated AI Development Environment

SageMaker Unified Studio is an integrated AI development environment released in H2 2024, designed to perform ML/data/analytics tasks in a single IDE. Previously, teams had to use fragmented tools like SageMaker Studio Classic, Athena, and Glue Studio separately, but Unified Studio consolidates them into one.

Key Differentiators

FeatureDescriptionImprovement vs Previous
Unified IDEJupyterLab + SQL Editor + No-code InterfaceData+ML integration vs SageMaker Studio Classic
Built-in MLflowExperiment tracking, model registry, model comparisonNo need to operate separate MLflow server
Lakehouse IntegrationApache Iceberg tables, Glue Catalog native integrationOne-stop data engineering → ML pipeline
Governance CollaborationAmazon DataZone-based IAM sharing, data lineage trackingSecure data/model sharing between teams
Unified ComputeManage training, notebooks, pipelines in single environmentPrevent resource fragmentation

Positioning: When to Choose?

Key Message

SageMaker Unified Studio is a development environment (Tier 2). It has a complementary relationship with Bedrock (inference) or EKS (serving), and provides the greatest value when data teams and ML teams need to collaborate on a single platform.


Platform Comparison Matrix

The optimal approach varies based on customer circumstances. Compare each platform option across 5 key evaluation dimensions.

AI Platform 5-Axis Comparison Matrix
Evaluation AxisBedrock + AgentCoreSageMaker Unified StudioEKS+Open SourceHybrid
Cost StructureUsage-based pricing, no GPU managementInstance+usage hybrid, notebook/training separateSpot/MIG optimization, upfront investment neededBedrock + self-hosted SLM, Cascade 66% savings
Operational BurdenMinimal — AWS fully managedLow — minimal infra management, focus on ML workflowsMedium — K8s/GPU ops capability needed (reduced with Auto Mode)Medium — understanding of both environments required
Data SovereigntyProcessed within AWS regionVPC isolation, training data stays in S3Full control — model+data isolation within VPCSelective isolation per workload
CustomizationLimited — Bedrock-supported models, within Guardrails scopeMLflow, custom pipelines, fine-tuning supportFully flexible — all open models, LoRA, custom gatewaySelective expansion as needed
Time-to-Value2-4 weeks — start with API calls only4-8 weeks — environment setup + pipeline configuration2-4 months — cluster + GPU + model serving setup1-3 months — Bedrock start + gradual EKS expansion
Detailed Cost Analysis

For detailed cost comparison between self-hosting and Bedrock (break-even points, Cascade Routing savings), refer to Coding Tools Cost Analysis.


Decision Flowchart

A decision flow you can use in customer meetings. Find the optimal approach by answering key questions.

The Flowchart is a Starting Point

This flowchart is the starting point for conversation, not the final conclusion. Actual customer situations are complex, and most enterprises converge on a hybrid approach.


Starting points and expansion paths vary based on the customer's current AI/ML maturity level.

AI Platform Maturity Path
Maturity LevelCharacteristicsRecommended StackCore ServicesTimeline
Level 1 — AI ExplorerNo AI/ML workloads, need fast PoCAWS Managed FirstBedrock API + Strands SDK + AgentCore2-4 weeks
Level 2 — AI BuilderSome ML in production, training pipelines neededSageMaker + Bedrock HybridSageMaker Unified Studio + Bedrock + S3/Glue1-3 months
Level 3 — AI OptimizerLarge-scale inference, cost pressure, custom modelsEKS Open Architecture + Cascade RoutingEKS + vLLM/llm-d + kgateway + Bifrost + Langfuse3-6 months

Detailed Guide by Level:


Hybrid Combination Patterns

Most enterprises converge on hybrid approaches rather than a single path. Here are 4 proven combination patterns.

Hybrid Pattern Summary
PatternConfigurationBest Fit ScenarioComplexity
Bedrock + EKS SLMBedrock (inference) + EKS self-hosted SLM (high-frequency)Large-scale inference with urgent API cost reduction★★☆☆☆
SageMaker Training + EKS ServingSageMaker (training/experimentation) + EKS+vLLM (serving)Organizations with separated ML and serving teams★★★☆☆
AgentCore + Self-hosted ModelsAgentCore (Agent runtime) + EKS (custom model inference)AWS-managed Agent operations, self-hosted models★★★☆☆
Full StackUnified Studio (dev) + Bedrock (external) + EKS (self-hosted) + AgentCore (ops)Enterprise AI CoE, full AI lifecycle management★★★★☆

Pattern 1: Bedrock + EKS SLM (Cascade Routing)

When to use: When monthly inference volume exceeds 500K requests and 60-70% of requests are simple tasks (code completion, translation, summarization)

Core value: Maintain Bedrock API quality while reducing costs by 40-60%

Reference: Inference Gateway & Cascade Routing


Pattern 2: SageMaker Training + EKS Serving

When to use: When training custom models and minimizing inference costs

Core value: SageMaker's managed training environment + EKS cost-efficient serving

Reference: SageMaker-EKS Integration


Pattern 3: AgentCore + Self-Hosted Models

When to use: When operating Agent runtime serverlessly but self-hosting specific domain models

Core value: AgentCore's serverless operability + custom model domain accuracy

Reference: AWS Native Platform


Pattern 4: Full Stack (SageMaker + Bedrock + EKS)

The most complex but most flexible pattern:

  • Data & Training: SageMaker Unified Studio + Pipelines
  • Production Inference: Bedrock API (high-reliability tasks) + EKS vLLM (high-volume tasks)
  • Agent Runtime: AgentCore (serverless) + Kagent (Kubernetes-native)
  • Observability: CloudWatch (managed) + Langfuse (self-hosted)

This pattern is chosen by large enterprises to meet different requirements across teams. Due to high architectural complexity, clear operational responsibility boundaries and a service catalog are essential.

Reference: For technical implementation of hybrid architecture, refer to SageMaker-EKS Integration.


Cost Simulation Summary

Optimal options and estimated costs based on monthly inference volume.

Monthly Inference VolumeOptimal OptionEst. Monthly CostNotes
~100K requestsBedrock API~$300-500No GPU management, fastest start
~500K requestsBedrock + Cascade~$800-1,200Start separating simple requests with SLM
~1.5M requestsHybrid transition point~$2,500-3,500Near self-hosting break-even
~5M+ requestsEKS self-hosting~$3,500-5,00060%+ savings with Spot + Cascade
Detailed Cost Analysis

For detailed analysis of instance costs, Spot savings rates, and Cascade Routing effects, refer to Coding Tools Cost Analysis.


Customer Discovery Checklist

10 key questions to identify the optimal approach in customer meetings.

  1. Are you currently operating AI/ML workloads? → Determine maturity level
  2. What is your monthly inference request volume? → Cost optimization path
  3. Do you need to self-host Open Weight models? → EKS necessity
  4. Do you have data sovereignty or VPC isolation requirements? → Self-hosting/hybrid
  5. Does your team have Kubernetes operations experience? → Assess operational burden
  6. Do you perform ML training and data engineering together? → SageMaker Unified Studio
  7. What is your monthly budget range? → Cost structure matching
  8. When is your target production deployment date? → Time-to-value path
  9. Do you have multi-cloud or on-premises hybrid requirements? → EKS Hybrid Nodes
  10. What AWS services are you currently using? → Leverage existing investments

References

Official Documentation

Papers / Technical Blogs