AI Platform Selection Guide: Managed vs Open Source vs Hybrid
When customers start developing AI directly, the first question they face is "Should we use managed services or build with open source?" This document provides a decision framework to select the optimal approach among SageMaker Unified Studio, Bedrock AgentCore, and EKS-based open architecture based on customer circumstances.
AI platform construction paths are broadly divided into three categories:
- (A) AWS Managed: Start with no infrastructure operations using Bedrock + Strands SDK + AgentCore
- (B) EKS + Open Source: Secure maximum control with self-hosting using vLLM, llm-d, Langfuse, etc.
- (C) Hybrid: Achieve balance of cost, control, and speed by combining Bedrock and EKS
Before reading this document, refer to the following:
- Platform Architecture — 6-layer core design blueprint
- Technical Challenges — 5 core challenge analysis
AWS AI Platform Service Landscape
AWS AI services are structured into 4 tiers. Customers start at lower tiers and move to higher tiers as needed.
Key Tier Distinctions:
- Tier 1-3: AWS managed services allow you to start without infrastructure operations.
- Tier 4: Choose when fine-grained control, cost optimization, or data sovereignty is required.
- Most customers start at Tier 1 and expand incrementally, while enterprises tend to combine Tier 3 and Tier 4 in hybrid configurations.
SageMaker Unified Studio
Integrated AI Development Environment
SageMaker Unified Studio is an integrated AI development environment released in H2 2024, designed to perform ML/data/analytics tasks in a single IDE. Previously, teams had to use fragmented tools like SageMaker Studio Classic, Athena, and Glue Studio separately, but Unified Studio consolidates them into one.
Key Differentiators
| Feature | Description | Improvement vs Previous |
|---|---|---|
| Unified IDE | JupyterLab + SQL Editor + No-code Interface | Data+ML integration vs SageMaker Studio Classic |
| Built-in MLflow | Experiment tracking, model registry, model comparison | No need to operate separate MLflow server |
| Lakehouse Integration | Apache Iceberg tables, Glue Catalog native integration | One-stop data engineering → ML pipeline |
| Governance Collaboration | Amazon DataZone-based IAM sharing, data lineage tracking | Secure data/model sharing between teams |
| Unified Compute | Manage training, notebooks, pipelines in single environment | Prevent resource fragmentation |
Positioning: When to Choose?
SageMaker Unified Studio is a development environment (Tier 2). It has a complementary relationship with Bedrock (inference) or EKS (serving), and provides the greatest value when data teams and ML teams need to collaborate on a single platform.
Platform Comparison Matrix
The optimal approach varies based on customer circumstances. Compare each platform option across 5 key evaluation dimensions.
For detailed cost comparison between self-hosting and Bedrock (break-even points, Cascade Routing savings), refer to Coding Tools Cost Analysis.
Decision Flowchart
A decision flow you can use in customer meetings. Find the optimal approach by answering key questions.
This flowchart is the starting point for conversation, not the final conclusion. Actual customer situations are complex, and most enterprises converge on a hybrid approach.
Recommended Path by Customer Maturity
Starting points and expansion paths vary based on the customer's current AI/ML maturity level.
Detailed Guide by Level:
- Level 1 (Exploration): → AWS Native Platform
- Level 2 (Build): → SageMaker-EKS Integration
- Level 3 (Optimization): → EKS-based Open Architecture, Inference Gateway
Hybrid Combination Patterns
Most enterprises converge on hybrid approaches rather than a single path. Here are 4 proven combination patterns.
Pattern 1: Bedrock + EKS SLM (Cascade Routing)
When to use: When monthly inference volume exceeds 500K requests and 60-70% of requests are simple tasks (code completion, translation, summarization)
Core value: Maintain Bedrock API quality while reducing costs by 40-60%
Reference: Inference Gateway & Cascade Routing
Pattern 2: SageMaker Training + EKS Serving
When to use: When training custom models and minimizing inference costs
Core value: SageMaker's managed training environment + EKS cost-efficient serving
Reference: SageMaker-EKS Integration
Pattern 3: AgentCore + Self-Hosted Models
When to use: When operating Agent runtime serverlessly but self-hosting specific domain models
Core value: AgentCore's serverless operability + custom model domain accuracy
Reference: AWS Native Platform
Pattern 4: Full Stack (SageMaker + Bedrock + EKS)
The most complex but most flexible pattern:
- Data & Training: SageMaker Unified Studio + Pipelines
- Production Inference: Bedrock API (high-reliability tasks) + EKS vLLM (high-volume tasks)
- Agent Runtime: AgentCore (serverless) + Kagent (Kubernetes-native)
- Observability: CloudWatch (managed) + Langfuse (self-hosted)
This pattern is chosen by large enterprises to meet different requirements across teams. Due to high architectural complexity, clear operational responsibility boundaries and a service catalog are essential.
Reference: For technical implementation of hybrid architecture, refer to SageMaker-EKS Integration.
Cost Simulation Summary
Optimal options and estimated costs based on monthly inference volume.
| Monthly Inference Volume | Optimal Option | Est. Monthly Cost | Notes |
|---|---|---|---|
| ~100K requests | Bedrock API | ~$300-500 | No GPU management, fastest start |
| ~500K requests | Bedrock + Cascade | ~$800-1,200 | Start separating simple requests with SLM |
| ~1.5M requests | Hybrid transition point | ~$2,500-3,500 | Near self-hosting break-even |
| ~5M+ requests | EKS self-hosting | ~$3,500-5,000 | 60%+ savings with Spot + Cascade |
For detailed analysis of instance costs, Spot savings rates, and Cascade Routing effects, refer to Coding Tools Cost Analysis.
Customer Discovery Checklist
10 key questions to identify the optimal approach in customer meetings.
- Are you currently operating AI/ML workloads? → Determine maturity level
- What is your monthly inference request volume? → Cost optimization path
- Do you need to self-host Open Weight models? → EKS necessity
- Do you have data sovereignty or VPC isolation requirements? → Self-hosting/hybrid
- Does your team have Kubernetes operations experience? → Assess operational burden
- Do you perform ML training and data engineering together? → SageMaker Unified Studio
- What is your monthly budget range? → Cost structure matching
- When is your target production deployment date? → Time-to-value path
- Do you have multi-cloud or on-premises hybrid requirements? → EKS Hybrid Nodes
- What AWS services are you currently using? → Leverage existing investments
References
Official Documentation
- Amazon SageMaker Unified Studio — Integrated AI development environment
- Amazon Bedrock Documentation — Bedrock official documentation
- Amazon EKS Best Practices — EKS recommendations
- AWS Well-Architected Framework — Architecture framework
Papers / Technical Blogs
- Choosing the Right AI Platform — Platform selection guide
- Cost Optimization for LLM Inference — Cost optimization strategies
- Hybrid AI Architecture Patterns — Hybrid patterns
- Building Production ML Systems — Production ML guide
Related Documents (Internal)
- Platform Architecture — 6 core layers
- Technical Challenges — 5 core challenges
- AWS Native Platform — Managed service details
- EKS-based Open Architecture — Self-hosting details