EKS-based Agentic AI Open Architecture
Before reading this document, refer to the following documents first:
- Platform Architecture — Structure and core layers of Agentic AI Platform
- Technical Challenges — 5 core challenges
- AI Platform Selection Guide — Managed vs open-source decision-making
- AWS Native Platform — Managed service-based alternative approach (for comparison reference)
Why EKS-based Open Architecture?
AWS Native Platform is a powerful approach for getting started quickly. However, when the following requirements arise, EKS-based open architecture becomes necessary:
- Self-hosted Open Weight Models (Llama, Qwen, DeepSeek)
- Hybrid Architecture (on-premises GPU + cloud)
- Custom Agent Workflows (LangGraph, MCP/A2A)
- Multi-provider Routing (Bifrost 2-Tier Gateway)
- Fine-grained GPU Cost Optimization (Spot, MIG, Consolidation)
For a 5-axis comparison of AWS Native, SageMaker Unified Studio, EKS open architecture, and hybrid approaches, refer to AI Platform Selection Guide.
Key Message: AWS Native → EKS is a complementary relationship. A realistic approach is to start with AWS Native and expand to EKS as needed. Both approaches can coexist within the same VPC.
Quick Start with EKS Auto Mode
EKS Cluster Configuration Options: Control Plane and Data Plane
EKS cluster configuration is divided into two independent layers.
Provisioned Control Plane (PCP)
PCP is a premium option that provisions control plane capacity in advance with fixed tiers, ensuring consistent API server performance.
PCP Tier Specifications
| Tier | API Concurrency (seats) | Pod Scheduling | etcd DB | SLA | Cost |
|---|---|---|---|---|---|
| Standard | Dynamic (AWS auto-adjusted) | Dynamic | 8GB | 99.95% | $0.10/hr |
| XL | 1,700 | 167/sec | 16GB | 99.99% | - |
| 2XL | 3,400 | 283/sec | 16GB | 99.99% | - |
| 4XL | 6,800 | 400/sec | 16GB | 99.99% | - |
| 8XL | 13,600 | 400/sec | 16GB | 99.99% | - |
Source: AWS EKS Provisioned Control Plane Official Documentation (K8s 1.30+ baseline). For PCP tier pricing, refer to AWS official pricing page.
Tier Selection Criteria: Metric-based Judgment
PCP tier should be selected based on Kubernetes control plane metrics.
Key Monitoring Metrics:
| Metric | Prometheus Query | Judgment Criterion |
|---|---|---|
| API Inflight Seats (Most Important) | apiserver_flowcontrol_current_executing_seats_total | Sustained exceeds 1,200 seats → XL or higher |
| Pod Scheduling Rate | scheduler_schedule_attempts_SCHEDULED | 100/sec or higher → XL, 200/sec or higher → 2XL |
| etcd DB Size | apiserver_storage_size_bytes | Exceeds 10GB → XL or higher required |
PCP is a control plane capacity option, and Auto Mode is a data plane management option. Both features can be used in combination.
Control Plane × Data Plane Comparison and Combination
| Feature | Standard (Default) | Provisioned Control Plane (PCP) |
|---|---|---|
| Scaling | Dynamic auto-scaling (AWS managed) | Fixed tier (pre-provisioned) |
| API Concurrency (seats) | Dynamic (AWS auto-adjusted) | XL: 1,700 / 2XL: 3,400 / 4XL: 6,800 / 8XL: 13,600 |
| Pod Scheduling Rate | Dynamic | XL: 167 / 2XL: 283 / 4XL–8XL: 400 pods/sec |
| etcd DB Size | 8 GB | 16 GB |
| SLA | 99.95% | 99.99% |
| Cost | $0.10/hr ($73/mo) | Per-tier pricing (see AWS pricing page) |
| Tier Selection Criteria | - | API Inflight Seats + Pod Scheduling Rate + etcd DB Size (NOT node count) |
| Feature | Managed Node Groups | Karpenter | EKS Auto Mode |
|---|---|---|---|
| Node Provisioning | Manual (ASG-based) | Automatic (Pod-driven) | Fully automatic (AWS managed) |
| GPU Optimization | Manual instance selection | Auto GPU selection | Auto + default NodeClass |
| Scaling Speed | Slow (ASG → EC2) | Fast (direct EC2 API) | Fast (built-in Karpenter) |
| Add-on Mgmt | Manual (CNI, CSI, etc.) | Manual | ✅ Automatic |
| Security Patches | Manual AMI update | Manual | ✅ Automatic |
| Cost Optimization | Limited | Consolidation + Spot | Consolidation + 7.5% surcharge |
| Operational Burden | High | Medium | Low |
| Combination | Control Plane | Data Plane | Best For |
|---|---|---|---|
| General AI Service | Standard | Auto Mode | Small-mid inference, minimal ops |
| GPU-Optimized Platform | Standard | Karpenter | Multi-GPU, Spot, cost optimization |
| Large AI Platform | PCP (tier-xl+) | Auto Mode | API perf guaranteed (1,700+ seats) + auto ops |
| Ultra-Scale Training | PCP (tier-4xl+) | Karpenter | API concurrency 6,800+, fine GPU control |
- Small-scale (PoC/Demo): Standard + Auto Mode — Minimal operational burden, 99.95% SLA
- Medium-scale (Production Inference): Standard + Karpenter — GPU cost optimization, 99.95% SLA
- Large-scale (Enterprise AI): PCP XL + Auto Mode — API seats ≤ 1,700, 99.99% SLA
- Extra-large-scale (Training Cluster): PCP 4XL+ + Karpenter — API seats ≤ 6,800+, fine-grained GPU control
Amazon EKS and Karpenter: Maximizing Kubernetes Advantages
The combination of Amazon EKS and Karpenter maximizes Kubernetes advantages to implement fully automated optimal infrastructure. Karpenter provides node provisioning optimized for AI workloads, enabling faster scaling and finer-grained instance selection compared to existing Cluster Autoscaler.
For Karpenter v1.2+ GA features, NodePool configuration, GPU instance comparison, and cost optimization strategies, refer to GPU Resource Management.
| Aspect | Traditional Cluster Autoscaler | Karpenter on EKS |
|---|---|---|
| Scaling Speed | 60-90 seconds (ASG-based) | 10-30 seconds (direct EC2 API) |
| Instance Selection | Limited by ASG pre-configuration | Dynamic selection from 600+ EC2 types |
| GPU Workloads | Requires separate ASGs per GPU type | Single NodePool handles all GPU types |
| Spot Optimization | Manual fallback configuration | Automatic spot-to-on-demand fallback |
| Cost Efficiency | Limited consolidation | Aggressive bin-packing and consolidation |
| AWS Integration | Indirect via ASG | Direct EC2/Spot API calls |
| Configuration | ASG + IAM + Launch Templates | Simple NodePool CRD |
EKS Auto Mode: Complete Automation
EKS Auto Mode automatically configures and manages core components including Karpenter.
EKS Auto Mode vs Manual Configuration Comparison
| Your Situation | Recommendation |
|---|---|
| New EKS cluster for Agentic AI | **Karpenter** (native AWS integration) |
| Existing cluster with CA | **Migrate to Karpenter** (worth the effort) |
| Need GPU autoscaling | **Karpenter** (required for GPU efficiency) |
| Simple CPU-only workloads | **EKS Auto Mode** (easiest option) |
| Multi-tenant platform | **Karpenter** (better isolation and cost attribution) |
| Regulated industries | **EKS Auto Mode** (compliance-friendly) |
EKS Auto Mode Configuration for GPU Workloads
EKS Auto Mode automatically configures and manages Karpenter. Adding GPU NodePool enables immediate AI workload deployment.
For detailed configuration including GPU NodePool composition, Spot/On-Demand strategy, Consolidation policy, refer to GPU Resource Management.
EKS Auto Mode fully supports accelerated computing instances including NVIDIA GPU.
re:Invent 2024/2025 New Features:
- EKS Hybrid Nodes (GA): Integrate on-premises GPU infrastructure into EKS cluster
- Enhanced Pod Identity v2: Cross-account IAM role support
- Native Inferentia/Trainium Support: Automatic Neuron SDK configuration
- Provisioned Control Plane: Pre-provisioning for large-scale AI training workloads
Agentic AI Components Deployable on Auto Mode
All core components of the Agentic AI platform can be deployed on EKS Auto Mode.
Inference: vLLM + llm-d
vLLM is an LLM inference-dedicated engine, and llm-d provides intelligent routing considering KV Cache state.
- vLLM: LLM inference-dedicated (GPT, Claude, Llama, etc.) — PagedAttention-based KV Cache optimization
- Triton Inference Server: Handles non-LLM inference (embedding, reranking, Whisper STT)
- llm-d: Maximizes prefix cache hit rate with KV Cache-aware routing
For detailed configuration, refer to vLLM Model Serving and llm-d Distributed Inference.
Gateway: kgateway + Bifrost (2-Tier Gateway)
Separates traffic management and model routing with 2-Tier Gateway architecture:
- Tier 1 (kgateway): Gateway API-based authentication, rate limiting, traffic management
- Tier 2 (Bifrost): Model abstraction, fallback, cost tracking, cascade routing
For detailed architecture, refer to Inference Gateway Routing.
Agent: LangGraph + NeMo Guardrails + MCP/A2A
Agent workflows on EKS consist of:
- LangGraph: Multi-step agent workflow definition, conditional branching, parallel execution
- NeMo Guardrails: Prompt injection defense, PII leak prevention, output validation — For tool comparison and implementation details, refer to AI Gateway Guardrails
- MCP: Agent Ready apps provide tools in a standardized way
- A2A: Safe and efficient communication between agents
- Redis (ElastiCache): State management with LangGraph checkpointer
Agent Pods autoscale based on Redis queue length via KEDA.
For details, refer to Kagent Agent Management and AWS Native Platform — AgentCore & MCP. For Guardrails technology stack (Input/Output Guard, Tool Allow-list, kgateway/Bifrost integration), refer to AI Gateway Guardrails.
RAG + Observability
- Milvus: Vector DB — Core of RAG system (Details)
- Langfuse: Production LLM tracing, token cost tracking (Architecture, Deployment Guide)
- Prometheus + Grafana: Infrastructure metrics monitoring
EKS-based Easy Deployment
| Benefit | Description |
|---|---|
| Immediate Start | Deploy GPU workloads immediately after cluster creation without Karpenter installation/configuration |
| Automatic Upgrades | Automatic updates for core components like Karpenter, CNI, CSI |
| Automated Security Patching | Automatic application of security vulnerability patches |
| Extensible with Custom Configuration | Add custom settings like GPU NodePool, EFA NodeClass when needed |
EKS Deployment Methods by Solution
| Challenge | Kubernetes-Based | EKS Auto Mode + Karpenter | Expected Effect |
|---|---|---|---|
| GPU Monitoring | DCGM + Prometheus | NodePool-based integrated management | 40% improved resource utilization |
| Dynamic Scaling | HPA + KEDA | Just-in-Time provisioning (auto-configured) | 50% reduced provisioning time |
| Cost Control | Namespace Quota | Spot + Consolidation (auto-enabled) | 50-70% cost reduction |
| FM Fine-tuning | Kubeflow Operator | Training NodePool + EFA | 30% improved training efficiency |
Easy Deployment Example
For deployment guide, refer to Reference Architecture.
For GPU cost optimization strategies including Spot instance usage, Consolidation, and schedule-based cost management, refer to GPU Resource Management document.
For GPU Pod security policies, Network Policy, IAM, MIG isolation, and GPU troubleshooting guide, refer to EKS GPU Node Strategy document.
Minimize Infrastructure Operational Burden with EKS Capability
What is EKS Capability?
EKS Capability is a platform-level feature that integrates proven open-source tools and AWS services to effectively operate specific workloads on Amazon EKS.
Core EKS Capabilities for Agentic AI
| EKS Capability | 역할 | Agentic AI 활용 | 지원 방식 |
|---|---|---|---|
| ACK (AWS Controllers for Kubernetes) | AWS 서비스의 Kubernetes 네이티브 관리 | S3 모델 저장소, RDS 메타데이터, SageMaker 학습 작업 | EKS Add-on |
| KRO (Kubernetes Resource Orchestrator) | 복합 리소스 추상화 및 템플릿화 | AI 추론 스택, 학습 파이프라인 원클릭 배포 | EKS Add-on |
| Argo CD | GitOps 기반 지속적 배포 | 모델 서빙 배포 자동화, 롤백, 환경 동기화 | EKS Add-on |
Argo Workflows is not officially supported as an EKS Capability, so direct installation is required.
For deployment guide, refer to Argo Workflows Official Documentation.
ACK (AWS Controllers for Kubernetes)
ACK directly provisions and manages AWS services through Kubernetes Custom Resources. It can be easily installed as an EKS Add-on.
ACK Usage Examples in AI Platform:
| AWS 서비스 | ACK Controller | Agentic AI 활용 |
|---|---|---|
| S3 | `s3.services.k8s.aws` | 모델 아티팩트 저장소, 학습 데이터 버킷 |
| RDS/Aurora | `rds.services.k8s.aws` | Langfuse 백엔드, 메타데이터 저장소 |
| SageMaker | `sagemaker.services.k8s.aws` | 모델 학습 작업, 엔드포인트 배포 |
| Secrets Manager | `secretsmanager.services.k8s.aws` | API 키, 모델 자격증명 관리 |
| ECR | `ecr.services.k8s.aws` | 컨테이너 이미지 레지스트리 |
S3 Bucket Creation Example with ACK:
apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket
metadata:
name: agentic-ai-models
namespace: ai-platform
spec:
name: agentic-ai-models-prod
versioning:
status: Enabled
encryption:
rules:
- applyServerSideEncryptionByDefault:
sseAlgorithm: aws:kms
tags:
- key: Project
value: agentic-ai
KRO (Kubernetes Resource Orchestrator)
KRO combines multiple Kubernetes resources and AWS resources into one abstracted unit to deploy complex infrastructure simply.
Deploy AI Inference Stack as Single Resource with KRO:
# Deploy entire stack as single resource
apiVersion: v1alpha1
kind: AIInferenceStack
metadata:
name: llama-inference
namespace: ai-platform
spec:
modelName: llama-3-70b
gpuType: g5.12xlarge
minReplicas: 2
maxReplicas: 20
Argo-based ML Pipeline Automation
Combining Argo Workflows and Argo CD enables full MLOps pipeline automation in GitOps style from AI model training, evaluation, to deployment.
ACK + KRO + ArgoCD Integration Architecture
| 구성요소 | 역할 | 자동화 범위 |
|---|---|---|
| Argo CD | GitOps 배포 자동화 | 애플리케이션 배포, 롤백, 동기화 |
| Argo Workflows | ML 파이프라인 오케스트레이션 | 학습, 평가, 모델 등록 워크플로 |
| KRO | 복합 리소스 추상화 | K8s + AWS 리소스를 단일 단위로 관리 |
| ACK | AWS 리소스 선언적 관리 | S3, RDS, SageMaker 등 AWS 서비스 |
| Karpenter | GPU 노드 프로비저닝 | Just-in-Time 인스턴스 프로비저닝 |
- Developer: Model deployment with just Git push
- Platform Team: Minimize infrastructure management burden
- Cost Optimization: Dynamic provisioning of only necessary resources
- Consistency: Same deployment method across all environments
Conclusion and Next Steps
Progressive Journey: AWS Native → Auto Mode → EKS Capability
EKS Auto Mode: Recommended Starting Point
| 이점 | 설명 |
|---|---|
| 즉시 시작 가능 | Karpenter 설치/구성 없이 클러스터 생성 즉시 GPU 워크로드 배포 |
| 자동 업그레이드 | Karpenter, CNI, CSI 등 핵심 컴포넌트 자동 업데이트 |
| 보안 패치 자동화 | 보안 취약점 패치 자동 적용 |
| 커스텀 확장 가능 | GPU NodePool, EFA NodeClass 등 필요시 커스텀 설정 추가 |
Solution Summary by Challenge
| 도전과제 | Kubernetes 기반 | EKS Auto Mode + Karpenter | 기대 효과 |
|---|---|---|---|
| GPU Resource Mgmt | DCGM + Prometheus | NodePool + MIG | 40% utilization improvement |
| Inference Routing | kgateway + Bifrost | llm-d KV Cache-aware routing | 50% faster provisioning |
| LLMOps Observability | LangSmith (Dev) + Langfuse (Prod) | Spot + Consolidation | 50-70% cost reduction |
| Agent Orchestration | LangGraph + NeMo Guardrails | Agent Pod auto-scaling | Safety & scalability |
| Model Supply Chain | MLflow + Kubeflow + ArgoCD | Training NodePool + EFA | 30% training efficiency |
EKS Auto Mode GPU Limitations and Hybrid Strategy
EKS Auto Mode is optimal for general workloads and basic GPU inference, but has limitations for advanced GPU features.
| Workload Type | Auto Mode Suitability | Reason |
|---|---|---|
| API Gateway, Agent Framework | Suitable | Non-GPU, automatic scaling sufficient |
| Observability Stack | Suitable | Non-GPU, minimize management burden |
| Basic GPU Inference (Full GPU) | Suitable | AWS-managed GPU stack sufficient |
| MIG Partitioning Needed | Unsuitable | Cannot partition MIG with read-only NodeClass (GPU Operator itself can be installed) |
| Run:ai GPU Scheduling | Possible | Disable Device Plugin label after GPU Operator installation |
Recommended hybrid configuration: Operate Auto Mode (general workloads) + Karpenter (advanced GPU features) in a single cluster. For detailed configuration, refer to EKS GPU Node Strategy.
Gateway API Limitations and Workarounds
EKS Auto Mode's built-in load balancer does not directly support Kubernetes Gateway API. To use kgateway, provision NLB with a separate Service (type: LoadBalancer).
apiVersion: v1
kind: Service
metadata:
name: kgateway-proxy
namespace: kgateway-system
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "external"
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
spec:
type: LoadBalancer
selector:
app: kgateway-proxy
ports:
- name: https
port: 443
targetPort: 8443
For complete 2-Tier Gateway architecture design, refer to LLM Gateway 2-Tier Architecture.
Key Recommendations
- Start with EKS Auto Mode: Create new clusters with Auto Mode to leverage automatic Karpenter configuration
- Advanced GPU Features on Karpenter Nodes: Add Karpenter NodePool when GPU Operator needed for MIG, Run:ai, etc.
- GPU NodePool Custom Definition: Add GPU NodePool suited to workload characteristics (separate inference/training/experimentation)
- Aggressive Spot Instance Use: Operate 70%+ of inference workloads with Spot
- Enable Consolidation by Default: Leverage auto-enabled Consolidation in EKS Auto Mode
- KEDA Integration: Link metric-based Pod scaling with Karpenter node provisioning
Choose Deployment Path
- EKS Auto Mode (Recommended for Most)
- EKS + Karpenter (Maximum Control)
- Hybrid (Combine Advantages of Both)
When Suitable:
- Startups and small teams
- Kubernetes beginner teams
- Standard Agentic AI workloads
Getting Started:
For deployment guide, refer to EKS Auto Mode Official Documentation.
Advantages: Zero infrastructure management burden, AWS-optimized default settings, automatic security patches
When Suitable:
- Large-scale production workloads
- Complex GPU requirements (mixed instance types)
- Cost optimization as top priority
Getting Started:
For deployment guide, refer to Karpenter Official Documentation.
Advantages: Fine-grained instance control, maximum cost optimization (70-80% savings), custom AMI
When Suitable:
- Growing platform (start simple, expand complex)
- Mixed workload types (CPU agents + GPU LLM)
Getting Started:
For deployment guide, refer to Reference Architecture.
Advantages: Progressive complexity increase, GPU cost optimization, AWS-managed + custom combination
Reference Documents for Scaling
| Area | Document | Content |
|---|---|---|
| GPU Node Strategy | EKS GPU Node Strategy | Auto Mode + Karpenter + Hybrid Node + Security/Troubleshooting |
| GPU Resource Management | GPU Resource Management | Karpenter scaling, KEDA, DRA, cost optimization |
| NVIDIA GPU Stack | NVIDIA GPU Stack | GPU Operator, DCGM, MIG, Time-Slicing |
| Model Serving | vLLM Model Serving | vLLM configuration, performance optimization |
| Distributed Inference | llm-d Distributed Inference | KV Cache-aware routing |
| Training Infrastructure | NeMo Framework | Distributed training, EFA network |
References
Official Documentation
- Amazon EKS Documentation — EKS official documentation
- EKS Auto Mode — Auto Mode guide
- Karpenter Documentation — Karpenter official documentation
- KEDA - Kubernetes Event-driven Autoscaling — Event-driven autoscaling
Papers / Technical Blogs
- vLLM: Easy, Fast, and Cheap LLM Serving — vLLM official blog
- Efficient Memory Management for LLM Serving — PagedAttention paper
- AWS re:Invent 2024: EKS Auto Mode Deep Dive — Auto Mode session
- NVIDIA Developer Blog: AI on Kubernetes — GPU workload optimization
Related Documents (Internal)
- Platform Architecture — Overall system design
- Technical Challenges — 5 core challenges
- GPU Resource Management — Karpenter, KEDA, DRA
- vLLM Model Serving — vLLM deployment guide