Skip to main content

EKS-based Agentic AI Open Architecture

Prerequisite Documents

Before reading this document, refer to the following documents first:


Why EKS-based Open Architecture?

AWS Native Platform is a powerful approach for getting started quickly. However, when the following requirements arise, EKS-based open architecture becomes necessary:

  • Self-hosted Open Weight Models (Llama, Qwen, DeepSeek)
  • Hybrid Architecture (on-premises GPU + cloud)
  • Custom Agent Workflows (LangGraph, MCP/A2A)
  • Multi-provider Routing (Bifrost 2-Tier Gateway)
  • Fine-grained GPU Cost Optimization (Spot, MIG, Consolidation)
Platform Comparison

For a 5-axis comparison of AWS Native, SageMaker Unified Studio, EKS open architecture, and hybrid approaches, refer to AI Platform Selection Guide.

Key Message: AWS Native → EKS is a complementary relationship. A realistic approach is to start with AWS Native and expand to EKS as needed. Both approaches can coexist within the same VPC.


Quick Start with EKS Auto Mode

EKS Cluster Configuration Options: Control Plane and Data Plane

EKS cluster configuration is divided into two independent layers.

Provisioned Control Plane (PCP)

PCP is a premium option that provisions control plane capacity in advance with fixed tiers, ensuring consistent API server performance.

PCP Tier Specifications

TierAPI Concurrency (seats)Pod Schedulingetcd DBSLACost
StandardDynamic (AWS auto-adjusted)Dynamic8GB99.95%$0.10/hr
XL1,700167/sec16GB99.99%-
2XL3,400283/sec16GB99.99%-
4XL6,800400/sec16GB99.99%-
8XL13,600400/sec16GB99.99%-

Source: AWS EKS Provisioned Control Plane Official Documentation (K8s 1.30+ baseline). For PCP tier pricing, refer to AWS official pricing page.

Tier Selection Criteria: Metric-based Judgment

Worker Node Count Is Not a PCP Tier Selection Criterion

PCP tier should be selected based on Kubernetes control plane metrics.

Key Monitoring Metrics:

MetricPrometheus QueryJudgment Criterion
API Inflight Seats (Most Important)apiserver_flowcontrol_current_executing_seats_totalSustained exceeds 1,200 seats → XL or higher
Pod Scheduling Ratescheduler_schedule_attempts_SCHEDULED100/sec or higher → XL, 200/sec or higher → 2XL
etcd DB Sizeapiserver_storage_size_bytesExceeds 10GB → XL or higher required
PCP vs Auto Mode — Different Layers

PCP is a control plane capacity option, and Auto Mode is a data plane management option. Both features can be used in combination.

Control Plane × Data Plane Comparison and Combination

⬆️ Control Plane: Standard vs Provisioned (PCP)
FeatureStandard (Default)Provisioned Control Plane (PCP)
ScalingDynamic auto-scaling (AWS managed)Fixed tier (pre-provisioned)
API Concurrency (seats)Dynamic (AWS auto-adjusted)XL: 1,700 / 2XL: 3,400 / 4XL: 6,800 / 8XL: 13,600
Pod Scheduling RateDynamicXL: 167 / 2XL: 283 / 4XL–8XL: 400 pods/sec
etcd DB Size8 GB16 GB
SLA99.95%99.99%
Cost$0.10/hr ($73/mo)Per-tier pricing (see AWS pricing page)
Tier Selection Criteria-API Inflight Seats + Pod Scheduling Rate + etcd DB Size (NOT node count)
⬇️ Data Plane: MNG vs Karpenter vs Auto Mode
FeatureManaged Node GroupsKarpenterEKS Auto Mode
Node ProvisioningManual (ASG-based)Automatic (Pod-driven)Fully automatic (AWS managed)
GPU OptimizationManual instance selectionAuto GPU selectionAuto + default NodeClass
Scaling SpeedSlow (ASG → EC2)Fast (direct EC2 API)Fast (built-in Karpenter)
Add-on MgmtManual (CNI, CSI, etc.)Manual✅ Automatic
Security PatchesManual AMI updateManual✅ Automatic
Cost OptimizationLimitedConsolidation + SpotConsolidation + 7.5% surcharge
Operational BurdenHighMediumLow
🔗 Recommended Combination Matrix
CombinationControl PlaneData PlaneBest For
General AI ServiceStandardAuto ModeSmall-mid inference, minimal ops
GPU-Optimized PlatformStandardKarpenterMulti-GPU, Spot, cost optimization
Large AI PlatformPCP (tier-xl+)Auto ModeAPI perf guaranteed (1,700+ seats) + auto ops
Ultra-Scale TrainingPCP (tier-4xl+)KarpenterAPI concurrency 6,800+, fine GPU control
Recommended Configuration by AI Platform Scale
  • Small-scale (PoC/Demo): Standard + Auto Mode — Minimal operational burden, 99.95% SLA
  • Medium-scale (Production Inference): Standard + Karpenter — GPU cost optimization, 99.95% SLA
  • Large-scale (Enterprise AI): PCP XL + Auto Mode — API seats ≤ 1,700, 99.99% SLA
  • Extra-large-scale (Training Cluster): PCP 4XL+ + Karpenter — API seats ≤ 6,800+, fine-grained GPU control

Amazon EKS and Karpenter: Maximizing Kubernetes Advantages

The combination of Amazon EKS and Karpenter maximizes Kubernetes advantages to implement fully automated optimal infrastructure. Karpenter provides node provisioning optimized for AI workloads, enabling faster scaling and finer-grained instance selection compared to existing Cluster Autoscaler.

Karpenter Detailed Guide

For Karpenter v1.2+ GA features, NodePool configuration, GPU instance comparison, and cost optimization strategies, refer to GPU Resource Management.

EKS + Karpenter + AWS Infrastructure Layers
AspectTraditional Cluster AutoscalerKarpenter on EKS
Scaling Speed60-90 seconds (ASG-based)10-30 seconds (direct EC2 API)
Instance SelectionLimited by ASG pre-configurationDynamic selection from 600+ EC2 types
GPU WorkloadsRequires separate ASGs per GPU typeSingle NodePool handles all GPU types
Spot OptimizationManual fallback configurationAutomatic spot-to-on-demand fallback
Cost EfficiencyLimited consolidationAggressive bin-packing and consolidation
AWS IntegrationIndirect via ASGDirect EC2/Spot API calls
ConfigurationASG + IAM + Launch TemplatesSimple NodePool CRD

EKS Auto Mode: Complete Automation

EKS Auto Mode automatically configures and manages core components including Karpenter.

EKS Auto Mode vs Manual Configuration Comparison

EKS Auto Mode vs Manual Configuration
Your SituationRecommendation
New EKS cluster for Agentic AI**Karpenter** (native AWS integration)
Existing cluster with CA**Migrate to Karpenter** (worth the effort)
Need GPU autoscaling**Karpenter** (required for GPU efficiency)
Simple CPU-only workloads**EKS Auto Mode** (easiest option)
Multi-tenant platform**Karpenter** (better isolation and cost attribution)
Regulated industries**EKS Auto Mode** (compliance-friendly)

EKS Auto Mode Configuration for GPU Workloads

EKS Auto Mode automatically configures and manages Karpenter. Adding GPU NodePool enables immediate AI workload deployment.

NodePool Configuration Details

For detailed configuration including GPU NodePool composition, Spot/On-Demand strategy, Consolidation policy, refer to GPU Resource Management.

EKS Auto Mode and GPU Support

EKS Auto Mode fully supports accelerated computing instances including NVIDIA GPU.

re:Invent 2024/2025 New Features:

  • EKS Hybrid Nodes (GA): Integrate on-premises GPU infrastructure into EKS cluster
  • Enhanced Pod Identity v2: Cross-account IAM role support
  • Native Inferentia/Trainium Support: Automatic Neuron SDK configuration
  • Provisioned Control Plane: Pre-provisioning for large-scale AI training workloads

Agentic AI Components Deployable on Auto Mode

All core components of the Agentic AI platform can be deployed on EKS Auto Mode.

Inference: vLLM + llm-d

vLLM is an LLM inference-dedicated engine, and llm-d provides intelligent routing considering KV Cache state.

Model Serving Stack Configuration
  • vLLM: LLM inference-dedicated (GPT, Claude, Llama, etc.) — PagedAttention-based KV Cache optimization
  • Triton Inference Server: Handles non-LLM inference (embedding, reranking, Whisper STT)
  • llm-d: Maximizes prefix cache hit rate with KV Cache-aware routing

For detailed configuration, refer to vLLM Model Serving and llm-d Distributed Inference.

Gateway: kgateway + Bifrost (2-Tier Gateway)

Separates traffic management and model routing with 2-Tier Gateway architecture:

  • Tier 1 (kgateway): Gateway API-based authentication, rate limiting, traffic management
  • Tier 2 (Bifrost): Model abstraction, fallback, cost tracking, cascade routing

For detailed architecture, refer to Inference Gateway Routing.

Agent: LangGraph + NeMo Guardrails + MCP/A2A

Agent workflows on EKS consist of:

  • LangGraph: Multi-step agent workflow definition, conditional branching, parallel execution
  • NeMo Guardrails: Prompt injection defense, PII leak prevention, output validation — For tool comparison and implementation details, refer to AI Gateway Guardrails
  • MCP: Agent Ready apps provide tools in a standardized way
  • A2A: Safe and efficient communication between agents
  • Redis (ElastiCache): State management with LangGraph checkpointer

Agent Pods autoscale based on Redis queue length via KEDA.

For details, refer to Kagent Agent Management and AWS Native Platform — AgentCore & MCP. For Guardrails technology stack (Input/Output Guard, Tool Allow-list, kgateway/Bifrost integration), refer to AI Gateway Guardrails.

RAG + Observability

  • Milvus: Vector DB — Core of RAG system (Details)
  • Langfuse: Production LLM tracing, token cost tracking (Architecture, Deployment Guide)
  • Prometheus + Grafana: Infrastructure metrics monitoring

EKS-based Easy Deployment

Deployment Time Comparison
BenefitDescription
Immediate StartDeploy GPU workloads immediately after cluster creation without Karpenter installation/configuration
Automatic UpgradesAutomatic updates for core components like Karpenter, CNI, CSI
Automated Security PatchingAutomatic application of security vulnerability patches
Extensible with Custom ConfigurationAdd custom settings like GPU NodePool, EFA NodeClass when needed

EKS Deployment Methods by Solution

EKS Integration Benefits
ChallengeKubernetes-BasedEKS Auto Mode + KarpenterExpected Effect
GPU MonitoringDCGM + PrometheusNodePool-based integrated management40% improved resource utilization
Dynamic ScalingHPA + KEDAJust-in-Time provisioning (auto-configured)50% reduced provisioning time
Cost ControlNamespace QuotaSpot + Consolidation (auto-enabled)50-70% cost reduction
FM Fine-tuningKubeflow OperatorTraining NodePool + EFA30% improved training efficiency

Easy Deployment Example

For deployment guide, refer to Reference Architecture.

GPU Cost Optimization Details

For GPU cost optimization strategies including Spot instance usage, Consolidation, and schedule-based cost management, refer to GPU Resource Management document.

GPU Security and Troubleshooting

For GPU Pod security policies, Network Policy, IAM, MIG isolation, and GPU troubleshooting guide, refer to EKS GPU Node Strategy document.


Minimize Infrastructure Operational Burden with EKS Capability

What is EKS Capability?

EKS Capability is a platform-level feature that integrates proven open-source tools and AWS services to effectively operate specific workloads on Amazon EKS.

Core EKS Capabilities for Agentic AI

EKS Advanced Capabilities
EKS Capability역할Agentic AI 활용지원 방식
ACK (AWS Controllers for Kubernetes)AWS 서비스의 Kubernetes 네이티브 관리S3 모델 저장소, RDS 메타데이터, SageMaker 학습 작업EKS Add-on
KRO (Kubernetes Resource Orchestrator)복합 리소스 추상화 및 템플릿화AI 추론 스택, 학습 파이프라인 원클릭 배포EKS Add-on
Argo CDGitOps 기반 지속적 배포모델 서빙 배포 자동화, 롤백, 환경 동기화EKS Add-on
Argo Workflows Requires Separate Installation

Argo Workflows is not officially supported as an EKS Capability, so direct installation is required.

For deployment guide, refer to Argo Workflows Official Documentation.


ACK (AWS Controllers for Kubernetes)

ACK directly provisions and manages AWS services through Kubernetes Custom Resources. It can be easily installed as an EKS Add-on.

ACK Usage Examples in AI Platform:

ACK Controllers Usage
AWS 서비스ACK ControllerAgentic AI 활용
S3`s3.services.k8s.aws`모델 아티팩트 저장소, 학습 데이터 버킷
RDS/Aurora`rds.services.k8s.aws`Langfuse 백엔드, 메타데이터 저장소
SageMaker`sagemaker.services.k8s.aws`모델 학습 작업, 엔드포인트 배포
Secrets Manager`secretsmanager.services.k8s.aws`API 키, 모델 자격증명 관리
ECR`ecr.services.k8s.aws`컨테이너 이미지 레지스트리

S3 Bucket Creation Example with ACK:

apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket
metadata:
name: agentic-ai-models
namespace: ai-platform
spec:
name: agentic-ai-models-prod
versioning:
status: Enabled
encryption:
rules:
- applyServerSideEncryptionByDefault:
sseAlgorithm: aws:kms
tags:
- key: Project
value: agentic-ai

KRO (Kubernetes Resource Orchestrator)

KRO combines multiple Kubernetes resources and AWS resources into one abstracted unit to deploy complex infrastructure simply.

Deploy AI Inference Stack as Single Resource with KRO:

# Deploy entire stack as single resource
apiVersion: v1alpha1
kind: AIInferenceStack
metadata:
name: llama-inference
namespace: ai-platform
spec:
modelName: llama-3-70b
gpuType: g5.12xlarge
minReplicas: 2
maxReplicas: 20

Argo-based ML Pipeline Automation

Combining Argo Workflows and Argo CD enables full MLOps pipeline automation in GitOps style from AI model training, evaluation, to deployment.

ACK + KRO + ArgoCD Integration Architecture

Automation Components
구성요소역할자동화 범위
Argo CDGitOps 배포 자동화애플리케이션 배포, 롤백, 동기화
Argo WorkflowsML 파이프라인 오케스트레이션학습, 평가, 모델 등록 워크플로
KRO복합 리소스 추상화K8s + AWS 리소스를 단일 단위로 관리
ACKAWS 리소스 선언적 관리S3, RDS, SageMaker 등 AWS 서비스
KarpenterGPU 노드 프로비저닝Just-in-Time 인스턴스 프로비저닝
Benefits of Complete Automation — Delegate Infrastructure Operations to EKS and Focus on Agent Development
  • Developer: Model deployment with just Git push
  • Platform Team: Minimize infrastructure management burden
  • Cost Optimization: Dynamic provisioning of only necessary resources
  • Consistency: Same deployment method across all environments

Conclusion and Next Steps

Progressive Journey: AWS Native → Auto Mode → EKS Capability

EKS Auto Mode Benefits
이점설명
즉시 시작 가능Karpenter 설치/구성 없이 클러스터 생성 즉시 GPU 워크로드 배포
자동 업그레이드Karpenter, CNI, CSI 등 핵심 컴포넌트 자동 업데이트
보안 패치 자동화보안 취약점 패치 자동 적용
커스텀 확장 가능GPU NodePool, EFA NodeClass 등 필요시 커스텀 설정 추가

Solution Summary by Challenge

Challenge Solutions Summary
도전과제Kubernetes 기반EKS Auto Mode + Karpenter기대 효과
GPU Resource MgmtDCGM + PrometheusNodePool + MIG40% utilization improvement
Inference Routingkgateway + Bifrostllm-d KV Cache-aware routing50% faster provisioning
LLMOps ObservabilityLangSmith (Dev) + Langfuse (Prod)Spot + Consolidation50-70% cost reduction
Agent OrchestrationLangGraph + NeMo GuardrailsAgent Pod auto-scalingSafety & scalability
Model Supply ChainMLflow + Kubeflow + ArgoCDTraining NodePool + EFA30% training efficiency

EKS Auto Mode GPU Limitations and Hybrid Strategy

EKS Auto Mode is optimal for general workloads and basic GPU inference, but has limitations for advanced GPU features.

Workload TypeAuto Mode SuitabilityReason
API Gateway, Agent FrameworkSuitableNon-GPU, automatic scaling sufficient
Observability StackSuitableNon-GPU, minimize management burden
Basic GPU Inference (Full GPU)SuitableAWS-managed GPU stack sufficient
MIG Partitioning NeededUnsuitableCannot partition MIG with read-only NodeClass (GPU Operator itself can be installed)
Run:ai GPU SchedulingPossibleDisable Device Plugin label after GPU Operator installation

Recommended hybrid configuration: Operate Auto Mode (general workloads) + Karpenter (advanced GPU features) in a single cluster. For detailed configuration, refer to EKS GPU Node Strategy.

Gateway API Limitations and Workarounds

EKS Auto Mode's built-in load balancer does not directly support Kubernetes Gateway API. To use kgateway, provision NLB with a separate Service (type: LoadBalancer).

apiVersion: v1
kind: Service
metadata:
name: kgateway-proxy
namespace: kgateway-system
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "external"
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
spec:
type: LoadBalancer
selector:
app: kgateway-proxy
ports:
- name: https
port: 443
targetPort: 8443

For complete 2-Tier Gateway architecture design, refer to LLM Gateway 2-Tier Architecture.

Key Recommendations

  1. Start with EKS Auto Mode: Create new clusters with Auto Mode to leverage automatic Karpenter configuration
  2. Advanced GPU Features on Karpenter Nodes: Add Karpenter NodePool when GPU Operator needed for MIG, Run:ai, etc.
  3. GPU NodePool Custom Definition: Add GPU NodePool suited to workload characteristics (separate inference/training/experimentation)
  4. Aggressive Spot Instance Use: Operate 70%+ of inference workloads with Spot
  5. Enable Consolidation by Default: Leverage auto-enabled Consolidation in EKS Auto Mode
  6. KEDA Integration: Link metric-based Pod scaling with Karpenter node provisioning

Choose Deployment Path

When Suitable:

  • Startups and small teams
  • Kubernetes beginner teams
  • Standard Agentic AI workloads

Getting Started:

For deployment guide, refer to EKS Auto Mode Official Documentation.

Advantages: Zero infrastructure management burden, AWS-optimized default settings, automatic security patches

Reference Documents for Scaling

AreaDocumentContent
GPU Node StrategyEKS GPU Node StrategyAuto Mode + Karpenter + Hybrid Node + Security/Troubleshooting
GPU Resource ManagementGPU Resource ManagementKarpenter scaling, KEDA, DRA, cost optimization
NVIDIA GPU StackNVIDIA GPU StackGPU Operator, DCGM, MIG, Time-Slicing
Model ServingvLLM Model ServingvLLM configuration, performance optimization
Distributed Inferencellm-d Distributed InferenceKV Cache-aware routing
Training InfrastructureNeMo FrameworkDistributed training, EFA network

References

Official Documentation

Papers / Technical Blogs