EKS-based Agentic AI Open Architecture

Prerequisite Documents

Before reading this document, refer to the following documents first:

Platform Architecture — Structure and core layers of Agentic AI Platform
Technical Challenges — 5 core challenges
AI Platform Selection Guide — Managed vs open-source decision-making
AWS Native Platform — Managed service-based alternative approach (for comparison reference)

Why EKS-based Open Architecture?

AWS Native Platform is a powerful approach for getting started quickly. However, when the following requirements arise, EKS-based open architecture becomes necessary:

Self-hosted Open Weight Models (Llama, Qwen, DeepSeek)
Hybrid Architecture (on-premises GPU + cloud)
Custom Agent Workflows (LangGraph, MCP/A2A)
Multi-provider Routing (Bifrost 2-Tier Gateway)
Fine-grained GPU Cost Optimization (Spot, MIG, Consolidation)

Platform Comparison

For a 5-axis comparison of AWS Native, SageMaker Unified Studio, EKS open architecture, and hybrid approaches, refer to AI Platform Selection Guide.

Key Message: AWS Native → EKS is a complementary relationship. A realistic approach is to start with AWS Native and expand to EKS as needed. Both approaches can coexist within the same VPC.

Quick Start with EKS Auto Mode

EKS Cluster Configuration Options: Control Plane and Data Plane

EKS cluster configuration is divided into two independent layers.

Provisioned Control Plane (PCP)

PCP is a premium option that provisions control plane capacity in advance with fixed tiers, ensuring consistent API server performance.

PCP Tier Specifications

Tier	API Concurrency (seats)	Pod Scheduling	etcd DB	SLA	Cost
Standard	Dynamic (AWS auto-adjusted)	Dynamic	8GB	99.95%	$0.10/hr
XL	1,700	167/sec	16GB	99.99%	-
2XL	3,400	283/sec	16GB	99.99%	-
4XL	6,800	400/sec	16GB	99.99%	-
8XL	13,600	400/sec	16GB	99.99%	-

Source: AWS EKS Provisioned Control Plane Official Documentation (K8s 1.30+ baseline). For PCP tier pricing, refer to AWS official pricing page.

Tier Selection Criteria: Metric-based Judgment

Worker Node Count Is Not a PCP Tier Selection Criterion

PCP tier should be selected based on Kubernetes control plane metrics.

Key Monitoring Metrics:

Metric	Prometheus Query	Judgment Criterion
API Inflight Seats (Most Important)	`apiserver_flowcontrol_current_executing_seats_total`	Sustained exceeds 1,200 seats → XL or higher
Pod Scheduling Rate	`scheduler_schedule_attempts_SCHEDULED`	100/sec or higher → XL, 200/sec or higher → 2XL
etcd DB Size	`apiserver_storage_size_bytes`	Exceeds 10GB → XL or higher required

PCP vs Auto Mode — Different Layers

PCP is a control plane capacity option, and Auto Mode is a data plane management option. Both features can be used in combination.

Control Plane × Data Plane Comparison and Combination

⬆️ Control Plane: Standard vs Provisioned (PCP)

Feature	Standard (Default)	Provisioned Control Plane (PCP)
Scaling	Dynamic auto-scaling (AWS managed)	Fixed tier (pre-provisioned)
API Concurrency (seats)	Dynamic (AWS auto-adjusted)	XL: 1,700 / 2XL: 3,400 / 4XL: 6,800 / 8XL: 13,600
Pod Scheduling Rate	Dynamic	XL: 167 / 2XL: 283 / 4XL–8XL: 400 pods/sec
etcd DB Size	8 GB	16 GB
SLA	99.95%	99.99%
Cost	$0.10/hr ($73/mo)	Per-tier pricing (see AWS pricing page)
Tier Selection Criteria	-	API Inflight Seats + Pod Scheduling Rate + etcd DB Size (NOT node count)

⬇️ Data Plane: MNG vs Karpenter vs Auto Mode

Feature	Managed Node Groups	Karpenter	EKS Auto Mode
Node Provisioning	Manual (ASG-based)	Automatic (Pod-driven)	Fully automatic (AWS managed)
GPU Optimization	Manual instance selection	Auto GPU selection	Auto + default NodeClass
Scaling Speed	Slow (ASG → EC2)	Fast (direct EC2 API)	Fast (built-in Karpenter)
Add-on Mgmt	Manual (CNI, CSI, etc.)	Manual	✅ Automatic
Security Patches	Manual AMI update	Manual	✅ Automatic
Cost Optimization	Limited	Consolidation + Spot	Consolidation + 7.5% surcharge
Operational Burden	High	Medium	Low

🔗 Recommended Combination Matrix

Combination	Control Plane	Data Plane	Best For
General AI Service	Standard	Auto Mode	Small-mid inference, minimal ops
GPU-Optimized Platform	Standard	Karpenter	Multi-GPU, Spot, cost optimization
Large AI Platform	PCP (tier-xl+)	Auto Mode	API perf guaranteed (1,700+ seats) + auto ops
Ultra-Scale Training	PCP (tier-4xl+)	Karpenter	API concurrency 6,800+, fine GPU control

Recommended Configuration by AI Platform Scale

Small-scale (PoC/Demo): Standard + Auto Mode — Minimal operational burden, 99.95% SLA
Medium-scale (Production Inference): Standard + Karpenter — GPU cost optimization, 99.95% SLA
Large-scale (Enterprise AI): PCP XL + Auto Mode — API seats ≤ 1,700, 99.99% SLA
Extra-large-scale (Training Cluster): PCP 4XL+ + Karpenter — API seats ≤ 6,800+, fine-grained GPU control

Amazon EKS and Karpenter: Maximizing Kubernetes Advantages

The combination of Amazon EKS and Karpenter maximizes Kubernetes advantages to implement fully automated optimal infrastructure. Karpenter provides node provisioning optimized for AI workloads, enabling faster scaling and finer-grained instance selection compared to existing Cluster Autoscaler.

Karpenter Detailed Guide

For Karpenter v1.2+ GA features, NodePool configuration, GPU instance comparison, and cost optimization strategies, refer to GPU Resource Management.

EKS + Karpenter + AWS Infrastructure Layers

Aspect	Traditional Cluster Autoscaler	Karpenter on EKS
Scaling Speed	60-90 seconds (ASG-based)	10-30 seconds (direct EC2 API)
Instance Selection	Limited by ASG pre-configuration	Dynamic selection from 600+ EC2 types
GPU Workloads	Requires separate ASGs per GPU type	Single NodePool handles all GPU types
Spot Optimization	Manual fallback configuration	Automatic spot-to-on-demand fallback
Cost Efficiency	Limited consolidation	Aggressive bin-packing and consolidation
AWS Integration	Indirect via ASG	Direct EC2/Spot API calls
Configuration	ASG + IAM + Launch Templates	Simple NodePool CRD

EKS Auto Mode: Complete Automation

EKS Auto Mode automatically configures and manages core components including Karpenter.

EKS Auto Mode vs Manual Configuration Comparison

EKS Auto Mode vs Manual Configuration

Your Situation	Recommendation
New EKS cluster for Agentic AI	Karpenter (native AWS integration)
Existing cluster with CA	Migrate to Karpenter (worth the effort)
Need GPU autoscaling	Karpenter (required for GPU efficiency)
Simple CPU-only workloads	EKS Auto Mode (easiest option)
Multi-tenant platform	Karpenter (better isolation and cost attribution)
Regulated industries	EKS Auto Mode (compliance-friendly)

EKS Auto Mode Configuration for GPU Workloads

EKS Auto Mode automatically configures and manages Karpenter. Adding GPU NodePool enables immediate AI workload deployment.

NodePool Configuration Details

For detailed configuration including GPU NodePool composition, Spot/On-Demand strategy, Consolidation policy, refer to GPU Resource Management.

EKS Auto Mode and GPU Support

EKS Auto Mode fully supports accelerated computing instances including NVIDIA GPU.

re:Invent 2024/2025 New Features:

EKS Hybrid Nodes (GA): Integrate on-premises GPU infrastructure into EKS cluster
Enhanced Pod Identity v2: Cross-account IAM role support
Native Inferentia/Trainium Support: Automatic Neuron SDK configuration
Provisioned Control Plane: Pre-provisioning for large-scale AI training workloads

Agentic AI Components Deployable on Auto Mode

All core components of the Agentic AI platform can be deployed on EKS Auto Mode.

Inference: vLLM + llm-d

vLLM is an LLM inference-dedicated engine, and llm-d provides intelligent routing considering KV Cache state.

Model Serving Stack Configuration

vLLM: LLM inference-dedicated (GPT, Claude, Llama, etc.) — PagedAttention-based KV Cache optimization
Triton Inference Server: Handles non-LLM inference (embedding, reranking, Whisper STT)
llm-d: Maximizes prefix cache hit rate with KV Cache-aware routing

For detailed configuration, refer to vLLM Model Serving and llm-d Distributed Inference.

Gateway: kgateway + Bifrost (2-Tier Gateway)

Separates traffic management and model routing with 2-Tier Gateway architecture:

Tier 1 (kgateway): Gateway API-based authentication, rate limiting, traffic management
Tier 2 (Bifrost): Model abstraction, fallback, cost tracking, cascade routing

For detailed architecture, refer to Inference Gateway Routing.

Agent: LangGraph + NeMo Guardrails + MCP/A2A

Agent workflows on EKS consist of:

LangGraph: Multi-step agent workflow definition, conditional branching, parallel execution
NeMo Guardrails: Prompt injection defense, PII leak prevention, output validation — For tool comparison and implementation details, refer to AI Gateway Guardrails
MCP: Agent Ready apps provide tools in a standardized way
A2A: Safe and efficient communication between agents
Redis (ElastiCache): State management with LangGraph checkpointer

Agent Pods autoscale based on Redis queue length via KEDA.

For details, refer to Kagent Agent Management and AWS Native Platform — AgentCore & MCP. For Guardrails technology stack (Input/Output Guard, Tool Allow-list, kgateway/Bifrost integration), refer to AI Gateway Guardrails.

RAG + Observability

Milvus: Vector DB — Core of RAG system (Details)
Langfuse: Production LLM tracing, token cost tracking (Architecture, Deployment Guide)
Prometheus + Grafana: Infrastructure metrics monitoring

EKS-based Easy Deployment

Deployment Time Comparison

Benefit	Description
Immediate Start	Deploy GPU workloads immediately after cluster creation without Karpenter installation/configuration
Automatic Upgrades	Automatic updates for core components like Karpenter, CNI, CSI
Automated Security Patching	Automatic application of security vulnerability patches
Extensible with Custom Configuration	Add custom settings like GPU NodePool, EFA NodeClass when needed

EKS Deployment Methods by Solution

EKS Integration Benefits

Challenge	Kubernetes-Based	EKS Auto Mode + Karpenter	Expected Effect
GPU Monitoring	DCGM + Prometheus	NodePool-based integrated management	40% improved resource utilization
Dynamic Scaling	HPA + KEDA	Just-in-Time provisioning (auto-configured)	50% reduced provisioning time
Cost Control	Namespace Quota	Spot + Consolidation (auto-enabled)	50-70% cost reduction
FM Fine-tuning	Kubeflow Operator	Training NodePool + EFA	30% improved training efficiency

Easy Deployment Example

For deployment guide, refer to Reference Architecture.

GPU Cost Optimization Details

For GPU cost optimization strategies including Spot instance usage, Consolidation, and schedule-based cost management, refer to GPU Resource Management document.

GPU Security and Troubleshooting

For GPU Pod security policies, Network Policy, IAM, MIG isolation, and GPU troubleshooting guide, refer to EKS GPU Node Strategy document.

Minimize Infrastructure Operational Burden with EKS Capability

What is EKS Capability?

EKS Capability is a platform-level feature that integrates proven open-source tools and AWS services to effectively operate specific workloads on Amazon EKS.

Core EKS Capabilities for Agentic AI

EKS Advanced Capabilities

EKS Capability	역할	Agentic AI 활용	지원 방식
ACK (AWS Controllers for Kubernetes)	AWS 서비스의 Kubernetes 네이티브 관리	S3 모델 저장소, RDS 메타데이터, SageMaker 학습 작업	EKS Add-on
KRO (Kubernetes Resource Orchestrator)	복합 리소스 추상화 및 템플릿화	AI 추론 스택, 학습 파이프라인 원클릭 배포	EKS Add-on
Argo CD	GitOps 기반 지속적 배포	모델 서빙 배포 자동화, 롤백, 환경 동기화	EKS Add-on

Argo Workflows Requires Separate Installation

Argo Workflows is not officially supported as an EKS Capability, so direct installation is required.

For deployment guide, refer to Argo Workflows Official Documentation.

ACK (AWS Controllers for Kubernetes)

ACK directly provisions and manages AWS services through Kubernetes Custom Resources. It can be easily installed as an EKS Add-on.

ACK Usage Examples in AI Platform:

ACK Controllers Usage

AWS 서비스	ACK Controller	Agentic AI 활용
S3	`s3.services.k8s.aws`	모델 아티팩트 저장소, 학습 데이터 버킷
RDS/Aurora	`rds.services.k8s.aws`	Langfuse 백엔드, 메타데이터 저장소
SageMaker	`sagemaker.services.k8s.aws`	모델 학습 작업, 엔드포인트 배포
Secrets Manager	`secretsmanager.services.k8s.aws`	API 키, 모델 자격증명 관리
ECR	`ecr.services.k8s.aws`	컨테이너 이미지 레지스트리

S3 Bucket Creation Example with ACK:

apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket
metadata:
  name: agentic-ai-models
  namespace: ai-platform
spec:
  name: agentic-ai-models-prod
  versioning:
    status: Enabled
  encryption:
    rules:
    - applyServerSideEncryptionByDefault:
        sseAlgorithm: aws:kms
  tags:
  - key: Project
    value: agentic-ai

KRO (Kubernetes Resource Orchestrator)

KRO combines multiple Kubernetes resources and AWS resources into one abstracted unit to deploy complex infrastructure simply.

Deploy AI Inference Stack as Single Resource with KRO:

# Deploy entire stack as single resource
apiVersion: v1alpha1
kind: AIInferenceStack
metadata:
  name: llama-inference
  namespace: ai-platform
spec:
  modelName: llama-3-70b
  gpuType: g5.12xlarge
  minReplicas: 2
  maxReplicas: 20

Argo-based ML Pipeline Automation

Combining Argo Workflows and Argo CD enables full MLOps pipeline automation in GitOps style from AI model training, evaluation, to deployment.

ACK + KRO + ArgoCD Integration Architecture

Automation Components

구성요소	역할	자동화 범위
Argo CD	GitOps 배포 자동화	애플리케이션 배포, 롤백, 동기화
Argo Workflows	ML 파이프라인 오케스트레이션	학습, 평가, 모델 등록 워크플로
KRO	복합 리소스 추상화	K8s + AWS 리소스를 단일 단위로 관리
ACK	AWS 리소스 선언적 관리	S3, RDS, SageMaker 등 AWS 서비스
Karpenter	GPU 노드 프로비저닝	Just-in-Time 인스턴스 프로비저닝

Benefits of Complete Automation — Delegate Infrastructure Operations to EKS and Focus on Agent Development

Developer: Model deployment with just Git push
Platform Team: Minimize infrastructure management burden
Cost Optimization: Dynamic provisioning of only necessary resources
Consistency: Same deployment method across all environments

Conclusion and Next Steps

Progressive Journey: AWS Native → Auto Mode → EKS Capability

EKS Auto Mode: Recommended Starting Point

EKS Auto Mode Benefits

이점	설명
즉시 시작 가능	Karpenter 설치/구성 없이 클러스터 생성 즉시 GPU 워크로드 배포
자동 업그레이드	Karpenter, CNI, CSI 등 핵심 컴포넌트 자동 업데이트
보안 패치 자동화	보안 취약점 패치 자동 적용
커스텀 확장 가능	GPU NodePool, EFA NodeClass 등 필요시 커스텀 설정 추가

Solution Summary by Challenge

Challenge Solutions Summary

도전과제	Kubernetes 기반	EKS Auto Mode + Karpenter	기대 효과
GPU Resource Mgmt	DCGM + Prometheus	NodePool + MIG	40% utilization improvement
Inference Routing	kgateway + Bifrost	llm-d KV Cache-aware routing	50% faster provisioning
LLMOps Observability	LangSmith (Dev) + Langfuse (Prod)	Spot + Consolidation	50-70% cost reduction
Agent Orchestration	LangGraph + NeMo Guardrails	Agent Pod auto-scaling	Safety & scalability
Model Supply Chain	MLflow + Kubeflow + ArgoCD	Training NodePool + EFA	30% training efficiency

EKS Auto Mode GPU Limitations and Hybrid Strategy

EKS Auto Mode is optimal for general workloads and basic GPU inference, but has limitations for advanced GPU features.

Workload Type	Auto Mode Suitability	Reason
API Gateway, Agent Framework	Suitable	Non-GPU, automatic scaling sufficient
Observability Stack	Suitable	Non-GPU, minimize management burden
Basic GPU Inference (Full GPU)	Suitable	AWS-managed GPU stack sufficient
MIG Partitioning Needed	Unsuitable	Cannot partition MIG with read-only NodeClass (GPU Operator itself can be installed)
Run:ai GPU Scheduling	Possible	Disable Device Plugin label after GPU Operator installation

Recommended hybrid configuration: Operate Auto Mode (general workloads) + Karpenter (advanced GPU features) in a single cluster. For detailed configuration, refer to EKS GPU Node Strategy.

Gateway API Limitations and Workarounds

EKS Auto Mode's built-in load balancer does not directly support Kubernetes Gateway API. To use kgateway, provision NLB with a separate Service (type: LoadBalancer).

apiVersion: v1
kind: Service
metadata:
  name: kgateway-proxy
  namespace: kgateway-system
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "external"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
spec:
  type: LoadBalancer
  selector:
    app: kgateway-proxy
  ports:
    - name: https
      port: 443
      targetPort: 8443

For complete 2-Tier Gateway architecture design, refer to LLM Gateway 2-Tier Architecture.

Key Recommendations

Start with EKS Auto Mode: Create new clusters with Auto Mode to leverage automatic Karpenter configuration
Advanced GPU Features on Karpenter Nodes: Add Karpenter NodePool when GPU Operator needed for MIG, Run:ai, etc.
GPU NodePool Custom Definition: Add GPU NodePool suited to workload characteristics (separate inference/training/experimentation)
Aggressive Spot Instance Use: Operate 70%+ of inference workloads with Spot
Enable Consolidation by Default: Leverage auto-enabled Consolidation in EKS Auto Mode
KEDA Integration: Link metric-based Pod scaling with Karpenter node provisioning

Choose Deployment Path

EKS Auto Mode (Recommended for Most)
EKS + Karpenter (Maximum Control)
Hybrid (Combine Advantages of Both)

When Suitable:

Startups and small teams
Kubernetes beginner teams
Standard Agentic AI workloads

Getting Started:

For deployment guide, refer to EKS Auto Mode Official Documentation.

Advantages: Zero infrastructure management burden, AWS-optimized default settings, automatic security patches

Reference Documents for Scaling

Area	Document	Content
GPU Node Strategy	EKS GPU Node Strategy	Auto Mode + Karpenter + Hybrid Node + Security/Troubleshooting
GPU Resource Management	GPU Resource Management	Karpenter scaling, KEDA, DRA, cost optimization
NVIDIA GPU Stack	NVIDIA GPU Stack	GPU Operator, DCGM, MIG, Time-Slicing
Model Serving	vLLM Model Serving	vLLM configuration, performance optimization
Distributed Inference	llm-d Distributed Inference	KV Cache-aware routing
Training Infrastructure	NeMo Framework	Distributed training, EFA network

References

Official Documentation

Amazon EKS Documentation — EKS official documentation
EKS Auto Mode — Auto Mode guide
Karpenter Documentation — Karpenter official documentation
KEDA - Kubernetes Event-driven Autoscaling — Event-driven autoscaling

Papers / Technical Blogs

vLLM: Easy, Fast, and Cheap LLM Serving — vLLM official blog
Efficient Memory Management for LLM Serving — PagedAttention paper
AWS re:Invent 2024: EKS Auto Mode Deep Dive — Auto Mode session
NVIDIA Developer Blog: AI on Kubernetes — GPU workload optimization

Platform Architecture — Overall system design
Technical Challenges — 5 core challenges
GPU Resource Management — Karpenter, KEDA, DRA
vLLM Model Serving — vLLM deployment guide

Why EKS-based Open Architecture?​

Quick Start with EKS Auto Mode​

EKS Cluster Configuration Options: Control Plane and Data Plane​

Provisioned Control Plane (PCP)​

PCP Tier Specifications​

Tier Selection Criteria: Metric-based Judgment​

Control Plane × Data Plane Comparison and Combination​

Amazon EKS and Karpenter: Maximizing Kubernetes Advantages​

EKS Auto Mode: Complete Automation​

EKS Auto Mode vs Manual Configuration Comparison​

EKS Auto Mode Configuration for GPU Workloads​

Agentic AI Components Deployable on Auto Mode​

Inference: vLLM + llm-d​

Gateway: kgateway + Bifrost (2-Tier Gateway)​

Agent: LangGraph + NeMo Guardrails + MCP/A2A​

RAG + Observability​

EKS-based Easy Deployment​

EKS Deployment Methods by Solution​

Easy Deployment Example​

Minimize Infrastructure Operational Burden with EKS Capability​

What is EKS Capability?​

Core EKS Capabilities for Agentic AI​

ACK (AWS Controllers for Kubernetes)​

KRO (Kubernetes Resource Orchestrator)​

Argo-based ML Pipeline Automation​

ACK + KRO + ArgoCD Integration Architecture​

Conclusion and Next Steps​

Progressive Journey: AWS Native → Auto Mode → EKS Capability​

EKS Auto Mode: Recommended Starting Point​

Solution Summary by Challenge​

EKS Auto Mode GPU Limitations and Hybrid Strategy​

Gateway API Limitations and Workarounds​

Key Recommendations​

Choose Deployment Path​

Reference Documents for Scaling​

References​

Official Documentation​

Papers / Technical Blogs​

Related Documents (Internal)​