Skip to main content

Comprehensive Guide to Karpenter-Based EKS Scaling Strategies

📅 Created: 2025-02-09 | Updated: 2026-02-18 | ⏱️ Reading time: ~28 min

Overview

In modern cloud-native applications, ensuring users don't experience errors during traffic spikes is a core engineering challenge. This document covers comprehensive scaling strategies using Karpenter on Amazon EKS, encompassing reactive scaling optimization, predictive scaling, and architectural resilience.

Realistic Optimization Expectations

The "ultra-fast scaling" discussed in this document assumes Warm Pools (pre-allocated nodes). The physical minimum time for the E2E autoscaling pipeline (metric detection → decision → Pod creation → container start) is 6-11 seconds, with an additional 45-90 seconds when new node provisioning is required.

Pushing scaling speed to the extreme is not the only strategy. Architectural resilience (queue-based buffering, Circuit Breaker) and predictive scaling (pattern-based pre-expansion) are more cost-effective for most workloads. This document covers all of these approaches together.

We explore a production-validated architecture from a global-scale EKS environment (3 regions, 28 clusters, 15,000+ Pods) that reduced scaling latency from over 180 seconds to under 45 seconds, and down to 5-10 seconds with Warm Pool utilization.

Scaling Strategy Decision Framework

Before optimizing scaling, you must first determine "Does our workload really need ultra-fast reactive scaling?" There are 4 approaches to solving the same business problem of "preventing user errors during traffic spikes," and for most workloads, approaches 2-4 are more cost-effective.

Comparison by Approach

ApproachCore StrategyE2E Scaling TimeMonthly Additional Cost (28 clusters)ComplexitySuitable Workloads
1. Fast ReactiveKarpenter + KEDA + Warm Pool5-45s$40K-190KVery HighVery few mission-critical
2. Predictive ScalingCronHPA + Predictive ScalingPre-expansion (0s)$2K-5KLowMost services with patterns
3. Architectural ResilienceSQS/Kafka + Circuit BreakerTolerates scaling delay$1K-3KMediumServices allowing async processing
4. Adequate Base CapacityIncrease base replicas by 20-30%Unnecessary (already sufficient)$5K-15KVery LowStable traffic

Cost Structure Comparison by Approach

Below are the estimated monthly costs based on 10 medium-sized clusters. Actual costs vary depending on workloads and instance types.

ApproachMonthly Cost (10 clusters)Initial Build CostOperations Staff NeededROI Achievement Condition
1. Fast Reactive$14,800+High (2-4 weeks)Dedicated 1-2 peopleSLA violation penalty > $15K/mo
2. Predictive Scaling~$2,500Low (2-3 days)Existing staffTraffic pattern prediction rate > 70%
3. Architectural Resilience~$800Medium (1-2 weeks)Existing staffServices allowing async processing
4. Base Capacity Increase~$4,500None (immediate)None30% buffer over peak is sufficient
Recommendation: Combined Approaches

In most production environments, covering 90%+ of traffic spikes with Approaches 2 + 4 (Predictive + Base Capacity) and handling the remaining 10% with Approach 1 (Reactive Karpenter) is the most cost-effective combination.

Approach 3 (Architectural Resilience) is a fundamental pattern that should always be considered when designing new services.

Approach 2: Predictive Scaling

Most production traffic has patterns (commute hours, lunch, events). Predictive pre-expansion is often more effective than reactive scaling.

# CronHPA: Time-based pre-scaling
apiVersion: autoscaling.k8s.io/v1alpha1
kind: CronHPA
metadata:
name: traffic-pattern-scaling
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
jobs:
- name: morning-peak
schedule: "0 8 * * 1-5" # Weekdays at 8 AM
targetSize: 50 # Pre-expand for peak
completionPolicy:
type: Never
- name: lunch-peak
schedule: "30 11 * * 1-5" # Weekdays at 11:30 AM
targetSize: 80
completionPolicy:
type: Never
- name: off-peak
schedule: "0 22 * * *" # Daily at 10 PM
targetSize: 10 # Night-time reduction
completionPolicy:
type: Never

Approach 3: Architectural Resilience

Rather than trying to reduce scaling time to zero, it is more realistic to design so that scaling delays are invisible to users.

Queue-based Buffering: By putting requests into SQS/Kafka, scaling delays become "waiting" instead of "failure."

# KEDA SQS-based scaling - Requests wait safely in queue
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: queue-worker
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 2
maxReplicaCount: 100
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
queueLength: "5" # 1 Pod per 5 queue messages
awsRegion: us-east-1

Circuit Breaker + Rate Limiting: Graceful degradation during overload with Istio/Envoy

# Istio Circuit Breaker - Prevent overload during scaling
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: web-app-circuit-breaker
spec:
host: web-app
trafficPolicy:
connectionPool:
http:
h2UpgradePolicy: DEFAULT
http1MaxPendingRequests: 100 # Limit pending requests
http2MaxRequests: 1000 # Limit concurrent requests
outlierDetection:
consecutive5xxErrors: 5 # Isolate after 5 consecutive 5xx errors
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50

Approach 4: Adequate Base Capacity

Instead of spending $1,080-$5,400/month on Warm Pools, increasing base replicas by 20-30% achieves the same effect without complex infrastructure.

apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
# Expected required Pods: 20 → Operate with 25 as baseline (25% buffer)
replicas: 25
# HPA handles additional expansion during peaks
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 25 # Guarantee base capacity
maxReplicas: 100 # Prepare for extreme situations
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # Generous target (70 → 60)

The sections below cover the detailed implementation of Approach 1: Fast Reactive Scaling. Review Approaches 2-4 above first, then apply the content below for workloads that require additional optimization.


Problems with Traditional Autoscaling

Before optimizing reactive scaling, you need to understand the bottlenecks in traditional approaches:

The fundamental problem: By the time CPU metrics trigger scaling, it's already too late.

Current Environment Challenges:

  • Global Scale: 3 regions, 28 EKS clusters, 15,000 Pods in operation
  • High Traffic Volume: Processing 773.4K daily requests
  • Latency Issues: 1-3 minute scaling delays with HPA + Karpenter combination
  • Metric Collection Delays: 1-3 minute delays in CloudWatch metrics making real-time response impossible

The Karpenter Revolution: Direct-to-Metal Provisioning

Karpenter removes the Auto Scaling Group (ASG) abstraction layer and directly provisions EC2 instances based on pending Pod requirements. Karpenter v1.x automatically replaces existing nodes when NodePool specs change through Drift Detection. AMI updates, security patches, and more are automated.

High-Speed Metric Architecture: Two Approaches

To minimize scaling response time, a fast detection system is needed. We compare two proven architectures.

Approach 1: CloudWatch High-Resolution Integration

Leveraging CloudWatch's high-resolution metrics in an AWS-native environment.

Key Components

Scaling Timeline

Timeline Interpretation
  • When a node already exists (Warm Pool or existing spare node): E2E ~13 seconds
  • When new node provisioning is needed: E2E ~53 seconds
  • EC2 instance launch (30-40 seconds) is a physical limitation that cannot be eliminated through metric pipeline optimization alone.

Advantages:

  • Fast metric collection: Low latency of 1-2 seconds
  • Simple setup: AWS-native integration
  • No management overhead: No separate infrastructure management required

Disadvantages:

  • Limited throughput: 500 TPS per account (PutMetricData per-region limit)
  • Pod limit: Maximum 5,000 per cluster
  • High metric costs: AWS CloudWatch metric charges

Approach 2: ADOT + Prometheus-Based Architecture

A high-performance open-source pipeline combining AWS Distro for OpenTelemetry (ADOT) with Prometheus.

Key Components

  • ADOT Collector: Hybrid deployment with DaemonSet and Sidecar
  • Prometheus: HA configuration with Remote Storage integration
  • Thanos Query Layer: Multi-cluster global view
  • KEDA Prometheus Scaler: High-speed polling at 2-second intervals
  • Grafana Mimir: Long-term storage and high-speed query engine

Scaling Timeline (~66s)

Advantages:

  • High throughput: 100,000+ TPS support
  • Scalability: 20,000+ Pods per cluster support
  • Low metric costs: Only storage costs (Self-managed)
  • Full control: Complete configuration and optimization freedom

Disadvantages:

  • Complex setup: Additional component management required
  • High operational complexity: HA configuration, backup/recovery, performance tuning needed
  • Specialist staff required: Prometheus operational experience essential

Cost-Optimized Metric Strategy

Based on 28 clusters: ~$500/month for comprehensive monitoring vs $30,000+ when collecting all metrics at high resolution

CloudWatch High Resolution Metrics are suitable when:

  • Small-scale applications (5,000 Pods or fewer)
  • Simple monitoring requirements
  • AWS-native solution preferred
  • Fast deployment and stable operations prioritized

ADOT + Prometheus is suitable when:

  • Large-scale clusters (20,000+ Pods)
  • High metric throughput required
  • Granular monitoring and customization needed
  • Highest level of performance and scalability required

Scaling Optimization Architecture: Layer-by-Layer Analysis

To minimize scaling response time, optimization across all layers is required:

Karpenter Core Configuration

The key to sub-60-second node provisioning lies in optimal Karpenter configuration:

Karpenter NodePool YAML

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: fast-scaling
spec:
# Speed optimization configuration
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
budgets:
- nodes: "10%"

# Maximum flexibility for speed
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: node.kubernetes.io/instance-type
operator: In
values:
# Compute optimized - default selection
- c6i.xlarge
- c6i.2xlarge
- c6i.4xlarge
- c6i.8xlarge
- c7i.xlarge
- c7i.2xlarge
- c7i.4xlarge
- c7i.8xlarge
# AMD alternatives - better availability
- c6a.xlarge
- c6a.2xlarge
- c6a.4xlarge
- c6a.8xlarge
# Memory optimized - for specific workloads
- m6i.xlarge
- m6i.2xlarge
- m6i.4xlarge

nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: fast-nodepool

# Guarantee fast provisioning
limits:
cpu: 100000 # Soft limits only
memory: 400000Gi
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: fast-nodepool
spec:
amiSelectorTerms:
- alias: al2023@latest

subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"

securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"

role: "KarpenterNodeRole-${CLUSTER_NAME}"

# Speed optimization
userData: |
#!/bin/bash
# Node startup time optimization
/etc/eks/bootstrap.sh ${CLUSTER_NAME} \
--b64-cluster-ca ${B64_CLUSTER_CA} \
--apiserver-endpoint ${API_SERVER_URL} \
--kubelet-extra-args '--node-labels=karpenter.sh/fast-scaling=true --max-pods=110'

# Pre-pull critical images (registry.k8s.io replaces k8s.gcr.io)
ctr -n k8s.io images pull registry.k8s.io/pause:3.10 &
ctr -n k8s.io images pull public.ecr.aws/eks-distro/kubernetes/pause:3.10 &

Real-Time Scaling Workflow

How all components work together to achieve optimal scaling performance:

Aggressive HPA Configuration for Scaling

The HorizontalPodAutoscaler must be configured for immediate response:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ultra-fast-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 10
maxReplicas: 1000

metrics:
# Primary metric - Queue depth
- type: External
external:
metric:
name: sqs_queue_depth
selector:
matchLabels:
queue: "web-requests"
target:
type: AverageValue
averageValue: "10"

# Secondary metric - Request rate
- type: External
external:
metric:
name: alb_request_rate
selector:
matchLabels:
targetgroup: "web-tg"
target:
type: AverageValue
averageValue: "100"

behavior:
scaleUp:
stabilizationWindowSeconds: 0 # No delay!
policies:
- type: Percent
value: 100
periodSeconds: 10
- type: Pods
value: 100
periodSeconds: 10
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300 # 5 min cooldown
policies:
- type: Percent
value: 10
periodSeconds: 60

When to Use KEDA: Event-Driven Scenarios

While Karpenter handles infrastructure scaling, KEDA excels in specific event-driven scenarios:

Production Performance Metrics

Actual results from a deployment handling 750K+ daily requests:

Multi-Region Considerations

For organizations operating across multiple regions, region-specific optimization is needed for consistent high-speed scaling:

Scaling Optimization Best Practices

1. Metric Selection

  • Use leading indicators (queue depth, connection count), not lagging indicators (CPU)
  • Keep high-resolution metrics to 10-15 or fewer per cluster
  • Batch metric submissions to prevent API throttling

2. Karpenter Optimization

  • Provide maximum instance type flexibility
  • Aggressively leverage Spot instances with proper interruption handling
  • Enable consolidation for cost efficiency
  • Set appropriate ttlSecondsAfterEmpty (30-60 seconds)

3. HPA Tuning

  • Zero stabilization window for scale-up
  • Aggressive scaling policies (allow 100% increase)
  • Multiple metrics with appropriate weights
  • Appropriate cooldown for scale-down

4. Monitoring

  • Track P95 scaling latency as a primary KPI
  • Alert on scaling failures or delays exceeding 15 seconds
  • Monitor Spot interruption rates
  • Track cost per scaled Pod

Common Troubleshooting

In real production environments, a hybrid approach mixing both methods is recommended:

  1. Mission-Critical Services: Achieve 10-13 second scaling with ADOT + Prometheus
  2. General Services: 12-15 second scaling with CloudWatch Direct and simplified operations
  3. Gradual Migration: Start with CloudWatch and transition to ADOT as needed

EKS Auto Mode vs Self-managed Karpenter

EKS Auto Mode (2025 GA) has built-in Karpenter with automatic management:

ItemSelf-managed KarpenterEKS Auto Mode
Install/UpgradeManual management (Helm)AWS auto-managed
NodePool ConfigurationFull customizationLimited settings
Cost OptimizationFine-grained control availableAuto-optimization
OS PatchingManual managementAuto-patching
Suitable EnvironmentAdvanced customization neededMinimize operational burden

Recommendation: Choose Self-managed when complex scheduling requirements exist, EKS Auto Mode when operational simplification is the goal.

P1: Ultra-Fast Scaling Architecture (Critical)

Scaling Latency Breakdown Analysis

To optimize scaling response time, you must first granularly decompose the latency occurring across the entire scaling chain.

⚡ Production Scaling Latency (Before Optimization)
P50/P95/P99 scaling latency measured across 28 EKS clusters
Stage
P50
P95
P99
Metric Collection
30s
65s
90s
HPA Decision
10s
25s
45s
Node Provisioning
90s
180s
300s
Container Start
15s
35s
60s
Total E2E
145s
305s
495s
Result

During traffic spikes, users experience errors for 5+ minutes — node provisioning accounts for over 60% of total latency

Multi-Layer Scaling Strategy

Ultra-fast scaling is achieved not through a single optimization but through a 3-layer fallback strategy.

Layer-by-Layer Scaling Timeline Comparison

Layer Selection Criteria

Layer 1 (Warm Pool) -- Pre-allocation strategy:

  • Essence: Not autoscaling but overprovisioning. Securing nodes in advance with Pause Pods
  • E2E 5-10 seconds (metric detection + Preemption + container start)
  • Cost: Maintaining 10-20% of expected peak capacity 24/7 ($720-$5,400/month)
  • Consider: Increasing base replicas at the same cost may be simpler

Layer 2 (Fast Provisioning) -- Default strategy for most cases:

  • Actual node provisioning with Karpenter + Spot instances
  • E2E 42-65 seconds (metric detection + EC2 launch + container start)
  • Cost: Proportional to actual usage (70-80% Spot discount)
  • Consider: Combined with architectural resilience (queue-based), this time becomes invisible to users

Layer 3 (On-Demand Fallback) -- Essential insurance:

  • Final safety net when Spot capacity is insufficient
  • E2E 60-90 seconds (On-Demand may be slower to provision than Spot)
  • Cost: On-Demand pricing (minimal usage)

P2: Eliminating API Bottlenecks with Provisioned EKS Control Plane

Provisioned Control Plane Overview

In November 2025, AWS announced EKS Provisioned Control Plane. By removing the API throttling limitations of the existing Standard Control Plane, it dramatically improves scaling speed in large-scale burst scenarios.

Standard vs Provisioned Comparison

🏗️ Standard vs Provisioned Control Plane
Maximize large-scale scaling by eliminating API throttling
Feature
Standard
Provisioned XL
Provisioned 2XL
Provisioned 4XL
API Throttling
Shared limit
10x increase
20x increase
40x increase
Pod Creation Rate
10 TPS
100 TPS
200 TPS
400 TPS
Node Update
5 TPS
50 TPS
100 TPS
200 TPS
Concurrent Scaling
100 Pod/10s
1,000 Pod/10s
2,000 Pod/10s
4,000 Pod/10s
Monthly Cost (extra)
$0
~$350
~$700
~$1,400
Recommended Cluster
<1,000 Pods
1,000-5,000 Pod
5,000-15,000 Pod
15,000+ Pod
Provisioned Control Plane Selection Criteria

Signals that you should upgrade to Provisioned:

  1. Frequent API throttling errors: kubectl commands frequently fail or retry
  2. Large deployment delays: 100+ Pod deployments take 5+ minutes
  3. Karpenter node provisioning failures: too many requests errors
  4. HPA scaling delays: Pod creation requests queuing up
  5. Cluster size: 1,000+ Pods continuously or 3,000+ Pods at peak

Cost vs Performance Trade-off:

  • Standard → XL: 10x API performance for $350/month additional cost (ROI: offset by preventing 10 minutes of downtime)
  • XL → 2XL: Only needed for ultra-large clusters (10,000+ Pods)
  • 4XL: For extreme scale (50,000+ Pods) or multi-tenant platforms

Provisioned Control Plane Setup

Creating a New Cluster with AWS CLI

aws eks create-cluster \
--name ultra-fast-cluster \
--region us-east-1 \
--role-arn arn:aws:iam::123456789012:role/EKSClusterRole \
--resources-vpc-config subnetIds=subnet-xxx,subnet-yyy,securityGroupIds=sg-xxx \
--kubernetes-version 1.32 \
--compute-config enabled=true,nodePools=system,nodeRoleArn=arn:aws:iam::123456789012:role/EKSNodeRole \
--kubernetes-network-config elasticLoadBalancing=disabled \
--access-config authenticationMode=API \
--upgrade-policy supportType=EXTENDED \
--zonal-shift-config enabled=true \
--compute-config enabled=true \
--control-plane-placement groupName=my-placement-group,clusterTenancy=dedicated \
--control-plane-provisioning mode=PROVISIONED,size=XL

Upgrading an Existing Cluster (Standard → Provisioned)

# 1. Check current Control Plane mode
aws eks describe-cluster --name my-cluster --query 'cluster.controlPlaneProvisioning'

# 2. Upgrade to Provisioned (no downtime)
aws eks update-cluster-config \
--name my-cluster \
--control-plane-provisioning mode=PROVISIONED,size=XL

# 3. Monitor upgrade status (takes 10-15 minutes)
aws eks describe-cluster \
--name my-cluster \
--query 'cluster.status'

# 4. Verify API performance
kubectl get pods --all-namespaces --watch
kubectl create deployment nginx --image=nginx --replicas=100
Upgrade Characteristics
  • No downtime: Control Plane automatically performs a rolling upgrade
  • Duration: 10-15 minutes (regardless of cluster size)
  • No rollback: Provisioned → Standard downgrade not supported
  • Billing starts: Charges begin immediately upon upgrade completion

Performance Comparison During Large-Scale Bursts

Actual production environment test with 1,000 simultaneous Pod scaling:

P3: Warm Pool / Overprovisioning Pattern (Core Strategy)

Pause Pod Overprovisioning Principle

The Warm Pool strategy pre-deploys low-priority "pause" Pods to provision nodes in advance. When actual workloads are needed, pause Pods are immediately evicted (preempted) and actual Pods are scheduled on those nodes.

Complete Overprovisioning Operation Flow

Pause Pod Overprovisioning YAML Configuration

1. PriorityClass Definition (Low Priority)

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: overprovisioning
value: -1 # Negative priority: lower than all actual workloads
globalDefault: false
description: "Pause pods for warm pool - will be preempted by real workloads"

2. Pause Deployment (Base Warm Pool)

apiVersion: apps/v1
kind: Deployment
metadata:
name: overprovisioning-pause
namespace: kube-system
spec:
replicas: 10 # Number of Pods equivalent to 15% of expected peak
selector:
matchLabels:
app: overprovisioning-pause
template:
metadata:
labels:
app: overprovisioning-pause
spec:
priorityClassName: overprovisioning
terminationGracePeriodSeconds: 0 # Immediate termination

# Scheduling constraints (same node pool as actual workloads)
nodeSelector:
karpenter.sh/nodepool: fast-scaling

containers:
- name: pause
image: registry.k8s.io/pause:3.9
resources:
requests:
cpu: "1000m" # Average CPU of actual workloads
memory: "2Gi" # Average memory of actual workloads
limits:
cpu: "1000m"
memory: "2Gi"

3. Time-Based Warm Pool Auto-Adjustment (CronJob)

---
# Expand Warm Pool before peak time (8:30 AM)
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-up-warm-pool
namespace: kube-system
spec:
schedule: "30 8 * * 1-5" # Weekdays at 8:30 AM
jobTemplate:
spec:
template:
spec:
serviceAccountName: warm-pool-scaler
restartPolicy: OnFailure
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
kubectl scale deployment overprovisioning-pause \
--namespace kube-system \
--replicas=30 # Expanded for peak time
---
# Shrink Warm Pool after peak time (7 PM)
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-down-warm-pool
namespace: kube-system
spec:
schedule: "0 19 * * 1-5" # Weekdays at 7 PM
jobTemplate:
spec:
template:
spec:
serviceAccountName: warm-pool-scaler
restartPolicy: OnFailure
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
kubectl scale deployment overprovisioning-pause \
--namespace kube-system \
--replicas=5 # Night-time minimum capacity
---
# ServiceAccount and RBAC for CronJob
apiVersion: v1
kind: ServiceAccount
metadata:
name: warm-pool-scaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: warm-pool-scaler
namespace: kube-system
rules:
- apiGroups: ["apps"]
resources: ["deployments", "deployments/scale"]
verbs: ["get", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: warm-pool-scaler
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: warm-pool-scaler
subjects:
- kind: ServiceAccount
name: warm-pool-scaler
namespace: kube-system

Warm Pool Sizing Method

Cost Analysis and Optimization

💰 Warm Pool Cost Analysis
Cost vs scaling speed by Pause Pod Overprovisioning configuration
Scenario 1: Mid-size Cluster (Peak 200 Pods)
Aggressive (10%)
20 Pod
$720
/mo
0-2s (90%)
High burst frequency
RECOMMENDED
Balanced (15%)
30 Pod
$1,080
/mo
0-2s (95%)
Recommended
Conservative (20%)
40 Pod
$1,440
/mo
0-2s (99%)
Mission Critical
Scenario 2: Large Cluster (Peak 1,000 Pods)
Aggressive (5%)
50 Pod
$1,800
/mo
0-2s (80%)
Predictable traffic
RECOMMENDED
Balanced (10%)
100 Pod
$3,600
/mo
0-2s (90%)
Recommended
Conservative (15%)
150 Pod
$5,400
/mo
0-2s (98%)
High Availability
Warm Pool Optimization Strategies

Cost Reduction Methods:

  1. Time-based scaling: Shrink Warm Pool during nights/weekends with CronJob (50-70% cost reduction)
  2. Spot instance utilization: Deploy Pause Pods on Spot nodes too (70% discount)
  3. Adaptive sizing: Auto-scaling based on CloudWatch Metrics
  4. Mixed strategy: Warm Pool only during peak times, rely on Layer 2 at other times

ROI Formula:

ROI = (SLA Violation Prevention Cost + Revenue Opportunity Loss Prevention) - Warm Pool Cost

Example:
- SLA violation penalty: $5,000/incident
- Average monthly violations (without Warm Pool): 3 incidents
- Warm Pool cost: $1,080/month
- ROI = ($5,000 x 3) - $1,080 = $13,920/month (1,290% ROI)

P4: Setu - Kueue + Karpenter Proactive Provisioning

Setu Overview

Setu bridges Kueue (queuing system) and Karpenter to provide proactive node provisioning for AI/ML workloads requiring Gang Scheduling. While traditional Karpenter reactively provisions nodes after Pods are created, Setu pre-provisions the required nodes the moment a Job enters the queue.

Setu Architecture and Operating Principles

Setu Installation and Configuration

1. Setu Installation (Helm)

# Add Setu Helm chart
helm repo add setu https://sanjeevrg89.github.io/Setu
helm repo update

# Install Setu (requires Kueue and Karpenter)
helm install setu setu/setu \
--namespace kueue-system \
--create-namespace \
--set karpenter.enabled=true \
--set karpenter.namespace=karpenter

2. ClusterQueue with AdmissionCheck

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: gpu-cluster-queue
spec:
namespaceSelector: {}

# Resource quota (entire cluster limits)
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: gpu-flavor
resources:
- name: "cpu"
nominalQuota: 1000
- name: "memory"
nominalQuota: 4000Gi
- name: "nvidia.com/gpu"
nominalQuota: 64

# Enable Setu AdmissionCheck
admissionChecks:
- setu-provisioning # Setu pre-provisions nodes
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: AdmissionCheck
metadata:
name: setu-provisioning
spec:
controllerName: setu.kueue.x-k8s.io/provisioning

# Setu parameters
parameters:
apiGroup: setu.kueue.x-k8s.io/v1alpha1
kind: ProvisioningParameters
name: gpu-provisioning
---
apiVersion: setu.kueue.x-k8s.io/v1alpha1
kind: ProvisioningParameters
metadata:
name: gpu-provisioning
spec:
# Karpenter NodePool reference
nodePoolName: gpu-nodepool

# Provisioning strategy
strategy:
type: Proactive # Proactive provisioning
bufferTime: 15s # Wait time before Job Admission

# Node requirement mapping
nodeSelectorRequirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- p4d.24xlarge
- p4de.24xlarge
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand # Avoid Spot risk for GPUs

3. GPU NodePool (Karpenter)

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-nodepool
spec:
template:
spec:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- p4d.24xlarge # 8x A100 (40GB)
- p4de.24xlarge # 8x A100 (80GB)
- p5.48xlarge # 8x H100

- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand # Avoid interruption risk for GPU workloads

nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-nodeclass

# Keep GPU nodes for extended periods (considering training duration)
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 300s # Remove after 5 min idle
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu-nodeclass
spec:
amiSelectorTerms:
- alias: al2023@latest # Includes GPU drivers

subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"

securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"

role: "KarpenterNodeRole-${CLUSTER_NAME}"

# GPU-optimized UserData
userData: |
#!/bin/bash
# EKS optimized GPU AMI setup
/etc/eks/bootstrap.sh ${CLUSTER_NAME} \
--b64-cluster-ca ${B64_CLUSTER_CA} \
--apiserver-endpoint ${API_SERVER_URL} \
--kubelet-extra-args '--node-labels=nvidia.com/gpu=true --max-pods=110'

# NVIDIA driver verification
nvidia-smi || echo "GPU driver not loaded"

4. AI/ML Job Submission Example

apiVersion: batch/v1
kind: Job
metadata:
name: llm-training
labels:
kueue.x-k8s.io/queue-name: gpu-queue # LocalQueue designation
spec:
parallelism: 8 # Gang Scheduling (8 Pods run simultaneously)
completions: 8

template:
spec:
restartPolicy: OnFailure

# PodGroup for Gang Scheduling
schedulerName: default-scheduler

containers:
- name: training
image: nvcr.io/nvidia/pytorch:24.01-py3

command:
- python3
- /workspace/train.py
- --distributed
- --nodes=8

resources:
requests:
nvidia.com/gpu: 1 # 1 GPU per Pod
cpu: "48"
memory: "320Gi"
limits:
nvidia.com/gpu: 1
cpu: "48"
memory: "320Gi"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: gpu-queue
namespace: default
spec:
clusterQueue: gpu-cluster-queue # ClusterQueue reference

Setu Performance Improvement Measurement

Setu GitHub and Additional Information

GitHub: https://github.com/sanjeevrg89/Setu

Key Features:

  • Leverages Kueue AdmissionCheck API
  • Direct Karpenter NodeClaim creation
  • Optimized for Gang Scheduling workloads (when all Pods must run simultaneously)
  • Eliminates wait time through GPU node pre-provisioning

Suitable Use Cases:

  • Distributed AI/ML training (PyTorch DDP, Horovod)
  • MPI-based HPC workloads
  • Large-scale batch simulations
  • Multi-node data processing Jobs

P5: Eliminating Boot Delay with Node Readiness Controller

The Node Readiness Problem

Even when Karpenter provisions nodes quickly, CNI/CSI/GPU driver initialization delays occur before Pods can actually be scheduled. Traditionally, kubelet waits until all DaemonSets are running before the node transitions to Ready state.

Node Readiness Controller Principles

Node Readiness Controller (NRC) provides fine-grained control over the conditions required for a node to transition to Ready state. By default, kubelet waits until all DaemonSets are running, but NRC can be configured to selectively wait only for essential components.

Node Readiness Controller Installation

1. NRC Installation (Helm)

# Node Feature Discovery (NFD) required (NRC dependency)
helm repo add nfd https://kubernetes-sigs.github.io/node-feature-discovery/charts
helm install nfd nfd/node-feature-discovery \
--namespace kube-system

# Install Node Readiness Controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-readiness-controller/main/deploy/manifests.yaml

2. NodeReadinessRule CRD Definition

apiVersion: nodereadiness.k8s.io/v1alpha1
kind: NodeReadinessRule
metadata:
name: bootstrap-only
spec:
# bootstrap-only mode: wait for essential components only
mode: bootstrap-only

# Required DaemonSets (only wait for these)
requiredDaemonSets:
- namespace: kube-system
name: aws-node # VPC CNI
selector:
matchLabels:
k8s-app: aws-node

# Optional DaemonSets (background initialization)
optionalDaemonSets:
- namespace: kube-system
name: ebs-csi-node # EBS CSI only used by Pods needing block storage
selector:
matchLabels:
app: ebs-csi-node

- namespace: kube-system
name: nvidia-device-plugin # Only needed by GPU Pods
selector:
matchLabels:
name: nvidia-device-plugin-ds

# Node Selector (nodes to apply this rule to)
nodeSelector:
matchLabels:
karpenter.sh/nodepool: fast-scaling

# Readiness timeout (maximum wait time)
readinessTimeout: 60s

Karpenter + NRC Integration Configuration

1. Karpenter NodePool with NRC Annotation

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: fast-scaling-nrc
spec:
template:
metadata:
# NRC activation annotation
annotations:
nodereadiness.k8s.io/rule: bootstrap-only

spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]

- key: node.kubernetes.io/instance-type
operator: In
values:
- c6i.xlarge
- c6i.2xlarge
- c6i.4xlarge

nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: fast-nodepool-nrc

disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: fast-nodepool-nrc
spec:
amiSelectorTerms:
- alias: al2023@latest

subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"

securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"

role: "KarpenterNodeRole-${CLUSTER_NAME}"

# NRC-optimized UserData
userData: |
#!/bin/bash
# EKS bootstrap (minimal options)
/etc/eks/bootstrap.sh ${CLUSTER_NAME} \
--b64-cluster-ca ${B64_CLUSTER_CA} \
--apiserver-endpoint ${API_SERVER_URL} \
--kubelet-extra-args '--node-labels=karpenter.sh/fast-scaling=true,nodereadiness.k8s.io/enabled=true --max-pods=110'

# VPC CNI fast initialization (required)
systemctl enable --now aws-node || true

2. VPC CNI Readiness Rule (Detailed Configuration)

apiVersion: nodereadiness.k8s.io/v1alpha1
kind: NodeReadinessRule
metadata:
name: vpc-cni-only
spec:
mode: bootstrap-only

# Wait for VPC CNI only
requiredDaemonSets:
- namespace: kube-system
name: aws-node
selector:
matchLabels:
k8s-app: aws-node

# CNI readiness check conditions
readinessProbe:
exec:
command:
- sh
- -c
- |
# Verify aws-node Pod's aws-vpc-cni-init container completion
kubectl wait --for=condition=Initialized \
pod -l k8s-app=aws-node \
-n kube-system \
--timeout=30s

initialDelaySeconds: 5
periodSeconds: 2
timeoutSeconds: 30
successThreshold: 1
failureThreshold: 3

# All other DaemonSets are optional
optionalDaemonSets:
- namespace: kube-system
name: "*" # Wildcard: all other DaemonSets

nodeSelector:
matchLabels:
karpenter.sh/nodepool: fast-scaling-nrc

readinessTimeout: 60s

NRC Performance Comparison

Production environment 100-node scaling test results:

Considerations When Using NRC

Advantages:

  • Node Ready time reduced by 50%
  • Pod scheduling delay minimized
  • API load reduced during large-scale scaling

Disadvantages and Risks:

  • Pods requiring CSI may fail: Pods mounting EBS volumes may enter CrashLoopBackOff if scheduled before CSI driver is ready
  • GPU Pod initialization delay: GPU Pods remain Pending during NVIDIA device plugin background initialization
  • Monitoring blind spots: Initial metrics may be missing if Prometheus node-exporter starts late

Solutions:

  1. Use PodSchedulingGate: Set manual gates for Pods requiring CSI/GPU
  2. NodeAffinity conditions: Wait for nodereadiness.k8s.io/csi-ready=true label
  3. InitContainer verification: Verify required drivers exist before Pod starts
# Example Pod requiring CSI (safe wait)
apiVersion: v1
kind: Pod
metadata:
name: app-with-ebs
spec:
initContainers:
- name: wait-for-csi
image: busybox
command:
- sh
- -c
- |
until [ -f /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock ]; do
echo "Waiting for EBS CSI driver..."
sleep 2
done

containers:
- name: app
image: my-app
volumeMounts:
- name: data
mountPath: /data

volumes:
- name: data
persistentVolumeClaim:
claimName: ebs-pvc

Conclusion

Efficient autoscaling optimization in EKS is not optional -- it is essential. The combination of Karpenter's intelligent provisioning, high-resolution metrics for critical indicators, and appropriately tuned HPA configurations enables implementing optimal scaling strategies tailored to workload characteristics.

Key Takeaways:

  • Karpenter as the foundation: Minutes saved in scaling time through direct EC2 provisioning
  • Selective high-resolution metrics: Monitor what matters at 1-5 second intervals
  • Aggressive HPA configuration: Eliminate artificial delays in scaling decisions
  • Cost optimization through intelligence: Reduce over-provisioning with faster scaling
  • Architecture selection: Choose CloudWatch or Prometheus based on scale and requirements

P1 Ultra-Fast Scaling Strategy Summary:

  1. Multi-Layer Fallback Strategy: Warm Pool (0-2s) -> Fast Provisioning (5-15s) -> On-Demand Fallback (15-30s) covers all scenarios
  2. Provisioned Control Plane: API throttling elimination enables 10x faster Pod creation during large bursts ($350/month prevents 10-minute downtime)
  3. Pause Pod Overprovisioning: Time-based auto-adjustment achieves 0-2s scaling with 1,290% ROI (SLA violation prevention)
  4. Setu (Kueue-Karpenter): 30% latency reduction for AI/ML Gang Scheduling workloads by parallelizing node provisioning with queue wait
  5. Node Readiness Controller: 50% node Ready time reduction by waiting for CNI only (85s -> 45s)

The architectures presented here have been validated in production environments handling millions of requests daily. By implementing these patterns, you can ensure your EKS cluster scales as fast as business demands require -- measured in seconds, not minutes.

🎯 Practical Implementation Guide
Recommended strategies, expected performance, and costs by scenario
Predictable peak times
Warm Pool (15%)
0-2s
Scaling
$1,080
Monthly extra
🌊
Unpredictable traffic
Fast Provisioning (Spot)
5-15s
Scaling
Usage-based
Monthly extra
🏢
Large cluster (5,000+ Pods)
Provisioned XL + Fast
5-10s
Scaling
$350+
Monthly extra
🤖
AI/ML training workloads
Setu + GPU NodePool
15-30s
Scaling
Usage-based
Monthly extra
🔒
Mission-critical SLA
Warm Pool + Provisioned + NRC
0-2s
Scaling
$1,430
Monthly extra

Comprehensive Recommendations

The patterns above are powerful, but most workloads don't need all of them. When applying in practice, review in this order:

  1. First: Optimize basic Karpenter settings (diverse instance types in NodePool, Spot utilization) -- this alone achieves 180s -> 45-65s
  2. Next: HPA tuning (reduce stabilizationWindow, adopt KEDA) -- metric detection from 60s -> 2-5s
  3. Then: Design architectural resilience (queue-based, Circuit Breaker) -- scaling delay becomes invisible to users
  4. Only when needed: Warm Pool, Provisioned CP, Setu, NRC -- when mission-critical SLA requirements exist
Always Calculate Cost-Effectiveness

Warm Pool ($1,080/month) + Provisioned CP ($350/month) = $1,430/month in additional costs. For 28 clusters, that's $40,000/month. With the same budget, increasing base replicas by 30% can achieve similar effects without complex infrastructure. Always ask yourself: "Does this complexity justify the business value?"


EKS Auto Mode Complete Guide

EKS Auto Mode (December 2024 GA)

EKS Auto Mode provides Karpenter as a fully managed service, including automatic infrastructure management, OS patching, and security updates. It supports ultra-fast scaling while minimizing operational complexity.

Managed Karpenter: Automatic Infrastructure Management

EKS Auto Mode automates the following:

  • Karpenter controller upgrades: AWS ensures compatibility with automatic updates
  • Security patches: AL2023 AMI automatic patching and node rolling replacement
  • NodePool default configuration: system and general-purpose pools are pre-configured
  • IAM roles: KarpenterNodeRole and KarpenterControllerRole automatically created

Auto Mode vs Self-managed Detailed Comparison

🔄 EKS Auto Mode vs Self-managed Karpenter
Operations complexity vs customization freedom tradeoff
Feature
Self-managed
Auto Mode
Scaling Speed
30-45s (optimized)
30-45s (same)
🔧
Customization
⭐⭐⭐⭐⭐
Full control
⭐⭐⭐
Limited
🔥
Warm Pool
Self-implementable
Not supported
🤖
Setu/Kueue
Full support
⚠️
Limited
💰
Cost
Free (resources only)
Free (resources only)
📊
Ops Complexity
⭐⭐⭐⭐
High
Low
🛡️
OS Patching
Manual AMI mgmt
Auto patching
🔍
Drift Detection
Manual setup
Enabled by default
🎯
Best For
Advanced scheduling, Gang scheduling
Operations simplicity

Ultra-Fast Scaling Methods in Auto Mode

Auto Mode uses the same Karpenter engine as Self-managed, so scaling speed is identical. However, the following optimizations are available:

  1. Leverage built-in NodePools: system and general-purpose pools are already optimized
  2. Expand instance types: Add more instance types to default pools
  3. Tune Consolidation policy: Enable WhenEmptyOrUnderutilized
  4. Adjust Disruption Budget: Minimize node replacement during spikes

Built-in NodePool Configuration

EKS Auto Mode provides two default NodePools:

# system pool (kube-system, monitoring, etc.)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: system
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["t3.medium", "t3.large"]
taints:
- key: CriticalAddonsOnly
value: "true"
effect: NoSchedule
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 300s
---
# general-purpose pool (application workloads)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general-purpose
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- c6i.xlarge
- c6i.2xlarge
- c6i.4xlarge
- m6i.xlarge
- m6i.2xlarge
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
budgets:
- nodes: "10%"

Self-managed to Auto Mode Migration Guide

Migration Precautions

To ensure workload availability during migration, a blue/green transition approach is recommended.

Step-by-step Migration:

# Step 1: Create new Auto Mode cluster
aws eks create-cluster \
--name my-cluster-auto \
--version 1.33 \
--compute-config enabled=true \
--role-arn arn:aws:iam::ACCOUNT:role/EKSClusterRole \
--resources-vpc-config subnetIds=subnet-xxx,subnet-yyy

# Step 2: Backup existing workloads
kubectl get all --all-namespaces -o yaml > workloads-backup.yaml

# Step 3: Create Custom NodePool (optional)
kubectl apply -f custom-nodepool.yaml

# Step 4: Gradually migrate workloads
# - Use DNS weighted routing for gradual traffic transition
# - From existing cluster -> Auto Mode cluster

# Step 5: Remove existing cluster after validation
kubectl drain --ignore-daemonsets --delete-emptydir-data <node-name>

Auto Mode Cluster Creation YAML

# Using eksctl
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
name: auto-mode-cluster
region: us-east-1
version: "1.33"

# Enable Auto Mode
computeConfig:
enabled: true
nodePoolDefaults:
instanceTypes:
- c6i.xlarge
- c6i.2xlarge
- c6i.4xlarge
- c7i.xlarge
- c7i.2xlarge
- m6i.xlarge
- m6i.2xlarge

# VPC configuration
vpc:
id: vpc-xxx
subnets:
private:
us-east-1a: { id: subnet-xxx }
us-east-1b: { id: subnet-yyy }
us-east-1c: { id: subnet-zzz }

# IAM configuration (auto-created)
iam:
withOIDC: true

Auto Mode NodePool Customization

# Custom NodePool for high-performance workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: high-performance
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- c7i.4xlarge
- c7i.8xlarge
- c7i.16xlarge
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: high-perf-class

disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 600s # 10 min wait
budgets:
- nodes: "0" # Halt replacement during spikes
schedule: "0 8-18 * * MON-FRI" # Business hours
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: high-perf-class
spec:
amiSelectorTerms:
- alias: al2023@latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: auto-mode-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: auto-mode-cluster
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
iops: 10000
throughput: 500

Karpenter v1.x Latest Features

Consolidation Policy: Speed vs Cost

Starting from Karpenter v1.0, the consolidationPolicy field has moved to the disruption section.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: optimized-pool
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s

# Consolidation exclusion conditions
expireAfter: 720h # Auto-replace nodes after 30 days

Policy Comparison:

PolicyBehaviorSpeedCost OptimizationSuitable Environment
WhenEmptyRemove empty nodes onlyFastLimitedStable traffic
WhenEmptyOrUnderutilizedEmpty nodes + consolidate underutilized nodesModerateExcellentVariable traffic

Scaling Speed Impact Analysis:

Disruption Budgets: Configuration for Burst Traffic

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: burst-ready
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s

# Time-based Disruption Budget
budgets:
- nodes: "0" # Halt replacement
schedule: "0 8-18 * * MON-FRI" # Business hours
reasons:
- Drifted
- Expired
- Consolidation

- nodes: "20%" # Allow up to 20% replacement
schedule: "0 19-7 * * *" # Nighttime
reasons:
- Drifted
- Expired

- nodes: "50%" # Aggressive optimization on weekends
schedule: "0 0-23 * * SAT,SUN"

Budget Strategies:

  • Events like Black Friday: nodes: "0" (completely halt replacement)
  • Normal operations: nodes: "10-20%" (gradual optimization)
  • Nights/weekends: nodes: "50%" (aggressive cost reduction)

Drift Detection: Automatic Node Replacement

Drift Detection automatically replaces existing nodes when the NodePool spec has changed.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: drift-enabled
spec:
template:
spec:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values: ["c6i.xlarge", "c7i.xlarge"] # Drift detected on spec change

nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: drift-class

disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
budgets:
- nodes: "20%" # Control Drift replacement speed
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: drift-class
spec:
amiSelectorTerms:
- alias: al2023@latest # Auto Drift on AMI change

# AMI update scenario
# 1. AWS releases new AL2023 AMI
# 2. Karpenter detects Drift
# 3. Nodes replaced sequentially according to Budget

Drift Trigger Conditions:

  • NodePool instance type change
  • EC2NodeClass AMI change
  • userData script modification
  • blockDeviceMappings change

NodePool Weights: Spot to On-Demand Fallback

# Weight 0: Highest priority (Spot)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-primary
spec:
weight: 0 # Lowest weight = highest priority
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
---
# Weight 50: Fallback when Spot unavailable
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: on-demand-fallback
spec:
weight: 50
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]

Weight Strategy:


Metric Collection Optimization

KEDA + Prometheus: Event-Driven Scaling (1-3s Response)

KEDA polls Prometheus metrics at 1-3 second intervals to achieve ultra-fast scaling.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ultra-fast-scaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app

pollingInterval: 2 # Poll every 2 seconds
cooldownPeriod: 60
minReplicaCount: 10
maxReplicaCount: 1000

triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: http_requests_per_second
query: |
sum(rate(http_requests_total[30s])) by (service)
threshold: "100"

- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: p99_latency_ms
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[30s])) by (le)
) * 1000
threshold: "500" # Scale up when exceeding 500ms

advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 5 # Allow 100% increase every 5 seconds

KEDA vs HPA Scaling Speed:

ConfigurationMetric UpdateScaling DecisionTotal Time
HPA + Metrics API15s15s30s
KEDA + Prometheus2s1s3s

ADOT Collector Tuning: Minimizing Scrape Interval

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: adot-collector-ultra-fast
spec:
mode: daemonset
config: |
receivers:
prometheus:
config:
scrape_configs:
# Critical metrics: 1-second scrape
- job_name: 'critical-metrics'
scrape_interval: 1s
scrape_timeout: 800ms
static_configs:
- targets: ['web-app:8080']
metric_relabel_configs:
- source_labels: [__name__]
regex: '(http_requests_total|http_request_duration_seconds.*|queue_depth)'
action: keep

# Standard metrics: 15-second scrape
- job_name: 'standard-metrics'
scrape_interval: 15s
static_configs:
- targets: ['web-app:8080']

processors:
batch:
timeout: 1s
send_batch_size: 1024
send_batch_max_size: 2048

memory_limiter:
check_interval: 1s
limit_mib: 512

exporters:
prometheus:
endpoint: "0.0.0.0:8889"

prometheusremotewrite:
endpoint: http://mimir:9009/api/v1/push
headers:
X-Scope-OrgID: "prod"

service:
pipelines:
metrics:
receivers: [prometheus]
processors: [memory_limiter, batch]
exporters: [prometheus, prometheusremotewrite]

CloudWatch Metric Streams

CloudWatch Metric Streams streams metrics to Kinesis Data Firehose in real-time.

# Create Metric Stream
aws cloudwatch put-metric-stream \
--name eks-metrics-stream \
--firehose-arn arn:aws:firehose:us-east-1:ACCOUNT:deliverystream/metrics \
--role-arn arn:aws:iam::ACCOUNT:role/CloudWatchMetricStreamRole \
--output-format json \
--include-filters Namespace=AWS/EKS \
--include-filters Namespace=ContainerInsights

Architecture:

Custom Metrics API HPA

apiVersion: v1
kind: Service
metadata:
name: custom-metrics-api
spec:
ports:
- port: 443
targetPort: 6443
selector:
app: custom-metrics-apiserver
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-metrics-apiserver
spec:
replicas: 2
template:
spec:
containers:
- name: custom-metrics-apiserver
image: your-registry/custom-metrics-api:v1
args:
- --secure-port=6443
- --logtostderr=true
- --v=4
- --prometheus-url=http://prometheus:9090
- --cache-ttl=5s # 5-second cache

Container Image Optimization

Relationship Between Image Size and Scaling Speed

Optimization Strategies:

  • Target image size under 500MB
  • Minimize runtime layers with multi-stage builds
  • Remove unnecessary packages

ECR Pull-Through Cache

# Create Pull-Through Cache rule
aws ecr create-pull-through-cache-rule \
--ecr-repository-prefix docker-hub \
--upstream-registry-url registry-1.docker.io \
--region us-east-1

# Usage example
# Original: docker.io/library/nginx:latest
# Cached: ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/docker-hub/library/nginx:latest

Benefits:

  • Cached in ECR after first pull
  • 3-5x faster from second pull onward
  • Avoids DockerHub rate limits

Image Pre-pull: DaemonSet vs userData

Method 1: Image Pre-pull with DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: image-prepull
spec:
selector:
matchLabels:
app: image-prepull
template:
metadata:
labels:
app: image-prepull
spec:
initContainers:
- name: prepull-web-app
image: your-registry/web-app:v1.2.3
command: ['sh', '-c', 'echo "Image pulled"']
- name: prepull-sidecar
image: your-registry/sidecar:v2.0.0
command: ['sh', '-c', 'echo "Image pulled"']
containers:
- name: pause
image: public.ecr.aws/eks-distro/kubernetes/pause:3.9
resources:
requests:
cpu: 10m
memory: 20Mi

Method 2: Pre-pull in userData

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: prepull-class
spec:
userData: |
#!/bin/bash
/etc/eks/bootstrap.sh ${CLUSTER_NAME}

# Pre-pull critical images
ctr -n k8s.io images pull your-registry.com/web-app:v1.2.3 &
ctr -n k8s.io images pull your-registry.com/sidecar:v2.0.0 &
ctr -n k8s.io images pull your-registry.com/init-db:v3.1.0 &
wait

Comparison:

MethodTimingEffect on New NodesMaintenance
DaemonSetAfter node ReadyModerateEasy
userDataDuring bootstrapBestDifficult

Minimal Base Image: distroless, scratch

# Before optimization: Ubuntu-based (500MB)
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y ca-certificates
COPY app /app
CMD ["/app"]

# After optimization: distroless (50MB)
FROM gcr.io/distroless/base-debian12
COPY app /app
CMD ["/app"]

# After optimization: scratch (20MB, static binary only)
FROM scratch
COPY app /app
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
CMD ["/app"]

SOCI (Seekable OCI) for Large Images

SOCI loads only the necessary parts without pulling the entire image.

# Create SOCI index
soci create your-registry/large-ml-model:v1.0.0

# Push SOCI index to registry
soci push your-registry/large-ml-model:v1.0.0

# Containerd configuration
cat <<EOF > /etc/containerd/config.toml
[plugins."io.containerd.snapshotter.v1.soci"]
enable_image_lazy_loading = true
EOF

Results:

  • 5GB image starts in 10-15 seconds (previously 2-3 minutes)
  • Useful for ML models and large datasets

Bottlerocket Optimization

Bottlerocket is a container-optimized OS with 30% faster boot time compared to AL2023.

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: bottlerocket-class
spec:
amiSelectorTerms:
- alias: bottlerocket@latest

userData: |
[settings.kubernetes]
cluster-name = "${CLUSTER_NAME}"

[settings.kubernetes.node-labels]
"karpenter.sh/fast-boot" = "true"

In-Place Pod Vertical Scaling (K8s 1.33+)

Starting from K8s 1.33, you can adjust resources without restarting the Pod.

apiVersion: v1
kind: Pod
metadata:
name: resizable-pod
spec:
containers:
- name: app
image: your-app:v1
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
resizePolicy:
- resourceName: cpu
restartPolicy: NotRequired # CPU does not require restart
- resourceName: memory
restartPolicy: RestartContainer # Memory requires restart

Criteria for Choosing Between Scaling and Resizing:

ScenarioMethodReason
Traffic surge (2x or more)HPA Scale-outLoad distribution needed
CPU utilization exceeds 80%In-Place ResizeSingle Pod performance insufficient
Memory OOM riskIn-Place ResizeSaves restart time
10+ Pods neededHPA Scale-outAvailability improvement

Advanced Patterns

Pod Scheduling Readiness Gates (K8s 1.30+)

Control scheduling timing with schedulingGates.

apiVersion: v1
kind: Pod
metadata:
name: gated-pod
spec:
schedulingGates:
- name: "example.com/image-preload" # Wait for image preload
- name: "example.com/config-ready" # Wait for ConfigMap ready
containers:
- name: app
image: your-app:v1

Gate Removal Controller Example:

// Gate removal logic
func (c *Controller) removeGateWhenReady(pod *v1.Pod) {
if imagePreloaded(pod) && configReady(pod) {
patch := []byte(`{"spec":{"schedulingGates":null}}`)
c.client.CoreV1().Pods(pod.Namespace).Patch(
ctx, pod.Name, types.StrategicMergePatchType, patch, metav1.PatchOptions{})
}
}

ARC + Karpenter AZ Failure Recovery

Combining AWS Route 53 Application Recovery Controller (ARC) with Karpenter enables automatic recovery during AZ failures.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: az-resilient
spec:
template:
spec:
requirements:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b", "us-east-1c"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]

# Automatic replacement on AZ failure
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: az-resilient-class
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: az-resilient-class
spec:
subnetSelectorTerms:
# ARC Zonal Shift integration: automatically exclude failed AZ
- tags:
karpenter.sh/discovery: my-cluster
aws:cloudformation:logical-id: PrivateSubnet*

Zonal Shift Scenario:

  1. Failure occurs in us-east-1a
  2. ARC triggers Zonal Shift
  3. Karpenter excludes 1a subnet and creates nodes only in 1b and 1c
  4. After recovery, 1a is automatically re-included

Comprehensive Scaling Benchmark Comparison Table

📊 Comprehensive Scaling Benchmark
P95 scaling times measured in production (28 clusters, 15,000+ Pods)
Basic HPA + KarpenterBasic setup
90-120s
Detect 30-60sProvision 45-60s → Pod 10-15s
Optimized Metrics + KarpenterMid-scale
50-70s
Detect 5-10sProvision 30-45s → Pod 10-15s
EKS Auto ModeSimplified Ops
45-70s
Detect 5-10sProvision 30-45s → Pod 10-15s
KEDA + KarpenterEvent-driven
42-65s
Detect 2-5sProvision 30-45s → Pod 10-15s
Setu + Kueue (Gang)ML/Batch
37-60s
Detect 2-5sProvision 30-45s → Pod 5-10s
Warm Pool (existing nodes)Predictable traffic
5-10s
🎯 Selection Guide
🚀
Sub-10s scaling required
Warm Pool + Provisioned CP
🌊
Unpredictable traffic
KEDA + Karpenter
🎯
Operational simplicity
EKS Auto Mode
🤖
ML/Batch jobs
Setu + Kueue
💰
Cost optimization first
Optimized Metrics + Karpenter