Comprehensive Guide to Karpenter-Based EKS Scaling Strategies

📅 Created: 2025-02-09 | Updated: 2026-02-18 | ⏱️ Reading time: ~28 min

Overview

In modern cloud-native applications, ensuring users don't experience errors during traffic spikes is a core engineering challenge. This document covers comprehensive scaling strategies using Karpenter on Amazon EKS, encompassing reactive scaling optimization, predictive scaling, and architectural resilience.

Realistic Optimization Expectations

The "ultra-fast scaling" discussed in this document assumes Warm Pools (pre-allocated nodes). The physical minimum time for the E2E autoscaling pipeline (metric detection → decision → Pod creation → container start) is 6-11 seconds, with an additional 45-90 seconds when new node provisioning is required.

Pushing scaling speed to the extreme is not the only strategy. Architectural resilience (queue-based buffering, Circuit Breaker) and predictive scaling (pattern-based pre-expansion) are more cost-effective for most workloads. This document covers all of these approaches together.

We explore a production-validated architecture from a global-scale EKS environment (3 regions, 28 clusters, 15,000+ Pods) that reduced scaling latency from over 180 seconds to under 45 seconds, and down to 5-10 seconds with Warm Pool utilization.

Scaling Strategy Decision Framework

Before optimizing scaling, you must first determine "Does our workload really need ultra-fast reactive scaling?" There are 4 approaches to solving the same business problem of "preventing user errors during traffic spikes," and for most workloads, approaches 2-4 are more cost-effective.

Comparison by Approach

Approach	Core Strategy	E2E Scaling Time	Monthly Additional Cost (28 clusters)	Complexity	Suitable Workloads
1. Fast Reactive	Karpenter + KEDA + Warm Pool	5-45s	$40K-190K	Very High	Very few mission-critical
2. Predictive Scaling	CronHPA + Predictive Scaling	Pre-expansion (0s)	$2K-5K	Low	Most services with patterns
3. Architectural Resilience	SQS/Kafka + Circuit Breaker	Tolerates scaling delay	$1K-3K	Medium	Services allowing async processing
4. Adequate Base Capacity	Increase base replicas by 20-30%	Unnecessary (already sufficient)	$5K-15K	Very Low	Stable traffic

Cost Structure Comparison by Approach

Below are the estimated monthly costs based on 10 medium-sized clusters. Actual costs vary depending on workloads and instance types.

Approach	Monthly Cost (10 clusters)	Initial Build Cost	Operations Staff Needed	ROI Achievement Condition
1. Fast Reactive	$14,800+	High (2-4 weeks)	Dedicated 1-2 people	SLA violation penalty > $15K/mo
2. Predictive Scaling	~$2,500	Low (2-3 days)	Existing staff	Traffic pattern prediction rate > 70%
3. Architectural Resilience	~$800	Medium (1-2 weeks)	Existing staff	Services allowing async processing
4. Base Capacity Increase	~$4,500	None (immediate)	None	30% buffer over peak is sufficient

Recommendation: Combined Approaches

In most production environments, covering 90%+ of traffic spikes with Approaches 2 + 4 (Predictive + Base Capacity) and handling the remaining 10% with Approach 1 (Reactive Karpenter) is the most cost-effective combination.

Approach 3 (Architectural Resilience) is a fundamental pattern that should always be considered when designing new services.

Approach 2: Predictive Scaling

Most production traffic has patterns (commute hours, lunch, events). Predictive pre-expansion is often more effective than reactive scaling.

# CronHPA: Time-based pre-scaling
apiVersion: autoscaling.k8s.io/v1alpha1
kind: CronHPA
metadata:
  name: traffic-pattern-scaling
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  jobs:
  - name: morning-peak
    schedule: "0 8 * * 1-5"    # Weekdays at 8 AM
    targetSize: 50              # Pre-expand for peak
    completionPolicy:
      type: Never
  - name: lunch-peak
    schedule: "30 11 * * 1-5"   # Weekdays at 11:30 AM
    targetSize: 80
    completionPolicy:
      type: Never
  - name: off-peak
    schedule: "0 22 * * *"      # Daily at 10 PM
    targetSize: 10              # Night-time reduction
    completionPolicy:
      type: Never

Approach 3: Architectural Resilience

Rather than trying to reduce scaling time to zero, it is more realistic to design so that scaling delays are invisible to users.

Queue-based Buffering: By putting requests into SQS/Kafka, scaling delays become "waiting" instead of "failure."

# KEDA SQS-based scaling - Requests wait safely in queue
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: queue-worker
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 2
  maxReplicaCount: 100
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
      queueLength: "5"         # 1 Pod per 5 queue messages
      awsRegion: us-east-1

Circuit Breaker + Rate Limiting: Graceful degradation during overload with Istio/Envoy

# Istio Circuit Breaker - Prevent overload during scaling
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: web-app-circuit-breaker
spec:
  host: web-app
  trafficPolicy:
    connectionPool:
      http:
        h2UpgradePolicy: DEFAULT
        http1MaxPendingRequests: 100    # Limit pending requests
        http2MaxRequests: 1000          # Limit concurrent requests
    outlierDetection:
      consecutive5xxErrors: 5            # Isolate after 5 consecutive 5xx errors
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Approach 4: Adequate Base Capacity

Instead of spending $1,080-$5,400/month on Warm Pools, increasing base replicas by 20-30% achieves the same effect without complex infrastructure.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  # Expected required Pods: 20 → Operate with 25 as baseline (25% buffer)
  replicas: 25
  # HPA handles additional expansion during peaks
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 25     # Guarantee base capacity
  maxReplicas: 100    # Prepare for extreme situations
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60   # Generous target (70 → 60)

The sections below cover the detailed implementation of Approach 1: Fast Reactive Scaling. Review Approaches 2-4 above first, then apply the content below for workloads that require additional optimization.

Problems with Traditional Autoscaling

Before optimizing reactive scaling, you need to understand the bottlenecks in traditional approaches:

The fundamental problem: By the time CPU metrics trigger scaling, it's already too late.

Current Environment Challenges:

Global Scale: 3 regions, 28 EKS clusters, 15,000 Pods in operation
High Traffic Volume: Processing 773.4K daily requests
Latency Issues: 1-3 minute scaling delays with HPA + Karpenter combination
Metric Collection Delays: 1-3 minute delays in CloudWatch metrics making real-time response impossible

The Karpenter Revolution: Direct-to-Metal Provisioning

Karpenter removes the Auto Scaling Group (ASG) abstraction layer and directly provisions EC2 instances based on pending Pod requirements. Karpenter v1.x automatically replaces existing nodes when NodePool specs change through Drift Detection. AMI updates, security patches, and more are automated.

High-Speed Metric Architecture: Two Approaches

To minimize scaling response time, a fast detection system is needed. We compare two proven architectures.

Approach 1: CloudWatch High-Resolution Integration

Leveraging CloudWatch's high-resolution metrics in an AWS-native environment.

Key Components

Scaling Timeline

Timeline Interpretation

When a node already exists (Warm Pool or existing spare node): E2E ~13 seconds
When new node provisioning is needed: E2E ~53 seconds
EC2 instance launch (30-40 seconds) is a physical limitation that cannot be eliminated through metric pipeline optimization alone.

Advantages:

✅ Fast metric collection: Low latency of 1-2 seconds
✅ Simple setup: AWS-native integration
✅ No management overhead: No separate infrastructure management required

Disadvantages:

❌ Limited throughput: 500 TPS per account (PutMetricData per-region limit)
❌ Pod limit: Maximum 5,000 per cluster
❌ High metric costs: AWS CloudWatch metric charges

Approach 2: ADOT + Prometheus-Based Architecture

A high-performance open-source pipeline combining AWS Distro for OpenTelemetry (ADOT) with Prometheus.

Key Components

ADOT Collector: Hybrid deployment with DaemonSet and Sidecar
Prometheus: HA configuration with Remote Storage integration
Thanos Query Layer: Multi-cluster global view
KEDA Prometheus Scaler: High-speed polling at 2-second intervals
Grafana Mimir: Long-term storage and high-speed query engine

Scaling Timeline (~66s)

Advantages:

✅ High throughput: 100,000+ TPS support
✅ Scalability: 20,000+ Pods per cluster support
✅ Low metric costs: Only storage costs (Self-managed)
✅ Full control: Complete configuration and optimization freedom

Disadvantages:

❌ Complex setup: Additional component management required
❌ High operational complexity: HA configuration, backup/recovery, performance tuning needed
❌ Specialist staff required: Prometheus operational experience essential

Cost-Optimized Metric Strategy

Based on 28 clusters: ~$500/month for comprehensive monitoring vs $30,000+ when collecting all metrics at high resolution

Recommended Use Cases

CloudWatch High Resolution Metrics are suitable when:

Small-scale applications (5,000 Pods or fewer)
Simple monitoring requirements
AWS-native solution preferred
Fast deployment and stable operations prioritized

ADOT + Prometheus is suitable when:

Large-scale clusters (20,000+ Pods)
High metric throughput required
Granular monitoring and customization needed
Highest level of performance and scalability required

Scaling Optimization Architecture: Layer-by-Layer Analysis

To minimize scaling response time, optimization across all layers is required:

Karpenter Core Configuration

The key to sub-60-second node provisioning lies in optimal Karpenter configuration:

Karpenter NodePool YAML

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: fast-scaling
spec:
  # Speed optimization configuration
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
    budgets:
    - nodes: "10%"

  # Maximum flexibility for speed
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            # Compute optimized - default selection
            - c6i.xlarge
            - c6i.2xlarge
            - c6i.4xlarge
            - c6i.8xlarge
            - c7i.xlarge
            - c7i.2xlarge
            - c7i.4xlarge
            - c7i.8xlarge
            # AMD alternatives - better availability
            - c6a.xlarge
            - c6a.2xlarge
            - c6a.4xlarge
            - c6a.8xlarge
            # Memory optimized - for specific workloads
            - m6i.xlarge
            - m6i.2xlarge
            - m6i.4xlarge

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: fast-nodepool

  # Guarantee fast provisioning
  limits:
    cpu: 100000  # Soft limits only
    memory: 400000Gi
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: fast-nodepool
spec:
  amiSelectorTerms:
    - alias: al2023@latest

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${CLUSTER_NAME}"

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${CLUSTER_NAME}"

  role: "KarpenterNodeRole-${CLUSTER_NAME}"

  # Speed optimization
  userData: |
    #!/bin/bash
    # Node startup time optimization
    /etc/eks/bootstrap.sh ${CLUSTER_NAME} \
      --b64-cluster-ca ${B64_CLUSTER_CA} \
      --apiserver-endpoint ${API_SERVER_URL} \
      --kubelet-extra-args '--node-labels=karpenter.sh/fast-scaling=true --max-pods=110'

    # Pre-pull critical images (registry.k8s.io replaces k8s.gcr.io)
    ctr -n k8s.io images pull registry.k8s.io/pause:3.10 &
    ctr -n k8s.io images pull public.ecr.aws/eks-distro/kubernetes/pause:3.10 &

Real-Time Scaling Workflow

How all components work together to achieve optimal scaling performance:

Aggressive HPA Configuration for Scaling

The HorizontalPodAutoscaler must be configured for immediate response:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ultra-fast-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 10
  maxReplicas: 1000

  metrics:
  # Primary metric - Queue depth
  - type: External
    external:
      metric:
        name: sqs_queue_depth
        selector:
          matchLabels:
            queue: "web-requests"
      target:
        type: AverageValue
        averageValue: "10"

  # Secondary metric - Request rate
  - type: External
    external:
      metric:
        name: alb_request_rate
        selector:
          matchLabels:
            targetgroup: "web-tg"
      target:
        type: AverageValue
        averageValue: "100"

  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0  # No delay!
      policies:
      - type: Percent
        value: 100
        periodSeconds: 10
      - type: Pods
        value: 100
        periodSeconds: 10
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300  # 5 min cooldown
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60

When to Use KEDA: Event-Driven Scenarios

While Karpenter handles infrastructure scaling, KEDA excels in specific event-driven scenarios:

Production Performance Metrics

Actual results from a deployment handling 750K+ daily requests:

Multi-Region Considerations

For organizations operating across multiple regions, region-specific optimization is needed for consistent high-speed scaling:

Scaling Optimization Best Practices

1. Metric Selection

Use leading indicators (queue depth, connection count), not lagging indicators (CPU)
Keep high-resolution metrics to 10-15 or fewer per cluster
Batch metric submissions to prevent API throttling

2. Karpenter Optimization

Provide maximum instance type flexibility
Aggressively leverage Spot instances with proper interruption handling
Enable consolidation for cost efficiency
Set appropriate ttlSecondsAfterEmpty (30-60 seconds)

3. HPA Tuning

Zero stabilization window for scale-up
Aggressive scaling policies (allow 100% increase)
Multiple metrics with appropriate weights
Appropriate cooldown for scale-down

4. Monitoring

Track P95 scaling latency as a primary KPI
Alert on scaling failures or delays exceeding 15 seconds
Monitor Spot interruption rates
Track cost per scaled Pod

Common Troubleshooting

Hybrid Approach (Recommended)

In real production environments, a hybrid approach mixing both methods is recommended:

Mission-Critical Services: Achieve 10-13 second scaling with ADOT + Prometheus
General Services: 12-15 second scaling with CloudWatch Direct and simplified operations
Gradual Migration: Start with CloudWatch and transition to ADOT as needed

EKS Auto Mode vs Self-managed Karpenter

EKS Auto Mode (2025 GA) has built-in Karpenter with automatic management:

Item	Self-managed Karpenter	EKS Auto Mode
Install/Upgrade	Manual management (Helm)	AWS auto-managed
NodePool Configuration	Full customization	Limited settings
Cost Optimization	Fine-grained control available	Auto-optimization
OS Patching	Manual management	Auto-patching
Suitable Environment	Advanced customization needed	Minimize operational burden

Recommendation: Choose Self-managed when complex scheduling requirements exist, EKS Auto Mode when operational simplification is the goal.

P1: Ultra-Fast Scaling Architecture (Critical)

Scaling Latency Breakdown Analysis

To optimize scaling response time, you must first granularly decompose the latency occurring across the entire scaling chain.

⚡ Production Scaling Latency (Before Optimization)

P50/P95/P99 scaling latency measured across 28 EKS clusters

Stage

P50

P95

P99

Metric Collection

30s

65s

90s

HPA Decision

10s

25s

45s

Node Provisioning

90s

180s

300s

Container Start

15s

35s

60s

Total E2E

145s

305s

495s

Result

During traffic spikes, users experience errors for 5+ minutes — node provisioning accounts for over 60% of total latency

Multi-Layer Scaling Strategy

Ultra-fast scaling is achieved not through a single optimization but through a 3-layer fallback strategy.

Layer-by-Layer Scaling Timeline Comparison

Layer Selection Criteria

Layer 1 (Warm Pool) -- Pre-allocation strategy:

Essence: Not autoscaling but overprovisioning. Securing nodes in advance with Pause Pods
E2E 5-10 seconds (metric detection + Preemption + container start)
Cost: Maintaining 10-20% of expected peak capacity 24/7 ($720-$5,400/month)
Consider: Increasing base replicas at the same cost may be simpler

Layer 2 (Fast Provisioning) -- Default strategy for most cases:

Actual node provisioning with Karpenter + Spot instances
E2E 42-65 seconds (metric detection + EC2 launch + container start)
Cost: Proportional to actual usage (70-80% Spot discount)
Consider: Combined with architectural resilience (queue-based), this time becomes invisible to users

Layer 3 (On-Demand Fallback) -- Essential insurance:

Final safety net when Spot capacity is insufficient
E2E 60-90 seconds (On-Demand may be slower to provision than Spot)
Cost: On-Demand pricing (minimal usage)

P2: Eliminating API Bottlenecks with Provisioned EKS Control Plane

Provisioned Control Plane Overview

In November 2025, AWS announced EKS Provisioned Control Plane. By removing the API throttling limitations of the existing Standard Control Plane, it dramatically improves scaling speed in large-scale burst scenarios.

Standard vs Provisioned Comparison

🏗️ Standard vs Provisioned Control Plane

Maximize large-scale scaling by eliminating API throttling

Feature

Standard

Provisioned XL

Provisioned 2XL

Provisioned 4XL

API Throttling

Shared limit

10x increase

20x increase

40x increase

Pod Creation Rate

10 TPS

100 TPS

200 TPS

400 TPS

Node Update

5 TPS

50 TPS

100 TPS

200 TPS

Concurrent Scaling

100 Pod/10s

1,000 Pod/10s

2,000 Pod/10s

4,000 Pod/10s

Monthly Cost (extra)

~$350

~$700

~$1,400

Recommended Cluster

<1,000 Pods

1,000-5,000 Pod

5,000-15,000 Pod

15,000+ Pod

Provisioned Control Plane Selection Criteria

Signals that you should upgrade to Provisioned:

Frequent API throttling errors: kubectl commands frequently fail or retry
Large deployment delays: 100+ Pod deployments take 5+ minutes
Karpenter node provisioning failures: too many requests errors
HPA scaling delays: Pod creation requests queuing up
Cluster size: 1,000+ Pods continuously or 3,000+ Pods at peak

Cost vs Performance Trade-off:

Standard → XL: 10x API performance for $350/month additional cost (ROI: offset by preventing 10 minutes of downtime)
XL → 2XL: Only needed for ultra-large clusters (10,000+ Pods)
4XL: For extreme scale (50,000+ Pods) or multi-tenant platforms

Provisioned Control Plane Setup

Creating a New Cluster with AWS CLI

aws eks create-cluster \
  --name ultra-fast-cluster \
  --region us-east-1 \
  --role-arn arn:aws:iam::123456789012:role/EKSClusterRole \
  --resources-vpc-config subnetIds=subnet-xxx,subnet-yyy,securityGroupIds=sg-xxx \
  --kubernetes-version 1.32 \
  --compute-config enabled=true,nodePools=system,nodeRoleArn=arn:aws:iam::123456789012:role/EKSNodeRole \
  --kubernetes-network-config elasticLoadBalancing=disabled \
  --access-config authenticationMode=API \
  --upgrade-policy supportType=EXTENDED \
  --zonal-shift-config enabled=true \
  --compute-config enabled=true \
  --control-plane-placement groupName=my-placement-group,clusterTenancy=dedicated \
  --control-plane-provisioning mode=PROVISIONED,size=XL

Upgrading an Existing Cluster (Standard → Provisioned)

# 1. Check current Control Plane mode
aws eks describe-cluster --name my-cluster --query 'cluster.controlPlaneProvisioning'

# 2. Upgrade to Provisioned (no downtime)
aws eks update-cluster-config \
  --name my-cluster \
  --control-plane-provisioning mode=PROVISIONED,size=XL

# 3. Monitor upgrade status (takes 10-15 minutes)
aws eks describe-cluster \
  --name my-cluster \
  --query 'cluster.status'

# 4. Verify API performance
kubectl get pods --all-namespaces --watch
kubectl create deployment nginx --image=nginx --replicas=100

Upgrade Characteristics

No downtime: Control Plane automatically performs a rolling upgrade
Duration: 10-15 minutes (regardless of cluster size)
No rollback: Provisioned → Standard downgrade not supported
Billing starts: Charges begin immediately upon upgrade completion

Performance Comparison During Large-Scale Bursts

Actual production environment test with 1,000 simultaneous Pod scaling:

P3: Warm Pool / Overprovisioning Pattern (Core Strategy)

Pause Pod Overprovisioning Principle

The Warm Pool strategy pre-deploys low-priority "pause" Pods to provision nodes in advance. When actual workloads are needed, pause Pods are immediately evicted (preempted) and actual Pods are scheduled on those nodes.

Complete Overprovisioning Operation Flow

Pause Pod Overprovisioning YAML Configuration

1. PriorityClass Definition (Low Priority)

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: overprovisioning
value: -1  # Negative priority: lower than all actual workloads
globalDefault: false
description: "Pause pods for warm pool - will be preempted by real workloads"

2. Pause Deployment (Base Warm Pool)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: overprovisioning-pause
  namespace: kube-system
spec:
  replicas: 10  # Number of Pods equivalent to 15% of expected peak
  selector:
    matchLabels:
      app: overprovisioning-pause
  template:
    metadata:
      labels:
        app: overprovisioning-pause
    spec:
      priorityClassName: overprovisioning
      terminationGracePeriodSeconds: 0  # Immediate termination

      # Scheduling constraints (same node pool as actual workloads)
      nodeSelector:
        karpenter.sh/nodepool: fast-scaling

      containers:
      - name: pause
        image: registry.k8s.io/pause:3.9
        resources:
          requests:
            cpu: "1000m"      # Average CPU of actual workloads
            memory: "2Gi"     # Average memory of actual workloads
          limits:
            cpu: "1000m"
            memory: "2Gi"

3. Time-Based Warm Pool Auto-Adjustment (CronJob)

---
# Expand Warm Pool before peak time (8:30 AM)
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-up-warm-pool
  namespace: kube-system
spec:
  schedule: "30 8 * * 1-5"  # Weekdays at 8:30 AM
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: warm-pool-scaler
          restartPolicy: OnFailure
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              kubectl scale deployment overprovisioning-pause \
                --namespace kube-system \
                --replicas=30  # Expanded for peak time
---
# Shrink Warm Pool after peak time (7 PM)
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-down-warm-pool
  namespace: kube-system
spec:
  schedule: "0 19 * * 1-5"  # Weekdays at 7 PM
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: warm-pool-scaler
          restartPolicy: OnFailure
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              kubectl scale deployment overprovisioning-pause \
                --namespace kube-system \
                --replicas=5  # Night-time minimum capacity
---
# ServiceAccount and RBAC for CronJob
apiVersion: v1
kind: ServiceAccount
metadata:
  name: warm-pool-scaler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: warm-pool-scaler
  namespace: kube-system
rules:
- apiGroups: ["apps"]
  resources: ["deployments", "deployments/scale"]
  verbs: ["get", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: warm-pool-scaler
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: warm-pool-scaler
subjects:
- kind: ServiceAccount
  name: warm-pool-scaler
  namespace: kube-system

Warm Pool Sizing Method

Cost Analysis and Optimization

💰 Warm Pool Cost Analysis

Cost vs scaling speed by Pause Pod Overprovisioning configuration

Scenario 1: Mid-size Cluster (Peak 200 Pods)

Aggressive (10%)

20 Pod

$720

/mo

0-2s (90%)

High burst frequency

RECOMMENDED

Balanced (15%)

30 Pod

$1,080

/mo

0-2s (95%)

Recommended

Conservative (20%)

40 Pod

$1,440

/mo

0-2s (99%)

Mission Critical

Scenario 2: Large Cluster (Peak 1,000 Pods)

Aggressive (5%)

50 Pod

$1,800

/mo

0-2s (80%)

Predictable traffic

RECOMMENDED

Balanced (10%)

100 Pod

$3,600

/mo

0-2s (90%)

Recommended

Conservative (15%)

150 Pod

$5,400

/mo

0-2s (98%)

High Availability

Warm Pool Optimization Strategies

Cost Reduction Methods:

Time-based scaling: Shrink Warm Pool during nights/weekends with CronJob (50-70% cost reduction)
Spot instance utilization: Deploy Pause Pods on Spot nodes too (70% discount)
Adaptive sizing: Auto-scaling based on CloudWatch Metrics
Mixed strategy: Warm Pool only during peak times, rely on Layer 2 at other times

ROI Formula:

ROI = (SLA Violation Prevention Cost + Revenue Opportunity Loss Prevention) - Warm Pool Cost

Example:
- SLA violation penalty: $5,000/incident
- Average monthly violations (without Warm Pool): 3 incidents
- Warm Pool cost: $1,080/month
- ROI = ($5,000 x 3) - $1,080 = $13,920/month (1,290% ROI)

P4: Setu - Kueue + Karpenter Proactive Provisioning

Setu Overview

Setu bridges Kueue (queuing system) and Karpenter to provide proactive node provisioning for AI/ML workloads requiring Gang Scheduling. While traditional Karpenter reactively provisions nodes after Pods are created, Setu pre-provisions the required nodes the moment a Job enters the queue.

Setu Architecture and Operating Principles

Setu Installation and Configuration

1. Setu Installation (Helm)

# Add Setu Helm chart
helm repo add setu https://sanjeevrg89.github.io/Setu
helm repo update

# Install Setu (requires Kueue and Karpenter)
helm install setu setu/setu \
  --namespace kueue-system \
  --create-namespace \
  --set karpenter.enabled=true \
  --set karpenter.namespace=karpenter

2. ClusterQueue with AdmissionCheck

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: gpu-cluster-queue
spec:
  namespaceSelector: {}

  # Resource quota (entire cluster limits)
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: gpu-flavor
      resources:
      - name: "cpu"
        nominalQuota: 1000
      - name: "memory"
        nominalQuota: 4000Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 64

  # Enable Setu AdmissionCheck
  admissionChecks:
  - setu-provisioning  # Setu pre-provisions nodes
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: AdmissionCheck
metadata:
  name: setu-provisioning
spec:
  controllerName: setu.kueue.x-k8s.io/provisioning

  # Setu parameters
  parameters:
    apiGroup: setu.kueue.x-k8s.io/v1alpha1
    kind: ProvisioningParameters
    name: gpu-provisioning
---
apiVersion: setu.kueue.x-k8s.io/v1alpha1
kind: ProvisioningParameters
metadata:
  name: gpu-provisioning
spec:
  # Karpenter NodePool reference
  nodePoolName: gpu-nodepool

  # Provisioning strategy
  strategy:
    type: Proactive  # Proactive provisioning
    bufferTime: 15s  # Wait time before Job Admission

  # Node requirement mapping
  nodeSelectorRequirements:
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - p4d.24xlarge
    - p4de.24xlarge
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand  # Avoid Spot risk for GPUs

3. GPU NodePool (Karpenter)

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    spec:
      requirements:
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - p4d.24xlarge   # 8x A100 (40GB)
        - p4de.24xlarge  # 8x A100 (80GB)
        - p5.48xlarge    # 8x H100

      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand  # Avoid interruption risk for GPU workloads

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-nodeclass

  # Keep GPU nodes for extended periods (considering training duration)
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 300s  # Remove after 5 min idle
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-nodeclass
spec:
  amiSelectorTerms:
  - alias: al2023@latest  # Includes GPU drivers

  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: "${CLUSTER_NAME}"

  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: "${CLUSTER_NAME}"

  role: "KarpenterNodeRole-${CLUSTER_NAME}"

  # GPU-optimized UserData
  userData: |
    #!/bin/bash
    # EKS optimized GPU AMI setup
    /etc/eks/bootstrap.sh ${CLUSTER_NAME} \
      --b64-cluster-ca ${B64_CLUSTER_CA} \
      --apiserver-endpoint ${API_SERVER_URL} \
      --kubelet-extra-args '--node-labels=nvidia.com/gpu=true --max-pods=110'

    # NVIDIA driver verification
    nvidia-smi || echo "GPU driver not loaded"

4. AI/ML Job Submission Example

apiVersion: batch/v1
kind: Job
metadata:
  name: llm-training
  labels:
    kueue.x-k8s.io/queue-name: gpu-queue  # LocalQueue designation
spec:
  parallelism: 8  # Gang Scheduling (8 Pods run simultaneously)
  completions: 8

  template:
    spec:
      restartPolicy: OnFailure

      # PodGroup for Gang Scheduling
      schedulerName: default-scheduler

      containers:
      - name: training
        image: nvcr.io/nvidia/pytorch:24.01-py3

        command:
        - python3
        - /workspace/train.py
        - --distributed
        - --nodes=8

        resources:
          requests:
            nvidia.com/gpu: 1  # 1 GPU per Pod
            cpu: "48"
            memory: "320Gi"
          limits:
            nvidia.com/gpu: 1
            cpu: "48"
            memory: "320Gi"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: gpu-queue
  namespace: default
spec:
  clusterQueue: gpu-cluster-queue  # ClusterQueue reference

Setu Performance Improvement Measurement

Setu GitHub and Additional Information

GitHub: https://github.com/sanjeevrg89/Setu

Key Features:

Leverages Kueue AdmissionCheck API
Direct Karpenter NodeClaim creation
Optimized for Gang Scheduling workloads (when all Pods must run simultaneously)
Eliminates wait time through GPU node pre-provisioning

Suitable Use Cases:

Distributed AI/ML training (PyTorch DDP, Horovod)
MPI-based HPC workloads
Large-scale batch simulations
Multi-node data processing Jobs

P5: Eliminating Boot Delay with Node Readiness Controller

The Node Readiness Problem

Even when Karpenter provisions nodes quickly, CNI/CSI/GPU driver initialization delays occur before Pods can actually be scheduled. Traditionally, kubelet waits until all DaemonSets are running before the node transitions to Ready state.

Node Readiness Controller Principles

Node Readiness Controller (NRC) provides fine-grained control over the conditions required for a node to transition to Ready state. By default, kubelet waits until all DaemonSets are running, but NRC can be configured to selectively wait only for essential components.

Node Readiness Controller Installation

1. NRC Installation (Helm)

# Node Feature Discovery (NFD) required (NRC dependency)
helm repo add nfd https://kubernetes-sigs.github.io/node-feature-discovery/charts
helm install nfd nfd/node-feature-discovery \
  --namespace kube-system

# Install Node Readiness Controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/node-readiness-controller/main/deploy/manifests.yaml

2. NodeReadinessRule CRD Definition

apiVersion: nodereadiness.k8s.io/v1alpha1
kind: NodeReadinessRule
metadata:
  name: bootstrap-only
spec:
  # bootstrap-only mode: wait for essential components only
  mode: bootstrap-only

  # Required DaemonSets (only wait for these)
  requiredDaemonSets:
  - namespace: kube-system
    name: aws-node  # VPC CNI
    selector:
      matchLabels:
        k8s-app: aws-node

  # Optional DaemonSets (background initialization)
  optionalDaemonSets:
  - namespace: kube-system
    name: ebs-csi-node  # EBS CSI only used by Pods needing block storage
    selector:
      matchLabels:
        app: ebs-csi-node

  - namespace: kube-system
    name: nvidia-device-plugin  # Only needed by GPU Pods
    selector:
      matchLabels:
        name: nvidia-device-plugin-ds

  # Node Selector (nodes to apply this rule to)
  nodeSelector:
    matchLabels:
      karpenter.sh/nodepool: fast-scaling

  # Readiness timeout (maximum wait time)
  readinessTimeout: 60s

Karpenter + NRC Integration Configuration

1. Karpenter NodePool with NRC Annotation

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: fast-scaling-nrc
spec:
  template:
    metadata:
      # NRC activation annotation
      annotations:
        nodereadiness.k8s.io/rule: bootstrap-only

    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]

      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - c6i.xlarge
        - c6i.2xlarge
        - c6i.4xlarge

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: fast-nodepool-nrc

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: fast-nodepool-nrc
spec:
  amiSelectorTerms:
  - alias: al2023@latest

  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: "${CLUSTER_NAME}"

  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: "${CLUSTER_NAME}"

  role: "KarpenterNodeRole-${CLUSTER_NAME}"

  # NRC-optimized UserData
  userData: |
    #!/bin/bash
    # EKS bootstrap (minimal options)
    /etc/eks/bootstrap.sh ${CLUSTER_NAME} \
      --b64-cluster-ca ${B64_CLUSTER_CA} \
      --apiserver-endpoint ${API_SERVER_URL} \
      --kubelet-extra-args '--node-labels=karpenter.sh/fast-scaling=true,nodereadiness.k8s.io/enabled=true --max-pods=110'

    # VPC CNI fast initialization (required)
    systemctl enable --now aws-node || true

2. VPC CNI Readiness Rule (Detailed Configuration)

apiVersion: nodereadiness.k8s.io/v1alpha1
kind: NodeReadinessRule
metadata:
  name: vpc-cni-only
spec:
  mode: bootstrap-only

  # Wait for VPC CNI only
  requiredDaemonSets:
  - namespace: kube-system
    name: aws-node
    selector:
      matchLabels:
        k8s-app: aws-node

    # CNI readiness check conditions
    readinessProbe:
      exec:
        command:
        - sh
        - -c
        - |
          # Verify aws-node Pod's aws-vpc-cni-init container completion
          kubectl wait --for=condition=Initialized \
            pod -l k8s-app=aws-node \
            -n kube-system \
            --timeout=30s

      initialDelaySeconds: 5
      periodSeconds: 2
      timeoutSeconds: 30
      successThreshold: 1
      failureThreshold: 3

  # All other DaemonSets are optional
  optionalDaemonSets:
  - namespace: kube-system
    name: "*"  # Wildcard: all other DaemonSets

  nodeSelector:
    matchLabels:
      karpenter.sh/nodepool: fast-scaling-nrc

  readinessTimeout: 60s

NRC Performance Comparison

Production environment 100-node scaling test results:

Considerations When Using NRC

Advantages:

Node Ready time reduced by 50%
Pod scheduling delay minimized
API load reduced during large-scale scaling

Disadvantages and Risks:

Pods requiring CSI may fail: Pods mounting EBS volumes may enter CrashLoopBackOff if scheduled before CSI driver is ready
GPU Pod initialization delay: GPU Pods remain Pending during NVIDIA device plugin background initialization
Monitoring blind spots: Initial metrics may be missing if Prometheus node-exporter starts late

Solutions:

Use PodSchedulingGate: Set manual gates for Pods requiring CSI/GPU
NodeAffinity conditions: Wait for nodereadiness.k8s.io/csi-ready=true label
InitContainer verification: Verify required drivers exist before Pod starts

# Example Pod requiring CSI (safe wait)
apiVersion: v1
kind: Pod
metadata:
  name: app-with-ebs
spec:
  initContainers:
  - name: wait-for-csi
    image: busybox
    command:
    - sh
    - -c
    - |
      until [ -f /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock ]; do
        echo "Waiting for EBS CSI driver..."
        sleep 2
      done

  containers:
  - name: app
    image: my-app
    volumeMounts:
    - name: data
      mountPath: /data

  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: ebs-pvc

Conclusion

Efficient autoscaling optimization in EKS is not optional -- it is essential. The combination of Karpenter's intelligent provisioning, high-resolution metrics for critical indicators, and appropriately tuned HPA configurations enables implementing optimal scaling strategies tailored to workload characteristics.

Key Takeaways:

Karpenter as the foundation: Minutes saved in scaling time through direct EC2 provisioning
Selective high-resolution metrics: Monitor what matters at 1-5 second intervals
Aggressive HPA configuration: Eliminate artificial delays in scaling decisions
Cost optimization through intelligence: Reduce over-provisioning with faster scaling
Architecture selection: Choose CloudWatch or Prometheus based on scale and requirements

P1 Ultra-Fast Scaling Strategy Summary:

Multi-Layer Fallback Strategy: Warm Pool (0-2s) -> Fast Provisioning (5-15s) -> On-Demand Fallback (15-30s) covers all scenarios
Provisioned Control Plane: API throttling elimination enables 10x faster Pod creation during large bursts ($350/month prevents 10-minute downtime)
Pause Pod Overprovisioning: Time-based auto-adjustment achieves 0-2s scaling with 1,290% ROI (SLA violation prevention)
Setu (Kueue-Karpenter): 30% latency reduction for AI/ML Gang Scheduling workloads by parallelizing node provisioning with queue wait
Node Readiness Controller: 50% node Ready time reduction by waiting for CNI only (85s -> 45s)

The architectures presented here have been validated in production environments handling millions of requests daily. By implementing these patterns, you can ensure your EKS cluster scales as fast as business demands require -- measured in seconds, not minutes.

🎯 Practical Implementation Guide

Recommended strategies, expected performance, and costs by scenario

⏰

Predictable peak times

Warm Pool (15%)

0-2s

Scaling

$1,080

Monthly extra

🌊

Unpredictable traffic

Fast Provisioning (Spot)

5-15s

Scaling

Usage-based

Monthly extra

🏢

Large cluster (5,000+ Pods)

Provisioned XL + Fast

5-10s

Scaling

$350+

Monthly extra

🤖

AI/ML training workloads

Setu + GPU NodePool

15-30s

Scaling

Usage-based

Monthly extra

🔒

Mission-critical SLA

Warm Pool + Provisioned + NRC

0-2s

Scaling

$1,430

Monthly extra

Comprehensive Recommendations

The patterns above are powerful, but most workloads don't need all of them. When applying in practice, review in this order:

First: Optimize basic Karpenter settings (diverse instance types in NodePool, Spot utilization) -- this alone achieves 180s -> 45-65s
Next: HPA tuning (reduce stabilizationWindow, adopt KEDA) -- metric detection from 60s -> 2-5s
Then: Design architectural resilience (queue-based, Circuit Breaker) -- scaling delay becomes invisible to users
Only when needed: Warm Pool, Provisioned CP, Setu, NRC -- when mission-critical SLA requirements exist

Always Calculate Cost-Effectiveness

Warm Pool ($1,080/month) + Provisioned CP ($350/month) = $1,430/month in additional costs. For 28 clusters, that's $40,000/month. With the same budget, increasing base replicas by 30% can achieve similar effects without complex infrastructure. Always ask yourself: "Does this complexity justify the business value?"

EKS Auto Mode Complete Guide

EKS Auto Mode (December 2024 GA)

EKS Auto Mode provides Karpenter as a fully managed service, including automatic infrastructure management, OS patching, and security updates. It supports ultra-fast scaling while minimizing operational complexity.

Managed Karpenter: Automatic Infrastructure Management

EKS Auto Mode automates the following:

Karpenter controller upgrades: AWS ensures compatibility with automatic updates
Security patches: AL2023 AMI automatic patching and node rolling replacement
NodePool default configuration: system and general-purpose pools are pre-configured
IAM roles: KarpenterNodeRole and KarpenterControllerRole automatically created

Auto Mode vs Self-managed Detailed Comparison

🔄 EKS Auto Mode vs Self-managed Karpenter

Operations complexity vs customization freedom tradeoff

Feature

Self-managed

Auto Mode

⚡

Scaling Speed

30-45s (optimized)

30-45s (same)

🔧

Customization

⭐⭐⭐⭐⭐

Full control

⭐⭐⭐

Limited

🔥

Warm Pool

✅

Self-implementable

❌

Not supported

🤖

Setu/Kueue

✅

Full support

⚠️

Limited

💰

Cost

Free (resources only)

📊

Ops Complexity

⭐⭐⭐⭐

High

⭐

Low

🛡️

OS Patching

Manual AMI mgmt

Auto patching

🔍

Drift Detection

Manual setup

Enabled by default

🎯

Best For

Advanced scheduling, Gang scheduling

Operations simplicity

Ultra-Fast Scaling Methods in Auto Mode

Auto Mode uses the same Karpenter engine as Self-managed, so scaling speed is identical. However, the following optimizations are available:

Leverage built-in NodePools: system and general-purpose pools are already optimized
Expand instance types: Add more instance types to default pools
Tune Consolidation policy: Enable WhenEmptyOrUnderutilized
Adjust Disruption Budget: Minimize node replacement during spikes

Built-in NodePool Configuration

EKS Auto Mode provides two default NodePools:

# system pool (kube-system, monitoring, etc.)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: system
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["t3.medium", "t3.large"]
      taints:
        - key: CriticalAddonsOnly
          value: "true"
          effect: NoSchedule
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 300s
---
# general-purpose pool (application workloads)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general-purpose
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - c6i.xlarge
            - c6i.2xlarge
            - c6i.4xlarge
            - m6i.xlarge
            - m6i.2xlarge
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
    budgets:
    - nodes: "10%"

Self-managed to Auto Mode Migration Guide

Migration Precautions

To ensure workload availability during migration, a blue/green transition approach is recommended.

Step-by-step Migration:

# Step 1: Create new Auto Mode cluster
aws eks create-cluster \
  --name my-cluster-auto \
  --version 1.33 \
  --compute-config enabled=true \
  --role-arn arn:aws:iam::ACCOUNT:role/EKSClusterRole \
  --resources-vpc-config subnetIds=subnet-xxx,subnet-yyy

# Step 2: Backup existing workloads
kubectl get all --all-namespaces -o yaml > workloads-backup.yaml

# Step 3: Create Custom NodePool (optional)
kubectl apply -f custom-nodepool.yaml

# Step 4: Gradually migrate workloads
# - Use DNS weighted routing for gradual traffic transition
# - From existing cluster -> Auto Mode cluster

# Step 5: Remove existing cluster after validation
kubectl drain --ignore-daemonsets --delete-emptydir-data <node-name>

Auto Mode Cluster Creation YAML

# Using eksctl
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: auto-mode-cluster
  region: us-east-1
  version: "1.33"

# Enable Auto Mode
computeConfig:
  enabled: true
  nodePoolDefaults:
    instanceTypes:
      - c6i.xlarge
      - c6i.2xlarge
      - c6i.4xlarge
      - c7i.xlarge
      - c7i.2xlarge
      - m6i.xlarge
      - m6i.2xlarge

# VPC configuration
vpc:
  id: vpc-xxx
  subnets:
    private:
      us-east-1a: { id: subnet-xxx }
      us-east-1b: { id: subnet-yyy }
      us-east-1c: { id: subnet-zzz }

# IAM configuration (auto-created)
iam:
  withOIDC: true

Auto Mode NodePool Customization

# Custom NodePool for high-performance workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: high-performance
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - c7i.4xlarge
            - c7i.8xlarge
            - c7i.16xlarge
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a", "us-east-1b"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: high-perf-class

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 600s  # 10 min wait
    budgets:
    - nodes: "0"  # Halt replacement during spikes
      schedule: "0 8-18 * * MON-FRI"  # Business hours
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: high-perf-class
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: auto-mode-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: auto-mode-cluster
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 10000
        throughput: 500

Karpenter v1.x Latest Features

Consolidation Policy: Speed vs Cost

Starting from Karpenter v1.0, the consolidationPolicy field has moved to the disruption section.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: optimized-pool
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s

    # Consolidation exclusion conditions
    expireAfter: 720h  # Auto-replace nodes after 30 days

Policy Comparison:

Policy	Behavior	Speed	Cost Optimization	Suitable Environment
`WhenEmpty`	Remove empty nodes only	Fast	Limited	Stable traffic
`WhenEmptyOrUnderutilized`	Empty nodes + consolidate underutilized nodes	Moderate	Excellent	Variable traffic

Scaling Speed Impact Analysis:

Disruption Budgets: Configuration for Burst Traffic

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: burst-ready
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s

    # Time-based Disruption Budget
    budgets:
    - nodes: "0"  # Halt replacement
      schedule: "0 8-18 * * MON-FRI"  # Business hours
      reasons:
        - Drifted
        - Expired
        - Consolidation

    - nodes: "20%"  # Allow up to 20% replacement
      schedule: "0 19-7 * * *"  # Nighttime
      reasons:
        - Drifted
        - Expired

    - nodes: "50%"  # Aggressive optimization on weekends
      schedule: "0 0-23 * * SAT,SUN"

Budget Strategies:

Events like Black Friday: nodes: "0" (completely halt replacement)
Normal operations: nodes: "10-20%" (gradual optimization)
Nights/weekends: nodes: "50%" (aggressive cost reduction)

Drift Detection: Automatic Node Replacement

Drift Detection automatically replaces existing nodes when the NodePool spec has changed.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: drift-enabled
spec:
  template:
    spec:
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["c6i.xlarge", "c7i.xlarge"]  # Drift detected on spec change

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: drift-class

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
    budgets:
    - nodes: "20%"  # Control Drift replacement speed
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: drift-class
spec:
  amiSelectorTerms:
    - alias: al2023@latest  # Auto Drift on AMI change

  # AMI update scenario
  # 1. AWS releases new AL2023 AMI
  # 2. Karpenter detects Drift
  # 3. Nodes replaced sequentially according to Budget

Drift Trigger Conditions:

NodePool instance type change
EC2NodeClass AMI change
userData script modification
blockDeviceMappings change

NodePool Weights: Spot to On-Demand Fallback

# Weight 0: Highest priority (Spot)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot-primary
spec:
  weight: 0  # Lowest weight = highest priority
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
---
# Weight 50: Fallback when Spot unavailable
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: on-demand-fallback
spec:
  weight: 50
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]

Weight Strategy:

Metric Collection Optimization

KEDA + Prometheus: Event-Driven Scaling (1-3s Response)

KEDA polls Prometheus metrics at 1-3 second intervals to achieve ultra-fast scaling.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ultra-fast-scaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app

  pollingInterval: 2  # Poll every 2 seconds
  cooldownPeriod: 60
  minReplicaCount: 10
  maxReplicaCount: 1000

  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: http_requests_per_second
      query: |
        sum(rate(http_requests_total[30s])) by (service)
      threshold: "100"

  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: p99_latency_ms
      query: |
        histogram_quantile(0.99,
          sum(rate(http_request_duration_seconds_bucket[30s])) by (le)
        ) * 1000
      threshold: "500"  # Scale up when exceeding 500ms

  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 0
          policies:
          - type: Percent
            value: 100
            periodSeconds: 5  # Allow 100% increase every 5 seconds

KEDA vs HPA Scaling Speed:

Configuration	Metric Update	Scaling Decision	Total Time
HPA + Metrics API	15s	15s	30s
KEDA + Prometheus	2s	1s	3s

ADOT Collector Tuning: Minimizing Scrape Interval

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: adot-collector-ultra-fast
spec:
  mode: daemonset
  config: |
    receivers:
      prometheus:
        config:
          scrape_configs:
          # Critical metrics: 1-second scrape
          - job_name: 'critical-metrics'
            scrape_interval: 1s
            scrape_timeout: 800ms
            static_configs:
            - targets: ['web-app:8080']
            metric_relabel_configs:
            - source_labels: [__name__]
              regex: '(http_requests_total|http_request_duration_seconds.*|queue_depth)'
              action: keep

          # Standard metrics: 15-second scrape
          - job_name: 'standard-metrics'
            scrape_interval: 15s
            static_configs:
            - targets: ['web-app:8080']

    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024
        send_batch_max_size: 2048

      memory_limiter:
        check_interval: 1s
        limit_mib: 512

    exporters:
      prometheus:
        endpoint: "0.0.0.0:8889"

      prometheusremotewrite:
        endpoint: http://mimir:9009/api/v1/push
        headers:
          X-Scope-OrgID: "prod"

    service:
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [memory_limiter, batch]
          exporters: [prometheus, prometheusremotewrite]

CloudWatch Metric Streams

CloudWatch Metric Streams streams metrics to Kinesis Data Firehose in real-time.

# Create Metric Stream
aws cloudwatch put-metric-stream \
  --name eks-metrics-stream \
  --firehose-arn arn:aws:firehose:us-east-1:ACCOUNT:deliverystream/metrics \
  --role-arn arn:aws:iam::ACCOUNT:role/CloudWatchMetricStreamRole \
  --output-format json \
  --include-filters Namespace=AWS/EKS \
  --include-filters Namespace=ContainerInsights

Architecture:

Custom Metrics API HPA

apiVersion: v1
kind: Service
metadata:
  name: custom-metrics-api
spec:
  ports:
  - port: 443
    targetPort: 6443
  selector:
    app: custom-metrics-apiserver
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: custom-metrics-apiserver
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: custom-metrics-apiserver
        image: your-registry/custom-metrics-api:v1
        args:
        - --secure-port=6443
        - --logtostderr=true
        - --v=4
        - --prometheus-url=http://prometheus:9090
        - --cache-ttl=5s  # 5-second cache

Container Image Optimization

Relationship Between Image Size and Scaling Speed

Optimization Strategies:

Target image size under 500MB
Minimize runtime layers with multi-stage builds
Remove unnecessary packages

ECR Pull-Through Cache

# Create Pull-Through Cache rule
aws ecr create-pull-through-cache-rule \
  --ecr-repository-prefix docker-hub \
  --upstream-registry-url registry-1.docker.io \
  --region us-east-1

# Usage example
# Original: docker.io/library/nginx:latest
# Cached: ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/docker-hub/library/nginx:latest

Benefits:

Cached in ECR after first pull
3-5x faster from second pull onward
Avoids DockerHub rate limits

Image Pre-pull: DaemonSet vs userData

Method 1: Image Pre-pull with DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: image-prepull
spec:
  selector:
    matchLabels:
      app: image-prepull
  template:
    metadata:
      labels:
        app: image-prepull
    spec:
      initContainers:
      - name: prepull-web-app
        image: your-registry/web-app:v1.2.3
        command: ['sh', '-c', 'echo "Image pulled"']
      - name: prepull-sidecar
        image: your-registry/sidecar:v2.0.0
        command: ['sh', '-c', 'echo "Image pulled"']
      containers:
      - name: pause
        image: public.ecr.aws/eks-distro/kubernetes/pause:3.9
        resources:
          requests:
            cpu: 10m
            memory: 20Mi

Method 2: Pre-pull in userData

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: prepull-class
spec:
  userData: |
    #!/bin/bash
    /etc/eks/bootstrap.sh ${CLUSTER_NAME}

    # Pre-pull critical images
    ctr -n k8s.io images pull your-registry.com/web-app:v1.2.3 &
    ctr -n k8s.io images pull your-registry.com/sidecar:v2.0.0 &
    ctr -n k8s.io images pull your-registry.com/init-db:v3.1.0 &
    wait

Comparison:

Method	Timing	Effect on New Nodes	Maintenance
DaemonSet	After node Ready	Moderate	Easy
userData	During bootstrap	Best	Difficult

Minimal Base Image: distroless, scratch

# Before optimization: Ubuntu-based (500MB)
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y ca-certificates
COPY app /app
CMD ["/app"]

# After optimization: distroless (50MB)
FROM gcr.io/distroless/base-debian12
COPY app /app
CMD ["/app"]

# After optimization: scratch (20MB, static binary only)
FROM scratch
COPY app /app
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
CMD ["/app"]

SOCI (Seekable OCI) for Large Images

SOCI loads only the necessary parts without pulling the entire image.

# Create SOCI index
soci create your-registry/large-ml-model:v1.0.0

# Push SOCI index to registry
soci push your-registry/large-ml-model:v1.0.0

# Containerd configuration
cat <<EOF > /etc/containerd/config.toml
[plugins."io.containerd.snapshotter.v1.soci"]
  enable_image_lazy_loading = true
EOF

Results:

5GB image starts in 10-15 seconds (previously 2-3 minutes)
Useful for ML models and large datasets

Bottlerocket Optimization

Bottlerocket is a container-optimized OS with 30% faster boot time compared to AL2023.

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: bottlerocket-class
spec:
  amiSelectorTerms:
    - alias: bottlerocket@latest

  userData: |
    [settings.kubernetes]
    cluster-name = "${CLUSTER_NAME}"

    [settings.kubernetes.node-labels]
    "karpenter.sh/fast-boot" = "true"

In-Place Pod Vertical Scaling (K8s 1.33+)

Starting from K8s 1.33, you can adjust resources without restarting the Pod.

apiVersion: v1
kind: Pod
metadata:
  name: resizable-pod
spec:
  containers:
  - name: app
    image: your-app:v1
    resources:
      requests:
        cpu: "500m"
        memory: "512Mi"
      limits:
        cpu: "1000m"
        memory: "1Gi"
    resizePolicy:
    - resourceName: cpu
      restartPolicy: NotRequired  # CPU does not require restart
    - resourceName: memory
      restartPolicy: RestartContainer  # Memory requires restart

Criteria for Choosing Between Scaling and Resizing:

Scenario	Method	Reason
Traffic surge (2x or more)	HPA Scale-out	Load distribution needed
CPU utilization exceeds 80%	In-Place Resize	Single Pod performance insufficient
Memory OOM risk	In-Place Resize	Saves restart time
10+ Pods needed	HPA Scale-out	Availability improvement

Advanced Patterns

Pod Scheduling Readiness Gates (K8s 1.30+)

Control scheduling timing with schedulingGates.

apiVersion: v1
kind: Pod
metadata:
  name: gated-pod
spec:
  schedulingGates:
  - name: "example.com/image-preload"  # Wait for image preload
  - name: "example.com/config-ready"   # Wait for ConfigMap ready
  containers:
  - name: app
    image: your-app:v1

Gate Removal Controller Example:

// Gate removal logic
func (c *Controller) removeGateWhenReady(pod *v1.Pod) {
    if imagePreloaded(pod) && configReady(pod) {
        patch := []byte(`{"spec":{"schedulingGates":null}}`)
        c.client.CoreV1().Pods(pod.Namespace).Patch(
            ctx, pod.Name, types.StrategicMergePatchType, patch, metav1.PatchOptions{})
    }
}

ARC + Karpenter AZ Failure Recovery

Combining AWS Route 53 Application Recovery Controller (ARC) with Karpenter enables automatic recovery during AZ failures.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: az-resilient
spec:
  template:
    spec:
      requirements:
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a", "us-east-1b", "us-east-1c"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]

      # Automatic replacement on AZ failure
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: az-resilient-class
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: az-resilient-class
spec:
  subnetSelectorTerms:
    # ARC Zonal Shift integration: automatically exclude failed AZ
    - tags:
        karpenter.sh/discovery: my-cluster
        aws:cloudformation:logical-id: PrivateSubnet*

Zonal Shift Scenario:

Failure occurs in us-east-1a
ARC triggers Zonal Shift
Karpenter excludes 1a subnet and creates nodes only in 1b and 1c
After recovery, 1a is automatically re-included

Comprehensive Scaling Benchmark Comparison Table

📊 Comprehensive Scaling Benchmark

P95 scaling times measured in production (28 clusters, 15,000+ Pods)

Basic HPA + KarpenterBasic setup

90-120s

Detect 30-60s → Provision 45-60s → Pod 10-15s

Optimized Metrics + KarpenterMid-scale

50-70s

Detect 5-10s → Provision 30-45s → Pod 10-15s

EKS Auto ModeSimplified Ops

45-70s

Detect 5-10s → Provision 30-45s → Pod 10-15s

KEDA + KarpenterEvent-driven

42-65s

Detect 2-5s → Provision 30-45s → Pod 10-15s

Setu + Kueue (Gang)ML/Batch

37-60s

Detect 2-5s → Provision 30-45s → Pod 5-10s

Warm Pool (existing nodes)Predictable traffic

5-10s

🎯 Selection Guide

🚀

Sub-10s scaling required

Warm Pool + Provisioned CP

🌊

Unpredictable traffic

KEDA + Karpenter

🎯

Operational simplicity

EKS Auto Mode

🤖

ML/Batch jobs

Setu + Kueue

💰

Cost optimization first

Optimized Metrics + Karpenter

Overview​

Scaling Strategy Decision Framework​

Comparison by Approach​

Cost Structure Comparison by Approach​

Approach 2: Predictive Scaling​

Approach 3: Architectural Resilience​

Approach 4: Adequate Base Capacity​

Problems with Traditional Autoscaling​

The Karpenter Revolution: Direct-to-Metal Provisioning​

High-Speed Metric Architecture: Two Approaches​

Approach 1: CloudWatch High-Resolution Integration​

Key Components​

Scaling Timeline​

Approach 2: ADOT + Prometheus-Based Architecture​

Key Components​

Scaling Timeline (~66s)​

Cost-Optimized Metric Strategy​

Recommended Use Cases​

Scaling Optimization Architecture: Layer-by-Layer Analysis​

Karpenter Core Configuration​

Karpenter NodePool YAML​

Real-Time Scaling Workflow​

Aggressive HPA Configuration for Scaling​

When to Use KEDA: Event-Driven Scenarios​

Production Performance Metrics​

Multi-Region Considerations​

Scaling Optimization Best Practices​

1. Metric Selection​

2. Karpenter Optimization​

3. HPA Tuning​

4. Monitoring​

Common Troubleshooting​

Hybrid Approach (Recommended)​

EKS Auto Mode vs Self-managed Karpenter​

P1: Ultra-Fast Scaling Architecture (Critical)​

Scaling Latency Breakdown Analysis​

Multi-Layer Scaling Strategy​

Layer-by-Layer Scaling Timeline Comparison​

P2: Eliminating API Bottlenecks with Provisioned EKS Control Plane​

Provisioned Control Plane Overview​

Standard vs Provisioned Comparison​

Provisioned Control Plane Setup​

Creating a New Cluster with AWS CLI​

Upgrading an Existing Cluster (Standard → Provisioned)​

Performance Comparison During Large-Scale Bursts​

P3: Warm Pool / Overprovisioning Pattern (Core Strategy)​

Pause Pod Overprovisioning Principle​

Complete Overprovisioning Operation Flow​

Pause Pod Overprovisioning YAML Configuration​

1. PriorityClass Definition (Low Priority)​

2. Pause Deployment (Base Warm Pool)​

3. Time-Based Warm Pool Auto-Adjustment (CronJob)​

Warm Pool Sizing Method​

Cost Analysis and Optimization​

P4: Setu - Kueue + Karpenter Proactive Provisioning​

Setu Overview​

Setu Architecture and Operating Principles​

Setu Installation and Configuration​

1. Setu Installation (Helm)​

2. ClusterQueue with AdmissionCheck​

3. GPU NodePool (Karpenter)​

4. AI/ML Job Submission Example​

Setu Performance Improvement Measurement​

P5: Eliminating Boot Delay with Node Readiness Controller​

The Node Readiness Problem​

Node Readiness Controller Principles​

Node Readiness Controller Installation​

1. NRC Installation (Helm)​

2. NodeReadinessRule CRD Definition​

Karpenter + NRC Integration Configuration​

1. Karpenter NodePool with NRC Annotation​

2. VPC CNI Readiness Rule (Detailed Configuration)​

NRC Performance Comparison​

Conclusion​

Comprehensive Recommendations​

EKS Auto Mode Complete Guide​

Managed Karpenter: Automatic Infrastructure Management​

Auto Mode vs Self-managed Detailed Comparison​

Ultra-Fast Scaling Methods in Auto Mode​

Built-in NodePool Configuration​

Overview

Scaling Strategy Decision Framework

Comparison by Approach

Cost Structure Comparison by Approach

Approach 2: Predictive Scaling

Approach 3: Architectural Resilience

Approach 4: Adequate Base Capacity

Problems with Traditional Autoscaling

The Karpenter Revolution: Direct-to-Metal Provisioning

High-Speed Metric Architecture: Two Approaches

Approach 1: CloudWatch High-Resolution Integration

Key Components

Scaling Timeline

Approach 2: ADOT + Prometheus-Based Architecture

Key Components

Scaling Timeline (~66s)

Cost-Optimized Metric Strategy

Recommended Use Cases

Scaling Optimization Architecture: Layer-by-Layer Analysis

Karpenter Core Configuration

Karpenter NodePool YAML

Real-Time Scaling Workflow

Aggressive HPA Configuration for Scaling

When to Use KEDA: Event-Driven Scenarios

Production Performance Metrics

Multi-Region Considerations

Scaling Optimization Best Practices

1. Metric Selection

2. Karpenter Optimization

3. HPA Tuning

4. Monitoring

Common Troubleshooting

Hybrid Approach (Recommended)

EKS Auto Mode vs Self-managed Karpenter

P1: Ultra-Fast Scaling Architecture (Critical)

Scaling Latency Breakdown Analysis

Multi-Layer Scaling Strategy

Layer-by-Layer Scaling Timeline Comparison

P2: Eliminating API Bottlenecks with Provisioned EKS Control Plane

Provisioned Control Plane Overview

Standard vs Provisioned Comparison

Provisioned Control Plane Setup

Creating a New Cluster with AWS CLI

Upgrading an Existing Cluster (Standard → Provisioned)

Performance Comparison During Large-Scale Bursts

P3: Warm Pool / Overprovisioning Pattern (Core Strategy)

Pause Pod Overprovisioning Principle

Complete Overprovisioning Operation Flow

Pause Pod Overprovisioning YAML Configuration

1. PriorityClass Definition (Low Priority)

2. Pause Deployment (Base Warm Pool)

3. Time-Based Warm Pool Auto-Adjustment (CronJob)

Warm Pool Sizing Method

Cost Analysis and Optimization

P4: Setu - Kueue + Karpenter Proactive Provisioning

Setu Overview

Setu Architecture and Operating Principles

Setu Installation and Configuration

1. Setu Installation (Helm)

2. ClusterQueue with AdmissionCheck

3. GPU NodePool (Karpenter)

4. AI/ML Job Submission Example

Setu Performance Improvement Measurement

P5: Eliminating Boot Delay with Node Readiness Controller

The Node Readiness Problem

Node Readiness Controller Principles

Node Readiness Controller Installation

1. NRC Installation (Helm)

2. NodeReadinessRule CRD Definition

Karpenter + NRC Integration Configuration

1. Karpenter NodePool with NRC Annotation

2. VPC CNI Readiness Rule (Detailed Configuration)

NRC Performance Comparison

Conclusion

Comprehensive Recommendations

EKS Auto Mode Complete Guide

Managed Karpenter: Automatic Infrastructure Management

Auto Mode vs Self-managed Detailed Comparison

Ultra-Fast Scaling Methods in Auto Mode

Built-in NodePool Configuration