Skip to main content

EKS PCP Tier Sizing & Performance Validation Guide

Purpose: This guide provides detailed specifications for EKS Provisioned Control Plane (PCP) tiers, explains control plane architecture improvements, and outlines performance validation methodologies.

Related Documentation

For Control Plane architecture overview, CRD impact analysis, monitoring setup, and CRD design best practices, see EKS Control Plane & CRD at Scale Comprehensive Guide.


In this post

Organizations running large-scale Kubernetes workloads on Amazon EKS face a critical question: how do you ensure your control plane can handle peak load without over-provisioning? This technical deep dive explores three key areas:

  1. PCP tier specifications and practical object limits — Understanding API request concurrency (seats), pod scheduling rates, and etcd database sizing with real-world examples
  2. EKS control plane architecture improvements — How AWS engineering enhancements deliver consistent performance and higher availability
  3. Performance validation methodology — Using ClusterLoader2 and comprehensive metrics to verify control plane capacity

Whether you're planning a 10,000-node cluster or troubleshooting API throttling, this guide provides the technical details and measurement strategies you need to right-size your EKS control plane.


1. PCP Tier Specifications and Practical Object Limits

Key takeaway: API Request Concurrency (Seats) represents "concurrent seat capacity," not "concurrent request count." A single LIST request can consume up to 10 seats depending on the number of objects returned. Customer-facing concurrency numbers (e.g., 4XL = 6,800 seats) apply cluster-wide. For a 10,000-node / 1,000,000-pod environment, you need ~8.2 GB etcd DB capacity at peak, ~1,155 seats, and ~370 pods/sec for AZ failure recovery — making 4XL the recommended tier. Kubernetes upstream officially supports up to 5,000 nodes / 150,000 pods, though AWS has benchmarked both 5K and 10K node configurations. Measure actual APF seat usage via apiserver_flowcontrol_current_executing_seats in CloudWatch (free) over a 1-week period to determine the appropriate tier.

1.1 Large-Scale Single Cluster Benchmarks

The following reference data is based on public documentation and AWS benchmarks for large single-cluster deployments.

Kubernetes Upstream and EKS Official Test Limits

BenchmarkNodesTotal PodsTotal K8s ObjectsNotes
K8s SIG-Scalability Official Limit5,000150,000~300,000Upstream SLI/SLO guarantee scope
EKS 5K Node Benchmark5,000~150,000~300,000AWS validated
EKS 10K Node Benchmark10,000~500,000+~760,000PCP 4XL, API P99 < 1s achieved

Note: While Kubernetes upstream's official SLI/SLO guarantee covers 5,000 nodes / 150,000 pods, this represents a conservative baseline applicable to all Kubernetes distributions. EKS PCP is designed to support beyond this threshold into 10K+ node environments.

Confirmed Customer Cases

CaseObject CountTierResult
Company S (Cloud/SaaS, cert-manager)~200K CRDs + ~400K related = ~600KPCP recommendedStable operations
Company C (Networking/Security, accessrulegroups)~12,500 CRDs (~300 KB each)-LIST timeout issues
Kyverno admissionreports leak (open-source controller)1,565,106 CRDsStandardetcd DB exceeded 8GB → failure

Important Notes on Cluster Scale

Some large customers claim to operate "tens of thousands of nodes in a single cluster." However, actual control plane load is not determined solely by node/pod count. Two 10,000-node clusters can require completely different PCP tiers depending on workload patterns.

Accurate tier sizing requires measuring actual APF seat usage, not claimed scale. Refer to section 1.9 "APF Seat Usage Monitoring Guide" to measure your cluster's actual concurrency consumption.

Note: Most large customers operate multiple clusters segmented by workload, region, and environment, rather than scaling a single cluster indefinitely.

Note: AWS has benchmarked PCP performance in both 5K and 10K node environments.

Key Bottlenecks in Single Cluster Scaling

ScalePrimary BottleneckDescription
~1,000 nodesGenerally noneStandard tier sufficient for most workloads
~3,000 nodesetcd DB size, API ConcurrencyXL+ required if CRD-heavy
~5,000 nodesScheduler throughput, LIST latencyApproaching K8s upstream official limit, 2XL+ recommended
~10,000 nodesAll components can saturate4XL required, consider AZ failure recovery time
~15,000+ nodesetcd 16GB limit, API Server horizontal scaling limits8XL or consider cluster splitting

1.2 Official Tier Specifications

Amazon EKS Provisioned Control Plane allows customers to directly select a control plane scaling tier, pre-provisioning capacity. While Standard mode auto-scales based on workload, PCP guarantees the minimum performance floor of the selected tier.

TierAPI Request Concurrency (seats)Pod Scheduling Rate (pods/sec)Cluster DB SizeSLAPrice ($/hr)
StandardAuto-scalingAuto-scaling8 GB99.95%$0.10
XL1,70016716 GB99.99%$1.65
2XL3,40028316 GB99.99%$3.40
4XL6,80040016 GB99.99%$6.90
8XL13,60040016 GB99.99%$14.00

Note: Standard tier auto-scales based on workload. XL+ tiers guarantee the minimum performance floor for that tier, with auto-scaling available beyond the baseline as needed. For current pricing, see the AWS EKS pricing page.

1.3 Detailed Control Plane Parameters by Tier

Performance differences across tiers are determined by core parameters in kube-apiserver, kube-scheduler, and kube-controller-manager.

ParameterXL2XL4XL8XL
API Server max-requests-inflight5671,1341,5111,511
API Server max-mutating-requests-inflight283566756756
Total APF Seats (inflight sum)8501,7002,2672,267
Scheduler kube-api-qps167283400400
Scheduler kube-api-burst167283400400
KCM kube-api-qps180340500500
KCM kube-api-burst180340500500
KCM concurrent-gc-syncs35505050
KCM concurrent-hpa-syncs29505050
KCM concurrent-job-syncs180340500500

Note: Standard tier automatically adjusts control plane parameters based on workload.

1.4 What Each Metric Actually Means

API Request Concurrency (Seats)

"API Request Concurrency = 1,700 seats" does not mean the system can handle 1,700 simultaneous simple requests.

  • Seat is the concurrency unit in APF (API Priority and Fairness). max-requests-inflight + max-mutating-requests-inflight sum to the API Server's Total Concurrency Limit, which is proportionally distributed across PriorityLevelConfigurations.
  • Simple requests (GET/POST/PUT/DELETE): 1 seat consumed
  • Large LIST requests: Consume multiple seats proportional to the number of objects returned (up to 10 seats via Work Estimator)
  • WATCH requests: Consume 1 seat during initial notification burst, then released
  • WRITE requests: Continue occupying additional seat time for WATCH notification processing even after write completion

Note: AWS official spec API Request Concurrency is cluster-wide. EKS control planes run multiple API Servers for high availability, and the sum of APF seats across all servers equals the cluster-wide Concurrency.

Behavior when exceeded:

  1. Total concurrency limit exceeded → requests wait in APF queue
  2. Queue full → rejected with HTTP 429 (Too Many Requests)
  3. Monitor via apiserver_flowcontrol_rejected_requests_total metric

Why 1,700 Seats Isn't as Small as It Sounds

Seats are weighted concurrency, not a simple connection count. The key factor is occupation duration — seats are returned immediately when a request completes.

Request TypeSeat CostTypical DurationThroughput per Seat per Second
Simple GET1~5ms~200 req/s
LIST (< 500 objects)1~100ms~10 req/s
LIST (5,000 objects)10~3s~0.3 req/s
CREATE/UPDATE1~60ms (write + WATCH propagation)~16 req/s

Streaming analogy: Think of seats as bandwidth, not connections. A 4K stream consumes 25 Mbps while SD uses 3 Mbps — "1 Gbps bandwidth" doesn't mean 1,000 concurrent users if they're all streaming 4K. Similarly, kubectl get pods -A (LIST all) is "4K streaming" (10 seats), while kubectl get pod my-pod is "SD streaming" (1 seat).

Real-world production example (~200 nodes, XL tier = 1,700 seats):

Steady-state load:
kubelet heartbeats (200 nodes × 10s interval) → ~20 seats
20 controllers in reconcile loops → ~50 seats
Prometheus scraping → ~5 seats
General kubectl usage → ~10 seats
─────────────────────────────────────────────────────────────
Total: ~85 seats (5% of 1,700)

Peak burst scenario (simultaneous):
500 Deployment rollouts → +500 seats
Monitoring dashboards running large LISTs → +30 seats
HPA simultaneous scaling → +100 seats
AZ failure → pod rescheduling burst → +300 seats
─────────────────────────────────────────────────────────────
Total: ~1,015 seats (60% of 1,700)

Tier selection is driven by peak bursts, not steady-state. 1,700 seats (XL) becomes insufficient when:

  • 500+ nodes with AZ failure triggering 1/3 pod rescheduling
  • 10+ large CRD controllers reconciling simultaneously
  • CI/CD pipelines deploying hundreds of Deployments at once

In these cases, upgrade to 2XL (3,400 seats) or 4XL (6,800 seats).

Pod Scheduling Rate (pods/sec)

  • Represents the number of pods the Scheduler can bind per second.
  • Determined by kube-api-qps and kube-api-burst parameters that control how fast the Scheduler can make API Server requests.
  • At 4XL+, Scheduler QPS plateaus at 400, but bottlenecks are mitigated by increased API Server count (3+).
  • Actual throughput can be verified via scheduler_schedule_attempts_total metric.

Cluster DB Size (etcd)

  • The upper limit of logical data size storable in etcd.
  • Standard: 8 GB
  • XL+: 16 GB
  • Due to etcd's MVCC characteristics, frequent updates cause revision accumulation, making actual DB size 2-5x the data size.
  • Compaction runs every 5 minutes to delete old revisions, but extremely high update frequencies can fill the DB between compaction cycles.
  • When quota exceeded, all writes are rejected → cluster effectively down

1.5 API Request Concurrency vs Inflight Seats — Concept Deep Dive with Examples

Terminology: Two Different Layers

"API Request Concurrency" and "Inflight Seats" are often used interchangeably, but they represent different layers.

┌─────────────────────────────────────────────────────────────────┐
│ AWS Official Spec │
│ "API Request Concurrency = 6,800 seats" (4XL) │
│ │
│ = Total "seat capacity" for concurrent requests cluster-wide │
│ = Individual API Server APF seats sum × API Server count │
└──────────────────────┬──────────────────────────────────────────┘

┌────────────┼────────────┐
▼ ▼ ▼
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ API Server #1 │ │ API Server #2 │ │ API Server #N │
│ │ │ │ │ │
│ APF Seats │ │ APF Seats │ │ APF Seats │
└────────────────┘ └────────────────┘ └────────────────┘

Cluster Total Concurrency = Individual Server APF Seats × API Server Count
ConceptScopeDescription
max-requests-inflightIndividual API ServerMaximum concurrent non-mutating (read-only) requests
max-mutating-requests-inflightIndividual API ServerMaximum concurrent mutating requests
Individual Server APF Total SeatsIndividual API ServerSum of the above two values. Proportionally distributed to APF PriorityLevels
API Request ConcurrencyCluster-wideIndividual Server APF Seats × API Server Count. Value published in AWS official specs

Core Difference: "Concurrent Request Count" vs "Concurrent Seat Count"

Seat (capacity) does not equal 1 request = 1 seat. Seats consumed vary by request type:

Request TypeSeat ConsumptionOccupation DurationDescription
Simple GET (e.g., kubectl get pod my-pod)1Until response completeSingle object retrieval
Simple CREATE/UPDATE/DELETE1Write complete + WATCH notification propagation timeMutating requests occupy additional time post-write
Small LIST (< 500 objects returned)1Until response completeWork Estimator calculates as 1 seat
Large LIST (1,000 objects returned)~2Until response completeIncreases proportional to object count
Large LIST (5,000 objects returned)~10Until response completeWork Estimator maximum
WATCH1 initially0Released after initial burstLong-lived connection but seat released

Concrete Scenario Example (4XL Cluster)

Scenario: 4XL cluster (total 6,800 seats) with the following simultaneous requests

┌─ Concurrent Requests ───────────────────────────────────────────┐
│ │
│ [1] kubectl get pods -A (all namespaces LIST, 50,000 pods) │
│ → Work Estimator: 10 seats × 3s response time = 10 seats │
│ │
│ [2] 20 controllers each running reconciliation loop │
│ → Each controller averages 5 GET + 2 UPDATE concurrent │
│ → 20 × 7 = 140 seats │
│ │
│ [3] CI/CD pipeline deploying 500 Deployments simultaneously │
│ → Each CREATE 1 seat + WATCH notification additional time │
│ → Peak ~500 seats │
│ │
│ [4] Prometheus scraping /metrics endpoints │
│ → Multiple API Servers × 1 seat = few seats │
│ │
│ [5] Other system components (kubelet heartbeat, node status) │
│ → 10,000 nodes × kubelet avg 0.1 concurrent = ~1,000 seats│
│ │
│ Total: 10 + 140 + 500 + few + 1,000 = ~1,653 seats (of 6,800) │
│ → Headroom: ~75% ✅ │
└──────────────────────────────────────────────────────────────────┘

Same scenario on XL cluster?:

  • XL total seats = 1,700
  • Same load 1,653 seats → ~97% utilization — approaching limit
  • In 10,000-node environments, kubelet heartbeat, node status updates occur continuously
  • During peak LIST request bursts, seat consumption spikes, causing 429 errors
  • Actually 4XL+ is recommended

APF PriorityLevel Distribution Example (4XL Basis)

Cluster-wide APF Seats are proportionally distributed to PriorityLevelConfigurations on each API Server. Below is an individual API Server example:

Individual API Server APF Seat Distribution Example

├─ system (highest priority) ─── ~5% = ~113 seats ← kube-system core components
├─ leader-election ─── ~5% = ~113 seats ← Leader election requests
├─ node-high ─── ~10% = ~227 seats ← kubelet core requests
├─ workload-high ─── ~10% = ~227 seats ← Critical workloads
├─ workload-low ─── ~15% = ~340 seats ← General workloads
├─ global-default ─── ~15% = ~340 seats ← Unclassified requests
├─ catch-all ─── ~5% = ~113 seats ← Lowest priority
└─ exempt ─── Unlimited ← system:masters, etc.

Key point: Even with sufficient total seats, if a specific PriorityLevel saturates, only requests in that group get rejected with 429. For example, if the 340 seats allocated to workload-low saturate, regular user kubectl requests may be rejected.

1.6 Large-Scale Cluster Scenario: 10,000 Nodes × 100 Pods Environment PCP Sizing

Assumptions

Cluster Scale:
- Worker Nodes: 10,000
- Pods per Node: 100
- Total Pods: 1,000,000 (1 million)

CRD Usage Scenario:
- CRD Type A (network policy): 1 per node = 10,000 × ~2 KB = ~20 MB
- CRD Type B (service mesh sidecar config): 1 per pod = 1,000,000 × ~1 KB = ~1 GB
- CRD Type C (certificate management): 1 per service = 5,000 × ~3 KB = ~15 MB
- CRD Type D (monitoring rules): 1 per namespace = 200 × ~5 KB = ~1 MB

Step 1: etcd DB Size Estimation

[K8s Built-in Objects]
Pod: 1,000,000 × ~1.5 KB = ~1.5 GB
Node: 10,000 × ~5 KB = ~50 MB
Service: 5,000 × ~1 KB = ~5 MB
Endpoint/EndpointSlice: 15,000 × ~2 KB = ~30 MB
ConfigMap: 10,000 × ~1 KB = ~10 MB
Secret: 20,000 × ~1 KB = ~20 MB
Deployment/ReplicaSet: 10,000 × ~2 KB = ~20 MB
Namespace: 200 × ~0.5 KB = ~0.1 MB
ServiceAccount: 10,000 × ~0.5 KB = ~5 MB
Event: 50,000 × ~1 KB = ~50 MB ← Separate partition on XL+
──────────────────────────────────────────────────────
Subtotal: ~1.69 GB

[CRD Objects]
Type A (network policy): 10,000 × 2 KB = ~20 MB
Type B (sidecar config): 1,000,000 × 1 KB = ~1.0 GB
Type C (certificates): 5,000 × 3 KB = ~15 MB
Type D (monitoring rules): 200 × 5 KB = ~1 MB
──────────────────────────────────────────────────────
Subtotal: ~1.04 GB

[MVCC Revision Overhead]
Pod status updates: Every 30s × 1,000,000 pods → ~33,333 updates/sec
CRD Type B updates: Every 60s → ~16,667 updates/sec
Compaction cycle: 5 minutes = 300 seconds

Accumulated revisions in 5 min = (33,333 + 16,667) × 300 = ~15,000,000 revisions
Additional size per revision ≈ avg ~0.1 KB (changed fields only)
→ MVCC overhead: ~15,000,000 × 0.1 KB = ~1.5 GB (at peak)

※ Immediately after compaction, this overhead approaches zero
※ In reality, compaction and updates proceed simultaneously, so
steady-state MVCC overhead ≈ 1-2x data size estimated

[Total etcd DB Size Estimate]
─────────────────────────────────────────────────────────
Built-in objects: ~1.69 GB
CRD objects: ~1.04 GB
MVCC Revision overhead (steady-state): ~2.73 GB (1x multiplier applied)
─────────────────────────────────────────────────────────
Total: ~5.46 GB
Peak (pre-compaction): ~8.19 GB (1.5x multiplier applied)
─────────────────────────────────────────────────────────

Verdict: At ~8.2 GB peak, Standard's 8 GB limit is exceeded. XL+ (16 GB) is required and provides safe margin.

Step 2: API Concurrency (Seats) Requirement Estimation

[Continuous API Load — Ongoing Requests]

kubelet heartbeat (NodeStatus):
10,000 nodes × (1 UPDATE / 10s) = 1,000 req/sec
Concurrent processing (avg 50ms response time):
1,000 × 0.05 = ~50 seats (1 seat each)

kubelet Pod status updates:
Only changed pods → avg ~500 UPDATE/sec
Concurrent: 500 × 0.05 = ~25 seats

kube-controller-manager:
GC, HPA, Job, etc. multiple controllers → avg ~100 concurrent seats

kube-scheduler:
New/reschedule pods → avg ~50 concurrent seats

CRD controllers (4 types):
Each controller's reconciliation loop → avg ~200 concurrent seats

Other systems (DNS, CNI, monitoring, etc.):
→ ~100 concurrent seats

──────────────────────────────────────────
Baseline Seats Consumption: ~525 seats
──────────────────────────────────────────

[Peak Additional Load]

Large rolling update (100 Deployments simultaneously):
→ +500 seats (CREATE/UPDATE surge)

Full Pod LIST (monitoring dashboard, kubectl):
→ LIST 1,000,000 pods = ~10 seats × 3 concurrent = +30 seats
→ Response time lengthens, increasing seat occupation time

HPA scaling events:
→ +100 seats

──────────────────────────────────────────
Peak Total Seats Consumption: ~1,155 seats
──────────────────────────────────────────

Step 3: Scheduling Throughput Requirement Estimation

[Normal Operations]
Daily avg deployments: ~200
Avg pods per deployment: ~50
Daily scheduling total: 200 × 50 = 10,000 pods/day
Per-second avg: ~0.12 pods/sec → All tiers sufficient

[Peak Scenario — Large Rollout]
10 simultaneous Deployments × 100 replicas = 1,000 pods in 5 minutes
Required throughput: 1,000 / 300s = ~3.3 pods/sec → All tiers sufficient

[Extreme Scenario — Node Failure Mass Rescheduling]
AZ failure, 3,333 nodes (1/3) with 333,300 pods need rescheduling
Target recovery time 15 minutes: 333,300 / 900s = ~370 pods/sec
→ 4XL (400 pods/sec) or higher required

Step 4: Comprehensive PCP Tier Sizing Result

┌───────────────────────────────────────────────────────────────────┐
│ 10K Nodes × 100 Pods Environment Comprehensive Sizing │
├──────────────────┬──────────┬──────────┬────────────┬────────────┤
│ Evaluation Item │ Required │ Tier │ Standard │ Verdict │
├──────────────────┼──────────┼──────────┼────────────┼────────────┤
│ etcd DB Size │ ~8.2 GB │ XL+ │ 8GB limit │ ❌ Exceeded│
│ (at peak) │ (peak) │ (16GB) │ No margin │ │
├──────────────────┼──────────┼──────────┼────────────┼────────────┤
│ API Concurrency │ ~1,155 │ XL │ Auto-scale │ Near floor │
│ (peak seats) │ seats │ (1,700) │ │ │
├──────────────────┼──────────┼──────────┼────────────┼────────────┤
│ Pod Scheduling │ ~370 │ 4XL │ Auto-scale │ ❌ Insufficient │
│ (AZ failure) │ pods/sec │ (400) │ │ │
├──────────────────┼──────────┼──────────┼────────────┼────────────┤
│ SLA requirement │ 99.99% │ XL+ │ 99.95% │ Not met │
├──────────────────┴──────────┴──────────┴────────────┴────────────┤
│ │
│ ✅ Final Recommendation: 4XL │
│ │
│ Rationale: │
│ 1. etcd 16GB provides sufficient margin at peak (8.2/16 = 51%) │
│ 2. API Concurrency 6,800 seats adequate for peak (1,155/6,800=17%)│
│ 3. AZ failure requires 370 pods/sec recovery → 4XL's 400 needed │
│ 4. Multiple API Servers via horizontal scaling → distributes │
│ large LIST load │
│ 5. 99.99% SLA guarantee │
│ │
│ ⚠️ If AZ failure recovery time can be relaxed to 30 minutes: │
│ 333,300 / 1,800s = ~185 pods/sec → 2XL (283 pods/sec) viable│
│ │
└───────────────────────────────────────────────────────────────────┘

PCP Tier Sizing Formula Summary

[Formula 1: etcd DB Size]
Required etcd size = (Built-in object total + CRD object total) × MVCC multiplier

MVCC multiplier:
- Low update frequency (< hundreds/min): 1.5x
- Medium update frequency (thousands/min): 2.0x
- High update frequency (thousands/sec): 3.0x ~ 5.0x

Standard suitable: Required < 6.4 GB (8 GB limit, 20% safety margin)
XL+ suitable: Required < 12.8 GB (16 GB limit, 20% safety margin)

[Formula 2: API Concurrency (Seats)]
Peak Seats = Σ(per-component req/sec × avg response time) + LIST additional seats

Individual request seats = 1 (simple GET/POST/PUT/DELETE)
LIST request seats = min(ceil(expected returned objects / 500), 10)
WRITE additional seats = seat × (1 + watch_notification_factor)

Required tier:
Peak Seats < 1,700 → Standard or XL
Peak Seats < 3,400 → 2XL
Peak Seats < 6,800 → 4XL
Peak Seats < 13,600 → 8XL

[Formula 3: Scheduling Throughput]
Required Scheduling Rate = Concurrent reschedule pod count / target recovery time(sec)

Required tier:
Rate < 100 → Standard
Rate < 283 → XL
Rate < 400 → 2XL / 4XL / 8XL (same)

[Final Tier = max(Formula1 result, Formula2 result, Formula3 result)]

1.7 Production Environment Practical Object Quantities

Theoretical Maximum Based on etcd DB Size (PCP 16GB Basis)

Object TypeTypical SizeTheoretical Maximum CountPractical Recommended Limit (50% safety margin)
Small CRD (< 1 KB)~0.5 - 1 KBMillions ~ 16M+~8M
Typical CRD (1 ~ 5 KB)~2 - 3 KB3M ~ 8M~1.5M ~ 4M
Medium CRD (5 ~ 10 KB)~5 - 10 KB1.5M ~ 3M~750K ~ 1.5M
Large CRD (100 KB+)~100 - 300 KB50K ~ 160K~25K ~ 80K
etcd single object maximum1.5 MiB (hard limit)--

Why 50% safety margin on practical limits: Must account for MVCC revision accumulation, update frequency, and space occupied by existing K8s built-in objects (Pod, ConfigMap, Secret, etc.).

Actual Benchmarks and Customer Cases

CaseObject CountTierResult
AWS PCP Official Benchmark~760,000 K8s objects4XLAPI P99 < 1s, Scheduler ~350 pods/sec maintained
Company S (Cloud/SaaS, cert-manager)~200K CRDs + ~400K related = ~600KPCP recommendedStable operations
Company C (Networking/Security, accessrulegroups)~12,500 CRDs-~300 KB each → LIST timeout (size issue)
Kyverno admissionreports leak (open-source controller)1,565,106Standardetcd DB exceeded → failure
TierTotal K8s ObjectsCRD Avg SizeAPI Concurrency DemandSuitable Use CasesMonthly Cost Reference
Standard< 100K< 10 KBLowSmall/medium clusters, dev/staging~$73
XL100K ~ 300K< 10 KBMediumMedium production, typical CRD usage~$1,277
2XL300K ~ 500K< 10 KBHighLarge production, multiple controllers~$2,555
4XL500K ~ 760K+< 50 KBVery HighUltra-large scale, heavy CRD workloads~$5,110

Specific Impact of CRDs on Control Plane

CRD operations have unique performance characteristics distinct from built-in resources:

Impact AreaDescriptionRisk Level
DB Size GrowthCRD objects directly occupy etcd storageHigh
Watch Stream LoadCRD controllers create Watch streams increasing etcd gRPC loadHigh
Request SizeIndividual CRD objects can exceed 1.5MB etcd request limitMedium
List Call CostCRDs use JSON encoding (not protobuf) → LIST/WATCH performance significantly degraded vs built-in resourcesHigh

1.8 Tier Selection Decision Tree

Calculate Total CRD Object Capacity

├─ Total objects × avg size < 5 GB
│ ├─ Low update frequency (< hundreds/min) → Standard
│ └─ High update frequency (thousands/min+) → XL (revision accumulation buffer)

├─ Total objects × avg size = 5 ~ 10 GB
│ ├─ API concurrency < 1,700 seats → XL
│ └─ API concurrency > 1,700 seats → 2XL

├─ Total objects × avg size = 8 ~ 16 GB
│ ├─ API concurrency < 3,400 seats → 2XL
│ └─ API concurrency > 3,400 seats → 4XL

└─ Total objects × avg size > 16 GB (exceeds XL+ etcd limit)
└─ Not viable as single cluster → Consider cluster splitting

PCP Core Design Principles:

  1. Tier determined by K8s metrics that drive billing (inflight requests, scheduler QPS, etcd DB size)
  2. Availability prioritized over cost
  3. Standard tier guarantees minimum Kubernetes upstream defaults or higher

1.9 APF Seat Actual Usage Monitoring Guide — Determine Tier by "Measurement," Not "Claims"

Cluster scale (node count, pod count) alone cannot accurately determine required PCP tier. Even with identical 10,000 nodes, actual seat consumption can differ by 10x+ depending on workload patterns. Therefore, measure your cluster's actual APF seat usage before determining tier.

Method 1: CloudWatch Vended Metrics (Free, Simplest)

For K8s 1.28+ clusters, available in CloudWatch AWS/EKS namespace without additional setup.

Key Metric: apiserver_flowcontrol_current_executing_seats

CloudWatch Console Path:
CloudWatch → Metrics → AWS/EKS → ClusterName
→ apiserver_flowcontrol_current_executing_seats

Recommended Settings:
- Statistic: Maximum (use Max, not Average, to capture peaks)
- Period: 1 minute
- Observation period: Minimum 1 week (including business peaks)

CloudWatch Alarm Setup Example:

Alarm Condition: apiserver_flowcontrol_current_executing_seats
Maximum > (80% of current tier limit) for 5 datapoints within 5 minutes

Example (XL tier):
Maximum > 1,360 (= 1,700 × 80%) → Alert to consider 2XL upgrade

Method 2: Prometheus Direct Scraping (Detailed Analysis)

Verify per-PriorityLevel seat distribution and consumption to analyze which workloads consume most seats.

# Direct API Server metrics query
kubectl get --raw=/metrics | grep apiserver_flowcontrol

# Or use PromQL if Prometheus is deployed

4 Core PromQL Queries:

# ① Current total seats in use (cluster-wide, most important)
sum(apiserver_flowcontrol_current_executing_seats{})

# ② Usage vs limit by PriorityLevel — identify saturation
# (usage)
sum by (priority_level)(apiserver_flowcontrol_current_executing_seats{})
# (limit)
sum by (priority_level)(apiserver_flowcontrol_nominal_limit_seats{})
# (utilization %)
sum by (priority_level)(apiserver_flowcontrol_current_executing_seats{})
/ sum by (priority_level)(apiserver_flowcontrol_nominal_limit_seats{})
* 100

# ③ Requests waiting in APF queue (> 0 indicates capacity shortage)
sum by (priority_level)(apiserver_flowcontrol_current_inqueue_requests{})

# ④ Requests rejected by APF (429 occurrences — should be 0)
sum(rate(apiserver_flowcontrol_rejected_requests_total{}[5m]))

Method 3: kubectl One-liner — Check Right Now

Even without Prometheus, you can check directly from the API Server metrics endpoint.

# Check current total seats in use
kubectl get --raw=/metrics | grep 'apiserver_flowcontrol_current_executing_seats{' \
| awk '{sum+=$2} END {print "Current seats in use:", sum}'

# Seat usage by PriorityLevel
kubectl get --raw=/metrics | grep 'apiserver_flowcontrol_current_executing_seats{' \
| sort -t' ' -k2 -rn | head -10

# Allocated limit by PriorityLevel
kubectl get --raw=/metrics | grep 'apiserver_flowcontrol_nominal_limit_seats{' \
| sort -t' ' -k2 -rn

# Check rejected requests (if not 0, immediate action needed)
kubectl get --raw=/metrics | grep 'apiserver_flowcontrol_rejected_requests_total{' \
| awk '{sum+=$2} END {print "Total rejected requests:", sum}'

# Check etcd DB size
kubectl get --raw=/metrics | grep 'apiserver_storage_size_bytes{' \
| awk '{sum+=$2} END {printf "etcd DB size: %.2f GB\n", sum/1024/1024/1024}'

Measurement Result Interpretation Guide

Measured Peak Seat Usage

├─ Peak < 1,000 seats
│ └─ Standard or XL sufficient
│ (However, if even 1 instance of 429 error, XL+ needed)

├─ Peak 1,000 ~ 1,400 seats
│ └─ XL recommended (1,700 seats, ~18-41% headroom)

├─ Peak 1,400 ~ 2,700 seats
│ └─ 2XL recommended (3,400 seats, ~21-59% headroom)

├─ Peak 2,700 ~ 5,400 seats
│ └─ 4XL recommended (6,800 seats, ~21-60% headroom)

└─ Peak > 5,400 seats
└─ 8XL (13,600 seats) or consider cluster splitting

⚠️ Important: Maintain minimum 20% safety margin.
When peak reaches 80% of limit, evaluate higher tier.
Reason: Need buffer for unexpected bursts (mass retry after
deploy failure, runaway controller infinite LIST, etc.).

Customer Measurement Request Template

Share the following with customers to collect 1 week of data for appropriate tier determination:

[Request]
Please collect the following 3 metrics from your current cluster over 1 week (including business peaks).

1. APF Seat Peak Usage:
CloudWatch → AWS/EKS → apiserver_flowcontrol_current_executing_seats
→ Maximum value (1-minute interval), max over 1 week

2. 429 Error Occurrences:
CloudWatch → AWS/EKS → apiserver_request_total_429
→ Sum value, whether any non-zero timepoints exist

3. etcd DB Size:
CloudWatch → AWS/EKS → apiserver_storage_size_bytes
→ Maximum value, max over 1 week

[Additional Helpful Information]
- Total node count, total pod count
- CRD types and counts (kubectl get crd results)
- Total CRD objects by resource type
- Daily deployment frequency and scale

2. EKS Control Plane Architecture Improvements

Key takeaway: EKS has continuously improved etcd architecture to achieve consistent latency, enhanced availability, etcd DB 16GB expansion (XL+), Event Sharding, and API Server horizontal scaling. Monitor etcd DB size using the apiserver_storage_size_bytes metric.

2.1 Overview

AWS continuously enhances the EKS control plane etcd architecture, delivering higher performance and availability. These improvements provide direct benefits to customers across all PCP tiers.

2.2 Performance Improvement Benefits for Customers

AreaImprovementDetailed Description
Predictable PerformanceConsistent etcd latencyArchitecture improvements reduce etcd write latency variance, providing stable API response times
Enhanced Data DurabilityStronger data consistencyData inconsistency potential significantly reduced
Improved AvailabilityInfrastructure optimizationReduced failure points improve overall availability
etcd DB Size Expansion16 GB etcd DB (XL+)2x expansion vs Standard's 8 GB, accommodating large-scale CRD workloads
etcd Event ShardingEvent objects isolated to separate partitionOn XL+ tiers, events don't impact main etcd
API Server Horizontal ScalingMultiple API Server operationsHigher tiers enable API Server horizontal scaling for load distribution

2.3 Features Available Only on XL+ Tiers

FeatureStandardXL+
API Server Horizontal ScalingBasic configurationScalable
etcd DB Size8 GB16 GB
etcd Event ShardingNot supportedSupported (events in separate partition)
SLA99.95%99.99%

3. EKS Control Plane Performance Validation Methodology

Key takeaway: ClusterLoader2 (CL2) is the standard load testing tool used by both AWS and the Kubernetes community, including in AWS PCP official benchmarks. Testing follows a 5-phase strategy (Baseline → Ramp-up → Sustained Peak → Burst → Recovery), but requires at minimum deploying Prometheus to collect detailed APF metrics and etcd metrics for accurate bottleneck analysis. Success criteria follows official Kubernetes SLI/SLO: API Mutating P99 ≤ 1s, Cluster LIST P99 ≤ 30s, Pod Scheduling P99 ≤ 5s. CloudWatch free metrics cover: 429 errors, API P99 latency, etcd DB size, APF seat usage, scheduling attempts. Prometheus required for: etcd latency, APF queue depth, KCM workqueue depth, per-PriorityLevel saturation analysis.

3.1 Testing Tool: ClusterLoader2 (CL2)

Both AWS and the Kubernetes community use ClusterLoader2 as the standard load testing tool. AWS PCP launch blog benchmarks were performed with this tool.

Installation and Build

git clone https://github.com/kubernetes/perf-tests.git \
"/Users/$USER/go/src/k8s.io/perf-tests"
cd "/Users/$USER/go/src/k8s.io/perf-tests/clusterloader2"
GOPROXY=direct go build -o /tmp/clusterloader ./cmd/

Execution Method

# Create override file
cat > /tmp/overrides.yaml <<EOL
NODES_PER_NAMESPACE: 50
PODS_PER_NODE: 30
CL2_LOAD_TEST_THROUGHPUT: 80
BIG_GROUP_SIZE: 25
MEDIUM_GROUP_SIZE: 10
SMALL_GROUP_SIZE: 5
SMALL_STATEFUL_SETS_PER_NAMESPACE: 0
MEDIUM_STATEFUL_SETS_PER_NAMESPACE: 0
CL2_ENABLE_PVS: false
PROMETHEUS_SCRAPE_KUBE_PROXY: false
ENABLE_SYSTEM_POD_METRICS: false
EOL

# Run test
/tmp/clusterloader \
--kubeconfig ~/.kube/config \
--testconfig testing/load/config.yaml \
--testoverrides /tmp/overrides.yaml \
--nodes <NODE_COUNT> \
--provider "eks" \
--report-dir ./results \
--alsologtostderr

Key Override Parameters

ParameterDescriptionSmall TestLarge Test
NODES_PER_NAMESPACENodes per namespace1050
PODS_PER_NODEPods per node1030
CL2_LOAD_TEST_THROUGHPUTClient-side requests per second501200
BIG_GROUP_SIZELarge Deployment size2525
MEDIUM_GROUP_SIZEMedium Deployment size1010
SMALL_GROUP_SIZESmall Deployment size55
CL2_SCHEDULER_THROUGHPUT_THRESHOLDScheduler throughput threshold20100

3.2 Test Scenario Types

Test TypePurposeCL2 Config
Load TestMeasure service behavior at expected peak loadtesting/load/config.yaml
Density TestVerify stability at specific node/pod densitytesting/density/config.yaml
Scheduler ThroughputMeasure pod scheduling throughput limitsCL2 + scheduler throughput override
API Request BenchmarkMeasure latency/throughput per API verbtesting/request-benchmark
Stress TestApply load exceeding normal operating range, observe recoveryCL2 + gradual load increase

3.3 5-Phase Load Testing Strategy

Phase 1: Baseline Measurement
├── Collect key metrics under current workload
├── Analyze API request patterns (by verb, by resource)
└── Record etcd DB size and object counts

Phase 2: Ramp-up
├── Gradually increase pods/deployments with CL2
├── Monitor SLI/SLO thresholds at each step
└── Record when 429 errors or P99 > SLO occurs

Phase 3: Sustained Peak
├── Maintain target load for 30+ minutes
├── Verify stability (no metric fluctuation)
└── Observe control plane auto-scaling (Standard)

Phase 4: Burst Testing
├── Simulate sudden load spikes
├── For PCP, verify immediate response capability
└── For Standard, measure auto-scaling reaction time

Phase 5: Recovery Testing
├── Measure metric normalization time after load removal
└── Verify residual queue depth, latency, etc.

3.4 Simple Script-Based Testing (Without CL2)

# 1. Mass Deployment creation for API load test
for i in $(seq 1 500); do
kubectl create deployment test-$i --image=nginx --replicas=10 &
done
wait

# 2. Mass ConfigMap creation for etcd write load
for i in $(seq 1 10000); do
kubectl create configmap test-cm-$i --from-literal=key=value &
done

# 3. Mass LIST calls for read load
while true; do kubectl get pods --all-namespaces > /dev/null; done

3.5 Official Kubernetes SLI/SLO Standards (Validation Success Criteria)

SLISLOMetric
API Call Latency (Mutating, resource-scope)P99 ≤ 1sapiserver_request_sli_duration_seconds
API Call Latency (Read-only, resource-scope)P99 ≤ 1sapiserver_request_sli_duration_seconds
API Call Latency (Namespace-scope LIST)P99 ≤ 30sapiserver_request_sli_duration_seconds
API Call Latency (Cluster-scope LIST)P99 ≤ 30sapiserver_request_sli_duration_seconds
Pod Startup LatencyP99 ≤ 5s (excluding image pull/init)kubelet_pod_start_sli_duration_seconds
Pod Scheduling LatencyP99 ≤ 5sscheduler_pod_scheduling_sli_duration_seconds

3.6 Key Monitoring Metrics — Availability by Collection Path

EKS provides 4 dimensions of Control Plane observability:

#ChannelCostSetupData ProvidedPCP Support
1CloudWatch Vended MetricsFreeAutomatic (v1.28+)Core K8s metrics (time series)Includes tier usage metrics
2Prometheus EndpointFree (scraping)Manual configurationKCM/KSH/etcd detailed metricsScalable
3Control Plane LoggingCloudWatch standard ratesManual activationLogs (API/Audit/Auth/CM/Sched)
4Cluster InsightsFreeAutomaticCluster health/upgrade recommendationsPCP tier recommendations (future)
5EKS Console DashboardFreeAutomaticVisualized metrics + log queriesTier information displayed

CloudWatch Vended Metrics (Free, Automatic)

Automatically published to AWS/EKS namespace for K8s 1.28+.

ComponentMetricDescriptionPriority
API Serverapiserver_request_totalTotal API requestsCritical
API Serverapiserver_request_total_4xx4xx error requestsCritical
API Serverapiserver_request_total_5xx5xx error requestsCritical
API Serverapiserver_request_total_429429 Throttling requestsCritical
API Serverapiserver_request_duration_secondsAPI request latencyRecommended
API Serverapiserver_storage_size_bytesetcd storage sizeCritical
API Serverapiserver_flowcontrol_current_executing_seatsCurrent APF seats in use (PCP core)Critical
Schedulerscheduler_schedule_attempts_totalTotal scheduling attemptsRecommended
Schedulerscheduler_schedule_attempts_SCHEDULEDSuccessful schedulesCritical
Schedulerscheduler_schedule_attempts_UNSCHEDULABLEUnschedulable countRecommended

Prometheus Scraping Endpoints (K8s 1.28+)

# API Server metrics (existing)
kubectl get --raw=/metrics

# Kube-Controller-Manager metrics
kubectl get --raw=/apis/metrics.eks.amazonaws.com/v1/kcm/container/metrics

# Kube-Scheduler metrics
kubectl get --raw=/apis/metrics.eks.amazonaws.com/v1/ksh/container/metrics

# etcd metrics (support varies by cluster version)
kubectl get --raw=/apis/metrics.eks.amazonaws.com/v1/etcd/container/metrics

Note: Using Amazon Managed Prometheus (AMP) Agentless Collector (Poseidon) enables automatic collection of Control Plane metrics to AMP workspace without installing Prometheus in-cluster.

3.7 Load Testing Checklist (10 Items)

#Verification ItemMetric/MethodCW Free
1Are API requests rejected with 429?apiserver_request_total_429 (CW) or apiserver_flowcontrol_rejected_requests_total (Prometheus)O
2Is API P99 latency within 1 second?apiserver_request_duration_seconds_*_P99 (CW) or apiserver_request_sli_duration_seconds (Prometheus)O
3Is etcd the bottleneck?Compare etcd_request_duration_seconds vs apiserver_request_duration_secondsX (Prometheus needed)
4Is APF queue full?apiserver_flowcontrol_current_inqueue_requestsX (Prometheus needed)
5Which APF priority group is saturated?Compare apiserver_flowcontrol_nominal_limit_seats vs actual usageX (Prometheus needed)
6Is pod scheduling delayed?scheduler_pending_pods (CW), scheduler_pod_scheduling_sli_duration_seconds (Prometheus)Partial
7Is etcd DB size approaching limit? (Standard 8GB, XL+ 16GB)apiserver_storage_size_bytesO
8Is there asymmetric traffic?Individual API server inflight request count (check max, not avg)O
9Is a specific client making excessive LISTs?Analyze LIST frequency/latency by userAgent in Audit logsCW Logs
10Are KCM controller queues backing up?workqueue_depthX (Prometheus needed)

Recommendation: During load testing, strongly recommend deploying at minimum Prometheus to collect detailed APF metrics and etcd metrics.

3.8 Useful PromQL Queries

# API request latency heatmap (most important)
max(increase(apiserver_request_duration_seconds_bucket{
subresource!="status",subresource!="token",subresource!="scale",
subresource!="/healthz",subresource!="binding",subresource!="proxy",
verb!="WATCH"
}[$__rate_interval])) by (le)

# APF seat utilization (PCP tier monitoring)
max without(instance)(apiserver_flowcontrol_nominal_limit_seats{})

# 429 error rate
sum(rate(apiserver_request_total{code="429"}[5m]))
/ sum(rate(apiserver_request_total[5m]))

# 5xx error rate
sum(rate(apiserver_request_total{code=~"5.."}[5m]))
/ sum(rate(apiserver_request_total[5m]))

3.9 Useful CloudWatch Logs Insights Queries

-- Find slowest API calls
fields @timestamp, @message
| filter @logStream like "kube-apiserver-audit"
| filter ispresent(requestURI)
| filter verb = "list"
| parse requestReceivedTimestamp /\d+-\d+-(?<StartDay>\d+)T(?<StartHour>\d+):(?<StartMinute>\d+):(?<StartSec>\d+).(?<StartMsec>\d+)Z/
| parse stageTimestamp /\d+-\d+-(?<EndDay>\d+)T(?<EndHour>\d+):(?<EndMinute>\d+):(?<EndSec>\d+).(?<EndMsec>\d+)Z/
| fields (StartHour*3600+StartMinute*60+StartSec+StartMsec/1000000) as StartTime,
(EndHour*3600+EndMinute*60+EndSec+EndMsec/1000000) as EndTime,
(EndTime-StartTime) as DeltaTime
| stats avg(DeltaTime) as AvgLatency, count(*) as Count by requestURI, userAgent
| filter Count >= 50
| sort AvgLatency desc

-- Analyze CRD API call patterns
fields @timestamp, userAgent, verb, requestURI
| filter requestURI like /customresourcedefinitions/
| stats count(*) by verb, userAgent
| sort count(*) desc
| limit 20

-- API QPS from KCM by controller
fields @timestamp, userAgent, @message
| filter @logStream like "kube-apiserver-audit"
| filter user.username like "system:serviceaccount:kube-system:"
| filter verb not like "WATCH"
| stats count(*) as calls by user.username, bin(1m)
| sort calls desc

3.10 API vs etcd Bottleneck Identification

API latency high?

├─ etcd_request_duration_seconds also high?
│ └─ YES → etcd is bottleneck (etcd overload, disk I/O, etc.)

├─ etcd normal but API slow?
│ ├─ Webhook latency high? → Admission Webhook is bottleneck
│ ├─ APF queue wait high? → API Server concurrency insufficient → Consider tier upgrade
│ └─ Only LIST requests slow? → Optimize large LISTs (server-side filtering, pagination)

└─ Both normal but 429 occurring?
└─ Review APF configuration (specific priority group saturation)

3.11 PCP Tier Upgrade Decision Criteria Summary

Current TierKey Monitoring MetricsUpgrade ConditionAction
Standardapiserver_request_total_429> 0 sustainedConsider XL+ upgrade
XLapiserver_flowcontrol_current_executing_seats> 80% of limit (~1,360)Consider 2XL upgrade
2XLapiserver_flowcontrol_current_executing_seats> 80% of limit (~2,720)Consider 4XL upgrade
XL+apiserver_storage_size_bytes> 12.8GB (16GB limit)Storage optimization needed
All tiersscheduler_schedule_attempts_UNSCHEDULABLE> 0 sustainedCheck node resource shortage

AWS Official Documentation

AWS Blogs

Kubernetes Upstream