EKS PCP 层级选型与性能验证指南
Purpose: This guide provides detailed specifications for EKS Provisioned Control Plane (PCP) tiers, explains control plane architecture improvements, and outlines performance validation methodologies.
Control Plane 架构概述、CRD 影响分析、监控设置、CRD 设计最佳实践请参阅 EKS Control Plane & CRD at Scale 综合指南。
In this post
Organizations running large-scale Kubernetes workloads on Amazon EKS face a critical question: how do you ensure your control plane can handle peak load without over-provisioning? This technical deep dive explores three key areas:
- PCP tier specifications and practical object limits — Understanding API request concurrency (seats), pod scheduling rates, and etcd database sizing with real-world examples
- EKS control plane architecture improvements — How AWS engineering enhancements deliver consistent performance and higher availability
- Performance validation methodology — Using ClusterLoader2 and comprehensive metrics to verify control plane capacity
Whether you're planning a 10,000-node cluster or troubleshooting API throttling, this guide provides the technical details and measurement strategies you need to right-size your EKS control plane.
1. PCP Tier Specifications and Practical Object Limits
Key takeaway: API Request Concurrency (Seats) represents "concurrent seat capacity," not "concurrent request count." A single LIST request can consume up to 10 seats depending on the number of objects returned. Customer-facing concurrency numbers (e.g., 4XL = 6,800 seats) apply cluster-wide. For a 10,000-node / 1,000,000-pod environment, you need ~8.2 GB etcd DB capacity at peak, ~1,155 seats, and ~370 pods/sec for AZ failure recovery — making 4XL the recommended tier. Kubernetes upstream officially supports up to 5,000 nodes / 150,000 pods, though AWS has benchmarked both 5K and 10K node configurations. Measure actual APF seat usage via
apiserver_flowcontrol_current_executing_seatsin CloudWatch (free) over a 1-week period to determine the appropriate tier.
1.1 Large-Scale Single Cluster Benchmarks
The following reference data is based on public documentation and AWS benchmarks for large single-cluster deployments.
Kubernetes Upstream and EKS Official Test Limits
| Benchmark | Nodes | Total Pods | Total K8s Objects | Notes |
|---|---|---|---|---|
| K8s SIG-Scalability Official Limit | 5,000 | 150,000 | ~300,000 | Upstream SLI/SLO guarantee scope |
| EKS 5K Node Benchmark | 5,000 | ~150,000 | ~300,000 | AWS validated |
| EKS 10K Node Benchmark | 10,000 | ~500,000+ | ~760,000 | PCP 4XL, API P99 < 1s achieved |
Note: While Kubernetes upstream's official SLI/SLO guarantee covers 5,000 nodes / 150,000 pods, this represents a conservative baseline applicable to all Kubernetes distributions. EKS PCP is designed to support beyond this threshold into 10K+ node environments.
Confirmed Customer Cases
| Case | Object Count | Tier | Result |
|---|---|---|---|
| Company S (Cloud/SaaS, cert-manager) | ~200K CRDs + ~400K related = ~600K | PCP recommended | Stable operations |
| Company C (Networking/Security, accessrulegroups) | ~12,500 CRDs (~300 KB each) | - | LIST timeout issues |
| Kyverno admissionreports leak (open-source controller) | 1,565,106 CRDs | Standard | etcd DB exceeded 8GB → failure |
Important Notes on Cluster Scale
Some large customers claim to operate "tens of thousands of nodes in a single cluster." However, actual control plane load is not determined solely by node/pod count. Two 10,000-node clusters can require completely different PCP tiers depending on workload patterns.
Accurate tier sizing requires measuring actual APF seat usage, not claimed scale. Refer to section 1.9 "APF Seat Usage Monitoring Guide" to measure your cluster's actual concurrency consumption.
Note: Most large customers operate multiple clusters segmented by workload, region, and environment, rather than scaling a single cluster indefinitely.
Note: AWS has benchmarked PCP performance in both 5K and 10K node environments.
Key Bottlenecks in Single Cluster Scaling
| Scale | Primary Bottleneck | Description |
|---|---|---|
| ~1,000 nodes | Generally none | Standard tier sufficient for most workloads |
| ~3,000 nodes | etcd DB size, API Concurrency | XL+ required if CRD-heavy |
| ~5,000 nodes | Scheduler throughput, LIST latency | Approaching K8s upstream official limit, 2XL+ recommended |
| ~10,000 nodes | All components can saturate | 4XL required, consider AZ failure recovery time |
| ~15,000+ nodes | etcd 16GB limit, API Server horizontal scaling limits | 8XL or consider cluster splitting |
1.2 Official Tier Specifications
Amazon EKS Provisioned Control Plane allows customers to directly select a control plane scaling tier, pre-provisioning capacity. While Standard mode auto-scales based on workload, PCP guarantees the minimum performance floor of the selected tier.
| Tier | API Request Concurrency (seats) | Pod Scheduling Rate (pods/sec) | Cluster DB Size | SLA | Price ($/hr) |
|---|---|---|---|---|---|
| Standard | Auto-scaling | Auto-scaling | 8 GB | 99.95% | $0.10 |
| XL | 1,700 | 167 | 16 GB | 99.99% | $1.65 |
| 2XL | 3,400 | 283 | 16 GB | 99.99% | $3.40 |
| 4XL | 6,800 | 400 | 16 GB | 99.99% | $6.90 |
| 8XL | 13,600 | 400 | 16 GB | 99.99% | $14.00 |
Note: Standard tier auto-scales based on workload. XL+ tiers guarantee the minimum performance floor for that tier, with auto-scaling available beyond the baseline as needed. For current pricing, see the AWS EKS pricing page.
1.3 Detailed Control Plane Parameters by Tier
Performance differences across tiers are determined by core parameters in kube-apiserver, kube-scheduler, and kube-controller-manager.
| Parameter | XL | 2XL | 4XL | 8XL |
|---|---|---|---|---|
| API Server max-requests-inflight | 567 | 1,134 | 1,511 | 1,511 |
| API Server max-mutating-requests-inflight | 283 | 566 | 756 | 756 |
| Total APF Seats (inflight sum) | 850 | 1,700 | 2,267 | 2,267 |
| Scheduler kube-api-qps | 167 | 283 | 400 | 400 |
| Scheduler kube-api-burst | 167 | 283 | 400 | 400 |
| KCM kube-api-qps | 180 | 340 | 500 | 500 |
| KCM kube-api-burst | 180 | 340 | 500 | 500 |
| KCM concurrent-gc-syncs | 35 | 50 | 50 | 50 |
| KCM concurrent-hpa-syncs | 29 | 50 | 50 | 50 |
| KCM concurrent-job-syncs | 180 | 340 | 500 | 500 |
Note: Standard tier automatically adjusts control plane parameters based on workload.
1.4 What Each Metric Actually Means
API Request Concurrency (Seats)
"API Request Concurrency = 1,700 seats" does not mean the system can handle 1,700 simultaneous simple requests.
- Seat is the concurrency unit in APF (API Priority and Fairness).
max-requests-inflight+max-mutating-requests-inflightsum to the API Server's Total Concurrency Limit, which is proportionally distributed across PriorityLevelConfigurations. - Simple requests (GET/POST/PUT/DELETE): 1 seat consumed
- Large LIST requests: Consume multiple seats proportional to the number of objects returned (up to 10 seats via Work Estimator)
- WATCH requests: Consume 1 seat during initial notification burst, then released
- WRITE requests: Continue occupying additional seat time for WATCH notification processing even after write completion
Note: AWS official spec API Request Concurrency is cluster-wide. EKS control planes run multiple API Servers for high availability, and the sum of APF seats across all servers equals the cluster-wide Concurrency.
Behavior when exceeded:
- Total concurrency limit exceeded → requests wait in APF queue
- Queue full → rejected with HTTP 429 (Too Many Requests)
- Monitor via
apiserver_flowcontrol_rejected_requests_totalmetric
Why 1,700 Seats Isn't as Small as It Sounds
Seats are weighted concurrency, not a simple connection count. The key factor is occupation duration — seats are returned immediately when a request completes.
| Request Type | Seat Cost | Typical Duration | Throughput per Seat per Second |
|---|---|---|---|
| Simple GET | 1 | ~5ms | ~200 req/s |
| LIST (< 500 objects) | 1 | ~100ms | ~10 req/s |
| LIST (5,000 objects) | 10 | ~3s | ~0.3 req/s |
| CREATE/UPDATE | 1 | ~60ms (write + WATCH propagation) | ~16 req/s |
Streaming analogy: Think of seats as bandwidth, not connections. A 4K stream consumes 25 Mbps while SD uses 3 Mbps — "1 Gbps bandwidth" doesn't mean 1,000 concurrent users if they're all streaming 4K. Similarly, kubectl get pods -A (LIST all) is "4K streaming" (10 seats), while kubectl get pod my-pod is "SD streaming" (1 seat).
Real-world production example (~200 nodes, XL tier = 1,700 seats):
Steady-state load:
kubelet heartbeats (200 nodes × 10s interval) → ~20 seats
20 controllers in reconcile loops → ~50 seats
Prometheus scraping → ~5 seats
General kubectl usage → ~10 seats
─────────────────────────────────────────────────────────────
Total: ~85 seats (5% of 1,700)
Peak burst scenario (simultaneous):
500 Deployment rollouts → +500 seats
Monitoring dashboards running large LISTs → +30 seats
HPA simultaneous scaling → +100 seats
AZ failure → pod rescheduling burst → +300 seats
─────────────────────────────────────────────────────────────
Total: ~1,015 seats (60% of 1,700)
Tier selection is driven by peak bursts, not steady-state. 1,700 seats (XL) becomes insufficient when:
- 500+ nodes with AZ failure triggering 1/3 pod rescheduling
- 10+ large CRD controllers reconciling simultaneously
- CI/CD pipelines deploying hundreds of Deployments at once
In these cases, upgrade to 2XL (3,400 seats) or 4XL (6,800 seats).
Pod Scheduling Rate (pods/sec)
- Represents the number of pods the Scheduler can bind per second.
- Determined by
kube-api-qpsandkube-api-burstparameters that control how fast the Scheduler can make API Server requests. - At 4XL+, Scheduler QPS plateaus at 400, but bottlenecks are mitigated by increased API Server count (3+).
- Actual throughput can be verified via
scheduler_schedule_attempts_totalmetric.
Cluster DB Size (etcd)
- The upper limit of logical data size storable in etcd.
- Standard: 8 GB
- XL+: 16 GB
- Due to etcd's MVCC characteristics, frequent updates cause revision accumulation, making actual DB size 2-5x the data size.
- Compaction runs every 5 minutes to delete old revisions, but extremely high update frequencies can fill the DB between compaction cycles.
- When quota exceeded, all writes are rejected → cluster effectively down
1.5 API Request Concurrency vs Inflight Seats — Concept Deep Dive with Examples
Terminology: Two Different Layers
"API Request Concurrency" and "Inflight Seats" are often used interchangeably, but they represent different layers.
┌─────────────────────────────────────────────────────────────────┐
│ AWS Official Spec │
│ "API Request Concurrency = 6,800 seats" (4XL) │
│ │
│ = Total "seat capacity" for concurrent requests cluster-wide │
│ = Individual API Server APF seats sum × API Server count │
└──────────────────────┬──────────────────────────────────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ API Server #1 │ │ API Server #2 │ │ API Server #N │
│ │ │ │ │ │
│ APF Seats │ │ APF Seats │ │ APF Seats │
└────────────────┘ └────────────────┘ └────────────────┘
Cluster Total Concurrency = Individual Server APF Seats × API Server Count
| Concept | Scope | Description |
|---|---|---|
| max-requests-inflight | Individual API Server | Maximum concurrent non-mutating (read-only) requests |
| max-mutating-requests-inflight | Individual API Server | Maximum concurrent mutating requests |
| Individual Server APF Total Seats | Individual API Server | Sum of the above two values. Proportionally distributed to APF PriorityLevels |
| API Request Concurrency | Cluster-wide | Individual Server APF Seats × API Server Count. Value published in AWS official specs |
Core Difference: "Concurrent Request Count" vs "Concurrent Seat Count"
Seat (capacity) does not equal 1 request = 1 seat. Seats consumed vary by request type:
| Request Type | Seat Consumption | Occupation Duration | Description |
|---|---|---|---|
Simple GET (e.g., kubectl get pod my-pod) | 1 | Until response complete | Single object retrieval |
| Simple CREATE/UPDATE/DELETE | 1 | Write complete + WATCH notification propagation time | Mutating requests occupy additional time post-write |
| Small LIST (< 500 objects returned) | 1 | Until response complete | Work Estimator calculates as 1 seat |
| Large LIST (1,000 objects returned) | ~2 | Until response complete | Increases proportional to object count |
| Large LIST (5,000 objects returned) | ~10 | Until response complete | Work Estimator maximum |
| WATCH | 1 initially → 0 | Released after initial burst | Long-lived connection but seat released |
Concrete Scenario Example (4XL Cluster)
Scenario: 4XL cluster (total 6,800 seats) with the following simultaneous requests
┌─ Concurrent Requests ───────────────────────────────────────────┐
│ │
│ [1] kubectl get pods -A (all namespaces LIST, 50,000 pods) │
│ → Work Estimator: 10 seats × 3s response time = 10 seats │
│ │
│ [2] 20 controllers each running reconciliation loop │
│ → Each controller averages 5 GET + 2 UPDATE concurrent │
│ → 20 × 7 = 140 seats │