EKS Control Plane Deep Dive — CRD at Scale Comprehensive Guide
Written: 2026-03-24 | Reading time: ~25 min
When operating CRD-based platforms on EKS, the Control Plane is the first bottleneck. This guide covers Control Plane internals, CRD impacts, Provisioned Control Plane (PCP), and monitoring strategies.
1. EKS Control Plane Internal Architecture
1.1 Physical Infrastructure Layout
EKS Control Plane (AWS Managed)
├── kube-apiserver (min 2, multi-AZ)
├── kube-controller-manager
├── kube-scheduler
├── etcd (distributed key-value store)
└── Network Load Balancer (API Server endpoint)
- Components distributed across multiple AZs for HA
- Single API Server endpoint exposed via NLB
- Fully managed by AWS, separate from customer VPC
1.2 etcd — The Heart of the Control Plane
| Characteristic | Description | CRD Impact |
|---|---|---|
| DB Size Limit | Standard 8GB, Provisioned 16GB | More CRD objects increase DB size |
| Request Size Limit | Single object max 1.5MB | Large CR specs approach the limit |
| Watch Stream | Real-time change propagation | Load increases with more CRD controller Watches |
| RAFT Consensus | Majority agreement for writes | Latency in write-heavy CRD patterns |
AWS continues improving the EKS etcd layer for predictable performance, data durability, and availability.
2. Control Plane Auto-Scaling
EKS automatically vertically scales Control Plane instances based on API Server load, etcd load, scheduling load, and data plane size.
Standard tier etcd DB Size is fixed at 8GB. This is the first bottleneck for CRD-heavy platforms — auto-scaling CPU/Memory does not expand etcd capacity.
3. EKS Provisioned Control Plane (PCP)
GA at re:Invent 2025. Set a performance floor by selecting a tier.
| Tier | etcd DB | SLA | Hourly Price |
|---|---|---|---|
| Standard | 8GB | 99.95% | $0.10 |
| XL | 16GB | 99.99% | $1.65 |
| 2XL | 16GB | 99.99% | $3.40 |
| 4XL | 16GB | 99.99% | $6.90 |
| 8XL | 16GB | 99.99% | $13.90 |
| Feature | Standard | XL+ |
|---|---|---|
| API Server horizontal scaling (>2) | Limited to 2 | Yes |
| etcd DB Size 16GB | Fixed 8GB | 16GB |
| etcd Event Sharding | No | Yes |
| 99.99% SLA | 99.95% | 99.99% |
The first limit in CRD platforms is etcd DB Size. Provisioned doubles it to 16GB and offloads event pressure via Event Sharding.
aws eks create-cluster --name prod \
--role-arn arn:aws:iam::012345678910:role/eks-service-role \
--resources-vpc-config subnetIds=subnet-xxx,securityGroupIds=sg-xxx \
--control-plane-scaling-config tier=XL
aws eks update-cluster-config --name example \
--control-plane-scaling-config tier=XL
4. Impact of CRDs on Control Plane
4.1 Impact on etcd
| Factor | Mechanism | Severity |
|---|---|---|
| DB Size Growth | CRD objects occupy etcd storage | High |
| Watch Stream Load | Controllers create Watch streams | High |
| Request Size | Objects approach 1.5MB limit | Medium |
| List Call Cost | JSON encoding (not protobuf) | High |
4.2 Impact on API Server
- JSON vs Protobuf: CRDs use JSON — List/Watch performance significantly degraded
- APF: List requests can occupy up to 10 seats
- Watch Cache: Defaults to 100
Control Plane Load = CRD Type Count x Object Size x Controller Pattern (List/Watch Frequency)
5. EKS Control Plane Monitoring
Four observability dimensions:
- CloudWatch Vended Metrics (automatic, free, v1.28+)
- Prometheus Endpoints (KCM/KSH/etcd, manual)
- Control Plane Logging (5 log types to CloudWatch)
- Cluster Insights (automatic health/upgrade checks)
kubectl get --raw=/apis/metrics.eks.amazonaws.com/v1/kcm/container/metrics
kubectl get --raw=/apis/metrics.eks.amazonaws.com/v1/ksh/container/metrics
kubectl get --raw=/apis/metrics.eks.amazonaws.com/v1/etcd/container/metrics
aws eks update-cluster-config --name my-cluster \
--logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'
| Channel | Cost | Setup | PCP Support |
|---|---|---|---|
| CloudWatch Vended Metrics | Free | Automatic (v1.28+) | Tier usage metrics |
| Prometheus Endpoint | Free | Manual | Extensible |
| Control Plane Logging | CW rates | Manual | — |
| Cluster Insights | Free | Automatic | Future tier recommendations |
6. CRD Design Best Practices
- Minimize Object Size — Keep CR specs small, offload large data
- Manage CRD Count — Consolidate similar resources, clean unused CRDs
- Controller Optimization — SharedInformer, pagination, Exponential Backoff
- Keep K8s Current — K8s 1.33+ Streaming List
- Cluster Architecture — Separate CRD clusters from workload clusters
7. Recommendations & Adoption Roadmap
| Workload Profile | Tier | Rationale | Monthly Cost |
|---|---|---|---|
| ~50 nodes, basic add-ons (Karpenter, cert-manager) | Standard | Default auto-scaling is sufficient | ~$73 |
| ~200 nodes, 5+ operators (ArgoCD, Prometheus, custom controllers) | XL | etcd 16GB, 99.99% SLA | ~$1,204 |
| ~500 nodes, service mesh + GitOps + multi-tenant | 2XL | Enhanced API Server throughput | ~$2,482 |
| 1,000+ nodes, AI/ML operators + large-scale CRD pipelines | 4XL | API Server horizontal scaling | ~$5,037 |
Control Plane Metrics Reference by Scale
Industry-average reference values for key metrics that drive EKS control plane scaling decisions. Actual values vary by workload pattern — investigate when thresholds are exceeded.
| Metric | ~50 Nodes (Standard) | ~200 Nodes (XL) | ~500 Nodes (2XL) | 1,000+ Nodes (4XL) |
|---|---|---|---|---|
| etcd DB Size | 0.5–1.5 GB | 2–5 GB | 5–10 GB | 10–20 GB |
| etcd Object Count | ~5,000 | ~30,000 | ~100,000 | 300,000+ |
| API QPS (req/sec) | 20–50 | 100–300 | 300–800 | 1,000–3,000 |
| API Latency (p99) | < 200ms | < 500ms | < 1s | < 1.5s (target) |
| 429 Throttle (/min) | 0 | < 5 | < 20 | Upgrade trigger |
| Watch Connections | ~200 | ~1,500 | ~5,000 | 15,000+ |
| CRD Types (ref) | 5–15 | 15–40 | 40–80 | 80+ |
| Controller Reconcile/sec | 5–20 | 50–150 | 150–500 | 500–2,000 |
- etcd DB Size:
apiserver_storage_size_bytes(CloudWatch or Prometheus) - API QPS:
apiserver_request_totalrate (split by verb recommended) - 429 Throttle:
apiserver_request_total{code="429"}— investigate immediately if non-zero - Watch Connections:
apiserver_longrunning_requests{verb="WATCH"}— scales with controllers/nodes - Reconcile Rate:
controller_runtime_reconcile_totalrate per controller
- Standard: Warning at > 6GB → consider XL
- XL/2XL: Warning at > 12GB → clean up unused CRs or upgrade tier
- 4XL: Critical at > 20GB → consider architecture split (multi-cluster)
| Phase | Timeline | Activities |
|---|---|---|
| 1: Basic | 1 week | CloudWatch alarms, Control Plane Logging |
| 2: Prometheus | 2 weeks | AMP Scraper, Grafana dashboards |
| 3: PCP | 1 week | Select and apply PCP tier |
| 4: Optimize | Ongoing | Insights, tier adjustments, controller tuning |
- Amazon EKS Provisioned Control Plane
- EKS Control Plane Metrics
- EKS Best Practices — Control Plane
- Amazon EKS Introduces Provisioned Control Plane
- Managing etcd Database Size on Amazon EKS Clusters
- Amazon EKS Enhances Kubernetes Control Plane Observability
- API Priority and Fairness
- etcd Performance Best Practices
- Grafana Dashboard: EKS Control Plane