EKS Control Plane Deep Dive — CRD at Scale Comprehensive Guide

Published 2026-03-24Updated 2026-06-3013 min read

When operating CRD-based platforms on EKS, the Control Plane is the first bottleneck. This guide covers Control Plane internals, CRD impacts, Provisioned Control Plane (PCP), and monitoring strategies.

1. EKS Control Plane Internal Architecture

1.1 Physical Infrastructure Layout

EKS Control Plane (AWS Managed)
├── kube-apiserver (min 2, multi-AZ)
├── kube-controller-manager
├── kube-scheduler
├── etcd (distributed key-value store)
└── Network Load Balancer (API Server endpoint)

Components distributed across multiple AZs for HA
Single API Server endpoint exposed via NLB
Fully managed by AWS, separate from customer VPC

1.2 etcd — The Heart of the Control Plane

Characteristic	Description	CRD Impact
DB Size Limit	Standard 8GB, Provisioned 16GB	More CRD objects increase DB size
Request Size Limit	Single object max 1.5MB	Large CR specs approach the limit
Watch Stream	Real-time change propagation	Load increases with more CRD controller Watches
RAFT Consensus	Majority agreement for writes	Latency in write-heavy CRD patterns

etcd Architecture Evolution

AWS continues improving the EKS etcd layer for predictable performance, data durability, and availability.

2. Control Plane Auto-Scaling

EKS automatically vertically scales Control Plane instances based on API Server load, etcd load, scheduling load, and data plane size.

Key Insight

Standard tier etcd DB Size is fixed at 8GB. This is the first bottleneck for CRD-heavy platforms — auto-scaling CPU/Memory does not expand etcd capacity.

3. EKS Provisioned Control Plane (PCP)

GA at re:Invent 2025. Set a performance floor by selecting a tier.

Tier	etcd DB	SLA	Hourly Price
Standard	8GB	99.95%	$0.10
XL	16GB	99.99%	$1.65
2XL	16GB	99.99%	$3.40
4XL	16GB	99.99%	$6.90
8XL	16GB	99.99%	$13.90

Feature	Standard	XL+
API Server horizontal scaling (>2)	Limited to 2	Yes
etcd DB Size 16GB	Fixed 8GB	16GB
etcd Event Sharding	No	Yes
99.99% SLA	99.95%	99.99%

Why Provisioned for CRD Platforms

The first limit in CRD platforms is etcd DB Size. Provisioned doubles it to 16GB and offloads event pressure via Event Sharding.

aws eks create-cluster --name prod \
  --role-arn arn:aws:iam::012345678910:role/eks-service-role \
  --resources-vpc-config subnetIds=subnet-xxx,securityGroupIds=sg-xxx \
  --control-plane-scaling-config tier=XL

aws eks update-cluster-config --name example \
  --control-plane-scaling-config tier=XL

4. Impact of CRDs on Control Plane

4.1 Impact on etcd

Factor	Mechanism	Severity
DB Size Growth	CRD objects occupy etcd storage	High
Watch Stream Load	Controllers create Watch streams	High
Request Size	Objects approach 1.5MB limit	Medium
List Call Cost	JSON encoding (not protobuf)	High

4.2 Impact on API Server

JSON vs Protobuf: CRDs use JSON — List/Watch performance significantly degraded
APF: List requests can occupy up to 10 seats
Watch Cache: Defaults to 100

CRD Load Formula

Control Plane Load = CRD Type Count x Object Size x Controller Pattern (List/Watch Frequency)

5. EKS Control Plane Monitoring

Four observability dimensions:

CloudWatch Vended Metrics (automatic, free, v1.28+)
Prometheus Endpoints (KCM/KSH/etcd, manual)
Control Plane Logging (5 log types to CloudWatch)
Cluster Insights (automatic health/upgrade checks)

kubectl get --raw=/apis/metrics.eks.amazonaws.com/v1/kcm/container/metrics
kubectl get --raw=/apis/metrics.eks.amazonaws.com/v1/ksh/container/metrics
kubectl get --raw=/apis/metrics.eks.amazonaws.com/v1/etcd/container/metrics

aws eks update-cluster-config --name my-cluster \
  --logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'

Channel	Cost	Setup	PCP Support
CloudWatch Vended Metrics	Free	Automatic (v1.28+)	Tier usage metrics
Prometheus Endpoint	Free	Manual	Extensible
Control Plane Logging	CW rates	Manual	—
Cluster Insights	Free	Automatic	Future tier recommendations

6. CRD Design Best Practices

Minimize Object Size — Keep CR specs small, offload large data
Manage CRD Count — Consolidate similar resources, clean unused CRDs
Controller Optimization — SharedInformer, pagination, Exponential Backoff
Keep K8s Current — K8s 1.33+ Streaming List
Cluster Architecture — Separate CRD clusters from workload clusters

7. Recommendations & Adoption Roadmap

Workload Profile	Tier	Rationale	Monthly Cost
~50 nodes, basic add-ons (Karpenter, cert-manager)	Standard	Default auto-scaling is sufficient	~$73
~200 nodes, 5+ operators (ArgoCD, Prometheus, custom controllers)	XL	etcd 16GB, 99.99% SLA	~$1,204
~500 nodes, service mesh + GitOps + multi-tenant	2XL	Enhanced API Server throughput	~$2,482
1,000+ nodes, AI/ML operators + large-scale CRD pipelines	4XL	API Server horizontal scaling	~$5,037

Control Plane Metrics Reference by Scale

Industry-average reference values for key metrics that drive EKS control plane scaling decisions. Actual values vary by workload pattern — investigate when thresholds are exceeded.

Metric	~50 Nodes (Standard)	~200 Nodes (XL)	~500 Nodes (2XL)	1,000+ Nodes (4XL)
etcd DB Size	0.5–1.5 GB	2–5 GB	5–10 GB	10–20 GB
etcd Object Count	~5,000	~30,000	~100,000	300,000+
API QPS (req/sec)	20–50	100–300	300–800	1,000–3,000
API Latency (p99)	< 200ms	< 500ms	< 1s	< 1.5s (target)
429 Throttle (/min)	0	< 5	< 20	Upgrade trigger
Watch Connections	~200	~1,500	~5,000	15,000+
CRD Types (ref)	5–15	15–40	40–80	80+
Controller Reconcile/sec	5–20	50–150	150–500	500–2,000

How to Measure

etcd DB Size: apiserver_storage_size_bytes (CloudWatch or Prometheus)
API QPS: apiserver_request_total rate (split by verb recommended)
429 Throttle: apiserver_request_total{code="429"} — investigate immediately if non-zero
Watch Connections: apiserver_longrunning_requests{verb="WATCH"} — scales with controllers/nodes
Reconcile Rate: controller_runtime_reconcile_total rate per controller

etcd Size Alert Thresholds

Standard: Warning at > 6GB → consider XL
XL/2XL: Warning at > 12GB → clean up unused CRs or upgrade tier
4XL: Critical at > 20GB → consider architecture split (multi-cluster)

Phase	Timeline	Activities
1: Basic	1 week	CloudWatch alarms, Control Plane Logging
2: Prometheus	2 weeks	AMP Scraper, Grafana dashboards
3: PCP	1 week	Select and apply PCP tier
4: Optimize	Ongoing	Insights, tier adjustments, controller tuning

References

1. EKS Control Plane Internal Architecture​

1.1 Physical Infrastructure Layout​

1.2 etcd — The Heart of the Control Plane​

2. Control Plane Auto-Scaling​

3. EKS Provisioned Control Plane (PCP)​

4. Impact of CRDs on Control Plane​

4.1 Impact on etcd​

4.2 Impact on API Server​

5. EKS Control Plane Monitoring​

6. CRD Design Best Practices​

7. Recommendations & Adoption Roadmap​

Control Plane Metrics Reference by Scale​