Skip to main content

EKS Control Plane Deep Dive — CRD at Scale Comprehensive Guide

Written: 2026-03-24 | Reading time: ~25 min

When operating CRD-based platforms on EKS, the Control Plane is the first bottleneck. This guide covers Control Plane internals, CRD impacts, Provisioned Control Plane (PCP), and monitoring strategies.


1. EKS Control Plane Internal Architecture

1.1 Physical Infrastructure Layout

EKS Control Plane (AWS Managed)
├── kube-apiserver (min 2, multi-AZ)
├── kube-controller-manager
├── kube-scheduler
├── etcd (distributed key-value store)
└── Network Load Balancer (API Server endpoint)
  • Components distributed across multiple AZs for HA
  • Single API Server endpoint exposed via NLB
  • Fully managed by AWS, separate from customer VPC

1.2 etcd — The Heart of the Control Plane

CharacteristicDescriptionCRD Impact
DB Size LimitStandard 8GB, Provisioned 16GBMore CRD objects increase DB size
Request Size LimitSingle object max 1.5MBLarge CR specs approach the limit
Watch StreamReal-time change propagationLoad increases with more CRD controller Watches
RAFT ConsensusMajority agreement for writesLatency in write-heavy CRD patterns
etcd Architecture Evolution

AWS continues improving the EKS etcd layer for predictable performance, data durability, and availability.


2. Control Plane Auto-Scaling

EKS automatically vertically scales Control Plane instances based on API Server load, etcd load, scheduling load, and data plane size.

Key Insight

Standard tier etcd DB Size is fixed at 8GB. This is the first bottleneck for CRD-heavy platforms — auto-scaling CPU/Memory does not expand etcd capacity.


3. EKS Provisioned Control Plane (PCP)

GA at re:Invent 2025. Set a performance floor by selecting a tier.

Tieretcd DBSLAHourly Price
Standard8GB99.95%$0.10
XL16GB99.99%$1.65
2XL16GB99.99%$3.40
4XL16GB99.99%$6.90
8XL16GB99.99%$13.90
FeatureStandardXL+
API Server horizontal scaling (>2)Limited to 2Yes
etcd DB Size 16GBFixed 8GB16GB
etcd Event ShardingNoYes
99.99% SLA99.95%99.99%
Why Provisioned for CRD Platforms

The first limit in CRD platforms is etcd DB Size. Provisioned doubles it to 16GB and offloads event pressure via Event Sharding.

aws eks create-cluster --name prod \
--role-arn arn:aws:iam::012345678910:role/eks-service-role \
--resources-vpc-config subnetIds=subnet-xxx,securityGroupIds=sg-xxx \
--control-plane-scaling-config tier=XL

aws eks update-cluster-config --name example \
--control-plane-scaling-config tier=XL

4. Impact of CRDs on Control Plane

4.1 Impact on etcd

FactorMechanismSeverity
DB Size GrowthCRD objects occupy etcd storageHigh
Watch Stream LoadControllers create Watch streamsHigh
Request SizeObjects approach 1.5MB limitMedium
List Call CostJSON encoding (not protobuf)High

4.2 Impact on API Server

  1. JSON vs Protobuf: CRDs use JSON — List/Watch performance significantly degraded
  2. APF: List requests can occupy up to 10 seats
  3. Watch Cache: Defaults to 100
CRD Load Formula

Control Plane Load = CRD Type Count x Object Size x Controller Pattern (List/Watch Frequency)


5. EKS Control Plane Monitoring

Four observability dimensions:

  1. CloudWatch Vended Metrics (automatic, free, v1.28+)
  2. Prometheus Endpoints (KCM/KSH/etcd, manual)
  3. Control Plane Logging (5 log types to CloudWatch)
  4. Cluster Insights (automatic health/upgrade checks)
kubectl get --raw=/apis/metrics.eks.amazonaws.com/v1/kcm/container/metrics
kubectl get --raw=/apis/metrics.eks.amazonaws.com/v1/ksh/container/metrics
kubectl get --raw=/apis/metrics.eks.amazonaws.com/v1/etcd/container/metrics

aws eks update-cluster-config --name my-cluster \
--logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'
ChannelCostSetupPCP Support
CloudWatch Vended MetricsFreeAutomatic (v1.28+)Tier usage metrics
Prometheus EndpointFreeManualExtensible
Control Plane LoggingCW ratesManual
Cluster InsightsFreeAutomaticFuture tier recommendations

6. CRD Design Best Practices

  1. Minimize Object Size — Keep CR specs small, offload large data
  2. Manage CRD Count — Consolidate similar resources, clean unused CRDs
  3. Controller Optimization — SharedInformer, pagination, Exponential Backoff
  4. Keep K8s Current — K8s 1.33+ Streaming List
  5. Cluster Architecture — Separate CRD clusters from workload clusters

7. Recommendations & Adoption Roadmap

Workload ProfileTierRationaleMonthly Cost
~50 nodes, basic add-ons (Karpenter, cert-manager)StandardDefault auto-scaling is sufficient~$73
~200 nodes, 5+ operators (ArgoCD, Prometheus, custom controllers)XLetcd 16GB, 99.99% SLA~$1,204
~500 nodes, service mesh + GitOps + multi-tenant2XLEnhanced API Server throughput~$2,482
1,000+ nodes, AI/ML operators + large-scale CRD pipelines4XLAPI Server horizontal scaling~$5,037

Control Plane Metrics Reference by Scale

Industry-average reference values for key metrics that drive EKS control plane scaling decisions. Actual values vary by workload pattern — investigate when thresholds are exceeded.

Metric~50 Nodes (Standard)~200 Nodes (XL)~500 Nodes (2XL)1,000+ Nodes (4XL)
etcd DB Size0.5–1.5 GB2–5 GB5–10 GB10–20 GB
etcd Object Count~5,000~30,000~100,000300,000+
API QPS (req/sec)20–50100–300300–8001,000–3,000
API Latency (p99)< 200ms< 500ms< 1s< 1.5s (target)
429 Throttle (/min)0< 5< 20Upgrade trigger
Watch Connections~200~1,500~5,00015,000+
CRD Types (ref)5–1515–4040–8080+
Controller Reconcile/sec5–2050–150150–500500–2,000
How to Measure
  • etcd DB Size: apiserver_storage_size_bytes (CloudWatch or Prometheus)
  • API QPS: apiserver_request_total rate (split by verb recommended)
  • 429 Throttle: apiserver_request_total{code="429"} — investigate immediately if non-zero
  • Watch Connections: apiserver_longrunning_requests{verb="WATCH"} — scales with controllers/nodes
  • Reconcile Rate: controller_runtime_reconcile_total rate per controller
etcd Size Alert Thresholds
  • Standard: Warning at > 6GB → consider XL
  • XL/2XL: Warning at > 12GB → clean up unused CRs or upgrade tier
  • 4XL: Critical at > 20GB → consider architecture split (multi-cluster)
PhaseTimelineActivities
1: Basic1 weekCloudWatch alarms, Control Plane Logging
2: Prometheus2 weeksAMP Scraper, Grafana dashboards
3: PCP1 weekSelect and apply PCP tier
4: OptimizeOngoingInsights, tier adjustments, controller tuning