CoreDNS Monitoring and Performance Optimization Complete Guide

Published 2025-05-20Updated 2026-06-3015 min read

In Amazon EKS and modern Kubernetes clusters, CoreDNS is the core component responsible for all in-cluster service discovery and external domain name resolution. Since CoreDNS performance and availability directly impact application response times and stability, building an effective monitoring and optimization architecture is critical. This article analyzes CoreDNS performance monitoring metrics, TTL configuration guide, monitoring architecture best practices, and AWS recommendations with real-world cases. Each section leverages Prometheus metrics and Amazon EKS environment examples.

1. CoreDNS Performance Monitoring: Key Prometheus Metrics

CoreDNS exposes Prometheus-format metrics through the metrics plugin, available by default in EKS on port 9153 of the kube-dns service. The core metrics cover DNS request throughput, latency, errors, and caching efficiency, enabling rapid detection of DNS performance bottlenecks or failure indicators.

CoreDNS 4 Golden Signals

🎯 CoreDNS 4 Golden Signals

Core monitoring indicators based on Google SRE methodology

📈

Throughputcoredns_dns_requests_total

DNS queries per second (QPS). Check per-Pod load balance; consider scale-out on sustained growth.

⏱️

Latencycoredns_dns_request_duration_seconds

P99 response time. If elevated, check upstream DNS latency or CoreDNS CPU/memory saturation.

❌

Errorscoredns_dns_responses_total{rcode=SERVFAIL}

Check external connectivity or ACL issues on SERVFAIL/REFUSED spike. NXDOMAIN surge indicates wrong domain lookups.

💻

ResourceCPU / Memory utilization

EKS default memory request/limit: 70Mi/170Mi. Alert above 150Mi. CPU throttling at limit causes DNS latency.

CoreDNS Key Prometheus Metrics

📊 CoreDNS Core Prometheus Metrics

Exposed via port 9153 (/metrics) by default on EKS

Metric

Signal

Description / PromQL

coredns_dns_requests_total

Counter

Throughput

Total DNS requests (by proto/type). Use rate() for QPS

rate(coredns_dns_requests_total[5m])

coredns_dns_request_duration_seconds

Histogram

Latency

DNS processing time distribution. Check upstream/resources if P99 > 100ms

histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m]))

coredns_dns_responses_total

Counter

Errors

DNS response code distribution. Track SERVFAIL/NXDOMAIN ratios

rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m])

coredns_cache_hits_total

Counter

Cache

Cache hits (success/denial). Used to calculate cache hit ratio

rate(coredns_cache_hits_total[5m])

coredns_cache_misses_total

Counter

Cache

Cache misses. Hit ratio = hits / (hits + misses)

rate(coredns_cache_misses_total[5m])

coredns_forward_requests_total

Counter

Forward

Upstream DNS forwarded requests. Triggered on cache miss

rate(coredns_forward_requests_total[5m])

coredns_forward_responses_total

Counter

Forward

Upstream DNS responses (by rcode). Monitor upstream errors

rate(coredns_forward_responses_total[5m])

coredns_panics_total

Counter

Stability

CoreDNS panic count. Investigate immediately if non-zero

coredns_panics_total

Beyond these, additional metrics such as request/response size (coredns_dns_request_size_bytes, ...response_size_bytes), DO bit presence (coredns_dns_do_requests_total), and plugin-specific metrics are available. For example, the Forward plugin provides upstream query time (coredns_forward_request_duration_seconds), and the kubernetes plugin provides API update latency (coredns_kubernetes_dns_programming_duration_seconds).

Key Metric Meanings and Usage

Track coredns_dns_requests_total per-second rate for DNS QPS, distributed per CoreDNS Pod to verify load balance. If QPS consistently grows, evaluate whether CoreDNS scale-out is needed. When coredns_dns_request_duration_seconds p99 rises above normal, CoreDNS is experiencing response latency — check for upstream DNS delays or CoreDNS CPU/memory saturation. If cache hit ratio (coredns_cache_hits_total) is low, check whether TTL is too short. If coredns_dns_responses_total shows increasing SERVFAIL or REFUSED, check CoreDNS external communication or access control issues. A spike in NXDOMAIN for specific domains may indicate applications querying incorrect domains.

System resource metrics (CPU/memory) are also important. Monitor CoreDNS Pod CPU/memory utilization and alert when approaching resource limits. EKS default CoreDNS memory request/limit is 70Mi/170Mi — track if usage exceeds 150Mi. CPU throttling by kubelet causes DNS latency, so consider scaling when CPU approaches limits.

VPC ENI DNS Packet Limit

Each node ENI allows only 1024 DNS packets per second. Even if you increase CoreDNS max_concurrent, the ENI PPS limit (1024 PPS) may prevent reaching desired performance.

2. CoreDNS TTL Configuration Guide and Amazon EKS Examples

TTL (Time-To-Live) defines the DNS record cache validity period, balancing DNS traffic load and information freshness. CoreDNS handles TTL at two levels:

Authoritative zone record (SOA) TTL: The kubernetes plugin response TTL for internal cluster domains (cluster.local, etc.), defaulting to 5 seconds. Configurable in the kubernetes section of the Corefile with the ttl option (min 0, max 3600 seconds).
Cache TTL: The cache plugin maximum cache retention time, defaulting to 3600 seconds (success responses). The specified TTL acts as an upper limit — if the actual DNS record TTL is shorter, the cache respects the shorter value.

⚙️ CoreDNS TTL Configuration Guide

Optimal balance between DNS traffic load and data freshness

Kubernetes Internal Domainskubernetes

Setting:ttl 30

Default:5s

Recommended:30s

Response TTL for cluster.local records. 30s recommended for better cache hit ratio

DNS Response Cache (Global)cache

Setting:cache 30

Default:3600s (max)

Recommended:30s

CoreDNS internal cache ceiling. EKS default 30s. Separate success/denial configurable

Negative Cache (NXDOMAIN)cache

Setting:denial 2000 10

Default:3600s (max)

Recommended:5-10s

NXDOMAIN response cache. Too long delays new service discovery

Prefetchcache

Setting:prefetch 5 60s

Default:Disabled

Recommended:5 60s

Pre-refresh before TTL expiry when same query seen 5+ times. Keeps cache fresh

💡 TTL Tuning Principle: Short TTL (< 5s) reflects changes quickly but increases CoreDNS load. Long TTL (minutes+) reduces load but risks stale records. 30s is optimal for most EKS environments.

Amazon EKS Default CoreDNS Configuration

The default EKS CoreDNS Corefile uses 5-second default TTL for the kubernetes plugin (no explicit TTL) and cache 30 to cache all DNS responses for up to 30 seconds. Internal service record TTL is 5 seconds in the response packet, but CoreDNS itself caches for up to 30 seconds to avoid frequent Kubernetes API queries. External domains are also cached up to 30 seconds.

TTL Configuration Guide

Short TTLs (≤5s) reflect DNS changes quickly but increase CoreDNS load. Long TTLs (minutes+) reduce query frequency but delay change propagation. The recommended approach is to moderately increase TTL (tens of seconds) to improve cache hit rate while avoiding severe information delays. Many Kubernetes environments use 30 seconds as a baseline.

Amazon EKS Application Examples

To adjust TTL in EKS, modify the CoreDNS ConfigMap. Add ttl 30 to the kubernetes cluster.local ... block. Note that standard Linux glibc resolver doesn't cache — without NodeLocal DNSCache, TTL increases mainly reduce CoreDNS's own load.

Aurora DNS Load Balancing Issue

AWS Aurora uses very low TTL (1 second) for DNS load balancing. CoreDNS's default minimum TTL of 5 seconds over-caches the 1-second TTL, distorting Aurora reader endpoint traffic distribution. Apply domain-specific low TTL settings for such cases.

3. CoreDNS Monitoring Architecture Best Practices

🏗️ CoreDNS Monitoring Architecture

AMP + ADOT vs CloudWatch Container Insights Comparison

AMP + ADOTManaged OSS

ADOT Collector / Prometheus scrapes CoreDNS metrics → AMP remote write → Grafana (AMG) visualization

Pros

+ Native PromQL queries

+ Long-term retention & large-scale support

+ Terraform automation accelerator

Considerations

- Requires ADOT/Prometheus installation

- Charges based on ingested metrics

CloudWatch Container InsightsAWS Native

CloudWatch Agent DaemonSet → kube-dns:9153 scrape → CloudWatch Metrics storage → CloudWatch dashboard/alarms

Pros

+ AWS managed - no extra infra

+ Native CloudWatch Alarm integration

+ Usable as AMG data source

Considerations

- CloudWatch metrics collection/storage charges

- CloudWatch query syntax instead of PromQL

Pipeline Layers

CollectionADOT Collector · Prometheus · CloudWatch Agent · Fluent Bit

↓

StorageAMP (Prometheus) · CloudWatch Metrics · CloudWatch Logs

↓

VisualizationAMG (Grafana) · CloudWatch Dashboards

↓

AlertingAlertmanager · CloudWatch Alarms · SNS / PagerDuty / Slack

💡 Recommended: With Prometheus Operator (kube-prometheus-stack), ServiceMonitor can auto-scrape kube-system/kube-dns (k8s-app=kube-dns) service on port 9153.

Metric Collection and Storage

Two common approaches in Amazon EKS:

Amazon Managed Service for Prometheus (AMP): Fully managed Prometheus-compatible service. Install ADOT Collector or Prometheus to scrape and forward CoreDNS metrics.
CloudWatch Container Insights: Use CloudWatch agent as DaemonSet to scrape CoreDNS metrics from kube-dns service port 9153.

ServiceMonitor Configuration

EKS's kube-dns service provides a metrics port. With Prometheus Operator, create a ServiceMonitor targeting the k8s-app=kube-dns label on port 9153.

Log Collection

Enable log or errors plugins as needed. Use Fluent Bit or Fluentd DaemonSets to collect CoreDNS stdout/stderr logs and export to CloudWatch Logs.

Log Collection Caution

Avoid excessive logging overhead. Set metadata caching (Kube_Meta_Cache_TTL=60) and reduce unnecessary field collection.

Visualization and Dashboards

Use Grafana (or Amazon Managed Grafana) to visualize CoreDNS metrics: QPS, latency histograms, error rates (rcode distribution), cache hit rates.

Alerting

Set alerts via Prometheus Alertmanager or CloudWatch Alarms:

CoreDNSDown: CoreDNS metrics unreported for 15+ minutes
HighDNSLatency: p99 latency exceeding 100ms
DNSErrorsSpike: SERVFAIL/NXDOMAIN ratio above threshold
ENIThrottling: ENI DNS packet limit exceeded
HighCoreDNSCPU/Memory: Resource utilization alerts

4. Amazon EKS Best Practices and Customer Cases

🛡️ EKS Best Practices & Real-World Cases

AWS recommended CoreDNS optimization strategies and incident response cases

📈

Cluster Proportional AutoscalerLinear DNS QPS scaling

EKS default CoreDNS replicas is 2. Auto-scale proportionally to node count/CPU cores to distribute DNS load.

🗄️

NodeLocal DNSCacheReduced RTT, ENI bottleneck eliminated

Run DNS cache agent (DaemonSet) on all nodes for local DNS. Eliminates network latency and ENI limits.

🔒

DNS Packet Limit & Traffic DistributionAvoid ENI PPS bottleneck

VPC ENI limits 1024 DNS packets/sec. Spread CoreDNS Pods across nodes (Pod Anti-Affinity) to distribute ENI limits.

🔄

Graceful Termination (Lameduck)Zero-downtime DNS rolling updates

Prevent transient DNS failures during CoreDNS restart/scale-down. Configure lameduck 30s + /ready Readiness Probe.

Real-World Incident Cases

Case 1: DNS Latency from ENI PPS Limit

Symptom: Specific service DNS response delay → 1s+ added to total response time

Cause: VPC DNS Resolver hit ENI PPS limit (1024 PPS) causing packet drops

Solution: Deploy NodeLocal DNSCache + CoreDNS Pod node distribution (Anti-Affinity)

Case 2: Aurora Reader Skew from DNS TTL Caching

Symptom: Aurora reader node session skew → Some readers overloaded

Cause: Aurora Reader endpoint DNS TTL is 1s but CoreDNS min TTL 5s causes over-caching

Solution: Configure cache 1 with success/denial 1 for amazonaws.com in NodeLocal DNSCache

⚠️ ENI DNS Packet Limit: Each node ENI allows only 1024 DNS packets per second. Even with higher CoreDNS max_concurrent, ENI PPS limit (1024 PPS) may constrain performance.

CoreDNS Horizontal Scaling

Default 2 replicas; use Cluster Proportional Autoscaler to scale based on node count or CPU cores.

NodeLocal DNSCache

For large clusters or high DNS traffic, deploy NodeLocal DNSCache DaemonSet for local DNS caching on every node.

DNS Packet Limits and Traffic Distribution

VPC DNS packet limit is 1024 PPS/ENI. Ensure CoreDNS Pods are distributed across nodes.

Graceful Termination (Lameduck & Ready Plugin)

Apply lameduck 30s and configure Readiness Probe on /ready endpoint.

Higher QPS Requirements

Increase max_concurrent to 2000+
Scale CoreDNS horizontally or deploy NodeLocal DNSCache
Monitor ENI limits via aws_ec2_eni_allowance_exceeded

Key Summary

🎯 Performance Benchmarks & Tuning Guide

CoreDNS core performance targets and tuning parameters

Metric

Target

Critical

Note

Query Latency (P99)

< 50ms

> 100ms

99% of DNS queries complete within 50ms

Throughput (QPS/Pod)

> 10K

< 5K

Process 10,000+ queries per second per Pod

Cache Hit Ratio

> 80%

< 50%

80%+ cache utilization with TTL 30s baseline

Error Rate (SERVFAIL)

< 0.1%

> 1%

Keep SERVFAIL response ratio under 0.1%

CPU Utilization

< 60%

> 80%

CPU throttling at limit causes DNS latency

Memory Utilization

< 120Mi

> 150Mi

EKS default limit 170Mi. Alert above 150Mi

Tuning Parameters

max_concurrent

1000

2000+

Concurrent query limit. Consider memory 2KB × concurrent queries

Replica Count

Auto-proportional

Apply Cluster Proportional Autoscaler

lameduck

30s

Prevent DNS failures during rolling updates

💡 Benchmark Tool: dnsperf -s <COREDNS_IP> -d queries.txt -c 10 -T 10 to measure CoreDNS QPS and latency.

Monitoring Metrics: requests_total, request_duration_seconds, cache_hits/misses, responses_total{rcode}, CPU/memory
Recommended TTL: Service records 30s, cache (success 30, denial 5-10), prefetch 5 60s
Monitoring: kube-prometheus-stack dashboards + Alertmanager rules, NodeLocal DNSCache for scale-out

Appendix: Configuration Examples

Recommended Corefile

.:53 {
  kubernetes cluster.local in-addr.arpa ip6.arpa {
    pods insecure
    fallthrough in-addr.arpa ip6.arpa
    ttl 30           # Service/POD record TTL
  }

  cache 30 {         # Max 30s retention
    success 10000 30 # capacity 10k, maxTTL 30s
    denial 2000 10   # negative cache 2k, maxTTL 10s
    prefetch 5 60s   # refresh before expiry if 5+ identical queries
  }

  forward . /etc/resolv.conf {
    max_concurrent 2000
    prefer_udp
  }

  prometheus :9153
  health {
    lameduck 30s
  }
  ready
  reload
  log
}

Alertmanager Rule Examples

- alert: CoreDNSHighErrorRate
  expr: >
    (sum(rate(coredns_dns_responses_total{rcode!~"NOERROR"}[5m])) /
     sum(rate(coredns_dns_requests_total[5m]))) > 0.01
  for: 10m
  labels:
    severity: critical
  annotations:
    description: "CoreDNS error rate > 1% for 10 min"

- alert: CoreDNSP99Latency
  expr: >
    histogram_quantile(0.99,
      sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le)) > 0.05
  for: 5m
  labels:
    severity: warning

Large Clusters (>100 Nodes or QPS > 5k)

NodeLocal DNSCache (DaemonSet) for local caching and RTT reduction
CloudWatch Container Insights as alternative when Prometheus collection is difficult

1. CoreDNS Performance Monitoring: Key Prometheus Metrics​

CoreDNS 4 Golden Signals​

CoreDNS Key Prometheus Metrics​

Key Metric Meanings and Usage​

2. CoreDNS TTL Configuration Guide and Amazon EKS Examples​

Amazon EKS Default CoreDNS Configuration​

TTL Configuration Guide​

Amazon EKS Application Examples​

3. CoreDNS Monitoring Architecture Best Practices​

Metric Collection and Storage​

Log Collection​

Visualization and Dashboards​

Alerting​

4. Amazon EKS Best Practices and Customer Cases​

CoreDNS Horizontal Scaling​

NodeLocal DNSCache​

DNS Packet Limits and Traffic Distribution​

Graceful Termination (Lameduck & Ready Plugin)​

Higher QPS Requirements​

Key Summary​

Appendix: Configuration Examples​

Recommended Corefile​

Alertmanager Rule Examples​

Large Clusters (>100 Nodes or QPS > 5k)​

1. CoreDNS Performance Monitoring: Key Prometheus Metrics

CoreDNS 4 Golden Signals

CoreDNS Key Prometheus Metrics

Key Metric Meanings and Usage

2. CoreDNS TTL Configuration Guide and Amazon EKS Examples

Amazon EKS Default CoreDNS Configuration

TTL Configuration Guide

Amazon EKS Application Examples

3. CoreDNS Monitoring Architecture Best Practices

Metric Collection and Storage

Log Collection

Visualization and Dashboards

Alerting

4. Amazon EKS Best Practices and Customer Cases

CoreDNS Horizontal Scaling

NodeLocal DNSCache

DNS Packet Limits and Traffic Distribution

Graceful Termination (Lameduck & Ready Plugin)

Higher QPS Requirements

Key Summary

Appendix: Configuration Examples

Recommended Corefile

Alertmanager Rule Examples

Large Clusters (>100 Nodes or QPS > 5k)