CoreDNS Monitoring and Optimization Guide

📅 Written: 2025-05-20 | Last Modified: 2026-02-18 | ⏱️ Reading Time: ~13 min

In Amazon EKS and modern Kubernetes clusters, CoreDNS serves as the core component responsible for all in-cluster service discovery and external domain name resolution. Because CoreDNS performance and availability directly impact application response times and stability, establishing an effective monitoring and optimization architecture is critical. This article analyzes CoreDNS performance monitoring metrics, TTL configuration guidelines, monitoring architecture best practices, AWS recommendations, and real-world case studies. Each section leverages Prometheus metrics and Amazon EKS environment examples to explore CoreDNS monitoring strategies.

1. CoreDNS Performance Monitoring: Key Prometheus Metrics and Their Meaning

CoreDNS provides Prometheus-format metrics through the metrics plugin, exposed by default on port 9153 of the kube-dns service in EKS. Core metrics reveal DNS request throughput, latency, errors, and caching efficiency, enabling rapid detection of DNS performance bottlenecks or failure indicators through monitoring.

CoreDNS 4 Golden Signals

🎯 CoreDNS 4 Golden Signals

Core monitoring indicators based on Google SRE methodology

📈

Throughputcoredns_dns_requests_total

DNS queries per second (QPS). Check per-Pod load balance; consider scale-out on sustained growth.

⏱️

Latencycoredns_dns_request_duration_seconds

P99 response time. If elevated, check upstream DNS latency or CoreDNS CPU/memory saturation.

❌

Errorscoredns_dns_responses_total{rcode=SERVFAIL}

Check external connectivity or ACL issues on SERVFAIL/REFUSED spike. NXDOMAIN surge indicates wrong domain lookups.

💻

ResourceCPU / Memory utilization

EKS default memory request/limit: 70Mi/170Mi. Alert above 150Mi. CPU throttling at limit causes DNS latency.

CoreDNS Key Prometheus Metrics

📊 CoreDNS Core Prometheus Metrics

Exposed via port 9153 (/metrics) by default on EKS

Metric

Signal

Description / PromQL

coredns_dns_requests_total

Counter

Throughput

Total DNS requests (by proto/type). Use rate() for QPS

rate(coredns_dns_requests_total[5m])

coredns_dns_request_duration_seconds

Histogram

Latency

DNS processing time distribution. Check upstream/resources if P99 > 100ms

histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m]))

coredns_dns_responses_total

Counter

Errors

DNS response code distribution. Track SERVFAIL/NXDOMAIN ratios

rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m])

coredns_cache_hits_total

Counter

Cache

Cache hits (success/denial). Used to calculate cache hit ratio

rate(coredns_cache_hits_total[5m])

coredns_cache_misses_total

Counter

Cache

Cache misses. Hit ratio = hits / (hits + misses)

rate(coredns_cache_misses_total[5m])

coredns_forward_requests_total

Counter

Forward

Upstream DNS forwarded requests. Triggered on cache miss

rate(coredns_forward_requests_total[5m])

coredns_forward_responses_total

Counter

Forward

Upstream DNS responses (by rcode). Monitor upstream errors

rate(coredns_forward_responses_total[5m])

coredns_panics_total

Counter

Stability

CoreDNS panic count. Investigate immediately if non-zero

coredns_panics_total

Beyond these, additional metrics like request/response size (coredns_dns_request_size_bytes, ...response_size_bytes), DO bit settings (coredns_dns_do_requests_total), and plugin-specific metrics are available. For example, the Forward plugin provides upstream query time (coredns_forward_request_duration_seconds), and the kubernetes plugin offers API update latency (coredns_kubernetes_dns_programming_duration_seconds).

Key Metrics Meaning and Application

For example, use the rate of increase in coredns_dns_requests_total to determine DNS QPS, dividing by CoreDNS pod count to verify balanced load distribution. If QPS continuously rises, evaluate whether CoreDNS scale-out is needed. When the 99th percentile of coredns_dns_request_duration_seconds exceeds normal levels, CoreDNS is experiencing response delays—check for upstream DNS latency or CoreDNS CPU/memory saturation. If CoreDNS cache (coredns_cache_hits_total) hit rate is low, verify whether TTL is too short, reducing cache effectiveness, and adjust accordingly. When coredns_dns_responses_total shows increased SERVFAIL or REFUSED rates, check logs for CoreDNS external communication issues or permission problems. If NXDOMAIN increases spike for specific domains, applications may be querying incorrect domains—correct the application code.

Additionally, system resource metrics (CPU/memory) are crucial. Monitor CoreDNS pod CPU/memory usage to set alerts when each pod approaches resource limits. For example, EKS default CoreDNS memory request/limit is 70Mi/170Mi, so track when memory usage exceeds 150Mi to alert at threshold and take action like increasing memory limits or adding pods. If CPU reaches limits, kubelet throttles the CoreDNS process, causing DNS delays—when CPU usage approaches limits, consider scaling or increasing resource allocation.

VPC ENI DNS Packet Limit

Each node ENI allows only 1024 DNS packets per second. Even if CoreDNS's max_concurrent limit is raised, the ENI PPS limit (1024 PPS) restriction may prevent reaching desired performance.

2. CoreDNS TTL Configuration Guide and Amazon EKS Application Examples

TTL (Time-To-Live) defines the valid cache time for DNS records, and appropriate TTL settings balance DNS traffic load versus information freshness. CoreDNS handles TTL at two levels:

Authoritative Zone Record (SOA, Start of Authority) TTL: The kubernetes plugin response TTL for in-cluster domains (cluster.local, etc.), with a default of 5 seconds. Modify via the ttl option in the CoreDNS Corefile kubernetes section, configurable from minimum 0 seconds (no caching) to maximum 3600 seconds.
Cache TTL: The maximum time the cache plugin retains cached items, defaulting to 3600 seconds (success responses), adjustable via cache [TTL] format in CoreDNS config. The specified TTL acts as an upper bound—if the actual DNS record TTL is shorter, items are removed from cache according to that shorter value. (The cache plugin's default minimum TTL is 5 seconds, adjustable via MINTTL).

⚙️ CoreDNS TTL Configuration Guide

Optimal balance between DNS traffic load and data freshness

Kubernetes Internal Domainskubernetes

Setting:ttl 30

Default:5s

Recommended:30s

Response TTL for cluster.local records. 30s recommended for better cache hit ratio

DNS Response Cache (Global)cache

Setting:cache 30

Default:3600s (max)

Recommended:30s

CoreDNS internal cache ceiling. EKS default 30s. Separate success/denial configurable

Negative Cache (NXDOMAIN)cache

Setting:denial 2000 10

Default:3600s (max)

Recommended:5-10s

NXDOMAIN response cache. Too long delays new service discovery

Prefetchcache

Setting:prefetch 5 60s

Default:Disabled

Recommended:5 60s

Pre-refresh before TTL expiry when same query seen 5+ times. Keeps cache fresh

💡 TTL Tuning Principle: Short TTL (< 5s) reflects changes quickly but increases CoreDNS load. Long TTL (minutes+) reduces load but risks stale records. 30s is optimal for most EKS environments.

Amazon EKS Default CoreDNS Configuration

Examining the default CoreDNS Corefile deployed in EKS, no explicit TTL is specified for the kubernetes plugin, using the default 5 seconds, while the cache 30 setting configures caching all DNS responses for maximum 30 seconds. This means internal service record TTL in response packets is 5 seconds, but CoreDNS itself caches responses via the cache plugin for maximum 30 seconds, optimizing to avoid frequent Kubernetes API queries for identical requests. For external domain queries, results are also cached for maximum 30 seconds, ensuring even external records with very long TTLs refresh after 30 seconds to avoid retaining excessively stale DNS information.

TTL Configuration Guidelines

Generally, short TTL (e.g., 5 seconds or less) has the advantage of rapidly reflecting DNS record changes (new service IPs, pod IP changes), but increases repeated queries from clients or DNS caches, raising CoreDNS load. Conversely, long TTL (several minutes or more) reduces DNS query frequency to improve performance, but delays change propagation, increasing the possibility of temporary connection failures due to stale information. The recommended approach is to moderately increase TTL (in tens of seconds units) according to cluster size and workload patterns to raise cache hit rates while avoiding serious information delays. Many Kubernetes environments use TTL around 30 seconds as a baseline.

Amazon EKS Application Example

To adjust TTL in EKS, modify the CoreDNS ConfigMap. For example, to increase internal domain cache time, add ttl 30 to the kubernetes cluster.local ... block in Corefile. This increases the cluster internal DNS response TTL field to 30 seconds, allowing client-side (NodeLocal DNSCache, application runtimes) caching for 30 seconds without re-querying. However, in Kubernetes environments, typical Linux glibc resolvers don't self-cache and query CoreDNS every time, so without auxiliary caches like NodeLocal DNSCache, increasing TTL provides limited client-side benefits. TTL adjustments primarily reduce CoreDNS load.

Aurora DNS Load Balancing Issue

Services like AWS Aurora use very low TTL (1 second) for DNS load balancing. In this case, CoreDNS's default minimum 5-second TTL over-caches the original 1-second TTL, distorting Aurora reader endpoint traffic distribution. In such situations, introduce domain-specific low TTL settings.

Real-world cases resolved this by configuring NodeLocal DNSCache CoreDNS with cache 1 and success/denial 1 TTL specifics for the amazonaws.com zone, respecting Aurora endpoint's original 1-second TTL. Therefore, external service TTL policies must be considered when tuning CoreDNS TTL and cache strategies.

3. CoreDNS Monitoring Architecture Best Practices

The ideal CoreDNS monitoring architecture is built as a comprehensive observability pipeline including metric collection (Prometheus, etc.), log collection (Fluent Bit, etc.), plus visualization and alerting systems. In Amazon EKS environments, stable and scalable monitoring systems can be implemented by combining managed services and open-source tools.

🏗️ CoreDNS Monitoring Architecture

AMP + ADOT vs CloudWatch Container Insights Comparison

AMP + ADOTManaged OSS

ADOT Collector / Prometheus scrapes CoreDNS metrics → AMP remote write → Grafana (AMG) visualization

Pros

+ Native PromQL queries

+ Long-term retention & large-scale support

+ Terraform automation accelerator

Considerations

- Requires ADOT/Prometheus installation

- Charges based on ingested metrics

CloudWatch Container InsightsAWS Native

CloudWatch Agent DaemonSet → kube-dns:9153 scrape → CloudWatch Metrics storage → CloudWatch dashboard/alarms

Pros

+ AWS managed - no extra infra

+ Native CloudWatch Alarm integration

+ Usable as AMG data source

Considerations

- CloudWatch metrics collection/storage charges

- CloudWatch query syntax instead of PromQL

Pipeline Layers

CollectionADOT Collector · Prometheus · CloudWatch Agent · Fluent Bit

↓

StorageAMP (Prometheus) · CloudWatch Metrics · CloudWatch Logs

↓

VisualizationAMG (Grafana) · CloudWatch Dashboards

↓

AlertingAlertmanager · CloudWatch Alarms · SNS / PagerDuty / Slack

💡 Recommended: With Prometheus Operator (kube-prometheus-stack), ServiceMonitor can auto-scrape kube-system/kube-dns (k8s-app=kube-dns) service on port 9153.

Metrics Collection and Storage

Amazon EKS typically uses two approaches for collecting CoreDNS Prometheus metrics:

Amazon Managed Service for Prometheus (AMP): AWS-provided fully-managed Prometheus-compatible service that remote-writes cluster metrics to a scalable time-series DB. Deploy ADOT (AWS Distro for OpenTelemetry) Collector or Prometheus server in the EKS cluster to scrape CoreDNS metrics and send to AMP. AMP-stored metrics are PromQL-queryable, suitable for long-term retention and large-cluster support.
CloudWatch Container Insights (and CloudWatch Agent): Method for collecting Prometheus metrics to CloudWatch using AWS CloudWatch. Deploy CloudWatch agent as DaemonSet, configuring to scrape CoreDNS metrics from the kube-system/kube-dns service port 9153.

ServiceMonitor Setup

Amazon EKS's kube-dns service provides a metrics port, so with Prometheus Operator, create a ServiceMonitor targeting the k8s-app=kube-dns labeled service in kube-system namespace to scrape port 9153.

Log Collection

CoreDNS query logs and error logs are useful information sources for diagnosing performance issues or security monitoring (e.g., flood queries for specific domains). The default CoreDNS Corefile lacks the log plugin, but you can activate log or errors plugins as needed. In practice, the common pattern is to collect logs written to CoreDNS pod stdout/stderr using Fluent Bit or Fluentd as DaemonSet and export to CloudWatch Logs.

Log Collection Caution

Avoid excessive load from over-collection by logging only necessary levels. EKS best practices recommend metadata caching (Kube_Meta_Cache_TTL=60, etc.) for agents like Fluent Bit to prevent repeated Kubernetes API queries and reduce unnecessary field collection.

Visualization and Dashboards

Collected CoreDNS metrics are typically visualized via Grafana monitoring dashboards. Amazon Managed Grafana (AMG) natively integrates with AMP or CloudWatch as data sources and controls access via IAM-integrated SSO. When building CoreDNS dashboards in Grafana, configure panels for request rate (QPS), response latency (histogram), error rate (rcode distribution), cache hit rate.

Alarms/Alerting

Use Prometheus Alertmanager or CloudWatch Alarms to set alerts for DNS anomaly indicators. Representative CoreDNS-related Alertmanager rule examples:

CoreDNSDown: Alert when CoreDNS metrics (up{job="kube-dns"}, etc.) aren't reported for a period (e.g., for: 15m).
HighDNSLatency: Alert when coredns_dns_request_duration_seconds p99 latency exceeds, for example, 100ms and is higher than usual.
DNSErrorsSpike: Alert when the rate of coredns_dns_responses_total with rcode label SERVFAIL or NXDOMAIN exceeds a threshold.
ENIThrottling: AWS-specific metric monitoring EC2 network interface (ENI) DNS packet limit exceeded alerts.
HighCoreDNSCPU/Memory: CoreDNS pod CPU/memory usage monitoring alerts.

4. Amazon EKS Best Practices and Customer Case Studies (DNS Bottleneck Resolution)

AWS provides EKS DNS operational best practices via documentation and blogs. Key recommendations and frequently-encountered scenarios:

🛡️ EKS Best Practices & Real-World Cases

AWS recommended CoreDNS optimization strategies and incident response cases

📈

Cluster Proportional AutoscalerLinear DNS QPS scaling

EKS default CoreDNS replicas is 2. Auto-scale proportionally to node count/CPU cores to distribute DNS load.

🗄️

NodeLocal DNSCacheReduced RTT, ENI bottleneck eliminated

Run DNS cache agent (DaemonSet) on all nodes for local DNS. Eliminates network latency and ENI limits.

🔒

DNS Packet Limit & Traffic DistributionAvoid ENI PPS bottleneck

VPC ENI limits 1024 DNS packets/sec. Spread CoreDNS Pods across nodes (Pod Anti-Affinity) to distribute ENI limits.

🔄

Graceful Termination (Lameduck)Zero-downtime DNS rolling updates

Prevent transient DNS failures during CoreDNS restart/scale-down. Configure lameduck 30s + /ready Readiness Probe.

Real-World Incident Cases

Case 1: DNS Latency from ENI PPS Limit

Symptom: Specific service DNS response delay → 1s+ added to total response time

Cause: VPC DNS Resolver hit ENI PPS limit (1024 PPS) causing packet drops

Solution: Deploy NodeLocal DNSCache + CoreDNS Pod node distribution (Anti-Affinity)

Case 2: Aurora Reader Skew from DNS TTL Caching

Symptom: Aurora reader node session skew → Some readers overloaded

Cause: Aurora Reader endpoint DNS TTL is 1s but CoreDNS min TTL 5s causes over-caching

Solution: Configure cache 1 with success/denial 1 for amazonaws.com in NodeLocal DNSCache

⚠️ ENI DNS Packet Limit: Each node ENI allows only 1024 DNS packets per second. Even with higher CoreDNS max_concurrent, ENI PPS limit (1024 PPS) may constrain performance.

CoreDNS Horizontal Scaling (Replica Adjustment)

The default CoreDNS Deployment replica count at EKS cluster creation is fixed at 2, but horizontal scaling may be needed as node count and workload increase. AWS best practice is using Cluster Proportional Autoscaler to automatically increase CoreDNS replicas proportional to node count or CPU core count.

NodeLocal DNSCache Adoption

In large-scale clusters or workloads with very frequent DNS traffic, the central CoreDNS processing approach can become bottlenecked by network latency and ENI limits. Kubernetes's official add-on NodeLocal DNSCache runs a DNS cache agent (CoreDNS-based) as DaemonSet on all nodes, providing local DNS on each node.

DNS Packet Limit and Traffic Distribution

A common AWS bottleneck is the VPC DNS packet limit (1024 PPS/ENI). In real-world cases, when applications make massive external DNS queries and CoreDNS pods both run on the same node, all external DNS queries exit through that node's single ENI, risking exceeding the limit.

Graceful Termination Configuration (Lameduck & Ready Plugin)

Configuration to prevent transient DNS failures during CoreDNS pod restart or scale-down. AWS best practice is applying lameduck 30s setting to CoreDNS and configuring Readiness Probe to the /ready endpoint.

When Higher QPS is Needed

Increase max_concurrent: Adjustable above 2000, but consider memory usage (2 KB × concurrent query count) and upstream DNS latency together.
CoreDNS Horizontal Scaling: Increase replica count or use Cluster Proportional Autoscaler, HPA, or NodeLocal DNSCache to distribute queries to node level.
Monitor ENI Limits: Set alarms on aws_ec2_eni_allowance_exceeded (CloudWatch) or linklocal_allowance_exceeded metrics for early detection of ENI PPS overage.

Key Summary

🎯 Performance Benchmarks & Tuning Guide

CoreDNS core performance targets and tuning parameters

Metric

Target

Critical

Note

Query Latency (P99)

< 50ms

> 100ms

99% of DNS queries complete within 50ms

Throughput (QPS/Pod)

> 10K

< 5K

Process 10,000+ queries per second per Pod

Cache Hit Ratio

> 80%

< 50%

80%+ cache utilization with TTL 30s baseline

Error Rate (SERVFAIL)

< 0.1%

> 1%

Keep SERVFAIL response ratio under 0.1%

CPU Utilization

< 60%

> 80%

CPU throttling at limit causes DNS latency

Memory Utilization

< 120Mi

> 150Mi

EKS default limit 170Mi. Alert above 150Mi

Tuning Parameters

max_concurrent

1000

2000+

Concurrent query limit. Consider memory 2KB × concurrent queries

Replica Count

Auto-proportional

Apply Cluster Proportional Autoscaler

lameduck

30s

Prevent DNS failures during rolling updates

💡 Benchmark Tool: dnsperf -s <COREDNS_IP> -d queries.txt -c 10 -T 10 to measure CoreDNS QPS and latency.

Monitoring Metrics: requests_total, request_duration_seconds, cache_hits/misses, responses_total{rcode}, CPU/memory
TTL Recommendations: Service records 30s, cache (success 30, denial 5-10), prefetch 5 60s
Monitoring: kube-prometheus-stack default dashboard + Alertmanager rules, scale-out with NodeLocal DNSCache if needed

Appendix: Configuration Examples

Recommended Corefile Configuration

.:53 {
  kubernetes cluster.local in-addr.arpa ip6.arpa {
    pods insecure
    fallthrough in-addr.arpa ip6.arpa
    ttl 30           # Service/POD record TTL
  }

  cache 30 {         # Maximum 30-second retention
    success 10000 30 # capacity 10k, maxTTL 30s
    denial 2000 10   # negative cache 2k, maxTTL 10s
    prefetch 5 60s   # 5+ identical queries → refresh 60s prior
  }

  forward . /etc/resolv.conf {
    max_concurrent 2000
    prefer_udp
  }

  prometheus :9153
  health {
    lameduck 30s
  }
  ready
  reload
  log
}

Alertmanager Rule Examples

- alert: CoreDNSHighErrorRate
  expr: >
    (sum(rate(coredns_dns_responses_total{rcode!~"NOERROR"}[5m])) /
     sum(rate(coredns_dns_requests_total[5m]))) > 0.01
  for: 10m
  labels:
    severity: critical
  annotations:
    description: "CoreDNS error rate > 1% for 10 min"

- alert: CoreDNSP99Latency
  expr: >
    histogram_quantile(0.99,
      sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le)) > 0.05
  for: 5m
  labels:
    severity: warning

Large-Scale Clusters (>100 nodes or QPS > 5k)

NodeLocal DNSCache (DaemonSet form) caches at node level to reduce RTT
- Collect nodelocaldns metrics in Prometheus to compare with CoreDNS
CloudWatch Container Insights (EKS-specific)
- If Prometheus collection is difficult, use cwagent + adot-internal-metrics option to send CoreDNS container metrics to CloudWatch (separate charges apply)

1. CoreDNS Performance Monitoring: Key Prometheus Metrics and Their Meaning​

CoreDNS 4 Golden Signals​

CoreDNS Key Prometheus Metrics​

Key Metrics Meaning and Application​

2. CoreDNS TTL Configuration Guide and Amazon EKS Application Examples​

Amazon EKS Default CoreDNS Configuration​

TTL Configuration Guidelines​

Amazon EKS Application Example​

3. CoreDNS Monitoring Architecture Best Practices​

Metrics Collection and Storage​

Log Collection​

Visualization and Dashboards​

Alarms/Alerting​

4. Amazon EKS Best Practices and Customer Case Studies (DNS Bottleneck Resolution)​

CoreDNS Horizontal Scaling (Replica Adjustment)​

NodeLocal DNSCache Adoption​

DNS Packet Limit and Traffic Distribution​

Graceful Termination Configuration (Lameduck & Ready Plugin)​

When Higher QPS is Needed​

Key Summary​

Appendix: Configuration Examples​

Recommended Corefile Configuration​

Alertmanager Rule Examples​

Large-Scale Clusters (>100 nodes or QPS > 5k)​

1. CoreDNS Performance Monitoring: Key Prometheus Metrics and Their Meaning

CoreDNS 4 Golden Signals

CoreDNS Key Prometheus Metrics

Key Metrics Meaning and Application

2. CoreDNS TTL Configuration Guide and Amazon EKS Application Examples

Amazon EKS Default CoreDNS Configuration

TTL Configuration Guidelines

Amazon EKS Application Example

3. CoreDNS Monitoring Architecture Best Practices

Metrics Collection and Storage

Log Collection

Visualization and Dashboards

Alarms/Alerting

4. Amazon EKS Best Practices and Customer Case Studies (DNS Bottleneck Resolution)

CoreDNS Horizontal Scaling (Replica Adjustment)

NodeLocal DNSCache Adoption

DNS Packet Limit and Traffic Distribution

Graceful Termination Configuration (Lameduck & Ready Plugin)

When Higher QPS is Needed

Key Summary

Appendix: Configuration Examples

Recommended Corefile Configuration

Alertmanager Rule Examples

Large-Scale Clusters (>100 nodes or QPS > 5k)