CoreDNS Monitoring and Optimization Guide
📅 Written: 2025-05-20 | Last Modified: 2026-02-18 | ⏱️ Reading Time: ~13 min
In Amazon EKS and modern Kubernetes clusters, CoreDNS serves as the core component responsible for all in-cluster service discovery and external domain name resolution. Because CoreDNS performance and availability directly impact application response times and stability, establishing an effective monitoring and optimization architecture is critical. This article analyzes CoreDNS performance monitoring metrics, TTL configuration guidelines, monitoring architecture best practices, AWS recommendations, and real-world case studies. Each section leverages Prometheus metrics and Amazon EKS environment examples to explore CoreDNS monitoring strategies.
1. CoreDNS Performance Monitoring: Key Prometheus Metrics and Their Meaning
CoreDNS provides Prometheus-format metrics through the metrics plugin, exposed by default on port 9153 of the kube-dns service in EKS. Core metrics reveal DNS request throughput, latency, errors, and caching efficiency, enabling rapid detection of DNS performance bottlenecks or failure indicators through monitoring.
CoreDNS 4 Golden Signals
CoreDNS Key Prometheus Metrics
Beyond these, additional metrics like request/response size (coredns_dns_request_size_bytes, ...response_size_bytes), DO bit settings (coredns_dns_do_requests_total), and plugin-specific metrics are available. For example, the Forward plugin provides upstream query time (coredns_forward_request_duration_seconds), and the kubernetes plugin offers API update latency (coredns_kubernetes_dns_programming_duration_seconds).
Key Metrics Meaning and Application
For example, use the rate of increase in coredns_dns_requests_total to determine DNS QPS, dividing by CoreDNS pod count to verify balanced load distribution. If QPS continuously rises, evaluate whether CoreDNS scale-out is needed. When the 99th percentile of coredns_dns_request_duration_seconds exceeds normal levels, CoreDNS is experiencing response delays—check for upstream DNS latency or CoreDNS CPU/memory saturation. If CoreDNS cache (coredns_cache_hits_total) hit rate is low, verify whether TTL is too short, reducing cache effectiveness, and adjust accordingly. When coredns_dns_responses_total shows increased SERVFAIL or REFUSED rates, check logs for CoreDNS external communication issues or permission problems. If NXDOMAIN increases spike for specific domains, applications may be querying incorrect domains—correct the application code.
Additionally, system resource metrics (CPU/memory) are crucial. Monitor CoreDNS pod CPU/memory usage to set alerts when each pod approaches resource limits. For example, EKS default CoreDNS memory request/limit is 70Mi/170Mi, so track when memory usage exceeds 150Mi to alert at threshold and take action like increasing memory limits or adding pods. If CPU reaches limits, kubelet throttles the CoreDNS process, causing DNS delays—when CPU usage approaches limits, consider scaling or increasing resource allocation.
Each node ENI allows only 1024 DNS packets per second. Even if CoreDNS's max_concurrent limit is raised, the ENI PPS limit (1024 PPS) restriction may prevent reaching desired performance.
2. CoreDNS TTL Configuration Guide and Amazon EKS Application Examples
TTL (Time-To-Live) defines the valid cache time for DNS records, and appropriate TTL settings balance DNS traffic load versus information freshness. CoreDNS handles TTL at two levels:
- Authoritative Zone Record (SOA, Start of Authority) TTL: The kubernetes plugin response TTL for in-cluster domains (
cluster.local, etc.), with a default of 5 seconds. Modify via thettloption in the CoreDNSCorefilekubernetes section, configurable from minimum 0 seconds (no caching) to maximum 3600 seconds. - Cache TTL: The maximum time the cache plugin retains cached items, defaulting to 3600 seconds (success responses), adjustable via
cache [TTL]format in CoreDNS config. The specified TTL acts as an upper bound—if the actual DNS record TTL is shorter, items are removed from cache according to that shorter value. (The cache plugin's default minimum TTL is 5 seconds, adjustable viaMINTTL).
Amazon EKS Default CoreDNS Configuration
Examining the default CoreDNS Corefile deployed in EKS, no explicit TTL is specified for the kubernetes plugin, using the default 5 seconds, while the cache 30 setting configures caching all DNS responses for maximum 30 seconds. This means internal service record TTL in response packets is 5 seconds, but CoreDNS itself caches responses via the cache plugin for maximum 30 seconds, optimizing to avoid frequent Kubernetes API queries for identical requests. For external domain queries, results are also cached for maximum 30 seconds, ensuring even external records with very long TTLs refresh after 30 seconds to avoid retaining excessively stale DNS information.
TTL Configuration Guidelines
Generally, short TTL (e.g., 5 seconds or less) has the advantage of rapidly reflecting DNS record changes (new service IPs, pod IP changes), but increases repeated queries from clients or DNS caches, raising CoreDNS load. Conversely, long TTL (several minutes or more) reduces DNS query frequency to improve performance, but delays change propagation, increasing the possibility of temporary connection failures due to stale information. The recommended approach is to moderately increase TTL (in tens of seconds units) according to cluster size and workload patterns to raise cache hit rates while avoiding serious information delays. Many Kubernetes environments use TTL around 30 seconds as a baseline.
Amazon EKS Application Example
To adjust TTL in EKS, modify the CoreDNS ConfigMap. For example, to increase internal domain cache time, add ttl 30 to the kubernetes cluster.local ... block in Corefile. This increases the cluster internal DNS response TTL field to 30 seconds, allowing client-side (NodeLocal DNSCache, application runtimes) caching for 30 seconds without re-querying. However, in Kubernetes environments, typical Linux glibc resolvers don't self-cache and query CoreDNS every time, so without auxiliary caches like NodeLocal DNSCache, increasing TTL provides limited client-side benefits. TTL adjustments primarily reduce CoreDNS load.
Services like AWS Aurora use very low TTL (1 second) for DNS load balancing. In this case, CoreDNS's default minimum 5-second TTL over-caches the original 1-second TTL, distorting Aurora reader endpoint traffic distribution. In such situations, introduce domain-specific low TTL settings.
Real-world cases resolved this by configuring NodeLocal DNSCache CoreDNS with cache 1 and success/denial 1 TTL specifics for the amazonaws.com zone, respecting Aurora endpoint's original 1-second TTL. Therefore, external service TTL policies must be considered when tuning CoreDNS TTL and cache strategies.
3. CoreDNS Monitoring Architecture Best Practices
The ideal CoreDNS monitoring architecture is built as a comprehensive observability pipeline including metric collection (Prometheus, etc.), log collection (Fluent Bit, etc.), plus visualization and alerting systems. In Amazon EKS environments, stable and scalable monitoring systems can be implemented by combining managed services and open-source tools.