CoreDNS Monitoring and Optimization Guide

Written: 2025-05-20 | Updated: 2026-02-18 | Reading time: ~13 min

In Amazon EKS and modern Kubernetes clusters, CoreDNS is the core component responsible for all in-cluster service discovery and external domain name resolution. Since CoreDNS performance and availability directly impact application response times and stability, building an effective monitoring and optimization architecture is critical. This article analyzes CoreDNS performance monitoring metrics, TTL configuration guide, monitoring architecture best practices, and AWS recommendations with real-world cases. Each section leverages Prometheus metrics and Amazon EKS environment examples.

1. CoreDNS Performance Monitoring: Key Prometheus Metrics

CoreDNS exposes Prometheus-format metrics through the metrics plugin, available by default in EKS on port 9153 of the kube-dns service. The core metrics cover DNS request throughput, latency, errors, and caching efficiency, enabling rapid detection of DNS performance bottlenecks or failure indicators.

CoreDNS 4 Golden Signals

🎯 CoreDNS 四大黄金信号

基于 Google SRE 方法论的核心监控指标

📈

流量 (Throughput)coredns_dns_requests_total

每秒 DNS 请求数(QPS)。检查各 Pod 负载是否均衡，持续增长时考虑横向扩展。

⏱️

延迟 (Latency)coredns_dns_request_duration_seconds

P99 响应时间。高于正常值时，检查上游 DNS 延迟或 CoreDNS CPU/内存饱和。

❌

错误 (Errors)coredns_dns_responses_total{rcode=SERVFAIL}

SERVFAIL/REFUSED 比例增加时，检查外部通信或访问权限问题。NXDOMAIN 激增意味着错误的域名查询。

💻

资源使用 (Resource)CPU / Memory utilization

EKS 默认内存请求/限制：70Mi/170Mi。超过 150Mi 时设置告警。CPU 达到限制会导致节流和 DNS 延迟。

CoreDNS Key Prometheus Metrics

📊 CoreDNS Prometheus 核心指标

EKS 默认通过端口 9153 (/metrics) 暴露

指标

信号

说明 / PromQL

coredns_dns_requests_total

Counter

流量

DNS 请求总数（按协议/类型）。用 rate() 计算 QPS

rate(coredns_dns_requests_total[5m])

coredns_dns_request_duration_seconds

Histogram

延迟

DNS 处理时间分布。P99 > 100ms 时检查上游/资源

histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m]))

coredns_dns_responses_total

Counter

错误

DNS 响应码分布。跟踪 SERVFAIL/NXDOMAIN 比例

rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m])

coredns_cache_hits_total

Counter

缓存

缓存命中数（success/denial）。用于计算缓存命中率

rate(coredns_cache_hits_total[5m])

coredns_cache_misses_total

Counter

缓存

缓存未命中数。命中率 = hits / (hits + misses)

rate(coredns_cache_misses_total[5m])

coredns_forward_requests_total

Counter

上游 DNS 转发请求数。缓存未命中时发生

rate(coredns_forward_requests_total[5m])

coredns_forward_responses_total

Counter

上游 DNS 响应数（按 rcode）。监控上游错误

rate(coredns_forward_responses_total[5m])

coredns_panics_total

Counter

稳定性

CoreDNS panic 次数。非零时需立即调查

coredns_panics_total

Beyond these, additional metrics such as request/response size (coredns_dns_request_size_bytes, ...response_size_bytes), DO bit presence (coredns_dns_do_requests_total), and plugin-specific metrics are available. For example, the Forward plugin provides upstream query time (coredns_forward_request_duration_seconds), and the kubernetes plugin provides API update latency (coredns_kubernetes_dns_programming_duration_seconds).

Key Metric Meanings and Usage

Track coredns_dns_requests_total per-second rate for DNS QPS, distributed per CoreDNS Pod to verify load balance. If QPS consistently grows, evaluate whether CoreDNS scale-out is needed. When coredns_dns_request_duration_seconds p99 rises above normal, CoreDNS is experiencing response latency — check for upstream DNS delays or CoreDNS CPU/memory saturation. If cache hit ratio (coredns_cache_hits_total) is low, check whether TTL is too short. If coredns_dns_responses_total shows increasing SERVFAIL or REFUSED, check CoreDNS external communication or access control issues. A spike in NXDOMAIN for specific domains may indicate applications querying incorrect domains.

System resource metrics (CPU/memory) are also important. Monitor CoreDNS Pod CPU/memory utilization and alert when approaching resource limits. EKS default CoreDNS memory request/limit is 70Mi/170Mi — track if usage exceeds 150Mi. CPU throttling by kubelet causes DNS latency, so consider scaling when CPU approaches limits.

VPC ENI DNS Packet Limit

Each node ENI allows only 1024 DNS packets per second. Even if you increase CoreDNS max_concurrent, the ENI PPS limit (1024 PPS) may prevent reaching desired performance.

2. CoreDNS TTL Configuration Guide and Amazon EKS Examples

TTL (Time-To-Live) defines the DNS record cache validity period, balancing DNS traffic load and information freshness. CoreDNS handles TTL at two levels:

Authoritative zone record (SOA) TTL: The kubernetes plugin response TTL for internal cluster domains (cluster.local, etc.), defaulting to 5 seconds. Configurable in the kubernetes section of the Corefile with the ttl option (min 0, max 3600 seconds).
Cache TTL: The cache plugin maximum cache retention time, defaulting to 3600 seconds (success responses). The specified TTL acts as an upper limit — if the actual DNS record TTL is shorter, the cache respects the shorter value.

⚙️ CoreDNS TTL 配置指南

DNS 流量负载与信息新鲜度之间的最佳平衡

Kubernetes 内部域kubernetes

配置:ttl 30

默认值:5s

Amazon EKS Default CoreDNS Configuration

The default EKS CoreDNS Corefile uses 5-second default TTL for the kubernetes plugin (no explicit TTL) and cache 30 to cache all DNS responses for up to 30 seconds. Internal service record TTL is 5 seconds in the response packet, but CoreDNS itself caches for up to 30 seconds to avoid frequent Kubernetes API queries. External domains are also cached up to 30 seconds.

TTL Configuration Guide

Short TTLs (≤5s) reflect DNS changes quickly but increase CoreDNS load. Long TTLs (minutes+) reduce query frequency but delay change propagation. The recommended approach is to moderately increase TTL (tens of seconds) to improve cache hit rate while avoiding severe information delays. Many Kubernetes environments use 30 seconds as a baseline.

Amazon EKS Application Examples

To adjust TTL in EKS, modify the CoreDNS ConfigMap. Add ttl 30 to the kubernetes cluster.local ... block. Note that standard Linux glibc resolver doesn't cache — without NodeLocal DNSCache, TTL increases mainly reduce CoreDNS's own load.

Aurora DNS Load Balancing Issue

AWS Aurora uses very low TTL (1 second) for DNS load balancing. CoreDNS's default minimum TTL of 5 seconds over-caches the 1-second TTL, distorting Aurora reader endpoint traffic distribution. Apply domain-specific low TTL settings for such cases.

3. CoreDNS Monitoring Architecture Best Practices

🏗️ CoreDNS 监控架构

AMP + ADOT vs CloudWatch Container Insights 对比

AMP + ADOTManaged OSS

ADOT Collector / Prometheus 抓取 CoreDNS 指标 → AMP remote write → Grafana(AMG) 可视化

优点

+ PromQL 原生查询

+ 长期存储 & 大规模集群支持

+ Terraform 自动化加速器

注意事项

- 需要安装 ADOT/Prometheus

- 基于摄入指标的费用

CloudWatch Container InsightsAWS Native

CloudWatch Agent DaemonSet → kube-dns:9153 抓取 → CloudWatch Metrics 存储 → CloudWatch 仪表盘/告警

优点

+ AWS 托管 - 无需额外基础设施

+ CloudWatch Alarm 原生集成

+ 可在 AMG 中作为数据源

注意事项

- CloudWatch 指标采集/存储费用

- 使用 CloudWatch 查询语法而非 PromQL

管道层次

采集 (Collection)ADOT Collector · Prometheus · CloudWatch Agent · Fluent Bit

↓

存储 (Storage)AMP (Prometheus) · CloudWatch Metrics · CloudWatch Logs

↓

可视化 (Visualization)AMG (Grafana) · CloudWatch Dashboards

↓

告警 (Alerting)Alertmanager · CloudWatch Alarms · SNS / PagerDuty / Slack

💡 推荐: 使用 Prometheus Operator (kube-prometheus-stack) 时，可通过 ServiceMonitor 自动抓取 kube-system/kube-dns (k8s-app=kube-dns) 服务的 9153 端口。

Metric Collection and Storage

Two common approaches in Amazon EKS:

Amazon Managed Service for Prometheus (AMP): Fully managed Prometheus-compatible service. Install ADOT Collector or Prometheus to scrape and forward CoreDNS metrics.
CloudWatch Container Insights: Use CloudWatch agent as DaemonSet to scrape CoreDNS metrics from kube-dns service port 9153.

ServiceMonitor Configuration

EKS's kube-dns service provides a metrics port. With Prometheus Operator, create a ServiceMonitor targeting the k8s-app=kube-dns label on port 9153.

Log Collection

Enable log or errors plugins as needed. Use Fluent Bit or Fluentd DaemonSets to collect CoreDNS stdout/stderr logs and export to CloudWatch Logs.

Log Collection Caution

Avoid excessive logging overhead. Set metadata caching (Kube_Meta_Cache_TTL=60) and reduce unnecessary field collection.

Visualization and Dashboards

Use Grafana (or Amazon Managed Grafana) to visualize CoreDNS metrics: QPS, latency histograms, error rates (rcode distribution), cache hit rates.

Alerting

Set alerts via Prometheus Alertmanager or CloudWatch Alarms:

CoreDNSDown: CoreDNS metrics unreported for 15+ minutes
HighDNSLatency: p99 latency exceeding 100ms
DNSErrorsSpike: SERVFAIL/NXDOMAIN ratio above threshold
ENIThrottling: ENI DNS packet limit exceeded
HighCoreDNSCPU/Memory: Resource utilization alerts

4. Amazon EKS Best Practices and Customer Cases

🛡️ EKS 最佳实践 & 实战案例

AWS 推荐的 CoreDNS 优化策略和故障应对案例

📈

Cluster Proportional AutoscalerDNS QPS 线性扩展

EKS 默认 CoreDNS 副本数为 2。根据节点数/CPU 核心数自动扩展，分散 DNS 负载。

🗄️

NodeLocal DNSCacheRTT 降低，ENI 瓶颈消除

在所有节点上运行 DNS 缓存代理（DaemonSet），提供本地 DNS。消除网络延迟和 ENI 限制。

🔒

DNS 包限制 & 流量分散避免 ENI PPS 瓶颈

VPC ENI 限制每秒 1024 个 DNS 包。将 CoreDNS Pod 分散到不同节点（Pod Anti-Affinity）以分散 ENI 限制。

🔄

优雅终止 (Lameduck)Zero-downtime DNS 滚动更新

防止 CoreDNS 重启/缩容时的临时 DNS 故障。配置 lameduck 30s + /ready Readiness Probe。

实战故障应对案例

案例 1：ENI PPS 限制导致的 DNS 延迟

症状：特定服务 DNS 响应延迟 → 整体响应时间增加 1 秒以上

原因：CoreDNS 查询的 VPC DNS Resolver 达到 ENI PPS 限制（1024 PPS）导致丢包

解决：引入 NodeLocal DNSCache + CoreDNS Pod 节点分散（Anti-Affinity）

案例 2：Aurora DNS TTL 缓存导致的读取器偏斜

症状：Aurora 读取器节点会话偏斜 → 部分读取器过载

原因：Aurora Reader 端点 DNS TTL 为 1 秒，但 CoreDNS 最小 TTL 5 秒导致过度缓存

解决：在 NodeLocal DNSCache 中为 amazonaws.com 配置 cache 1、success/denial 1

⚠️ ENI DNS 包限制: 每个节点 ENI 每秒仅允许 1024 个 DNS 包。即使提高 CoreDNS 的 max_concurrent，ENI PPS 限制（1024 PPS）也可能制约性能。

CoreDNS Horizontal Scaling

Default 2 replicas; use Cluster Proportional Autoscaler to scale based on node count or CPU cores.

NodeLocal DNSCache

For large clusters or high DNS traffic, deploy NodeLocal DNSCache DaemonSet for local DNS caching on every node.

DNS Packet Limits and Traffic Distribution

VPC DNS packet limit is 1024 PPS/ENI. Ensure CoreDNS Pods are distributed across nodes.

Graceful Termination (Lameduck & Ready Plugin)

Apply lameduck 30s and configure Readiness Probe on /ready endpoint.

Higher QPS Requirements

Increase max_concurrent to 2000+
Scale CoreDNS horizontally or deploy NodeLocal DNSCache
Monitor ENI limits via aws_ec2_eni_allowance_exceeded

Key Summary

🎯 性能基准 & 调优指南

CoreDNS 核心性能目标和调优参数

指标

目标

临界

说明

查询延迟 (P99)

< 50ms

> 100ms

99% 的 DNS 查询在 50ms 内完成

吞吐量 (QPS/Pod)

> 10K

< 5K

每个 Pod 每秒处理超过 10,000 次查询

缓存命中率

> 80%

< 50%

TTL 30s 基准，80% 以上缓存利用

错误率 (SERVFAIL)

< 0.1%

> 1%

SERVFAIL 响应比例保持在 0.1% 以下

CPU 使用率

< 60%

> 80%

CPU 达到限制时，节流会导致 DNS 延迟

内存使用率

< 120Mi

> 150Mi

EKS 默认限制 170Mi。超过 150Mi 时设置告警

调优参数

max_concurrent

1000

2000+

并发查询限制。考虑内存 2KB × 并发查询数

Replica Count

按节点比例自动

应用 Cluster Proportional Autoscaler

lameduck

30s

滚动更新时防止 DNS 故障

💡 基准测试工具: dnsperf -s <COREDNS_IP> -d queries.txt -c 10 -T 10 测量 CoreDNS QPS 和延迟。

Monitoring Metrics: requests_total, request_duration_seconds, cache_hits/misses, responses_total{rcode}, CPU/memory
Recommended TTL: Service records 30s, cache (success 30, denial 5-10), prefetch 5 60s
Monitoring: kube-prometheus-stack dashboards + Alertmanager rules, NodeLocal DNSCache for scale-out

Appendix: Configuration Examples

Recommended Corefile

.:53 {
  kubernetes cluster.local in-addr.arpa ip6.arpa {
    pods insecure
    fallthrough in-addr.arpa ip6.arpa
    ttl 30           # Service/POD record TTL
  }

  cache 30 {         # Max 30s retention
    success 10000 30 # capacity 10k, maxTTL 30s
    denial 2000 10   # negative cache 2k, maxTTL 10s
    prefetch 5 60s   # refresh before expiry if 5+ identical queries
  }

  forward . /etc/resolv.conf {
    max_concurrent 2000
    prefer_udp
  }

  prometheus :9153
  health {
    lameduck 30s
  }
  ready
  reload
  log
}

Alertmanager Rule Examples

- alert: CoreDNSHighErrorRate
  expr: >
    (sum(rate(coredns_dns_responses_total{rcode!~"NOERROR"}[5m])) /
     sum(rate(coredns_dns_requests_total[5m]))) > 0.01
  for: 10m
  labels:
    severity: critical
  annotations:
    description: "CoreDNS error rate > 1% for 10 min"

- alert: CoreDNSP99Latency
  expr: >
    histogram_quantile(0.99,
      sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le)) > 0.05
  for: 5m
  labels:
    severity: warning

Large Clusters (>100 Nodes or QPS > 5k)

NodeLocal DNSCache (DaemonSet) for local caching and RTT reduction
CloudWatch Container Insights as alternative when Prometheus collection is difficult

1. CoreDNS Performance Monitoring: Key Prometheus Metrics​

CoreDNS 4 Golden Signals​

CoreDNS Key Prometheus Metrics​

Key Metric Meanings and Usage​

2. CoreDNS TTL Configuration Guide and Amazon EKS Examples​

Amazon EKS Default CoreDNS Configuration​

TTL Configuration Guide​

Amazon EKS Application Examples​

3. CoreDNS Monitoring Architecture Best Practices​

Metric Collection and Storage​

Log Collection​

Visualization and Dashboards​

Alerting​

4. Amazon EKS Best Practices and Customer Cases​

CoreDNS Horizontal Scaling​

NodeLocal DNSCache​

DNS Packet Limits and Traffic Distribution​

Graceful Termination (Lameduck & Ready Plugin)​

Higher QPS Requirements​

Key Summary​

Appendix: Configuration Examples​

Recommended Corefile​

Alertmanager Rule Examples​

Large Clusters (>100 Nodes or QPS > 5k)​

1. CoreDNS Performance Monitoring: Key Prometheus Metrics

CoreDNS 4 Golden Signals

CoreDNS Key Prometheus Metrics

Key Metric Meanings and Usage

2. CoreDNS TTL Configuration Guide and Amazon EKS Examples

Amazon EKS Default CoreDNS Configuration

TTL Configuration Guide

Amazon EKS Application Examples

3. CoreDNS Monitoring Architecture Best Practices

Metric Collection and Storage

Log Collection

Visualization and Dashboards

Alerting

4. Amazon EKS Best Practices and Customer Cases

CoreDNS Horizontal Scaling

NodeLocal DNSCache

DNS Packet Limits and Traffic Distribution

Graceful Termination (Lameduck & Ready Plugin)

Higher QPS Requirements

Key Summary

Appendix: Configuration Examples

Recommended Corefile

Alertmanager Rule Examples

Large Clusters (>100 Nodes or QPS > 5k)