EKS Hybrid Nodes Complete Guide
📅 Written: 2025-08-20 | Last Modified: 2026-02-14 | ⏱️ Reading Time: ~6 min
Table of Contents
- Overview
- Prerequisites
- Networking and DNS Configuration
- Harbor Private Registry Installation
- EKS Hybrid Nodes Configuration
- Harbor and EKS Integration
- GPU Server Integration
- Cost Analysis and Optimization
- Dynamic Resource Allocation (DRA)
- Operations and Maintenance
Overview
This guide provides complete adoption methods for Amazon EKS Hybrid Nodes. EKS Hybrid Nodes, officially released in December 2024, enables integrated management of on-premises infrastructure with AWS EKS, allowing management of high-performance GPU servers and cloud resources within a single Kubernetes cluster.
Key Features:
- Unified management of on-premises and cloud
- Harbor 2.13 private registry integration
- H100 GPU server support
- Dynamic Resource Allocation (DRA)
- Flexible workload placement
Prerequisites
System Requirements
On-premises Nodes:
- Operating System: Ubuntu 20.04/22.04/24.04 LTS or RHEL 8/9
- Docker Engine 20.10.10+ (for Harbor)
- Container Runtime: containerd 1.6.x or later
- Minimum Hardware: 2 CPU cores, 4GB RAM
GPU Servers (Optional):
- NVIDIA Driver 550.x or later
- NVIDIA Container Toolkit
- H100/H200 GPU support
Network Requirements
| Item | Requirement |
|---|---|
| Bandwidth | Minimum 10Gbps (Direct Connect or VPN) |
| Latency | 5ms or less recommended |
| MTU | Jumbo Frame (9000) recommended |
Networking and DNS Configuration
Required Firewall Settings
Configure necessary firewall ports between on-premises and AWS:
| Protocol | Port | Direction | Purpose |
|---|---|---|---|
| TCP | 443 | Bidirectional | Kubernetes API server communication |
| TCP | 10250 | On-premises → AWS | Kubelet API |
| TCP/UDP | 53 | Bidirectional | DNS queries |
| TCP | 6443 | On-premises → AWS | Kubernetes API (alternative) |
Pod CIDR Firewall Configuration
It is recommended to register the entire Pod CIDR range in the firewall.
Configuration Methods:
-
Complete CIDR Registration (Recommended): e.g.,
10.244.0.0/16- Flexible adaptation to dynamic Pod IP allocation
- No additional firewall settings needed during pod scaling
- Reduced management complexity
-
Fixed IP Worker Nodes Only (Not Recommended):
- Need to update firewall rules whenever pod IPs change
- Increased operational complexity
- Increased service disruption risk
Istio + Calico CNI Mixed Mode
Additional port configuration when using Istio service mesh and Calico CNI together:
| Component | Port | Purpose |
|---|---|---|
| Envoy Proxy | 15001 | Outbound traffic |
| Envoy Proxy | 15006 | Inbound traffic |
| Pilot | 15010 | xDS server |
| Istio Telemetry | 15004 | Mixer policy |
| Calico BGP | 179 | BGP peering |
| Calico Felix | 9099 | Metrics |
# Firewall rule example (AWS Security Group)
aws ec2 authorize-security-group-ingress \
--group-id sg-hybrid-nodes \
--protocol tcp \
--port 15001 \
--source-group sg-eks-cluster
aws ec2 authorize-security-group-ingress \
--group-id sg-hybrid-nodes \
--protocol tcp \
--port 179 \
--source-group sg-hybrid-nodes
DNS Configuration
Route 53 Resolver Inbound Endpoint (On-premises → AWS)
Purpose: Enable on-premises servers to query AWS internal domains
# Create Route 53 Resolver Inbound Endpoint
aws route53resolver create-resolver-endpoint \
--creator-request-id unique-id-123 \
--name hybrid-inbound-endpoint \
--security-group-ids sg-resolver-xxxxx \
--direction INBOUND \
--ip-addresses SubnetId=subnet-xxxxx,Ip=10.0.1.100 \
SubnetId=subnet-yyyyy,Ip=10.0.2.100
On-premises DNS Configuration (Example: BIND):
# /etc/named.conf
zone "eks.amazonaws.com" {
type forward;
forward only;
forwarders { 10.0.1.100; 10.0.2.100; };
};
Route 53 Resolver Outbound Endpoint (AWS → On-premises)
Purpose: Enable AWS worker nodes to query on-premises internal domains
# Create Outbound Endpoint
aws route53resolver create-resolver-endpoint \
--creator-request-id unique-id-456 \
--name hybrid-outbound-endpoint \
--security-group-ids sg-resolver-xxxxx \
--direction OUTBOUND \
--ip-addresses SubnetId=subnet-xxxxx \
SubnetId=subnet-yyyyy
# Create Resolver Rule
aws route53resolver create-resolver-rule \
--creator-request-id unique-id-789 \
--name on-prem-dns-rule \
--rule-type FORWARD \
--domain-name company.local \
--target-ips Ip=192.168.1.53,Port=53 Ip=192.168.1.54,Port=53 \
--resolver-endpoint-id rslvr-out-xxxxx
Bidirectional DNS Query Validation
# Query AWS domains from on-premises
dig @10.0.1.100 my-service.eks.amazonaws.com
# Query on-premises domains from AWS
dig harbor.company.local
CIDR Design
CIDR Design Principles
AWS VPC CIDR:
- Primary:
10.0.0.0/16(65,536 IPs) - Secondary (if needed):
10.1.0.0/16
On-premises CIDR:
- Existing network:
192.168.0.0/16 - Pod CIDR:
10.244.0.0/16 - Service CIDR:
10.96.0.0/16
Preventing Overlaps:
# Check CIDR overlaps
aws ec2 describe-vpcs --query 'Vpcs[*].CidrBlock'
# Check on-premises routing table
ip route show
Routing Configuration
# Create AWS Transit Gateway
aws ec2 create-transit-gateway \
--description "Hybrid connectivity" \
--options AmazonSideAsn=64512,AutoAcceptSharedAttachments=enable
# Create VPN Connection
aws ec2 create-vpn-connection \
--type ipsec.1 \
--customer-gateway-id cgw-xxxxx \
--transit-gateway-id tgw-xxxxx \
--options TunnelInsideIpVersion=ipv4,TunnelOptions=[{TunnelInsideCidr=169.254.10.0/30}]
Harbor Private Registry Installation
Download Harbor 2.13.2
# Download Harbor 2.13.2 (latest stable version)
wget https://github.com/goharbor/harbor/releases/download/v2.13.2/harbor-offline-installer-v2.13.2.tgz
# Extract archive
tar xvf harbor-offline-installer-v2.13.2.tgz
cd harbor
SSL/TLS Certificate Configuration
Generate Self-Signed Certificate
# 1. Generate CA certificate
openssl genrsa -out ca.key 4096
openssl req -x509 -new -nodes -sha512 -days 3650 \
-key ca.key \
-out ca.crt \
-subj "/C=US/ST=California/L=San Francisco/O=MyOrganization/CN=Harbor-CA"
# 2. Generate server certificate
openssl genrsa -out harbor.key 4096
openssl req -new -sha512 \
-key harbor.key \
-out harbor.csr \
-subj "/C=US/ST=California/L=San Francisco/O=MyOrganization/CN=harbor.yourdomain.com"
# 3. Create v3.ext file (SAN configuration)
cat > v3.ext <<EOF
authorityKeyIdentifier=keyid,issuer
basicConstraints=CA:FALSE
keyUsage = digitalSignature, nonRepudiation, keyEncipherment, dataEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1=harbor.yourdomain.com
DNS.2=yourdomain.com
IP.1=192.168.1.100
EOF
# 4. Sign certificate
openssl x509 -req -sha512 -days 3650 \
-extfile v3.ext \
-CA ca.crt -CAkey ca.key -CAcreateserial \
-in harbor.csr \
-out harbor.crt
# 5. Create certificate directory and copy files
mkdir -p /data/cert
cp harbor.crt /data/cert/
cp harbor.key /data/cert/
Harbor Configuration File Setup
# Copy and edit harbor.yml file
cp harbor.yml.tmpl harbor.yml
vi harbor.yml
Key configuration content:
# Hostname setting
hostname: harbor.yourdomain.com
# HTTPS configuration
https:
port: 443
certificate: /data/cert/harbor.crt
private_key: /data/cert/harbor.key
# Harbor admin password
harbor_admin_password: Harbor12345!
# Database configuration
database:
password: root123
max_idle_conns: 100
max_open_conns: 900
conn_max_lifetime: 5m
conn_max_idle_time: 0
# Data storage path
data_volume: /data
# Logging configuration
log:
level: info
local:
rotate_count: 50
rotate_size: 200M
location: /var/log/harbor
# Trivy vulnerability scanner configuration
trivy:
ignore_unfixed: false
skip_update: false
offline_scan: false
insecure: false
# Metrics configuration
metric:
enabled: true
port: 9090
path: /metrics
Harbor Installation
# Run installation preparation script
sudo ./prepare
# Install Harbor (with Trivy)
sudo ./install.sh --with-trivy
# Verify installation
docker-compose ps
Create Robot Account
# Create via Harbor UI or use API
curl -X POST "https://harbor.yourdomain.com/api/v2.0/robots" \
-H "Content-Type: application/json" \
-u "admin:Harbor12345!" \
-d '{
"name": "k8s-robot",
"duration": 365,
"description": "Robot account for Kubernetes",
"disable": false,
"level": "system",
"permissions": [
{
"namespace": "*",
"kind": "project",
"access": [
{
"resource": "repository",
"action": "pull"
}
]
}
]
}'
EKS Hybrid Nodes Configuration
Install nodeadm
# For x86_64 architecture
curl -OL 'https://hybrid-assets.eks.amazonaws.com/releases/latest/bin/linux/amd64/nodeadm'
# For ARM architecture (if needed)
# curl -OL 'https://hybrid-assets.eks.amazonaws.com/releases/latest/bin/linux/arm64/nodeadm'
# Grant execute permission
chmod +x nodeadm
sudo mv nodeadm /usr/local/bin/
# Verify version
nodeadm version
Install Required Components
# Install Kubernetes 1.33 supporting components
sudo nodeadm install 1.33 --credential-provider ssm
# Or when using IAM Roles Anywhere
# sudo nodeadm install 1.33 --credential-provider iam-ra
Create NodeConfig File
# nodeconfig.yaml
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
cluster:
name: my-hybrid-cluster
region: ap-northeast-2
# Hybrid node configuration using SSM
hybrid:
ssm:
activationCode: "YOUR-ACTIVATION-CODE"
activationId: "YOUR-ACTIVATION-ID"
# Containerd configuration (Harbor registry setup)
containerd:
config: |
version = 2
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."harbor.yourdomain.com"]
endpoint = ["https://harbor.yourdomain.com"]
[plugins."io.containerd.grpc.v1.cri".registry.configs]
[plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.yourdomain.com"]
[plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.yourdomain.com".auth]
username = "robot$k8s-robot"
password = "YOUR-ROBOT-TOKEN"
[plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.yourdomain.com".tls]
ca_file = "/etc/ssl/certs/harbor-ca.crt"
insecure_skip_verify = false
# Kubelet configuration
kubelet:
config:
shutdownGracePeriod: 30s
maxPods: 110
flags:
- --node-labels=node-type=hybrid,registry=harbor
Install Certificate
# Add CA certificate to system trust store
sudo cp ca.crt /usr/local/share/ca-certificates/harbor-ca.crt
sudo update-ca-certificates
# Create certificate directory for containerd
sudo mkdir -p /etc/containerd/certs.d/harbor.yourdomain.com
# Copy certificate
sudo cp ca.crt /etc/containerd/certs.d/harbor.yourdomain.com/ca.crt
# Restart containerd
sudo systemctl restart containerd
Node Initialization
# Initialize node using NodeConfig
sudo nodeadm init --config-source file://nodeconfig.yaml
# Verify node status
kubectl get nodes
Harbor and EKS Integration
Network Configuration
# Allow EKS nodes to access Harbor security group
aws ec2 authorize-security-group-ingress \
--group-id sg-harbor-xxxxx \
--protocol tcp \
--port 443 \
--source-group sg-eks-nodes-xxxxx \
--region ap-northeast-2
CoreDNS Configuration
# Modify CoreDNS ConfigMap
kubectl edit configmap coredns -n kube-system
# Add the following content
data:
Corefile: |
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
# Add Harbor DNS
hosts {
192.168.1.100 harbor.yourdomain.com
fallthrough
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
Create Kubernetes Secret
# Test docker login
docker login harbor.yourdomain.com
Username: robot$k8s-robot
Password: YOUR-ROBOT-TOKEN
# Create Kubernetes Secret
kubectl create secret docker-registry harbor-registry \
--docker-server=harbor.yourdomain.com \
--docker-username='robot$k8s-robot' \
--docker-password='YOUR-ROBOT-TOKEN' \
--docker-email=admin@yourdomain.com
# Copy Secret to all namespaces (optional)
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
kubectl get secret harbor-registry -o yaml | \
sed "s/namespace: default/namespace: $ns/" | \
kubectl apply -f -
done
Testing and Validation
# 1. Verify network connectivity
curl -k https://harbor.yourdomain.com/api/v2.0/health
# 2. Test image pull directly from node
sudo crictl pull harbor.yourdomain.com/library/nginx:latest
# 3. Test Kubernetes Pod deployment
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: harbor-test
spec:
containers:
- name: nginx
image: harbor.yourdomain.com/library/nginx:latest
imagePullSecrets:
- name: harbor-registry
EOF
# 4. Verify Pod status
kubectl get pod harbor-test
kubectl describe pod harbor-test
GPU Server Integration
H100 GPU Server Integration
Validation Goal: Integrate 10 H100 GPU servers as EKS Hybrid Nodes
GPU Node Configuration
# nodeconfig-gpu.yaml
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
cluster:
name: gpu-hybrid-cluster
region: ap-northeast-2
hybrid:
ssm:
activationCode: "ACTIVATION-CODE"
activationId: "ACTIVATION-ID"
kubelet:
config:
maxPods: 110
shutdownGracePeriod: 30s
flags:
- --node-labels=node-type=hybrid,gpu=h100,gpu-count=8
- --register-with-taints=nvidia.com/gpu=present:NoSchedule
containerd:
config: |
version = 2
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
Deploy NVIDIA Device Plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
# Verify GPU resources
kubectl get nodes -o json | jq '.items[].status.allocatable."nvidia.com/gpu"'
GPU Test
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: Never
containers:
- name: cuda
image: nvidia/cuda:12.3.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
gpu: h100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
On-premises Storage Access
NFS Mount Test
# Run on AWS worker node
sudo mount -t nfs -o vers=4.1,rsize=1048576,wsize=1048576 \
192.168.1.100:/export/data /mnt/onprem-storage
# Performance measurement
dd if=/dev/zero of=/mnt/onprem-storage/testfile bs=1M count=1000 oflag=direct
PersistentVolume Configuration
apiVersion: v1
kind: PersistentVolume
metadata:
name: onprem-storage-pv
spec:
capacity:
storage: 10Ti
accessModes:
- ReadWriteMany
nfs:
server: 192.168.1.100
path: /export/data
mountOptions:
- vers=4.1
- rsize=1048576
- wsize=1048576
- hard
- timeo=600
- retrans=2
Cost Analysis and Optimization
Hybrid Nodes Pricing Structure
Base Pricing (February 2025):
- Per vCPU: $0.1099/hour
- Based on 730 hours/month: approximately $80.23/vCPU
H100 GPU Server Cost Analysis
H100 GPU Server Specifications (DGX H200 basis):
- CPU: 224 vCPU (2x Intel Xeon Platinum 8592+)
- RAM: 2TB
- GPU: 8x H200 (141GB HBM3e)
Monthly Cost Calculation:
Single Node:
- 224 vCPU × $80.23 = $17,971.52/month
10 Nodes:
- $17,971.52 × 10 = $179,715.20/month
Due to the high number of vCPUs in H100 GPU servers, Hybrid Nodes costs are substantial. Review the following optimization approaches:
- Selective Workload Placement: Place only GPU-intensive workloads on Hybrid Nodes
- Spot Instances Mix: Utilize Spot instances for AWS workers
- Auto Scaling: Remove nodes during non-usage hours
- Reserved Capacity: Negotiate reserved options for long-term usage
Cost Reduction Strategies
1. Hybrid Workload Distribution
# GPU workload → On-premises
apiVersion: v1
kind: Pod
metadata:
name: gpu-training
spec:
nodeSelector:
gpu: h100
node-type: hybrid
containers:
- name: training
image: pytorch/pytorch:2.1-cuda12.1
resources:
limits:
nvidia.com/gpu: 8
---
# CPU workload → AWS EC2
apiVersion: v1
kind: Pod
metadata:
name: web-api
spec:
nodeSelector:
eks.amazonaws.com/compute-type: ec2
containers:
- name: api
image: nginx:latest
2. Cluster Autoscaler Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0
command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --skip-nodes-with-system-pods=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled
3. Cost Monitoring
# Track Hybrid Nodes costs using AWS Cost Explorer API
aws ce get-cost-and-usage \
--time-period Start=2025-02-01,End=2025-02-28 \
--granularity MONTHLY \
--metrics BlendedCost \
--filter file://filter.json
# filter.json
{
"Dimensions": {
"Key": "SERVICE",
"Values": ["Amazon Elastic Kubernetes Service - Hybrid Nodes"]
}
}
Workload Distribution Strategy
On-premises GPU Workers:
- AI/ML training workloads
- High-performance inference services
- Data-intensive processing
AWS CPU Workers:
- Web applications and APIs
- Microservices
- Lightweight batch jobs
Dynamic Resource Allocation (DRA)
Dynamic Resource Allocation (DRA) is a feature introduced in Kubernetes 1.26 that enables Pods to dynamically allocate resources such as GPUs, NPUs, and specialized accelerators when requested. This improves upon traditional static resource allocation methods, enabling more efficient resource utilization.
Enable DRA
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: hybrid-dra-cluster
region: ap-northeast-2
version: "1.30"
kubernetesNetworkConfig:
serviceIPv4CIDR: 10.100.0.0/16
managedNodeGroups:
- name: cpu-nodes
instanceType: m5.xlarge
desiredCapacity: 3
minSize: 1
maxSize: 10
labels:
node-type: cpu
workload: general
- name: gpu-nodes
instanceType: g5.xlarge
desiredCapacity: 2
minSize: 1
maxSize: 5
labels:
node-type: gpu
workload: ml
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
Deploy DRA Driver
# Add DRA driver helm repository
helm repo add dra-driver https://charts.dra.io
helm repo update
# Install DRA driver
helm install dra-driver dra-driver/dra-driver \
--namespace kube-system \
--set driver.name=eks-hybrid-dra \
--set driver.enableGPU=true \
--set driver.enableCPU=true
Define Resource Classes
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClass
metadata:
name: gpu-compute-class
spec:
driverName: eks-hybrid-dra
suitableNodes:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values: ["gpu"]
parametersRef:
name: gpu-parameters
namespace: kube-system
---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClass
metadata:
name: cpu-compute-class
spec:
driverName: eks-hybrid-dra
suitableNodes:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values: ["cpu"]
parametersRef:
name: cpu-parameters
namespace: kube-system
Configure Resource Claims
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
name: ml-training-claim
namespace: ml-workloads
spec:
spec:
resourceClassName: gpu-compute-class
allocationMode: WaitForFirstConsumer
parametersRef:
name: ml-training-params
namespace: ml-workloads
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ml-training-params
namespace: ml-workloads
data:
gpu-count: "1"
gpu-memory: "16Gi"
cuda-version: "12.2"
ML Training Job Example
apiVersion: batch/v1
kind: Job
metadata:
name: ml-training-job
namespace: ml-workloads
spec:
template:
metadata:
labels:
app: ml-training
spec:
resourceClaims:
- name: gpu-resource
source:
resourceClaimTemplateName: ml-training-claim
containers:
- name: training
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
command: ["python", "train.py"]
resources:
claims:
- name: gpu-resource
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
restartPolicy: OnFailure
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "true"
effect: NoSchedule
DRA Monitoring
Monitor these critical metrics for DRA performance:
dra_allocation_duration_seconds- Time to allocate resourcesdra_allocation_errors_total- Failed allocation attemptsdra_resource_utilization_ratio- Resource usage efficiencydra_pending_claims_total- Unscheduled resource claims
Operations and Maintenance
Security Hardening
# Enable Harbor vulnerability scan automation
curl -X PUT "https://harbor.yourdomain.com/api/v2.0/projects/1" \
-H "Content-Type: application/json" \
-u "admin:Harbor12345!" \
-d '{
"metadata": {
"auto_scan": "true",
"prevent_vul": "true",
"severity": "high"
}
}'
# Configure image signing policy (Notary)
export DOCKER_CONTENT_TRUST=1
export DOCKER_CONTENT_TRUST_SERVER=https://harbor.yourdomain.com:4443
Backup and Recovery
#!/bin/bash
# harbor-backup.sh
BACKUP_DIR="/backup/harbor-$(date +%Y%m%d-%H%M%S)"
mkdir -p $BACKUP_DIR
# 1. Backup Harbor configuration
cp -r /data/harbor $BACKUP_DIR/
# 2. Backup database
docker exec harbor-db pg_dump -U postgres registry > $BACKUP_DIR/registry.sql
# 3. Backup registry data (optional)
tar -czf $BACKUP_DIR/registry-data.tar.gz /data/registry
echo "Backup completed: $BACKUP_DIR"
Monitoring
Prometheus Metrics Collection
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'harbor'
static_configs:
- targets: ['harbor.yourdomain.com:9090']
metrics_path: '/metrics'
Key Monitoring Metrics
- Registry request rate
- Authentication failure count
- Storage usage
- Database connection count
- API response time
Performance Validation
Direct Connect Performance Test
# 1. Basic connectivity test
ping -c 100 <aws-endpoint>
# 2. Check MTU optimization
ping -M do -s 8972 <aws-endpoint>
# 3. Trace route
traceroute -n <aws-endpoint>
# 4. Measure bandwidth
iperf3 -c <aws-endpoint> -t 60 -P 10 -w 512K
Performance Baseline
| Metric | Target | Warning | Critical |
|---|---|---|---|
| Latency | < 5ms | 5-10ms | > 10ms |
| Jitter | < 2ms | 2-5ms | > 5ms |
| Packet Loss | < 0.01% | 0.01-0.1% | > 0.1% |
| Bandwidth | > 10Gbps | 5-10Gbps | < 5Gbps |
Troubleshooting
ImagePullBackOff Error
# Diagnose problem
kubectl describe pod <pod-name>
kubectl get events --field-selector involvedObject.name=<pod-name>
# Check Secret
kubectl get secret harbor-registry -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
Certificate Error
# Install CA certificate on all nodes
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: harbor-ca-installer
namespace: kube-system
spec:
selector:
matchLabels:
name: harbor-ca-installer
template:
metadata:
labels:
name: harbor-ca-installer
spec:
hostNetwork: true
hostPID: true
containers:
- name: installer
image: busybox
command: ['sh', '-c']
args:
- |
echo "Installing Harbor CA certificate..."
cp /ca-cert/ca.crt /host/usr/local/share/ca-certificates/harbor-ca.crt
chroot /host update-ca-certificates
chroot /host systemctl restart containerd
sleep 3600
volumeMounts:
- name: ca-cert
mountPath: /ca-cert
- name: host
mountPath: /host
securityContext:
privileged: true
volumes:
- name: ca-cert
configMap:
name: harbor-ca
- name: host
hostPath:
path: /
EOF
DNS Resolution Failure
# Test DNS
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup harbor.yourdomain.com
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Restart CoreDNS
kubectl rollout restart deployment coredns -n kube-system
Conclusion
EKS Hybrid Nodes provides an integrated Kubernetes environment spanning on-premises and cloud. Key success factors covered in this guide:
- Proper Networking Configuration: Complete Pod CIDR firewall registration and bidirectional DNS configuration
- Certificate Management: When using self-signed certificates, install CA certificate on all nodes
- Cost Optimization: Establish hybrid distribution strategies according to workload characteristics
- Dynamic Resource Allocation: Efficient GPU resource management using DRA
- Continuous Validation: Configuration validation through step-by-step testing
Before adoption, prioritize reviewing:
- Secure low-latency connectivity through Direct Connect
- H100 GPU server high vCPU cost optimization strategies
- Verify actual environment performance and stability through PoC