Skip to main content

Cross-Cluster Object Replication (HA) Architecture Guide

📅 Created: 2026-03-24 | Updated: 2026-03-24 | ⏱️ Reading time: ~12 min

📌 Reference Environment: EKS 1.32+, ArgoCD 2.13+, Flux v2.4+, Velero 1.15+

1. Overview

Relying on a single EKS cluster in production environments means that cluster failure results in complete service outage. Cross-Cluster Object Replication is a strategy to ensure high availability by consistently replicating Kubernetes objects (ConfigMap, Secret, RBAC, CRD, NetworkPolicy, etc.) across multiple clusters.

Current Situation

EKS does not provide a managed Cross-Cluster Object Replication feature. Therefore, you must implement it yourself by combining open-source tools and architecture patterns. This guide compares the pros and cons of each pattern and presents selection criteria based on workload types.

Scope of This Guide

IncludedExcluded
K8s object replication (ConfigMap, Secret, CRD, RBAC, etc.)Application data replication (DB replicas)
GitOps-based declarative synchronizationService mesh-based traffic routing
Stateful object backup/restore (Velero)Storage layer replication (EBS, EFS)
DNS failover strategyApplication-level HA patterns

2. Multi-Cluster Architecture Pattern Comparison

There are three core patterns for implementing Cross-Cluster Object Replication.

Pattern 1: API Proxy (Push Model)

A central routing layer directly proxies CRUD requests to each cluster's API Server.

  • How it works: Central layer makes direct API calls to each cluster
  • Advantages: Lightweight and intuitive
  • Limitations: Weak credential security, cannot Watch multiple clusters, increasing connection complexity

Pattern 2: Multi-cluster Controller (Kubefed Family)

A central controller monitors each cluster's state via Informer-based List-Watch and synchronizes through CRDs.

  • How it works: Central controller watches and synchronizes each cluster's state
  • Advantages: Dynamic cluster discovery, Federation policy application
  • Limitations: Watch event overflow at ~10+ clusters, Informer cache size limits, risk of plaintext credential storage
Kubefed Project Status

Kubefed (v2) in Kubernetes SIG is effectively in maintenance mode. Not recommended for new projects.

Agents in each cluster Pull the desired state from a central source (Git or hub cluster) and Reconcile locally. This follows the same principle as kubelet receiving Pod specs and executing them locally.

  • How it works: Each cluster agent independently Pulls desired state and Reconciles locally
  • Advantages: High scalability, Eventual Consistency, maintains local operation despite central failure
  • Limitations: Requires agent deployment to all clusters

Pattern Comparison Summary

AspectAPI ProxyMulti-cluster ControllerAgent-based Pull
Operation ModeCentral → Cluster PushCentral Watch + CRD SyncCluster → Central Pull
ScalabilityLow (proportional to connections)Medium (~10 clusters)High (hundreds of clusters)
ComplexityLowHighMedium
SecurityWeak (multiple credentials)Weak (plaintext storage)Strong (agent local permissions)
Failure IsolationLowMediumHigh
Drift DetectionNonePartialBuilt-in
Recommended ScenarioPoC, small scaleLegacy environmentsProduction (Recommended)

Decision Flowchart


Use a Git repository as the Single Source of Truth, and have each cluster's GitOps agent independently Pull & Reconcile.

Key Benefits:

  • Drift Detection: Automatically detects and recovers when cluster state differs from Git
  • Audit Trail: All change history preserved as Git commits
  • Declarative Management: Define desired state and agents Reconcile
  • Failure Isolation: Agent failure in one cluster doesn't affect other clusters

Active-Active Configuration:

Both clusters independently Pull from the same Git repository. DNS (Route 53) distributes traffic, and when one cluster fails, the remaining cluster immediately handles all traffic.

Active-Passive Configuration:

Only the Active cluster enables the GitOps agent. The Passive cluster keeps its agent in Suspended state and activates it during failover.

Option B: ArgoCD Hub-and-Spoke Model

Install ArgoCD on a Management Cluster and deploy to multiple workload clusters via ApplicationSets.

HA Configuration Strategies:

StrategyDescriptionSuitable Scenario
Active-Passive MirroringDeploy ArgoCD in two regions, disable controller in Passive. Manual Scale-Up during failoverEnvironments with low DR requirements
Active-Active Sync WindowsTwo ArgoCD instances Sync in non-overlapping time windows (Sync Windows feature)Active-Active requiring conflict prevention
ApplicationSets Generator

Using ArgoCD ApplicationSets' Cluster Generator, you can automatically deploy applications to all clusters registered with ArgoCD. Replication starts immediately when a new cluster is added without additional configuration.

Option C: Custom Controller (MirrorController Pattern)

When fine-grained control over object replication is needed, develop a dedicated controller to manage synchronization between source and target clusters.

Application Scenarios:

  • Selective replication of objects with specific Labels/Annotations only
  • Object transformation needed during replication (e.g., Namespace changes, field modifications)
  • Custom conflict resolution logic implementation required

Pros and Cons:

AdvantagesDisadvantages
Clear separation of concernsAdditional operational overhead
Reduced core logic complexityPotential synchronization delays
Fine-grained replication policy controlIncreased debugging complexity
Customizable conflict resolutionDirect development/maintenance required

4. Active-Active vs Active-Passive Decision

Comparison Table

AspectActive-ActiveActive-Passive
Object SynchronizationBoth clusters Pull independently from same Git sourceOnly Active Reconciles, Passive waits
Failover TimeNear zero (both already serving)Few minutes (Passive activation needed)
Conflict ResolutionWrite conflicts possible — prevent with Sync Windows, etc.No conflicts — single Writer
Operational ComplexityHigh (object IDs, DNS, state sync)Low (standard failover model)
CostHigh (both operate at full capacity)Low (Passive can run at reduced capacity)
Suitable ScenarioMulti-region HA, global load balancingDR, cost-sensitive HA

5. Supporting Tool Stack

Object replication alone cannot achieve complete Cross-Cluster HA. Combine the following tools to build the full stack.

ToolRoleNotes
Flux / ArgoCDK8s object replication (GitOps)Core replication mechanism
Route 53DNS-based failover/load balancingHealth Check + Failover Routing
Global AcceleratorAnycast IP-based global routingFor multi-region Active-Active
VeleroStateful object backup/restore (PV, etcd)Integrates with S3 Cross-Region Replication
External Secrets OperatorSecret synchronizationAWS Secrets Manager → Both clusters
Crossplane / ACKAWS resource definition syncManage IaC as K8s objects

Tool Combination Architecture


6. Current Limitations and Future Outlook

There are features in EKS multi-cluster management that are not yet provided as managed services.

AreaCurrent StatusAlternative
Managed ClusterSetsNot releasedGroup cross-account with RAM (Resource Access Manager)
Built-in Cross-Cluster ReplicationNot releasedGitOps (Flux/ArgoCD)
Multi-Region EKS ClusterNot releasedIndependent cluster per region + GitOps sync
Managed ArgoCDIn developmentSelf-install/operate ArgoCD
Practical Approach

Until these features are released, the GitOps + supporting tool stack combination is the most mature and proven approach. Approximately 10% of EKS customers have already adopted Flux/ArgoCD-based GitOps.


Final recommended tool combination to eliminate single cluster dependency.

PurposeRecommended ToolConfiguration Method
K8s Object ReplicationGitOps (Flux or ArgoCD)Both clusters Pull from same Git repo
Stateful Data ProtectionVelero + S3 Cross-Region ReplicationRegular backup + cross-region replication
Secret SynchronizationExternal Secrets OperatorAWS Secrets Manager as shared source
DNS FailoverRoute 53 Health ChecksActive-Active or Failover Routing
CRD/Custom ResourcesInclude in GitOps repoManage like standard K8s objects
AWS Resource DefinitionsCrossplane or ACKSynchronize IaC K8s-natively

Implementation Priority

  1. P0: Deploy GitOps agents + Design Git repo structure
  2. P1: Configure External Secrets Operator + Route 53 Health Check
  3. P2: Establish Velero backup policy + S3 Cross-Region Replication
  4. P3: Synchronize AWS resources with Crossplane/ACK (if needed)


9. References