Skip to main content

Reference Architecture

This section provides production deployment and configuration guides for the Agentic AI Platform. Concepts and design principles are covered in the Documentation section; here we focus on specific configurations, YAML manifests, and verification procedures for deploying and operating on actual clusters.

Documentation vs Reference Architecture
AspectDocumentationReference Architecture
FocusArchitecture concepts, design principles, technology comparisonProduction deployment procedures, manifests, verification
AudienceDecision makers, architectsPlatform engineers, DevOps
DeliverablesArchitecture documents, decision recordsDeployable YAML, scripts, checklists
Update CadenceOn design changesAs deployment/operations experience accumulates

Platform Architecture

The complete architecture of the Agentic AI Platform, including the Ontology-based Knowledge Feature Store, 6-layer structure, and model serving/fine-tuning pipelines.

Edit in draw.io

Open in draw.io — Edit directly with GitHub integration.


Architecture Overview

The diagram below shows the 6 areas of the Reference Architecture and the deployment sequence.

Deployment Sequence

The Reference Architecture is configured in the following order. Each phase depends on the outputs of the previous phase, so the order must be followed.

Phase 1: GPU Infrastructure Setup

Configure the EKS cluster and GPU node groups. Covers differences between Auto Mode and Standard Mode, and considerations when installing GPU Operator.

ItemDetails
EKS Version1.32+ (recommended 1.33)
Node GroupMNG p5en.48xlarge (Spot)
GPU OperatordevicePlugin.enabled=false (to prevent Auto Mode conflicts)
Monitoring AgentsDCGM Exporter, GFD, Node Status Exporter

Phase 2: Model Deployment

Serve large open-source models with vLLM. Covers custom image building, S3 model caching, and considerations for multi-node deployment.

ItemDetails
Serving EnginevLLM (custom image)
Model CacheS3 → s5cmd → NVMe emptyDir
ParallelismTensor Parallelism (single node recommended)
ValidationOpenAI-compatible API endpoint

Phase 3: Inference Gateway

Configure the 2-Tier inference gateway based on kgateway + Bifrost/LiteLLM. Includes Complexity-based Cascade Routing, Semantic Caching, and Guardrails.

ItemDetails
L1 Gatewaykgateway (Gateway API, mTLS, rate limiting)
L2-A GatewayBifrost (CEL Rules conditional routing, failover) or LiteLLM (native complexity-based routing)
Load BalancerNLB (TCP/TLS)
Routing StrategyComplexity-based Cascade (SLM → LLM), Hybrid Routing, Fallback

Phase 4: Monitoring and Observability

Configure the monitoring stack based on Prometheus + AMP + AMG + Langfuse.

ItemDetails
Metrics CollectionPrometheus → AMP (Pod Identity authentication)
DashboardsAMG Grafana (SigV4 ec2_iam_role)
LLM ObservabilityLangfuse (OTel traces, cost tracking)
GPU MetricsDCGM Exporter (GPU utilization, VRAM, temperature)

Phase 5: Pipelines

Configure LoRA Fine-tuning and Cascade Routing pipelines.

ItemDetails
Fine-tuningLoRA adapter training → S3 storage → vLLM hot-reload
Cascade RoutingSLM (8B) → LLM (744B) cost optimization
EvaluationRagas + custom benchmarks

Phase 6: Coding Tool Integration

Connect AI coding tools such as Aider and Cline to self-hosted models.

ItemDetails
Coding ToolsAider, Cline, Continue.dev
ProtocolOpenAI-compatible API
Connection PathCoding tool → NLB → kgateway → Bifrost/LiteLLM → vLLM
MonitoringBifrost/LiteLLM OTel → Langfuse (per-request tracing)

Documents

Core Design Principles

The Reference Architecture follows these principles.

1. Single-Node First

Multi-node distribution significantly increases complexity and failure potential. Prioritize selecting instances with sufficient VRAM (p5en, p6) to serve with Tensor Parallelism only on a single node.

2. Spot Instance Utilization

GPU Spot instances are 80-85% cheaper than On-Demand. Inference workloads are stateless, so they can immediately restart on new instances upon Spot reclamation. Model weights are rapidly restored from S3.

3. Standard Toolchain

Use standard tools from the CNCF and Kubernetes ecosystem wherever possible.

AreaStandard ToolAlternative
GPU SchedulingKarpenter / MNGAuto Mode NodePool
Model ServingvLLMSGLang, llm-d
AI GatewayBifrost / LiteLLMOpenClaw, Helicone
MetricsPrometheus + AMPCloudWatch
LLM ObservabilityLangfuseHelicone, LangSmith
Distributed TrainingLeaderWorkerSet (LWS)KubeRay

4. Layered Cost Optimization

Cost optimization uses a layered approach rather than a single technique.

Prerequisites

Prerequisites for deploying the Reference Architecture.

AWS Account and Permissions

  • EKS cluster creation permissions (IAM, VPC, EC2, EKS)
  • GPU instance Spot quotas (p5en.48xlarge: 192+ vCPUs)
  • S3 bucket creation permissions
  • AMP/AMG creation permissions (for monitoring setup)
  • ECR registry creation permissions (for custom image builds)

Tools

ToolMinimum VersionPurpose
eksctl0.200+EKS cluster management
kubectl1.32+Kubernetes resource management
helm3.16+Chart deployment
aws CLI2.22+AWS resource management
docker27+Custom image builds
s5cmd2.2+High-speed S3 sync

Networking

  • Public subnets: For NLB deployment (when coding tools need external access)
  • Private subnets: For GPU nodes, vLLM, Bifrost deployment
  • NAT Gateway: For S3, ECR, HuggingFace Hub access
  • VPC Endpoints (recommended): S3, ECR, AMP

Next Steps

For concepts and architecture design, refer to the following documents:


Feedback

This Reference Architecture is continuously updated based on production deployment experience. If you have improvement suggestions or additional use cases, please open an issue.