Technical Challenges of Agentic AI Workloads

Introduction

When building and operating an Agentic AI platform, platform engineers and architects face technical challenges that are fundamentally different from traditional web applications. This document analyzes the 5 key challenges.

Prerequisite

Before reading this document, review the overall structure of the Agentic AI Platform in Platform Architecture.

Why a Single LLM Is Not Enough

In the Agentic AI era, the first question organizations face is "Can't we just use one large, expensive LLM?" In practice, relying entirely on a single massive LLM in enterprise environments leads to the following practical limitations.

4 Limitations of a Single LLM in Enterprise Practice

Limitation Area	Problem Organizations Face	Platform Response
Cost	Token pricing for 70B+ models can reach tens of millions of won per month at high traffic volumes, and the same cost applies to simple tasks like tool calls and formatting within agents. Research shows that 40-70% of agent LLM calls can be replaced by SLMs.	Bifrost 2-Tier routing separates simple calls to self-hosted SLMs, routing only complex reasoning to LLMs
Performance · Latency	Large models have long response latency (TTFT), degrading user experience in real-time customer service (AICC) and conversational agents. Domain-specific SLMs can deliver 10x faster responses for the same tasks.	3-Tier Orchestration — Tier 1 (SLM direct) is ~50ms, Tier 2 (LLM) is used only for complex reasoning
Information Accuracy	LLM hallucination is a structural characteristic, and it is critical in tasks requiring accuracy such as billing calculations and terms verification. Transformer architecture has inherent limitations in complex arithmetic and logical operations.	Tool Delegation — Arithmetic is delegated to rule engines, fact verification to Knowledge Graphs. LLMs focus only on natural language understanding
Governance · Security	Risks of sensitive data (PII/PHI) leaking to external LLM APIs, audit trails for autonomous agent actions, team-level access control and budget management are all required.	NeMo Guardrails (I/O filtering) + LangGraph HITL (human approval gates) + Langfuse (audit trails)

Infrastructure Optimization: Direction of Superintelligence Research Companies and K8s Ecosystem

To efficiently operate such a multi-model ecosystem, infrastructure platformization is essential. This is not merely a cost reduction issue — it is an area that leading AI companies universally invest in as a core priority.

Meta invests heavily in optimizing its own AI infrastructure alongside superintelligence (ASI) research. Grand Teton (GPU server architecture), MTIA (custom inference chip), and PyTorch ecosystem inference optimization (torch.compile, ExecuTorch) all stem from the recognition that infrastructure efficiency is as important as model performance.

The CNCF Kubernetes ecosystem is also rapidly expanding capabilities for AI workloads:

K8s AI Feature	Version	Role	Significance for Multi-Model Ecosystem
DRA (Dynamic Resource Allocation)	1.31 Beta	Fine-grained GPU allocation at MIG level	SLMs on MIG partitions, LLMs on full GPUs — coexisting in a single cluster
Gateway API + Inference Extension	2025	Standardized routing for LLM inference requests	Intelligent routing based on KV Cache state, per-model traffic distribution
Kueue	GA	AI workload queuing and scheduling	Fair GPU resource distribution for training/inference, per-team quotas
LeaderWorkerSet	1.31	Distributed inference/training workload pattern	K8s-native management of Tensor Parallel distributed inference for 70B+ models
KAI Scheduler	2025	GPU-aware Pod scheduling	Optimal placement considering GPU topology (NVLink, NVSwitch)

As such, Kubernetes is evolving beyond a simple container orchestrator to become the foundational infrastructure for AI workloads, and is the most mature platform for operating multi-model ecosystems.

Conclusion: Multi-Model Ecosystem and Infrastructure Platformization

Organizations must move beyond single LLM dependency to build a heterogeneous multi-model ecosystem, supported by a robust infrastructure platform.

Strategic planning · Complex reasoning    Routine tasks · Domain-specific
┌──────────────────┐                     ┌──────────────────┐
│  LLM Orchestrator │        Task        │   SLM Expert Pool │
│  (Claude, GPT etc)│───Distribution────→│  (7B/14B + LoRA)  │
│  Tier 2 workflow  │                    │  Tier 1 direct    │
└──────────────────┘                     └──────────────────┘
         │                                        │
         └── External tool delegation ────────────┘
             (Arithmetic, search, knowledge graph)
                      │
         ┌────────────┴────────────┐
         │  Kubernetes Infra Platform│
         │  DRA · Gateway API · Kueue│
         │  Karpenter · vLLM · Bifrost│
         └─────────────────────────┘

Below, we analyze the 5 key challenges that the platform must address to efficiently operate this ecosystem in a Kubernetes-native environment.

5 Key Challenges of the Agentic AI Platform

Agentic AI systems leveraging Frontier Models (state-of-the-art large language models) have fundamentally different infrastructure requirements compared to traditional web applications.

Challenge Summary

🚀 Agentic AI Platform Core Challenges

Legacy infrastructure limitations and problems to solve

🎯GPU Resource Management & Cost Optimization

Core Problem

Lack of multi-cluster GPU visibility, generation-specific workload matching, GPU idle costs

Legacy Limitation

Manual monitoring, static allocation, no cost visibility

🔀Intelligent Inference Routing & Gateway

Core Problem

Unpredictable traffic, multi-model routing, dynamic scaling

Legacy Limitation

Slow provisioning, fixed capacity, manual routing

💰LLMOps Observability & Cost Governance

Core Problem

Difficulty tracking at token level, no cost visibility, inadequate quality evaluation

Legacy Limitation

Manual tracking, no optimization, only post-analysis

🤖Agent Orchestration & Safety

Core Problem

Agent workflow complexity, tool integration challenges, inadequate safety guarantees

Legacy Limitation

Manual orchestration, lack of standardization, insufficient guardrails

🔧Model Supply Chain Management

Core Problem

Distributed training infrastructure complexity, resource provisioning delays, model deployment pipeline

Legacy Limitation

Manual cluster management, low utilization, no pipeline automation

Limitations of Traditional Infrastructure Approaches

Traditional VM-based infrastructure or manual management approaches cannot effectively handle the dynamic and unpredictable workload patterns of Agentic AI. The high cost of GPU resources and complex distributed system requirements make automated infrastructure management essential.

Challenge 1: GPU Resource Management and Cost Optimization

GPUs are the most expensive resource in the Agentic AI Platform. Appropriate GPU allocation strategies are needed based on model size and workload characteristics.

Why it's difficult:

High cost: GPU instances are 10-100x more expensive than CPU (H100 x8: ~$98/hr)
Varied model sizes: GPU memory requirements vary dramatically from 3B parameter models to 70B+
Dynamic workloads: Inference traffic fluctuates by more than 10x depending on time of day
Idle waste: Low utilization after GPU provisioning leads to massive cost waste
Multi-tenancy: Multiple models and teams must share limited GPUs

Model Size	GPU Requirements	Cost Pressure
70B+ parameters	Full GPU (H100/A100) x8	$30-$98/hr
7B-30B parameters	1-2 GPUs or MIG partition	$1-$10/hr
Under 3B parameters	Time-Slicing or shared GPU	$0.5-$2/hr

Challenge 2: Intelligent Inference Routing and Gateway

Agentic AI workloads leverage multiple models and providers simultaneously. Intelligent routing that understands model characteristics is needed, beyond simple load balancing.

Why it's difficult:

Multi-model operations: Running diverse models like Llama, Qwen, Claude, and GPT simultaneously on a single platform
KV Cache efficiency: Routing that doesn't consider LLM KV Cache state significantly degrades performance
Cost-performance tradeoff: Must dynamically choose between low-cost and high-performance models based on task complexity
Provider diversification: Must integrate management of self-hosted models and external APIs (Bedrock, OpenAI)
Canary/A-B deployment: Must safely transition traffic to new model versions

Challenge 3: LLMOps Observability and Cost Governance

LLM-based systems have fundamentally different observability requirements compared to traditional applications. Token-level cost tracking, agent workflow debugging, and prompt quality monitoring are required.

Why it's difficult:

Non-deterministic output: Different outputs for the same input make traditional testing/monitoring insufficient
Token cost tracking: Must track both infrastructure costs (GPU) and application costs (tokens)
Multi-step debugging: Identifying bottlenecks in complex chains where agents call multiple tools is challenging
Prompt quality: Must detect prompt performance degradation in production in real-time
Per-team budgets: Need per-team cost allocation and limit management across shared AI infrastructure

Observability Area	Traditional Applications	LLM Applications
Cost tracking	Infrastructure costs only	Dual tracking: infrastructure + token costs
Debugging	Request-response logs	Multi-step Agent Trace
Quality monitoring	Error rate, latency	Faithfulness, Relevance, Hallucination
Budget management	Resource-based	Per-model/per-team token budgets

Challenge 4: Agent Orchestration and Safety

In Agentic AI systems, agents autonomously invoke tools and interact with external systems. This autonomy creates new challenges in terms of safety and controllability.

Why it's difficult:

Autonomous actions: Agents make their own decisions to call tools, enabling unexpected behavior
Prompt injection: Risk of malicious inputs causing agents to perform unintended actions
Tool integration standardization: Need standards for safely connecting diverse external systems (DBs, APIs, files) to agents
Multi-agent communication: Safe and efficient communication protocols needed when multiple agents collaborate
State management: State persistence, recovery, and checkpointing needed for long-running agents
Scaling: Agent workloads are CPU-based but have irregular traffic patterns, making efficient scaling difficult

Challenge 5: Model Supply Chain Management

Beyond simply deploying models, the entire model lifecycle (training → evaluation → registry → deployment → feedback) must be systematically managed.

Why it's difficult:

Model version management: Managing diverse artifacts including foundation models, fine-tuned models, and adapters (LoRA)
Distributed training infrastructure: Large-scale model fine-tuning requires multi-node GPU clusters and high-speed networking (EFA)
Evaluation pipelines: Must automatically evaluate model quality and set deployment gates
Safe deployment: Minimize service impact during model updates with Canary/Blue-Green deployment
Hybrid environments: Model transfer and synchronization between on-premises and cloud GPUs
RAG data pipelines: Continuous update pipelines for document processing, embedding generation, and vector storage
Feedback loops: Continuous improvement systems that incorporate production tracing data into retraining

Next Steps: Approaches to Solving the Challenges

We present two approaches to solving these 5 challenges:

AWS Native Platform: An approach that minimizes infrastructure operational burden using AWS managed services (Bedrock, AgentCore) to focus on agent development
EKS-Based Open Architecture: An approach that achieves fine-grained control and cost optimization using Amazon EKS and the open-source ecosystem

These two approaches are complementary and can be combined based on workload characteristics.

Criteria	AWS Native	EKS-Based Open Architecture
GPU management	Not required (serverless)	Karpenter auto-provisioning
Model selection	Bedrock-supported models	All open weight models
Operational burden	Minimal	Medium (reduced with Auto Mode)
Cost optimization	Usage-based pricing	Fine-grained control: Spot, Consolidation
Customization	Limited	Full flexibility

Which approach to choose?

Quick start, focus on agent logic: AWS Native Platform
Open weight models + hybrid + cost optimization: EKS-based open architecture
Realistic optimum: Combine both approaches (start with AWS Native, expand to EKS as needed)

References

Official Documentation

Kubernetes Gateway API — K8s official gateway API specification
CNCF AI/ML Landscape — Cloud Native AI/ML ecosystem overview
NVIDIA GPU Operator Documentation — GPU Operator official guide
AWS EKS Best Practices for AI/ML — EKS AI/ML workload optimization

Papers / Technical Blogs

vLLM: Easy, Fast, and Cheap LLM Serving — PagedAttention mechanism explanation
Efficient Memory Management for LLM Serving (OSDI 2023) — KV Cache optimization research
Cost-Effective LLM Inference at Scale — Production cost optimization cases
NVIDIA Blog: Optimizing AI Workloads — GPU optimization technology blog

Platform Architecture — Overall system design blueprint
AWS Native Platform — Managed service approach
EKS-Based Open Architecture — Self-hosting approach
GPU Resource Management — GPU cost optimization details

Introduction​

Why a Single LLM Is Not Enough​

4 Limitations of a Single LLM in Enterprise Practice​

Infrastructure Optimization: Direction of Superintelligence Research Companies and K8s Ecosystem​

Conclusion: Multi-Model Ecosystem and Infrastructure Platformization​

5 Key Challenges of the Agentic AI Platform​

Challenge Summary​

Challenge 1: GPU Resource Management and Cost Optimization​

Challenge 2: Intelligent Inference Routing and Gateway​

Challenge 3: LLMOps Observability and Cost Governance​

Challenge 4: Agent Orchestration and Safety​

Challenge 5: Model Supply Chain Management​

Next Steps: Approaches to Solving the Challenges​

References​

Official Documentation​

Papers / Technical Blogs​

Related Documents (Internal)​

Introduction

Why a Single LLM Is Not Enough

4 Limitations of a Single LLM in Enterprise Practice

Infrastructure Optimization: Direction of Superintelligence Research Companies and K8s Ecosystem

Conclusion: Multi-Model Ecosystem and Infrastructure Platformization

5 Key Challenges of the Agentic AI Platform

Challenge Summary

Challenge 1: GPU Resource Management and Cost Optimization

Challenge 2: Intelligent Inference Routing and Gateway

Challenge 3: LLMOps Observability and Cost Governance

Challenge 4: Agent Orchestration and Safety

Challenge 5: Model Supply Chain Management

Next Steps: Approaches to Solving the Challenges

References

Official Documentation

Papers / Technical Blogs

Related Documents (Internal)