Skip to main content

Technical Challenges of Agentic AI Workloads

Introduction

When building and operating an Agentic AI platform, platform engineers and architects face technical challenges that are fundamentally different from traditional web applications. This document analyzes the 5 key challenges.

Prerequisite

Before reading this document, review the overall structure of the Agentic AI Platform in Platform Architecture.

Why a Single LLM Is Not Enough

In the Agentic AI era, the first question organizations face is "Can't we just use one large, expensive LLM?" In practice, relying entirely on a single massive LLM in enterprise environments leads to the following practical limitations.

4 Limitations of a Single LLM in Enterprise Practice

Limitation AreaProblem Organizations FacePlatform Response
CostToken pricing for 70B+ models can reach tens of millions of won per month at high traffic volumes, and the same cost applies to simple tasks like tool calls and formatting within agents. Research shows that 40-70% of agent LLM calls can be replaced by SLMs.Bifrost 2-Tier routing separates simple calls to self-hosted SLMs, routing only complex reasoning to LLMs
Performance · LatencyLarge models have long response latency (TTFT), degrading user experience in real-time customer service (AICC) and conversational agents. Domain-specific SLMs can deliver 10x faster responses for the same tasks.3-Tier Orchestration — Tier 1 (SLM direct) is ~50ms, Tier 2 (LLM) is used only for complex reasoning
Information AccuracyLLM hallucination is a structural characteristic, and it is critical in tasks requiring accuracy such as billing calculations and terms verification. Transformer architecture has inherent limitations in complex arithmetic and logical operations.Tool Delegation — Arithmetic is delegated to rule engines, fact verification to Knowledge Graphs. LLMs focus only on natural language understanding
Governance · SecurityRisks of sensitive data (PII/PHI) leaking to external LLM APIs, audit trails for autonomous agent actions, team-level access control and budget management are all required.NeMo Guardrails (I/O filtering) + LangGraph HITL (human approval gates) + Langfuse (audit trails)

Infrastructure Optimization: Direction of Superintelligence Research Companies and K8s Ecosystem

To efficiently operate such a multi-model ecosystem, infrastructure platformization is essential. This is not merely a cost reduction issue — it is an area that leading AI companies universally invest in as a core priority.

Meta invests heavily in optimizing its own AI infrastructure alongside superintelligence (ASI) research. Grand Teton (GPU server architecture), MTIA (custom inference chip), and PyTorch ecosystem inference optimization (torch.compile, ExecuTorch) all stem from the recognition that infrastructure efficiency is as important as model performance.

The CNCF Kubernetes ecosystem is also rapidly expanding capabilities for AI workloads:

K8s AI FeatureVersionRoleSignificance for Multi-Model Ecosystem
DRA (Dynamic Resource Allocation)1.31 BetaFine-grained GPU allocation at MIG levelSLMs on MIG partitions, LLMs on full GPUs — coexisting in a single cluster
Gateway API + Inference Extension2025Standardized routing for LLM inference requestsIntelligent routing based on KV Cache state, per-model traffic distribution
KueueGAAI workload queuing and schedulingFair GPU resource distribution for training/inference, per-team quotas
LeaderWorkerSet1.31Distributed inference/training workload patternK8s-native management of Tensor Parallel distributed inference for 70B+ models
KAI Scheduler2025GPU-aware Pod schedulingOptimal placement considering GPU topology (NVLink, NVSwitch)

As such, Kubernetes is evolving beyond a simple container orchestrator to become the foundational infrastructure for AI workloads, and is the most mature platform for operating multi-model ecosystems.

Conclusion: Multi-Model Ecosystem and Infrastructure Platformization

Organizations must move beyond single LLM dependency to build a heterogeneous multi-model ecosystem, supported by a robust infrastructure platform.

Strategic planning · Complex reasoning    Routine tasks · Domain-specific
┌──────────────────┐ ┌──────────────────┐
│ LLM Orchestrator │ Task │ SLM Expert Pool │
│ (Claude, GPT etc)│───Distribution────→│ (7B/14B + LoRA) │
│ Tier 2 workflow │ │ Tier 1 direct │
└──────────────────┘ └──────────────────┘
│ │
└── External tool delegation ────────────┘
(Arithmetic, search, knowledge graph)

┌────────────┴────────────┐
│ Kubernetes Infra Platform│
│ DRA · Gateway API · Kueue│
│ Karpenter · vLLM · Bifrost│
└─────────────────────────┘

Below, we analyze the 5 key challenges that the platform must address to efficiently operate this ecosystem in a Kubernetes-native environment.


5 Key Challenges of the Agentic AI Platform

Agentic AI systems leveraging Frontier Models (state-of-the-art large language models) have fundamentally different infrastructure requirements compared to traditional web applications.

Challenge Summary

🚀 Agentic AI Platform Core Challenges
Legacy infrastructure limitations and problems to solve
🎯GPU Resource Management & Cost Optimization
Core Problem
Lack of multi-cluster GPU visibility, generation-specific workload matching, GPU idle costs
Legacy Limitation
Manual monitoring, static allocation, no cost visibility
🔀Intelligent Inference Routing & Gateway
Core Problem
Unpredictable traffic, multi-model routing, dynamic scaling
Legacy Limitation
Slow provisioning, fixed capacity, manual routing
💰LLMOps Observability & Cost Governance
Core Problem
Difficulty tracking at token level, no cost visibility, inadequate quality evaluation
Legacy Limitation
Manual tracking, no optimization, only post-analysis
🤖Agent Orchestration & Safety
Core Problem
Agent workflow complexity, tool integration challenges, inadequate safety guarantees
Legacy Limitation
Manual orchestration, lack of standardization, insufficient guardrails
🔧Model Supply Chain Management
Core Problem
Distributed training infrastructure complexity, resource provisioning delays, model deployment pipeline
Legacy Limitation
Manual cluster management, low utilization, no pipeline automation
Limitations of Traditional Infrastructure Approaches

Traditional VM-based infrastructure or manual management approaches cannot effectively handle the dynamic and unpredictable workload patterns of Agentic AI. The high cost of GPU resources and complex distributed system requirements make automated infrastructure management essential.


Challenge 1: GPU Resource Management and Cost Optimization

GPUs are the most expensive resource in the Agentic AI Platform. Appropriate GPU allocation strategies are needed based on model size and workload characteristics.

Why it's difficult:

  • High cost: GPU instances are 10-100x more expensive than CPU (H100 x8: ~$98/hr)
  • Varied model sizes: GPU memory requirements vary dramatically from 3B parameter models to 70B+
  • Dynamic workloads: Inference traffic fluctuates by more than 10x depending on time of day
  • Idle waste: Low utilization after GPU provisioning leads to massive cost waste
  • Multi-tenancy: Multiple models and teams must share limited GPUs
Model SizeGPU RequirementsCost Pressure
70B+ parametersFull GPU (H100/A100) x8$30-$98/hr
7B-30B parameters1-2 GPUs or MIG partition$1-$10/hr
Under 3B parametersTime-Slicing or shared GPU$0.5-$2/hr

Challenge 2: Intelligent Inference Routing and Gateway

Agentic AI workloads leverage multiple models and providers simultaneously. Intelligent routing that understands model characteristics is needed, beyond simple load balancing.

Why it's difficult:

  • Multi-model operations: Running diverse models like Llama, Qwen, Claude, and GPT simultaneously on a single platform
  • KV Cache efficiency: Routing that doesn't consider LLM KV Cache state significantly degrades performance
  • Cost-performance tradeoff: Must dynamically choose between low-cost and high-performance models based on task complexity
  • Provider diversification: Must integrate management of self-hosted models and external APIs (Bedrock, OpenAI)
  • Canary/A-B deployment: Must safely transition traffic to new model versions

Challenge 3: LLMOps Observability and Cost Governance

LLM-based systems have fundamentally different observability requirements compared to traditional applications. Token-level cost tracking, agent workflow debugging, and prompt quality monitoring are required.

Why it's difficult:

  • Non-deterministic output: Different outputs for the same input make traditional testing/monitoring insufficient
  • Token cost tracking: Must track both infrastructure costs (GPU) and application costs (tokens)
  • Multi-step debugging: Identifying bottlenecks in complex chains where agents call multiple tools is challenging
  • Prompt quality: Must detect prompt performance degradation in production in real-time
  • Per-team budgets: Need per-team cost allocation and limit management across shared AI infrastructure
Observability AreaTraditional ApplicationsLLM Applications
Cost trackingInfrastructure costs onlyDual tracking: infrastructure + token costs
DebuggingRequest-response logsMulti-step Agent Trace
Quality monitoringError rate, latencyFaithfulness, Relevance, Hallucination
Budget managementResource-basedPer-model/per-team token budgets

Challenge 4: Agent Orchestration and Safety

In Agentic AI systems, agents autonomously invoke tools and interact with external systems. This autonomy creates new challenges in terms of safety and controllability.

Why it's difficult:

  • Autonomous actions: Agents make their own decisions to call tools, enabling unexpected behavior
  • Prompt injection: Risk of malicious inputs causing agents to perform unintended actions
  • Tool integration standardization: Need standards for safely connecting diverse external systems (DBs, APIs, files) to agents
  • Multi-agent communication: Safe and efficient communication protocols needed when multiple agents collaborate
  • State management: State persistence, recovery, and checkpointing needed for long-running agents
  • Scaling: Agent workloads are CPU-based but have irregular traffic patterns, making efficient scaling difficult

Challenge 5: Model Supply Chain Management

Beyond simply deploying models, the entire model lifecycle (training → evaluation → registry → deployment → feedback) must be systematically managed.

Why it's difficult:

  • Model version management: Managing diverse artifacts including foundation models, fine-tuned models, and adapters (LoRA)
  • Distributed training infrastructure: Large-scale model fine-tuning requires multi-node GPU clusters and high-speed networking (EFA)
  • Evaluation pipelines: Must automatically evaluate model quality and set deployment gates
  • Safe deployment: Minimize service impact during model updates with Canary/Blue-Green deployment
  • Hybrid environments: Model transfer and synchronization between on-premises and cloud GPUs
  • RAG data pipelines: Continuous update pipelines for document processing, embedding generation, and vector storage
  • Feedback loops: Continuous improvement systems that incorporate production tracing data into retraining

Next Steps: Approaches to Solving the Challenges

We present two approaches to solving these 5 challenges:

  1. AWS Native Platform: An approach that minimizes infrastructure operational burden using AWS managed services (Bedrock, AgentCore) to focus on agent development
  2. EKS-Based Open Architecture: An approach that achieves fine-grained control and cost optimization using Amazon EKS and the open-source ecosystem

These two approaches are complementary and can be combined based on workload characteristics.

CriteriaAWS NativeEKS-Based Open Architecture
GPU managementNot required (serverless)Karpenter auto-provisioning
Model selectionBedrock-supported modelsAll open weight models
Operational burdenMinimalMedium (reduced with Auto Mode)
Cost optimizationUsage-based pricingFine-grained control: Spot, Consolidation
CustomizationLimitedFull flexibility
Which approach to choose?
  • Quick start, focus on agent logic: AWS Native Platform
  • Open weight models + hybrid + cost optimization: EKS-based open architecture
  • Realistic optimum: Combine both approaches (start with AWS Native, expand to EKS as needed)

References

Official Documentation

Papers / Technical Blogs