Technical Challenges of Agentic AI Workloads
Introduction
When building and operating an Agentic AI platform, platform engineers and architects face technical challenges that are fundamentally different from traditional web applications. This document analyzes the 5 key challenges.
Before reading this document, review the overall structure of the Agentic AI Platform in Platform Architecture.
Why a Single LLM Is Not Enough
In the Agentic AI era, the first question organizations face is "Can't we just use one large, expensive LLM?" In practice, relying entirely on a single massive LLM in enterprise environments leads to the following practical limitations.
4 Limitations of a Single LLM in Enterprise Practice
| Limitation Area | Problem Organizations Face | Platform Response |
|---|---|---|
| Cost | Token pricing for 70B+ models can reach tens of millions of won per month at high traffic volumes, and the same cost applies to simple tasks like tool calls and formatting within agents. Research shows that 40-70% of agent LLM calls can be replaced by SLMs. | Bifrost 2-Tier routing separates simple calls to self-hosted SLMs, routing only complex reasoning to LLMs |
| Performance · Latency | Large models have long response latency (TTFT), degrading user experience in real-time customer service (AICC) and conversational agents. Domain-specific SLMs can deliver 10x faster responses for the same tasks. | 3-Tier Orchestration — Tier 1 (SLM direct) is ~50ms, Tier 2 (LLM) is used only for complex reasoning |
| Information Accuracy | LLM hallucination is a structural characteristic, and it is critical in tasks requiring accuracy such as billing calculations and terms verification. Transformer architecture has inherent limitations in complex arithmetic and logical operations. | Tool Delegation — Arithmetic is delegated to rule engines, fact verification to Knowledge Graphs. LLMs focus only on natural language understanding |
| Governance · Security | Risks of sensitive data (PII/PHI) leaking to external LLM APIs, audit trails for autonomous agent actions, team-level access control and budget management are all required. | NeMo Guardrails (I/O filtering) + LangGraph HITL (human approval gates) + Langfuse (audit trails) |
Infrastructure Optimization: Direction of Superintelligence Research Companies and K8s Ecosystem
To efficiently operate such a multi-model ecosystem, infrastructure platformization is essential. This is not merely a cost reduction issue — it is an area that leading AI companies universally invest in as a core priority.
Meta invests heavily in optimizing its own AI infrastructure alongside superintelligence (ASI) research. Grand Teton (GPU server architecture), MTIA (custom inference chip), and PyTorch ecosystem inference optimization (torch.compile, ExecuTorch) all stem from the recognition that infrastructure efficiency is as important as model performance.
The CNCF Kubernetes ecosystem is also rapidly expanding capabilities for AI workloads:
| K8s AI Feature | Version | Role | Significance for Multi-Model Ecosystem |
|---|---|---|---|
| DRA (Dynamic Resource Allocation) | 1.31 Beta | Fine-grained GPU allocation at MIG level | SLMs on MIG partitions, LLMs on full GPUs — coexisting in a single cluster |
| Gateway API + Inference Extension | 2025 | Standardized routing for LLM inference requests | Intelligent routing based on KV Cache state, per-model traffic distribution |
| Kueue | GA | AI workload queuing and scheduling | Fair GPU resource distribution for training/inference, per-team quotas |
| LeaderWorkerSet | 1.31 | Distributed inference/training workload pattern | K8s-native management of Tensor Parallel distributed inference for 70B+ models |
| KAI Scheduler | 2025 | GPU-aware Pod scheduling | Optimal placement considering GPU topology (NVLink, NVSwitch) |
As such, Kubernetes is evolving beyond a simple container orchestrator to become the foundational infrastructure for AI workloads, and is the most mature platform for operating multi-model ecosystems.
Conclusion: Multi-Model Ecosystem and Infrastructure Platformization
Organizations must move beyond single LLM dependency to build a heterogeneous multi-model ecosystem, supported by a robust infrastructure platform.
Strategic planning · Complex reasoning Routine tasks · Domain-specific
┌──────────────────┐ ┌──────────────────┐
│ LLM Orchestrator │ Task │ SLM Expert Pool │
│ (Claude, GPT etc)│───Distribution────→│ (7B/14B + LoRA) │
│ Tier 2 workflow │ │ Tier 1 direct │
└──────────────────┘ └──────────────────┘
│ │
└── External tool delegation ────────────┘
(Arithmetic, search, knowledge graph)
│
┌────────────┴────────────┐
│ Kubernetes Infra Platform│
│ DRA · Gateway API · Kueue│
│ Karpenter · vLLM · Bifrost│
└─────────────────────────┘
Below, we analyze the 5 key challenges that the platform must address to efficiently operate this ecosystem in a Kubernetes-native environment.
5 Key Challenges of the Agentic AI Platform
Agentic AI systems leveraging Frontier Models (state-of-the-art large language models) have fundamentally different infrastructure requirements compared to traditional web applications.
Challenge Summary
Traditional VM-based infrastructure or manual management approaches cannot effectively handle the dynamic and unpredictable workload patterns of Agentic AI. The high cost of GPU resources and complex distributed system requirements make automated infrastructure management essential.
Challenge 1: GPU Resource Management and Cost Optimization
GPUs are the most expensive resource in the Agentic AI Platform. Appropriate GPU allocation strategies are needed based on model size and workload characteristics.
Why it's difficult:
- High cost: GPU instances are 10-100x more expensive than CPU (H100 x8: ~$98/hr)
- Varied model sizes: GPU memory requirements vary dramatically from 3B parameter models to 70B+
- Dynamic workloads: Inference traffic fluctuates by more than 10x depending on time of day
- Idle waste: Low utilization after GPU provisioning leads to massive cost waste
- Multi-tenancy: Multiple models and teams must share limited GPUs
| Model Size | GPU Requirements | Cost Pressure |
|---|---|---|
| 70B+ parameters | Full GPU (H100/A100) x8 | $30-$98/hr |
| 7B-30B parameters | 1-2 GPUs or MIG partition | $1-$10/hr |
| Under 3B parameters | Time-Slicing or shared GPU | $0.5-$2/hr |
Challenge 2: Intelligent Inference Routing and Gateway
Agentic AI workloads leverage multiple models and providers simultaneously. Intelligent routing that understands model characteristics is needed, beyond simple load balancing.
Why it's difficult:
- Multi-model operations: Running diverse models like Llama, Qwen, Claude, and GPT simultaneously on a single platform
- KV Cache efficiency: Routing that doesn't consider LLM KV Cache state significantly degrades performance
- Cost-performance tradeoff: Must dynamically choose between low-cost and high-performance models based on task complexity
- Provider diversification: Must integrate management of self-hosted models and external APIs (Bedrock, OpenAI)
- Canary/A-B deployment: Must safely transition traffic to new model versions
Challenge 3: LLMOps Observability and Cost Governance
LLM-based systems have fundamentally different observability requirements compared to traditional applications. Token-level cost tracking, agent workflow debugging, and prompt quality monitoring are required.
Why it's difficult:
- Non-deterministic output: Different outputs for the same input make traditional testing/monitoring insufficient
- Token cost tracking: Must track both infrastructure costs (GPU) and application costs (tokens)
- Multi-step debugging: Identifying bottlenecks in complex chains where agents call multiple tools is challenging
- Prompt quality: Must detect prompt performance degradation in production in real-time
- Per-team budgets: Need per-team cost allocation and limit management across shared AI infrastructure
| Observability Area | Traditional Applications | LLM Applications |
|---|---|---|
| Cost tracking | Infrastructure costs only | Dual tracking: infrastructure + token costs |
| Debugging | Request-response logs | Multi-step Agent Trace |
| Quality monitoring | Error rate, latency | Faithfulness, Relevance, Hallucination |
| Budget management | Resource-based | Per-model/per-team token budgets |
Challenge 4: Agent Orchestration and Safety
In Agentic AI systems, agents autonomously invoke tools and interact with external systems. This autonomy creates new challenges in terms of safety and controllability.
Why it's difficult:
- Autonomous actions: Agents make their own decisions to call tools, enabling unexpected behavior
- Prompt injection: Risk of malicious inputs causing agents to perform unintended actions
- Tool integration standardization: Need standards for safely connecting diverse external systems (DBs, APIs, files) to agents
- Multi-agent communication: Safe and efficient communication protocols needed when multiple agents collaborate
- State management: State persistence, recovery, and checkpointing needed for long-running agents
- Scaling: Agent workloads are CPU-based but have irregular traffic patterns, making efficient scaling difficult
Challenge 5: Model Supply Chain Management
Beyond simply deploying models, the entire model lifecycle (training → evaluation → registry → deployment → feedback) must be systematically managed.
Why it's difficult:
- Model version management: Managing diverse artifacts including foundation models, fine-tuned models, and adapters (LoRA)
- Distributed training infrastructure: Large-scale model fine-tuning requires multi-node GPU clusters and high-speed networking (EFA)
- Evaluation pipelines: Must automatically evaluate model quality and set deployment gates
- Safe deployment: Minimize service impact during model updates with Canary/Blue-Green deployment
- Hybrid environments: Model transfer and synchronization between on-premises and cloud GPUs
- RAG data pipelines: Continuous update pipelines for document processing, embedding generation, and vector storage
- Feedback loops: Continuous improvement systems that incorporate production tracing data into retraining
Next Steps: Approaches to Solving the Challenges
We present two approaches to solving these 5 challenges:
- AWS Native Platform: An approach that minimizes infrastructure operational burden using AWS managed services (Bedrock, AgentCore) to focus on agent development
- EKS-Based Open Architecture: An approach that achieves fine-grained control and cost optimization using Amazon EKS and the open-source ecosystem
These two approaches are complementary and can be combined based on workload characteristics.
| Criteria | AWS Native | EKS-Based Open Architecture |
|---|---|---|
| GPU management | Not required (serverless) | Karpenter auto-provisioning |
| Model selection | Bedrock-supported models | All open weight models |
| Operational burden | Minimal | Medium (reduced with Auto Mode) |
| Cost optimization | Usage-based pricing | Fine-grained control: Spot, Consolidation |
| Customization | Limited | Full flexibility |
- Quick start, focus on agent logic: AWS Native Platform
- Open weight models + hybrid + cost optimization: EKS-based open architecture
- Realistic optimum: Combine both approaches (start with AWS Native, expand to EKS as needed)
References
Official Documentation
- Kubernetes Gateway API — K8s official gateway API specification
- CNCF AI/ML Landscape — Cloud Native AI/ML ecosystem overview
- NVIDIA GPU Operator Documentation — GPU Operator official guide
- AWS EKS Best Practices for AI/ML — EKS AI/ML workload optimization
Papers / Technical Blogs
- vLLM: Easy, Fast, and Cheap LLM Serving — PagedAttention mechanism explanation
- Efficient Memory Management for LLM Serving (OSDI 2023) — KV Cache optimization research
- Cost-Effective LLM Inference at Scale — Production cost optimization cases
- NVIDIA Blog: Optimizing AI Workloads — GPU optimization technology blog
Related Documents (Internal)
- Platform Architecture — Overall system design blueprint
- AWS Native Platform — Managed service approach
- EKS-Based Open Architecture — Self-hosting approach
- GPU Resource Management — GPU cost optimization details