ADR — Self-Improving Agent Loop Adoption Decision

Status: Proposed (consensus required before production deployment)
Date: 2026-04-19
Related Documents
- Design: Self-Improving Agent Loop
- Implementation: Continuous Training Pipeline
- Operations: Cascade Routing Tuning

Context

In the Phase 3 documentation reorganization, two documents mapping Andrej Karpathy's autosearch discourse to enterprise environments were added. The design (self-improving-agent-loop.md) and implementation (continuous-training/) are in draft status, and operational principle consensus must precede internal review.

While automated learning loops are technically appealing, in enterprise environments the risk of losses from reward hacking, data leakage, and governance gaps exceeding benefits is significant. This ADR serves as meeting material for agreeing on "what to automate and where to draw boundaries."

Decision Points (Consensus Items)

1. Fixed Scope — Self-hosted SLM Only

This loop is exclusively for self-hosted open-weight models: Qwen3 / Llama 4 / GLM-5.
Excludes managed closed models (Bedrock AgentCore Claude/Nova, OpenAI GPT-4.1, Gemini 2.5).
While Ragas/LLM-judge evaluation may use shared resources, target models receiving weight updates are limited to self-hosted models.

Rationale: Closed models lack weight access and risk vendor ToS violations. Address through prompt/routing tuning rather than loop application.

2. Loop Automation Boundary — Only 3 of 5 Stages Automated

Stage	Automation	Human Intervention
1. Rollout (trace collection)	✅ Automated	—
2. Score (Reward labeling)	✅ Automated	Weekly 1–2% manual sample verification
3. Filter (data curation)	✅ Automated	Mandatory PII scanner pass
4. Train (GRPO/DPO/RLAIF)	⚠ Manual trigger	Deployment engineer approval required
5. Deploy (Canary)	⚠ Manual approval + Canary	10% → 50% → 100% staged gates

Rationale: Unattended automation through Train/Deploy dramatically increases reward hacking risk and rollback costs. Start Train with minimum weekly cadence manual job triggers.

3. Reward Model Owner Designation

Changes to reward model policy (LLM-judge prompts, Ragas weights, user feedback weights) require a single Directly Responsible Individual (DRI).
Changes require PR review + backtest results on past 30 days of traces.
Decision owner: Platform MLOps Lead (TBD)

Rationale: Reward signals determine overall learning direction. Indiscriminate weight adjustments are the root cause of silent drift.

4. Data Governance — Mandatory 4-Gate Passage

Traces promoted to training data must pass the following 4 gates sequentially:

PII Gate — Mask personal information with Presidio/Comprehend. Discard failed traces.
Consent Gate — Exclude traces without collection consent from training data (consent=true tag required).
Regional Gate — Traces from regulated regions (EU GDPR, KR PIPA) only used by training jobs within corresponding regional data boundaries.
Confidentiality Gate — Customer confidential/internal-only traces reflected only in internal-only models.

Rationale: Using traces for training without compliance team pre-review causes cost explosion for retraining-level data deletion requests.

5. Cost Guardrails

Monthly GPU-hour ceiling: Pre-specified per environment (default proposal: 1,000 GPU-hour/month on p5en Spot, ~$15K ceiling).
Auto-halt on quality degradation: Automatic rollback if Ragas faithfulness drops 5%p or misroute rate rises 10%p after Canary deployment.
Train job scheduling prohibited without FinOps review when ceiling breach imminent.

Rationale: One GRPO iteration requires multi-hour GPU usage. Unlimited automation makes budget management impossible.

6. Rollback Boundary — Which Versions Must Allow Immediate Reversion

Model versions: Retain last 3 generation weights in registry (deletion prohibited).
Reward model versions: Maintain last 5 versions.
Router classifier versions: Maintain last 10 versions.
Periodically conduct Game day exercises verifying automatic reversion within 5 minutes to previous generation on Canary failure.

Rationale: The most common failure mode in self-learning loops is "worked yesterday, broken today" pattern. Without fast reversion paths, operational risk is excessive.

7. Organizational Boundaries — Single Team Pilot → Platform Expansion

Phase 1 (0–3 months): Classifier v7 / router only as self-learning targets. Single team pilot.
Phase 2 (3–6 months): Expand to one specific self-hosted SLM (Qwen3-4B).
Phase 3 (6+ months): Re-evaluate expansion to large models like GLM-5. Separate ADR required.

Rationale: Technical feasibility differs from organizational operability. Start with small scope, accumulate operational data, then expand.

Consequences

Positive

Explicitly defined loop introduction scope, speed, and responsibilities increase predictability.
Excludes closed models, avoiding vendor contract risks.
3-stage automation reduces blast radius in incident scenarios.

Negative / Trade-offs

Manual Train/Deploy approval sacrifices some speed advantages of "fully automated autosearch."
4-gate data gates may significantly reduce initial data collection volume, slowing learning pace.
Single DRI model risks operational disruption on personnel departure, requiring backup DRI designation.

Next Actions

Circulate this ADR draft to relevant leads (MLOps / Security / FinOps / Compliance).
Designate Reward model DRI (internal platform team discussion).
Create 4-gate data pipeline design work item → Enhance continuous-training/trace-to-dataset.md § Stage 3.
Draft Game day rollback scenarios → Enhance cascade-routing-tuning.md § Canary rollout.
Update status to Proposed → Accepted after consensus.

References

Official Documentation

Kubernetes Canary Deployment — Progressive rollout standard
MLflow Model Registry — Model version management reference

Papers / Tech Blogs

Andrej Karpathy — "LLMs doing autosearch" (2026 Q1 discourse)
LMSYS RouteLLM — Cascade routing classifier design
Langfuse OTel — Production trace standard

Self-Improving Agent Loop Design — 5-Stage Loop detailed architecture
Continuous Training Pipeline Implementation — Actual pipeline specifications
Cascade Routing Tuning Operations — Canary operations guide

Context​

Decision Points (Consensus Items)​

1. Fixed Scope — Self-hosted SLM Only​

2. Loop Automation Boundary — Only 3 of 5 Stages Automated​

3. Reward Model Owner Designation​

4. Data Governance — Mandatory 4-Gate Passage​

5. Cost Guardrails​

6. Rollback Boundary — Which Versions Must Allow Immediate Reversion​

7. Organizational Boundaries — Single Team Pilot → Platform Expansion​

Consequences​

Positive​

Negative / Trade-offs​

Next Actions​

References​

Official Documentation​

Papers / Tech Blogs​

Related Documents (Internal)​