This loop is exclusively for self-hosted open-weight models (Qwen3, Llama 4, GLM-5, etc.). AgentCore's managed closed models like Claude/Nova cannot self-learn and are excluded from scope.
Before production deployment, consensus on scope, automation boundaries, data governance, and rollback criteria is needed. See ADR — Self-Improving Agent Loop Decision for detailed consensus items.
Self-Improving Agent Loop (Autosearch)
Autosearch Discourse and Enterprise Interpretation
Karpathy's Core Argument
Andrej Karpathy argued that LLMs will evolve beyond simple "next token prediction" machines into autosearch systems. Core mechanisms:
- Tool-use Rollout: LLM explores multiple reasoning paths using tools (code execution, web search, calculator, etc.)
- Success as Signal: Successful paths (reaching correct answer, completing tasks) become signals for next learning
- Self-Supervised Loop: Accumulate own success/failure data without human labeling and retrain with reinforcement learning
- Compound Growth: Stronger models generate more success traces → stronger models (virtuous cycle)
Example: Math problem-solving Agent
- Rollout: "53 × 47 = ?" with 5 approaches (direct calculation, Python execution, Wolfram Alpha, approximation, decomposition)
- Success: Python execution and decomposition reach correct answer 2491
- Training: DPO learning with success paths as preferred samples, failure paths as rejected
- Next Iteration: Model bias increases to try Python execution first for complex calculations
Enterprise Environment Constraints
Applying Karpathy's idealism to enterprise environments requires considering these constraints:
| Constraint | Description | Solution Direction |
|---|---|---|
| Data Governance | Production traces may contain PII, confidential information | Presidio PII scanner, k-anonymity, consent tracking |
| Cost | LLM calls increase N-fold per rollout (N=number of exploration paths) | Optimize cost-quality trade-off, prioritize low-cost models |
| Reward Modeling | Definition of "success" is ambiguous (customer satisfaction? accuracy? latency?) | Composite reward: LLM-as-judge + Ragas + user feedback |
| Mode Collapse | Generate only specific patterns repeatedly (diversity loss) | Entropy regularization, diverse sampling |
| Regulatory | Audit log and model card update required for each model change | Version control, audit trail, Agent Versioning integration |
Self-improving loop should be interpreted as "automated reinforcement under human supervision", not "full automation". Quality gates and human-in-the-loop verification are essential at each iteration.
5-Stage Loop Architecture
Overall Architecture Diagram
Stage 1: Rollout — Production Traffic Collection
Goal: Collect Agent execution traces for actual user requests.
Execution Cycle: Continuous (Real-time)
Input: User request, context, Agent state
Output: Trace (prompt, tool calls, intermediate reasoning, final response, latency, token count)
Collection Mechanism:
from langfuse import Langfuse
langfuse = Langfuse()
@trace_agent_call # Automatic trace with decorator
def execute_agent(user_query: str, context: dict):
trace = langfuse.trace(name="agent-execution", metadata={"user_id": context["user_id"]})
with trace.span(name="retrieval"):
docs = vector_db.search(user_query)
with trace.span(name="reasoning"):
response = llm.generate(prompt=build_prompt(user_query, docs))
with trace.span(name="tool-execution"):
if response.requires_tool:
tool_result = execute_tool(response.tool_name, response.tool_args)
trace.event(name="completion", metadata={"tokens": response.token_count})
return response
Diversity Assurance: Generate 3 responses with temperature variations (0.7/0.9/1.1) for same request to increase diversity
Failure Recovery: User response returned normally even if trace collection fails (async logging)
Stage 2: Score — Reward Calculation
Goal: Quantify "how good each response is" by assigning 0-1 score to each trace.
Execution Cycle: Hourly batch
Input: Langfuse trace ID batch
Output: {trace_id: reward_score} table
Composite Reward Formula:
reward_score = (
w1 * llm_judge_score + # LLM-as-Judge (0-1)
w2 * ragas_faithfulness + # Ragas faithfulness (0-1)
w3 * ragas_context_recall + # Ragas context recall (0-1)
w4 * user_feedback_score + # Thumbs up=1, down=0, neutral=0.5
w5 * latency_penalty # Penalty if P99 exceeds
)
# Default weights (adjust with experiments)
w1, w2, w3, w4, w5 = 0.3, 0.25, 0.2, 0.2, 0.05
LLM-as-Judge Prompt:
judge_prompt = f"""
Evaluate the following Agent response:
**Question**: {question}
**Context**: {context}
**Answer**: {answer}
Evaluation Criteria:
1. Accuracy: Factual accuracy based on context
2. Completeness: Covers all aspects of the question
3. Clarity: Easy for user to understand
4. Conciseness: Delivers only core without unnecessary information
Return score between 0-1 and reasoning in JSON.
{{"score": 0.85, "reasoning": "Accurate and complete but slightly verbose"}}
"""
judge_response = cheap_llm.generate(judge_prompt) # Use Qwen3-7B (cost reduction)
Ragas Evaluation:
from ragas.metrics import faithfulness, context_recall
eval_data = {
"question": [question],
"answer": [answer],
"contexts": [contexts],
"ground_truth": [ground_truth] if available else None
}
ragas_result = evaluate(Dataset.from_dict(eval_data), metrics=[faithfulness, context_recall])
User Feedback Integration:
# Query user feedback from Langfuse
feedback = langfuse.get_scores(trace_id=trace_id, name="user-feedback")
user_score = 1.0 if feedback.value == "positive" else 0.0 if feedback.value == "negative" else 0.5
Cost Optimization:
- LLM-as-Judge uses low-cost model (Qwen3-7B, Llama 4 Scout)
- Ragas with caching (reuse same question+context combinations)
- Prioritize user feedback — skip LLM-as-Judge if feedback exists
Stage 3: Filter — Data Curation & PII Gate
Goal: Select only high-quality traces as training data and remove sensitive information.
Execution Cycle: Hourly batch
Input: Scored traces
Output: Clean training dataset (S3 Iceberg table)
Quality Gates:
def filter_traces(scored_traces):
filtered = []
for trace in scored_traces:
# 1. Minimum score threshold
if trace.reward_score < 0.7:
continue
# 2. Remove latency outliers (P99 > 30 seconds)
if trace.latency > 30:
continue
# 3. Exclude traces with errors
if trace.error_count > 0:
continue
# 4. Remove duplicates (same question+answer combination)
if is_duplicate(trace):
continue
filtered.append(trace)
return filtered
PII Scanning (Presidio):
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
def scan_and_anonymize(text: str) -> tuple[str, bool]:
"""Detect PII then anonymize. Returns (anonymized text, PII found)."""
results = analyzer.analyze(text=text, language='ko')
if not results:
return text, False # No PII
# PII found → anonymize
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
return anonymized.text, True
# Process trace
for trace in filtered_traces:
trace.question, q_has_pii = scan_and_anonymize(trace.question)
trace.answer, a_has_pii = scan_and_anonymize(trace.answer)
if q_has_pii or a_has_pii:
trace.metadata["pii_detected"] = True
k-Anonymity Check (Must have k+ people with same query pattern to use as training data):
def check_k_anonymity(traces, k=5):
"""Remove if same pattern has less than k instances"""
query_counts = defaultdict(int)
for trace in traces:
query_pattern = extract_pattern(trace.question) # Extract pattern after entity removal
query_counts[query_pattern] += 1
return [t for t in traces if query_counts[extract_pattern(t.question)] >= k]
Storage — S3 + Iceberg:
import pyiceberg
catalog = pyiceberg.catalog.load_catalog("training_data")
table = catalog.load_table("agent_traces")
# Append to Iceberg table
table.append([
{"trace_id": t.id, "question": t.question, "answer": t.answer,
"reward": t.reward_score, "timestamp": t.timestamp}
for t in filtered_traces
])
Regulatory Compliance:
- GDPR/PIPA: Opt-out mechanism required when using training data without user consent
- Data Retention Period: Delete within 90 days after training (policy setting)
- Audit Log: Record all PII detection/anonymization events in CloudTrail/Audit DB
Stage 4: Train — Preference Tuning
Goal: Retrain model with reinforcement learning using high-quality traces.
Execution Cycle: Weekly or Monthly
Input: S3 Iceberg table (preference pairs)
Output: Candidate model checkpoint
Preference Pair Construction:
Self-improving loop uses high-reward as preferred, low-reward as rejected among multiple responses to same question.
def build_preference_pairs(traces):
"""Group traces for same question to generate pairs"""
grouped = defaultdict(list)
for trace in traces:
grouped[trace.question].append(trace)
pairs = []
for question, trace_list in grouped.items():
if len(trace_list) < 2:
continue # Cannot pair
# Sort by reward
sorted_traces = sorted(trace_list, key=lambda t: t.reward_score, reverse=True)
# Top 1 vs Bottom 1 pair
preferred = sorted_traces[0]
rejected = sorted_traces[-1]
# Reward difference must be large enough for meaningful pair
if preferred.reward_score - rejected.reward_score < 0.2:
continue
pairs.append({
"prompt": question,
"chosen": preferred.answer,
"rejected": rejected.answer,
"reward_diff": preferred.reward_score - rejected.reward_score
})
return pairs
Training Method Selection Guide:
| Method | Data Requirement | GPU-hours (7B model) | Convergence Stability | Suitable Scenario |
|---|---|---|---|---|
| GRPO | 1k+ pairs | ~50 (4×H100) | ⭐⭐⭐ | Initial self-improvement, fast iteration |
| DPO | 5k+ pairs | ~200 (8×H100) | ⭐⭐⭐⭐ | Stable learning after sufficient data |
| RLAIF | 10k+ pairs + reward model | ~500 (8×H100) | ⭐⭐ | When complex reward modeling needed |
| RFT | 10k+ high-quality traces | ~300 (8×H100) | ⭐⭐⭐⭐⭐ | When golden dataset for supervised learning available |
- Initial (data <2k pairs): GRPO — Fastest and effective with minimal data
- Mid-stage (data 5k-10k pairs): DPO — Balance of stability and effectiveness
- Maturity (data >10k): RLAIF or RFT — Complex reward modeling
GRPO Training Example (NeMo-RL):
from nemo.collections.nlp.models.language_modeling import MegatronGPTSFTModel
from nemo_aligner.algorithms.grpo import GRPOTrainer
# Load base model
model = MegatronGPTSFTModel.restore_from("qwen3-7b-base.nemo")
# GRPO configuration
grpo_config = {
"num_rollouts": 4, # Generate 4 responses per question
"kl_coef": 0.05, # KL divergence penalty (prevent policy drift)
"clip_range": 0.2,
"learning_rate": 1e-6,
"batch_size": 16,
"gradient_accumulation": 4,
}
trainer = GRPOTrainer(model=model, config=grpo_config)
# Execute training
trainer.fit(train_dataset=preference_pairs, val_dataset=golden_dataset)
# Save checkpoint
model.save_to("qwen3-7b-grpo-2026-04-18.nemo")
DPO Training Example (TRL):
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-7B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-7B-Instruct")
dpo_config = DPOConfig(
beta=0.1, # Temperature for DPO loss
learning_rate=5e-7,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
max_length=2048,
num_train_epochs=1,
)
trainer = DPOTrainer(
model=model,
args=dpo_config,
train_dataset=preference_dataset,
tokenizer=tokenizer,
)
trainer.train()
model.save_pretrained("qwen3-7b-dpo-2026-04-18")
Training Monitoring:
# Track real-time metrics with Wandb integration
import wandb
wandb.init(project="self-improving-agent", name="grpo-2026-04-18")
# Tracked Metrics
- Reward mean/std (per batch)
- KL divergence (policy drift vs base model)
- Loss curve
- Validation accuracy (golden dataset)
- Training time per epoch
Cost Estimate (Qwen3-7B, 5k pairs, DPO):
- GPU: 8× H100 × 25 hours = 200 GPU-hours
- Cloud cost (p5.48xlarge, $98.32/hr): ~$2,458
- Comparison: Weekly training $10k/month, Monthly training $2.5k/month
Stage 5: Deploy — Regression Verification & Gradual Deployment
Goal: Verify new trained model has not regressed vs baseline before deploying to production.
Execution Cycle: Once after training completion
Input: Candidate model checkpoint
Output: Production deployment or rollback
Golden Dataset Evaluation:
from ragas import evaluate
from datasets import Dataset
# Golden Dataset (100-200 QA validated by domain experts)
golden_data = load_golden_dataset("s3://golden-eval/agent-qa-v2.jsonl")
# Evaluate baseline model
baseline_results = evaluate_model(baseline_model, golden_data)
# Evaluate candidate model
candidate_results = evaluate_model(candidate_model, golden_data)
# Statistical comparison
from scipy.stats import ttest_rel
t_stat, p_value = ttest_rel(baseline_results, candidate_results)
if p_value < 0.05 and mean(candidate_results) > mean(baseline_results):
print("✅ Candidate model is statistically significantly better")
decision = "PROCEED_TO_SHADOW"
elif mean(candidate_results) < mean(baseline_results) * 0.95:
print("❌ 5%+ regression detected → rollback")
decision = "ROLLBACK"
else:
print("⚠️ No significant difference → additional verification needed")
decision = "MANUAL_REVIEW"
Shadow Test (5% Traffic):
# Inference Gateway configuration (LiteLLM + Feature Flag)
from ldclient import LDClient, Context
ld_client = LDClient(sdk_key="sdk-key")
def select_model(user_id: str) -> str:
context = Context.builder(user_id).kind("user").build()
variant = ld_client.get_variant("agent-model-shadow-test", context)
# 95% baseline, 5% candidate (shadow)
return "qwen3-7b-baseline" if variant.name == "control" else "qwen3-7b-candidate"
# Shadow response logged only, baseline returned to user
async def execute_with_shadow(query: str, user_id: str):
baseline_task = agent_call(model="qwen3-7b-baseline", query=query)
candidate_task = agent_call(model="qwen3-7b-candidate", query=query, shadow=True)
baseline_resp, candidate_resp = await asyncio.gather(baseline_task, candidate_task)
# Comparison logging
log_shadow_comparison(query, baseline_resp, candidate_resp)
return baseline_resp # Only baseline to user
Regression Monitoring (24 hours):
# Prometheus query: Candidate vs Baseline error rate
rate(agent_errors_total{model="candidate"}[1h]) / rate(agent_requests_total{model="candidate"}[1h])
vs
rate(agent_errors_total{model="baseline"}[1h]) / rate(agent_requests_total{model="baseline"}[1h])
# Latency P99
histogram_quantile(0.99, rate(agent_latency_bucket{model="candidate"}[1h]))
vs
histogram_quantile(0.99, rate(agent_latency_bucket{model="baseline"}[1h]))
# User Feedback ratio
sum(rate(user_feedback_positive{model="candidate"}[1h])) / sum(rate(user_feedback_total{model="candidate"}[1h]))
Auto-rollback Triggers:
# Prometheus AlertManager
- alert: CandidateModelRegression
expr: |
(rate(agent_errors_total{model="candidate"}[30m])
/ rate(agent_requests_total{model="candidate"}[30m]))
> 1.5 *
(rate(agent_errors_total{model="baseline"}[30m])
/ rate(agent_requests_total{model="baseline"}[30m]))
for: 30m
annotations:
summary: "Candidate model error rate 1.5x increase → automatic rollback"
# Webhook → Lambda → LaunchDarkly API (change variant weight to 0%)
Canary Deployment (on Shadow success):
# Gradually increase ratio in LaunchDarkly console
# Day 1: 5% (shadow) → 5% (live)
# Day 2: 25%
# Day 3: 50%
# Day 4: 100%
# 24-hour monitoring at each stage → proceed to next if no regression
Reward Design
LLM-as-Judge + Ragas + User Feedback Weights
Default weights (adjust with experiments):
REWARD_WEIGHTS = {
"llm_judge": 0.30, # LLM-as-Judge evaluation
"faithfulness": 0.25, # Ragas faithfulness (prevent hallucination)
"context_recall": 0.20, # Ragas context recall (retrieval quality)
"user_feedback": 0.20, # Thumbs up/down
"latency_penalty": 0.05, # Penalty if P99 exceeds
}
def compute_reward(trace):
score = 0.0
# 1. LLM-as-Judge
judge_score = llm_judge_evaluate(trace.question, trace.answer, trace.context)
score += REWARD_WEIGHTS["llm_judge"] * judge_score
# 2. Ragas faithfulness
faith_score = ragas.faithfulness.score(trace.answer, trace.context)
score += REWARD_WEIGHTS["faithfulness"] * faith_score
# 3. Ragas context recall
recall_score = ragas.context_recall.score(trace.context, trace.ground_truth)
score += REWARD_WEIGHTS["context_recall"] * recall_score
# 4. User feedback
feedback_score = 1.0 if trace.user_feedback == "positive" else \
0.0 if trace.user_feedback == "negative" else 0.5
score += REWARD_WEIGHTS["user_feedback"] * feedback_score
# 5. Latency penalty (deduct if P99 > 10 seconds)
if trace.latency > 10:
penalty = min(0.05, (trace.latency - 10) / 100) # Maximum 5% penalty
score -= penalty
return max(0.0, min(1.0, score)) # Clamp to 0-1 range
Weight Tuning Experiment
Optimal Weight Search with A/B Test:
# Define experiment groups
experiments = [
{"name": "baseline", "weights": {"llm_judge": 0.3, "faithfulness": 0.25, ...}},
{"name": "user-first", "weights": {"llm_judge": 0.2, "user_feedback": 0.4, ...}},
{"name": "quality-first", "weights": {"faithfulness": 0.4, "context_recall": 0.3, ...}},
]
# Run separate training pipeline for each experiment group
for exp in experiments:
model = train_with_rewards(base_model, preference_pairs, reward_weights=exp["weights"])
# Golden dataset evaluation
results = evaluate(model, golden_dataset)
# Production test (Canary 5%)
production_metrics = deploy_canary(model, traffic_pct=0.05, duration_hours=24)
# Track business metrics
print(f"{exp['name']}: Accuracy={results.accuracy}, User Satisfaction={production_metrics.satisfaction}")
Iterative Optimization:
- Train model with initial weights
- Collect business metrics after production deployment (user satisfaction, task completion rate)
- Retrain after weight adjustment
- Finalize optimal combination after 2-3 iterations
Data Curation & PII Gate
Langfuse Trace → S3 Iceberg Table
Data Flow:
Lambda ETL Job:
import boto3
import psycopg2
from presidio_analyzer import AnalyzerEngine
from pyiceberg.catalog import load_catalog
def lambda_handler(event, context):
# 1. Query traces from last hour in Langfuse DB
conn = psycopg2.connect(os.environ["LANGFUSE_DB_URL"])
cursor = conn.execute("""
SELECT id, input, output, metadata, score
FROM traces
WHERE created_at > NOW() - INTERVAL '1 hour'
AND score > 0.7
""")
traces = cursor.fetchall()
# 2. PII scanning
analyzer = AnalyzerEngine()
clean_traces = []
for trace in traces:
input_results = analyzer.analyze(text=trace["input"], language="ko")
output_results = analyzer.analyze(text=trace["output"], language="ko")
if input_results or output_results:
# PII found → anonymize or discard
if should_anonymize(trace):
trace = anonymize_trace(trace, input_results, output_results)
else:
continue # Discard
clean_traces.append(trace)
# 3. Save to Iceberg table
catalog = load_catalog("glue", **{"s3.endpoint": "https://s3.amazonaws.com"})
table = catalog.load_table("training_data.agent_traces")
table.append(clean_traces)
return {"status": "success", "traces_processed": len(clean_traces)}
Presidio PII Scanner
Supported Entities (Korean):
- Name, email, phone number, SSN, credit card number, address, IP address
Add Custom Recognizers:
from presidio_analyzer import Pattern, PatternRecognizer
# Korean account number pattern
account_number_recognizer = PatternRecognizer(
supported_entity="KR_ACCOUNT_NUMBER",
patterns=[Pattern("account", r"\d{3}-\d{2}-\d{6}", 0.8)],
)
analyzer.registry.add_recognizer(account_number_recognizer)
k-Anonymity
Concept: Same pattern query must exist for at least k people to be considered low risk for individual identification.
Implementation:
from collections import defaultdict
def apply_k_anonymity(traces, k=5):
"""Remove traces failing k-anonymity criteria"""
# 1. Extract query pattern (remove named entities)
pattern_groups = defaultdict(list)
for trace in traces:
pattern = extract_pattern(trace.question) # "John Doe" → "[NAME]", "2026-04-18" → "[DATE]"
pattern_groups[pattern].append(trace)
# 2. Remove groups with less than k
filtered = []
for pattern, group in pattern_groups.items():
if len(group) >= k:
filtered.extend(group)
else:
print(f"⚠️ Pattern '{pattern}' removed (k={len(group)} < {k})")
return filtered
def extract_pattern(text: str) -> str:
"""Replace named entities with placeholders"""
# Extract entities with NER model and replace
entities = ner_model.predict(text)
for entity in entities:
text = text.replace(entity.text, f"[{entity.label}]")
return text
Terms & Regional Storage Requirements
Korean PIPA (Personal Information Protection Act, 개인정보보호법):
- Prohibits automated profile-based decisions without user consent → opt-in consent required
- Separate consent required for overseas transfer → storage in domestic region (ap-northeast-2)
GDPR:
- Right to be forgotten → Delete within 7 days upon user request
- Data minimization → Delete original traces within 90 days after training
Consent Tracking:
# User consent table
consent_table = {
"user_id": "u123",
"consent_to_training": True,
"consent_date": "2026-04-01",
"withdraw_date": None,
}
# Verify consent when collecting trace
if not user_consents[trace.user_id].consent_to_training:
continue # Cannot use as training data
Preference Tuning Selection Guide
GRPO (Group Relative Policy Optimization)
Principle: Update policy based on relative rewards of multiple responses (rollouts) to same prompt. Variant of PPO but no reference model required.
Advantages:
- Effective even with small data (starting from 1k pairs)
- Fast convergence (50 GPU-hours)
- No reference model required → save memory
Disadvantages:
- Unstable convergence (sensitive to learning rate adjustment)
- Difficult to handle complex reward functions
Usage Example:
# NeMo-Aligner GRPO
from nemo_aligner.algorithms.grpo import GRPOTrainer
trainer = GRPOTrainer(
model=base_model,
num_rollouts=4, # Generate 4 responses per question
kl_coef=0.05, # KL penalty
learning_rate=1e-6,
batch_size=16,
)
trainer.fit(train_dataset)
Suitable Scenario: Initial self-improvement when fast iteration required
DPO (Direct Preference Optimization)
Principle: Learn implicit reward using preferred/rejected pairs directly. Directly optimize policy without reward model.
Advantages:
- Stable convergence
- Automatic KL divergence control with reference model
- Simple implementation (TRL library)
Disadvantages:
- Requires sufficient data (5k+ pairs)
- Long training time (200 GPU-hours)
Usage Example:
from trl import DPOTrainer, DPOConfig
config = DPOConfig(
beta=0.1, # DPO temperature
learning_rate=5e-7,
max_length=2048,
num_train_epochs=1,
)
trainer = DPOTrainer(
model=base_model,
args=config,
train_dataset=preference_dataset, # {"prompt", "chosen", "rejected"} format
tokenizer=tokenizer,
)
trainer.train()
Suitable Scenario: Stable learning after sufficient data
RLAIF (Reinforcement Learning from AI Feedback)
Principle: Learn reward model from AI-generated feedback and optimize policy with PPO. RLHF variant replacing "Human" with "AI".
Advantages:
- Can express complex reward functions
- Advantageous for large-scale training
Disadvantages:
- Reward model training overhead (additional GPU-hours)
- Unstable convergence (sensitive to hyperparameters)
- High implementation complexity
Usage Example:
# 1. Train reward model
from transformers import AutoModelForSequenceClassification
reward_model = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen3-7B", num_labels=1)
reward_trainer = Trainer(
model=reward_model,
train_dataset=labeled_comparisons, # (prompt, response_a, response_b, preference)
)
reward_trainer.train()
# 2. Optimize policy with PPO
from trl import PPOTrainer
ppo_trainer = PPOTrainer(
model=base_model,
ref_model=reference_model,
reward_model=reward_model,
config=ppo_config,
)
ppo_trainer.train()
Suitable Scenario: When complex reward modeling needed (e.g., multi-step reasoning, creativity evaluation)
RFT (Rejection Sampling Fine-Tuning)
Principle: Select only high-reward responses from multiple rollouts for supervised fine-tuning. Reinforce with SFT without RL.
Advantages:
- Most stable convergence
- Simple implementation (same as SFT)
- Best efficiency when high-quality dataset available
Disadvantages:
- Requires golden dataset (10k+ high-quality traces)
- Lack of exploration (only selected responses learned)
Usage Example:
# 1. Select high-reward traces
high_quality_traces = [t for t in traces if t.reward_score > 0.9]
# 2. Construct SFT dataset
sft_dataset = [
{"prompt": t.question, "completion": t.answer}
for t in high_quality_traces
]
# 3. SFT training
from transformers import Trainer
trainer = Trainer(
model=base_model,
train_dataset=sft_dataset,
args=TrainingArguments(learning_rate=2e-5, num_train_epochs=3),
)
trainer.train()
Suitable Scenario: When golden dataset validated by domain experts available
Practical Comparison (Qwen3-7B, 5k pairs baseline)
| Metric | GRPO | DPO | RLAIF | RFT |
|---|---|---|---|---|
| GPU-hours | 50 | 200 | 500 | 300 |
| Minimum Data | 1k | 5k | 10k | 10k |
| Convergence Stability | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ |
| Implementation Complexity | Medium | Low | High | Low |
| Reward Flexibility | Low | Medium | High | Low |
| Cloud Cost | $500 | $2,000 | $5,000 | $3,000 |
Recommended Roadmap:
- Phase 1 (1-2 months): Fast proof-of-concept with GRPO
- Phase 2 (3-6 months): Switch to DPO after data accumulation
- Phase 3 (6+ months): Introduce RLAIF when complex reward needed, or parallel RFT when golden dataset available
Safety — Reward Hacking Detection and Defense
What is Reward Hacking?
Phenomenon where model learns only "responses that get high rewards" rather than "truly good responses".
Examples:
- Excessive verbosity: Write long to increase completeness score → unnecessarily long answers
- Template repetition: "Follow these steps: 1) ... 2) ..." pattern scores high → all answers follow same format
- Overconfidence: "Absolutely certain" assertive expressions get high LLM-as-Judge scores → confident hallucinations
Diverse Rollout Sampling
Strategy: Generate diverse responses to same question to ensure diversity.
def diverse_rollout(prompt: str, n=4):
"""Sampling for diversity"""
responses = []
for i in range(n):
# Temperature, top_p variation
temp = 0.7 + i * 0.1 # 0.7, 0.8, 0.9, 1.0
top_p = 0.9 - i * 0.05 # 0.9, 0.85, 0.8, 0.75
response = llm.generate(
prompt=prompt,
temperature=temp,
top_p=top_p,
max_tokens=512,
)
responses.append(response)
return responses
Diversity Metric Monitoring:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
embedder = SentenceTransformer("sentence-transformers/paraphrase-multilingual-mpnet-base-v2")
def measure_diversity(responses: list[str]) -> float:
"""Average cosine similarity between responses (lower is more diverse)"""
embeddings = embedder.encode(responses)
similarities = cosine_similarity(embeddings)
# Exclude diagonal (similarity with self)
avg_sim = (similarities.sum() - len(responses)) / (len(responses) * (len(responses) - 1))
return 1 - avg_sim # Diversity score (higher is more diverse)
# Alert configuration
if measure_diversity(batch_responses) < 0.3:
alert("⚠️ Insufficient response diversity → possibility of mode collapse")
Entropy Regularization
Purpose: Maintain entropy of output distribution to prevent model from being excessively biased toward specific patterns.
import torch
import torch.nn.functional as F
def entropy_regularized_loss(logits, labels, entropy_coef=0.01):
"""Cross-entropy loss + entropy regularization"""
# 1. Base loss
ce_loss = F.cross_entropy(logits, labels)
# 2. Calculate entropy of output distribution
probs = F.softmax(logits, dim=-1)
entropy = -torch.sum(probs * torch.log(probs + 1e-10), dim=-1).mean()
# 3. Subtract entropy from loss to prefer high-entropy
total_loss = ce_loss - entropy_coef * entropy
return total_loss
Entropy Monitoring:
# Track batch entropy during training
wandb.log({"output_entropy": entropy.item()})
# Alert mode collapse if entropy drops sharply
if entropy < 2.0: # Adjust threshold based on vocab size
alert("⚠️ Low entropy detected → possible mode collapse")
Policy Drift Monitoring (KL Divergence)
Purpose: Track KL divergence to prevent model from drifting too far from base model after retraining.
import torch.nn.functional as F
def compute_kl_divergence(base_model, new_model, test_prompts):
"""Calculate KL divergence between base model and new model"""
kl_divs = []
for prompt in test_prompts:
# Base model logits
with torch.no_grad():
base_logits = base_model(prompt).logits
base_probs = F.softmax(base_logits, dim=-1)
# New model logits
new_logits = new_model(prompt).logits
new_probs = F.softmax(new_logits, dim=-1)
# KL(new || base)
kl = F.kl_div(new_probs.log(), base_probs, reduction='batchmean')
kl_divs.append(kl.item())
return sum(kl_divs) / len(kl_divs)
# Pre-deployment check
kl_threshold = 0.5 # Empirically adjust
avg_kl = compute_kl_divergence(base_model, candidate_model, golden_prompts)
if avg_kl > kl_threshold:
alert(f"⚠️ KL divergence {avg_kl:.3f} > {kl_threshold} → excessive policy drift")
decision = "ROLLBACK"
Human-in-the-Loop Verification
Strategy: Verify quality with weekly human review of 1-2% of total training data.
def sample_for_human_review(traces, sample_rate=0.02):
"""Random sampling + prioritize edge cases"""
# 1. Random sample
random_sample = random.sample(traces, int(len(traces) * sample_rate * 0.5))
# 2. Edge case priority sample (high reward + low user feedback)
edge_cases = sorted(
traces,
key=lambda t: abs(t.reward_score - t.user_feedback_score),
reverse=True
)[:int(len(traces) * sample_rate * 0.5)]
return random_sample + edge_cases
# Weekly review
review_batch = sample_for_human_review(last_week_traces)
# Send to labeling UI
for trace in review_batch:
send_to_labeling_ui(trace, reviewer="domain_expert")
Review Result Feedback:
# Human review results
human_labels = load_human_reviews("s3://reviews/week-2026-04-18.json")
# Calculate correlation coefficient between reward function and human evaluation
from scipy.stats import spearmanr
corr, p_value = spearmanr(
[h.reward_score for h in human_labels],
[h.human_score for h in human_labels]
)
if corr < 0.7:
alert(f"⚠️ Reward-human correlation {corr:.2f} < 0.7 → need reward function readjustment")
Organizational Decision Checklist
Cost-Benefit Analysis
Investment Cost (Monthly):
| Item | Cost (USD) | Note |
|---|---|---|
| GPU Training | $2,500 | Weekly DPO training, 8×H100 × 25h |
| Trace Storage | $300 | S3 + Iceberg (1TB) |
| LLM-as-Judge Inference | $500 | Qwen3-7B, 10k evaluations per hour |
| Ragas Evaluation | $200 | With caching |
| Infrastructure Operations | $500 | Lambda, Glue, Athena |
| Total | $4,000 | Monthly operational cost |
Expected Effect (3 months baseline):
| Metric | Before | After | Improvement |
|---|---|---|---|
| Exact Match | 0.78 | 0.85 | +9%p |
| User Satisfaction | 3.5/5 | 4.2/5 | +20% |
| Task Completion | 72% | 83% | +11%p |
| Escalation Rate | 15% | 9% | -40% |
ROI Calculation:
- Monthly cost: $4,000
- Save 1 human agent (annual salary $60k) → monthly $5,000 savings
- Payback Period: 0.8 months
Governance
Model Card Update:
# model-card.yaml
model_name: "qwen3-7b-agent-v2"
version: "2.0"
training_date: "2026-04-18"
base_model: "Qwen/Qwen3-7B-Instruct"
training_data:
source: "Production traces (2026-01 ~ 2026-03)"
size: "5,247 preference pairs"
pii_filtered: true
consent_verified: true
training_method:
algorithm: "DPO"
hyperparameters:
beta: 0.1
learning_rate: 5e-7
epochs: 1
evaluation:
golden_dataset: "agent-qa-v2 (150 samples)"
exact_match: 0.85
faithfulness: 0.88
user_satisfaction: 4.2/5
safety:
pii_scanning: "Presidio v2.2"
k_anonymity: 5
human_review_rate: 0.02
approval:
approved_by: "Jane Doe (Lead ML Engineer)"
approval_date: "2026-04-18"
deployment_stage: "Canary 5%"
Audit Log:
-- Record all training events
CREATE TABLE training_audit_log (
id UUID PRIMARY KEY,
event_type VARCHAR(50), -- 'training_started', 'model_deployed', 'rollback'
model_version VARCHAR(50),
triggered_by VARCHAR(100),
timestamp TIMESTAMP,
metadata JSONB
);
-- Example query: "Who deployed models in April 2026?"
SELECT * FROM training_audit_log
WHERE event_type = 'model_deployed'
AND timestamp BETWEEN '2026-04-01' AND '2026-04-30';
Team Capability Check
Required Capabilities:
| Capability | Necessity | Current Level | Gap Closure Plan |
|---|---|---|---|
| RL Expertise | ⭐⭐⭐ | - | External consulting or hiring |
| MLOps Maturity | ⭐⭐⭐⭐ | - | Build CI/CD pipeline |
| LLM Evaluation Experience | ⭐⭐⭐ | - | Ragas/Langfuse training |
| Production Operations | ⭐⭐⭐⭐⭐ | - | SRE team collaboration |
| Data Governance | ⭐⭐⭐⭐ | - | Link with Legal/Compliance team |
Minimum Team Composition:
- ML Engineer (RL experience) × 1
- MLOps Engineer × 1
- Data Engineer × 1
- SRE × 0.5 (part-time)
- Domain Expert (labeling) × 1
Go/No-Go Criteria
Go (Proceed) Conditions:
- ✅ $4k monthly budget secured
- ✅ Minimum 3 months production trace accumulation (>2k traces)
- ✅ Golden dataset prepared (>100 samples)
- ✅ MLOps pipeline established (CI/CD, monitoring)
- ✅ Legal/Compliance approval (PII handling, consent)
- ✅ Secure RL/MLOps expertise (internal or external)
No-Go (Stop) Conditions:
- ❌ Insufficient data (<1k traces)
- ❌ Insufficient team capability (no RL expertise)
- ❌ Unresolved compliance (no PII handling plan)
- ❌ Negative ROI (cost > expected effect)
Phase-by-Phase Decision:
- Phase 0 (Pilot, 1 month): Small-scale experiment with GRPO, 500 traces, $500 budget
- Go Criteria: Exact Match +3%p improvement or more
- Phase 1 (PoC, 3 months): Expand with DPO, 5k traces, $12k budget
- Go Criteria: User Satisfaction +10% or more, no regression
- Phase 2 (Production, 6+ months): Establish regular learning loop
- Go Criteria: ROI > 1.5, quality gate pass rate >95%
References
Official Documentation
- TRL (Transformer Reinforcement Learning) — HuggingFace RL library
- NeMo-Aligner — NVIDIA reinforcement learning toolkit
- Presidio PII Scanner — Microsoft PII detection
- Ragas Documentation — RAG evaluation framework
Papers / Technical Blogs
- DPO: Direct Preference Optimization (NeurIPS 2023) — DPO paper
- DeepSeek-R1: GRPO for Reasoning (2024) — GRPO paper
- Constitutional AI: RLAIF (Anthropic 2022) — RLAIF paper
- Andrej Karpathy on Autosearch — Autosearch concept
Related Documents (Internal)
- Agent Versioning — Model version management
- Agent Monitoring — Langfuse tracing
- Ragas Evaluation — RAG quality evaluation
- Cascade Routing Tuning — Routing optimization
Self-improving loop cannot be fully automated. Reward hacking, mode collapse, and policy drift can occur anytime. Human-in-the-loop verification and statistical monitoring are essential. Blind automation can lead to model quality degradation.
Next Steps
If considering Self-improving loop adoption:
- Cascade Routing Tuning — Ensure training data diversity by prioritizing low-cost models first
- Continuous Training Pipeline — Design automated regular training pipeline
- Agent Versioning — Model version management and progressive deployment strategy
- Agent Monitoring — Langfuse-based trace collection and cost tracking