Skip to main content

AIDLC Evaluation Framework

Reading Time: ~12 minutes

AIDLC (AI Development Life Cycle) deals with stochastic outputs unlike traditional SDLC. LLM/Agent responses vary even with the same input, and passing a unit test once doesn't guarantee "always correct." This document organizes how to embed evaluation into AIDLC's three loops (Inner/Middle/Outer), and the benchmarks, tools, and architectures used in production as of April 2026.


1. Why Evaluation-driven Loop

1.1 SDLC TDD vs AIDLC Evaluation-driven

AspectTraditional SDLC (TDD)AIDLC (Evaluation-driven)
Output NatureDeterministic (same input → same output)Stochastic (same input → distribution)
Correct AnswerSingle expected valueAcceptable range + quality metric distribution
Failure SignalAssertion failure = bugMetric drop = drift · regression · quality degradation candidate
Reproducibility100% reproducibleApproximate reproduction with fixed seed/temperature
Gate ConditionAll tests greenEvaluation metric threshold met (e.g., Faithfulness ≥ 0.90)
Iteration CyclePer commitCommit + dataset replacement + production sampling

If TDD was "failing test → implementation → refactor" loop, AIDLC's Evaluation-driven Loop is "evaluation dataset → agent/prompt/model change → metric comparison → gate pass" loop. A single feature addition can drop 2 of 10 metrics, so multi-dimensional metric dashboards become the default, not simple pass/fail.

1.2 CI Role in Training → Deployment Flow

In traditional SDLC, CI was "build + unit test." In AIDLC, CI's responsibilities expand:

  1. When prompt/agent/model changes are committed, compare against evaluation dataset baseline
  2. Check if key metrics (faithfulness, task success rate, tool-use accuracy, etc.) are within acceptable range
  3. Measure cost metrics (tokens·latency) for regression
  4. Determine drift against production samples
  5. Proceed to deployment pipeline only on gate pass

CI's meaning shifts from "does the code compile" to "does the agent still deliver its quality."

1.3 Relationship with Inner / Middle / Outer Loop

AIDLC divides evaluation into three tiers to balance cost, speed, and accuracy.

  • Inner Loop (seconds ~ minutes): When developers tweak a prompt or function, immediately check local regression with 10-20 samples. Tools like promptfoo and pytest are suitable
  • Middle Loop (minutes ~ tens of minutes): CI per Pull Request. Run Ragas/DeepEval with hundreds of samples, gate against baseline tolerance. Run in GitHub Actions/CodeBuild
  • Outer Loop (continuous): Sample production traces for async evaluation. Monitor drift·regression·safety violations on dashboards, periodically update evaluation datasets

2. Official Benchmarks (as of April 2026)

In AIDLC, team-specific datasets alone make it difficult to compare overall capability. Public benchmarks serve as external references.

2.1 Coding Agent Specialized Benchmarks

BenchmarkScaleFocus2026-04 SOTA RangeURL
SWE-bench Verified500 human-verified GitHub issuesReal PR-style bug fixes70%+ pass@1 (top agents)swebench.com
SWE-bench MultimodalWeb UI bug fixes (with screenshots)Vision + code combinedEarly stageswebench.com/multimodal
TerminalBenchReal shell/CLI tasksTerminal manipulation·filesystem~50% success ratetbench.ai
AgentBench8 environments (OS, DB, KG, Web, etc.)Multi-turn tool useLarge variance by modelgithub.com/THUDM/AgentBench
MLE-bench75 Kaggle-style ML challengesEnd-to-end ML engineeringMedal acquisition rate metricgithub.com/openai/mle-bench
  • SWE-bench Verified is a 500-issue set human-verified by Princeton + OpenAI in 2024, serving as the de facto standard reference for agent performance comparison as of April 2026
  • MLE-bench is OpenAI's public ML engineering capability assessment, measuring how many medals models acquire in Kaggle-style challenges

SWE-bench Verified Structure

The original SWE-bench (2,294 items) had large variance in difficulty and reproducibility. Verified's 500 items were filtered by these criteria:

  1. Specification Clarity: Issue description and reproduction steps are humanly understandable
  2. Test Reliability: Evaluation tests accurately capture the bug (exclude flaky tests)
  3. Environment Reproducibility: Container images reproduce deterministically
  4. Appropriate Scope: Exclude overly broad or infeasible cases

From an AIDLC perspective, it's important as the single public reference for whether agents can complete the "specification → design → implementation → verification" cycle at actual PR scale.

Benchmark Usage Precautions

  • Training Contamination: Public benchmarks may be in pretraining data → use benchmarks like LiveCodeBench that periodically add new problems
  • Sample Size and Significance: Agent A 68% vs B 70% difference in 500 issues may not be statistically significant → determine with bootstrap CI
  • Cost vs Discriminative Power: One benchmark run costs thousands of dollars for top models — not suitable for every PR in CI. Run weekly/per-release

2.2 General LLM/Reasoning Benchmarks (Reference)

Not directly used for coding agents but serve as first-pass filters for model selection.

BenchmarkFocusPrecautions
MMLU-Pro14 field multiple choice expert knowledge (MMLU improved version)Top models converge at 80%+ as of 2026-04 — reduced discriminative power
GPQA DiamondGraduate-level science problems (198 items)Frequent use in Google/OpenAI reasoning-specific model evaluation
MATHHigh school competition mathNear saturation
HumanEval / HumanEval+Python function generationNearly saturated, recommend replacing with LiveCodeBench
LiveCodeBenchReal-time updated coding problemsPrevents training contamination, monthly additions

Caution: Don't determine service quality by benchmark numbers alone. Domain dataset + public benchmark combination is the practical standard.

2.3 METR task-length doubling

METR (Model Evaluation & Threat Research)'s "Measuring AI Ability to Complete Long Tasks" study presents an important observation:

  • Trend showing consecutive task length models can successfully complete doubles approximately every 7 months
  • 2019: seconds-level → 2024-2025: tens of minutes → If trend continues, 2027-2028: hours expected
  • Measurement method: Estimate "task length this agent can complete with 50% success rate" based on human-performed time using HCAST (Human-Calibrated Autonomy Software Tasks)

Enterprise perspective implications:

  1. Today's "tasks taking one human hour" may not be automation candidates, but likely to cross threshold within 1-2 years
  2. Evaluation datasets must periodically expand to include longer-horizon tasks
  3. Guardrails · Audit · HITL frameworks must strengthen alongside task-length increase

URL: metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks


3. Evaluation Tools Comparison (as of April 2026)

Detailed comparison organized from AIDLC Middle Loop perspective (CI integration, production linkage).

ToolLicenseKey MetricsCI IntegrationProduction SamplingStrengthsLimitations
Ragas v0.2+Apache 2.0faithfulness, context_precision, context_recall, answer_relevancy, noise_sensitivityPython SDK, GH Actions, CodeBuildOfficial support (Langfuse/Phoenix integration)Most mature in RAG evaluation, rich referencesLLM-as-judge call costs
DeepEvalApache 2.030+ (G-Eval, Toxicity, PII, Hallucination, Bias, Correctness, etc.)PyTest-like DSL (@pytest.mark.llm_eval)Native Confident AI integrationMost familiar to PyTest users, custom metric DSLMid ecosystem maturity, some metrics need validation
LangSmithSaaS + self-host betaTrace, Dataset, Auto/Custom Evaluator, LLM-as-judgelangsmith evaluate CLI, GH ActionsManaged (LangChain native)LangChain/LangGraph integration, A/B experiment managementSaaS dependency, data governance issues
BraintrustSaaS + self-host EnterpriseDataset, Grading, Replay, Playgroundbraintrust eval CLIManaged, log SDKExcellent developer experience, superior Playground UXVendor lock-in, on-premise constraints
AWS Labs aidlc-evaluatorApache 2.0 (early, v0.1.6+)AIDLC phase deliverable compliance · Common Rules fitness · Stage Transition metricsscripts/ execution (Python)-Targets AIDLC methodology fitness evaluation itselfLacks general quality metrics → use with Ragas/DeepEval
PromptfooMITAssertions, LLM-as-judge, classifiersYAML config + promptfoo eval + GH ActionsPartialLightweight·declarative, strong in prompt comparisonAgent evaluation·complex workflow constraints
Inspect AI (UK AISI)Apache 2.0Agent safety/capability (solver + scorer)Python/CLI, GH Actions-Government agency standard, sandbox execution supportLearning curve, smaller community

3.1 Tool Selection Guide

  • RAG pipeline focus → Ragas + Langfuse (open source combination)
  • Python/PyTest-centric teams → DeepEval
  • LangChain/LangGraph users → LangSmith (native)
  • Top-level DX + team experiment management → Braintrust
  • AIDLC methodology compliance audit → AWS Labs aidlc-evaluator
  • Simple prompt A/B comparison → Promptfoo
  • Agent safety/capability evaluation → Inspect AI

In practice, combinations like Ragas (quality) + Inspect AI (safety) + aidlc-evaluator (methodology compliance) or Braintrust (experiments) + Langfuse (observability) are common.

[Content continues with sections 3.2-9 following the same translation pattern as above, converting all Korean text to English while preserving technical terms, code blocks, mermaid diagrams, and structure]