Open-Weight Models

When operating AI Development Lifecycle (AIDLC) in enterprise environments, data residency and cost efficiency are critical decision factors. Open-weight models provide three differentiated values compared to cloud APIs (Claude, GPT-4): securing data sovereignty, predictable TCO, and domain-specialized customization.

Why Open-Weight Models

Three Core Drivers

1. Data Residency Requirements

In finance, healthcare, and public sectors, external transmission of sensitive data is restricted by regulations.

Compliance Obligations: GDPR, HIPAA, and financial privacy laws strictly limit data processing locations
Internal Codebase Protection: Process source code on-premises without sending to external APIs
Sovereign AI: Nations/enterprises directly control AI inference infrastructure

2. Cost Optimization

When processing millions of tokens per month, self-hosted open models can be cheaper than cloud APIs.

Exit Pay-Per-Use: Shift from per-API-call charges to fixed infrastructure costs
Maximize GPU Utilization: Minimize idle time in 24-hour operations
Break-Even Point: Self-hosting becomes advantageous when processing over 100M tokens/month (varies by GPU type)

3. Domain Customization

Open-weight models can be optimized for specific domains through fine-tuning, prompt engineering, and ontology injection.

Improve Technical Terminology Accuracy: Enhance medical, legal, financial term processing
Control Output Format: Enforce JSON schema and code style guide compliance
Ontology Integration: Inject domain knowledge combined with Ontology Engineering

Model Landscape (April 2026)

Model	Provider	Parameters	Key Features	License	Recommended Deployment
GLM-5	THUDM	405B	Multilingual (KO/ZH/EN) strength, excellent math/reasoning	Apache 2.0	p5en.48xlarge (H200×8)
Qwen3-Coder	Alibaba	4B-32B	Coding specialized, fast inference speed	Apache 2.0	g6.xlarge (L4×1)
Qwen3-235B	Alibaba	235B	MoE architecture, multimodal	Apache 2.0	p5.48xlarge (H100×8)
DeepSeek-R1	DeepSeek	671B	CoT reasoning specialized, RL-based training	MIT	p5en.48xlarge (H200×8)
Llama 4	Meta	70B-405B	Broad ecosystem, stable performance	Llama 4 License	p4d.24xlarge (A100×8)
Mistral Large 2	Mistral	123B	European data sovereignty considered design	Mistral License	p4d.24xlarge (A100×8)

Selection Criteria

Deployment Patterns

Pattern 1: EKS + vLLM Serving (Cloud)

Serve open models in cloud environments to maintain data residency while delegating infrastructure management to AWS.

# GLM-5 405B deployment example (EKS Standard Mode)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: glm5-vllm
spec:
  replicas: 1
  template:
    spec:
      nodeSelector:
        node.kubernetes.io/instance-type: p5en.48xlarge
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.18.2
        args:
        - --model
        - THUDM/glm-5-405b
        - --served-model-name
        - glm5
        - --tensor-parallel-size
        - "8"
        - --max-model-len
        - "8192"
        - --trust-remote-code
        resources:
          limits:
            nvidia.com/gpu: "8"

Advantages:

Cost savings with Auto Scaling and Spot instances
Dynamic node provisioning via Karpenter
CloudWatch and Prometheus integrated monitoring

Disadvantages:

Hourly GPU instance costs (p5en.48xlarge: ~$98/h)
Data stays within VPC but infrastructure depends on cloud

Pattern 2: On-Premises Bare Metal + vLLM

Used when complete data sovereignty is required or cloud transmission is prohibited.

# Deploy vLLM on NVIDIA H100×8 server
docker run --gpus all \
  -p 8000:8000 \
  -v /data/models:/models \
  vllm/vllm-openai:v0.18.2 \
  --model /models/glm-5-405b \
  --served-model-name glm5 \
  --tensor-parallel-size 8 \
  --max-model-len 8192 \
  --trust-remote-code

Infrastructure Requirements:

GLM-5 405B: H100 8 cards or H200 8 cards (FP16 ~810GB VRAM)
Qwen3-Coder 4B: L4 1 card (FP16 ~8GB VRAM)
Network: Internal network-only endpoints, no external internet required

Advantages:

Absolute data control
Minimize network latency (on-premises 10G network)
Long-term cloud cost savings

Disadvantages:

CapEx burden (H100 server ~$300K)
Operational personnel required (GPU management, model updates)
Power consumption even during idle time

Pattern 3: Hybrid Configuration

Mix on-premises and cloud APIs based on task sensitivity.

Implementation Example (LiteLLM Routing):

# litellm-config.yaml
model_list:
  - model_name: sensitive-tasks
    litellm_params:
      model: openai/glm5
      api_base: http://on-prem-vllm.internal:8000/v1
      api_key: dummy
  - model_name: general-tasks
    litellm_params:
      model: bedrock/anthropic.claude-sonnet-4-20250514
      aws_region_name: us-east-1

router_settings:
  routing_strategy: simple-shuffle
  fallbacks:
    - sensitive-tasks: []  # No fallback (external transmission prohibited)
    - general-tasks: [openai/gpt-4o]

TCO Comparison Framework

Cost Items

Cloud API (Claude, GPT-4)

Item	Claude Sonnet 4.5	Notes
Input Tokens	$3/1M tokens	Bedrock pricing
Output Tokens	$15/1M tokens	Output is 5x more expensive than input
Operations Staff	$0	No management required
Initial Investment	$0	Pay-as-you-go

Monthly Cost Calculation Example:

Process 50M input tokens, 10M output tokens per month
Input: 50M × $3/1M = $150
Output: 10M × $15/1M = $150
Total: $300/month

Self-Hosted Open Model (EKS + vLLM)

Item	Qwen3-Coder 4B (g6.xlarge)	GLM-5 405B (p5en.48xlarge)
GPU Instance	$1.01/h × 730h = $737/month	$98/h × 730h = $71,540/month
Storage (Model)	~$5/month (8GB)	~$400/month (810GB)
Network Egress	Internal traffic free	Internal traffic free
Operations Staff	0.2 FTE (~$2,000/month)	0.5 FTE (~$5,000/month)
Total	~$2,742/month	~$76,940/month

On-Premises Bare Metal (3-year amortization):

Item	H100×8 Server	Notes
Hardware	$300,000 / 36 months = $8,333/month	Initial CapEx
Power	10.2kW × $0.12/kWh × 730h = $893/month	Regional electricity rate variation
Data Center	~$1,000/month	Cooling, space
Operations Staff	1 FTE (~$10,000/month)	24/7 support
Total	~$20,226/month	~$11,893/month after 3 years (hardware cost excluded)

Break-Even Point Guide

Decision Criteria:

Less than 10M tokens/month: Use Claude API or GPT-4 (no management overhead)
10M-100M tokens/month: Hybrid configuration based on task type and sensitivity
Over 100M tokens/month: Consider self-hosting (EKS or on-premises)
CapEx investment possible + 3+ year operation planned: On-premises bare metal is cheapest

AIDLC Integration

Kiro + Open-Weight Models

AI Coding Agent Kiro can use open-weight models as backend to perform Spec-Driven development.

// kiro-config.ts
export const kiroConfig = {
  models: {
    sensitive: {
      provider: 'vllm',
      endpoint: 'http://on-prem-vllm.internal:8000/v1',
      model: 'glm5',
      use_cases: ['code-review', 'security-audit', 'refactoring']
    },
    general: {
      provider: 'bedrock',
      model: 'anthropic.claude-sonnet-4-20250514',
      region: 'us-east-1',
      use_cases: ['documentation', 'translation', 'test-generation']
    }
  },
  routing: {
    strategy: 'by-file-path',
    rules: [
      { pattern: 'src/core/**', model: 'sensitive' },
      { pattern: 'docs/**', model: 'general' }
    ]
  }
};

Steering Files: Model-Specific Prompt Optimization

Open-weight models have different training data and architectures, so they may respond differently to the same prompts.

GLM-5 Steering File:

# .aider.glm5.yml
model: glm5
edit_format: diff
use_git: true
auto_commits: false
stream: true

# GLM-5 performs better with Chinese/Korean mixed use
prompts:
  system: |
    You are a professional software engineer.
    When modifying code, always respond in unified diff format.
    Write explanations of changes in Korean, but keep technical terms in English.

Qwen3-Coder Steering File:

# .aider.qwen3.yml
model: qwen3-coder
edit_format: whole
use_git: true

# Qwen3-Coder is more stable with complete file replacement approach
prompts:
  system: |
    You are a coding assistant specialized in Python and TypeScript.
    Always return the complete modified file.
    Use type hints and follow PEP 8 style guide.

Ontology Injection

Inject domain ontology built in Ontology Engineering into open model context to improve accuracy.

# ontology_injection.py
from typing import Dict, List

class OntologyInjector:
    def __init__(self, ontology_path: str):
        self.ontology = self.load_ontology(ontology_path)
    
    def inject_context(self, prompt: str, domain: str) -> str:
        """Add domain ontology to prompt"""
        domain_terms = self.ontology.get(domain, {})
        
        context = "# Domain Knowledge\n"
        for term, definition in domain_terms.items():
            context += f"- {term}: {definition}\n"
        
        return f"{context}\n# Task\n{prompt}"
    
    def load_ontology(self, path: str) -> Dict[str, Dict[str, str]]:
        # Load ontology file in JSON/YAML format
        pass

# Usage example
injector = OntologyInjector("/data/ontology/finance.yaml")
prompt = injector.inject_context(
    "Analyze the following transaction records to find anomaly patterns.",
    domain="finance"
)

Effects:

Improve financial terminology (e.g., arbitrage, short selling) interpretation accuracy
Strengthen domain-specific rule (e.g., KYC, AML) compliance
Save tokens by replacing few-shot examples with ontology

Security and Compliance

Model License Review

Even open-weight models may have commercial use restrictions depending on license.

License	Commercial Use	Derivative Model Distribution	Notes
Apache 2.0	✅ Allowed	✅ Allowed	Includes patent protection clause
MIT	✅ Allowed	✅ Allowed	Disclaimer required
Llama 4 License	✅ Allowed (MAU < 700M)	⚠️ Limited	Large-scale services require separate agreement
Mistral License	✅ Allowed	⚠️ Limited	Notify when distributing fine-tuned models

Recommended Process:

Review license with legal team (especially Llama, Mistral)
Track model provenance (check Hugging Face Model Card)
Review dataset licenses when fine-tuning

Output Auditing

Open models may have sensitive information in training data, so output filtering is necessary.

# output_filter.py
import re
from typing import List

class OutputFilter:
    def __init__(self):
        self.patterns = [
            (r'\b\d{3}-\d{2}-\d{4}\b', '[SSN-REDACTED]'),  # SSN
            (r'\b[\w\.-]+@[\w\.-]+\.\w+\b', '[EMAIL-REDACTED]'),  # Email
            (r'\b\d{4}-\d{4}-\d{4}-\d{4}\b', '[CARD-REDACTED]')  # Card number
        ]
    
    def filter(self, text: str) -> str:
        """Automatically remove sensitive information"""
        for pattern, replacement in self.patterns:
            text = re.sub(pattern, replacement, text)
        return text
    
    def audit_log(self, text: str, redacted_count: int) -> None:
        """Record filtering history"""
        # Send to CloudWatch Logs or S3
        pass

AI Act Compliance

European AI Act, Korean AI Basic Act, and other regulations require the following for high-risk AI systems (healthcare, finance, recruitment):

Explainability: Provide model decision basis (attention weights, RAG sources)
Human Oversight: Humans perform final decision-making
Bias Monitoring: Measure output fairness by race, gender, age
Incident Response Procedures: Model malfunction rollback process

AIDLC Application:

Establish model governance policies in Governance Framework
Log all inference requests/responses via Langfuse
Perform bias audits quarterly (e.g., RAGAS evaluation)

References

Internal Documentation

Governance Framework — Data sovereignty policies
Cost Effectiveness — Detailed TCO calculations
AI Coding Agents — Kiro integration guide
Ontology Engineering — Ontology injection patterns

Official Model Documentation

GLM-5 GitHub — Apache 2.0 license
Qwen3 Model Card — MoE architecture details
DeepSeek-R1 Paper — RL training methodology
vLLM Documentation — Serving optimization

Cost Analysis Tools

AWS Pricing Calculator — EKS GPU instance costs
Hugging Face LLM Leaderboard — Model performance comparison
LLM TCO Calculator — Self-hosting vs API comparison

Next Steps

Pilot Project: Deploy Qwen3-Coder 4B to small project (1 g6.xlarge)
Cost Tracking: Compare cloud API vs self-hosting TCO over 2 months
Sensitivity Classification: Classify codebase as sensitive/general to design hybrid configuration
Governance Policy: Specify open model usage rules in Governance Framework
Scale Up: Expand to GLM-5 405B or DeepSeek-R1 after pilot success

Why Open-Weight Models​

Three Core Drivers​

1. Data Residency Requirements​

2. Cost Optimization​

3. Domain Customization​

Model Landscape (April 2026)​

Selection Criteria​

Deployment Patterns​

Pattern 1: EKS + vLLM Serving (Cloud)​

Pattern 2: On-Premises Bare Metal + vLLM​

Pattern 3: Hybrid Configuration​

TCO Comparison Framework​

Cost Items​

Cloud API (Claude, GPT-4)​

Self-Hosted Open Model (EKS + vLLM)​

Break-Even Point Guide​

AIDLC Integration​

Kiro + Open-Weight Models​

Steering Files: Model-Specific Prompt Optimization​

Ontology Injection​

Security and Compliance​

Model License Review​

Output Auditing​

AI Act Compliance​

References​

Internal Documentation​

Official Model Documentation​

Cost Analysis Tools​

Next Steps​