Custom Model Pipeline Guide

Overview

Why You Need a Custom Model Pipeline

SaaS-based AI coding tools (e.g., Kiro, GitHub Copilot) offer a quick start, but they hit fundamental limitations in enterprise environments.

Constraint	SaaS (Kiro, etc.)	Self-hosted Pipeline
LoRA Fine-tuning	Not possible	Domain-specific adapter training
Data Sovereignty	Code sent externally	Stays within VPC
Model Selection	Limited to provided models	Free choice of open-source models
Cost Control	Fixed per-token pricing	66% savings possible with SLM Cascade
Per-customer Optimization	Shared general-purpose model	Multi-LoRA for customer-specific specialization

Core Strategy

The Base Model + LoRA adapter pattern serves multiple domain-specialized models simultaneously on a single GPU. Since base model weights are shared, GPU memory efficiency is maximized.

End-to-End Pipeline Flow

The training pipeline trains domain data with QLoRA, and only adapters that pass evaluation are registered in the registry. The serving pipeline loads multiple adapters simultaneously with vLLM Multi-LoRA and performs cost-optimized routing between SLM/LLM through Bifrost Cascade.

LoRA Training & Deployment Pipeline (Domain Specialization)

This section covers how to implement LoRA Fine-tuning and Multi-LoRA hot-swap deployment in the domain specialization strategy. For strategic background and decision criteria for domain specialization, see Domain Customization (LoRA + RAG).

LoRA Fine-tuning Pipeline

QLoRA GPU Savings

QLoRA (Quantized LoRA) trains only the LoRA adapter while keeping the base model quantized to INT4. This dramatically reduces GPU requirements compared to full fine-tuning.

Model	Full Fine-tuning	LoRA	QLoRA
Llama-3.3-70B	H100×32 (impractical)	H100×8	H100×4
VRAM	280 GB	80 GB	40 GB
Training Time	-	5 days	2-3 days
Cost	-	$8,000	$2,000

INT4 Quantization Precision

QLoRA keeps base model weights in INT4 during training, so tasks requiring extremely precise numerical computations (e.g., financial calculations) may show slight accuracy differences compared to LoRA (FP16). Always validate in the domain evaluation stage.

Training Data Format

Prepare training data as input-output pairs in JSONL format.

{
  "input": "COBOL: PERFORM CALC-INTEREST USING WS-PRINCIPAL WS-RATE.",
  "output": "Java: @Transactional public BigDecimal calcInterest(BigDecimal principal, BigDecimal rate) { return principal.multiply(rate).setScale(2, RoundingMode.HALF_UP); }"
}

Data Collection Strategy:

Source	Transformation Method	Expected Data Volume
Legacy COBOL code	Generate COBOL → Java translation pairs	10,000+ modules
Internal frameworks	Framework pattern → code pairs	5,000+ patterns
Code review history	Pre-fix → post-fix pairs	20,000+ commits
Technical docs	Documentation → implementation code pairs	3,000+ pages

Data Quality Determines Model Quality

Quality matters more than quantity. 1,000 high-quality pairs reviewed by senior developers are more effective than 10,000 auto-generated ones. Start with at least 500 reviewed pairs.

Training Frameworks

NeMo Framework (NVIDIA)

NVIDIA's official framework optimized for large-scale model training. Natively supports multi-GPU and multi-node distributed training.

python train_lora.py \
  --config-path=conf \
  --config-name=llama3_70b_lora \
  model.data.train_ds.file_path=cobol_to_java.jsonl \
  model.peft.lora_tuning.adapter_dim=16

Key NeMo Configuration Parameters

adapter_dim (rank): 16 is typical. Can increase to 32-64 for complex domains
adapter_dropout: 0.05 recommended (prevents overfitting)
target_modules: attention layers (q_proj, k_proj, v_proj, o_proj)

Unsloth (2× Faster Training)

An open-source library that doubles LoRA/QLoRA training speed on a single node while reducing memory usage by up to 50%.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3.3-70B-Instruct",
    max_seq_length=4096,
    load_in_4bit=True,  # QLoRA: INT4 quantization
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                # LoRA rank
    lora_alpha=32,       # LoRA scaling factor
    target_modules=["q_proj", "k_proj", "v_proj"],
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    max_seq_length=4096,
)
trainer.train()

Framework	Strengths	Best For
NeMo	Multi-node distributed training, official NVIDIA support	H100 cluster available, large-scale training
Unsloth	2× faster training, memory savings, simple API	Single node, rapid prototyping

Checkpoint Management

Trained LoRA adapters are stored in S3 with version management via MLflow.

# Adapter storage structure
s3://model-registry/
  └── lora-adapters/
      ├── bank-ledger/
      │   ├── v1.0/adapter_model.safetensors
      │   ├── v1.1/adapter_model.safetensors
      │   └── latest -> v1.1
      ├── stock-order/
      │   └── v1.0/adapter_model.safetensors
      └── insurance-contract/
          └── v1.0/adapter_model.safetensors

MLflow Integration

Recording training metrics (loss, accuracy) alongside adapter paths in MLflow lets you track which dataset and hyperparameter combinations are optimal.

Reference: NeMo Framework Checkpoint Management

Multi-LoRA Hot-swap Deployment

Architecture

Leveraging vLLM's Multi-LoRA feature, you can load multiple LoRA adapters simultaneously on top of a single base model to serve customized responses per customer.

Multi-LoRA Memory Efficiency

The base model (e.g., 70B) is loaded into GPU memory only once. Each LoRA adapter at rank 16 is approximately 100-200MB, so even loading 10 adapters simultaneously adds less than 2GB of additional memory.

vLLM Multi-LoRA Configuration

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --enable-lora \
  --lora-modules \
    bank-ledger=/models/lora/bank \
    stock-order=/models/lora/stock \
    insurance-contract=/models/lora/insurance \
  --max-lora-rank 16

Key Options:

Option	Description	Default
`--enable-lora`	Enable Multi-LoRA	`false`
`--lora-modules`	Register adapters as `name=path`	-
`--max-lora-rank`	Maximum LoRA rank	16
`--max-loras`	Maximum adapters loaded simultaneously	1
`--max-cpu-loras`	Number of adapters cached in CPU memory	-

Hot-swap Considerations

vLLM loads adapters into GPU memory at request time. Using more adapters than --max-loras causes swap latency (hundreds of ms). Set --max-loras to match the number of frequently used adapters.

Specifying Adapters in Requests

Make requests via the OpenAI-compatible API and specify the LoRA name in extra_body.

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Convert this COBOL ledger code to Java"}],
    extra_body={"lora_name": "bank-ledger"}
)

Per-customer Routing (Bifrost + X-Customer-Domain Header)

Use kgateway's HTTPRoute for HTTP header-based per-customer LoRA adapter routing.

# kgateway HTTPRoute - Per-customer LoRA routing
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: lora-routing
spec:
  rules:
  - matches:
    - headers:
      - name: X-Customer-Domain
        value: bank
    backendRefs:
    - name: vllm-svc
      port: 8000

Routing Flow

Client (Aider/Cline) → Sets X-Customer-Domain: bank header → kgateway → Bifrost → vLLM (auto-maps to lora_name=bank-ledger)

Per-customer Inference Tracking and Cost Billing

Track each customer's inference requests with an LLM tracing system to monitor per-adapter performance and bill monthly costs.

from langfuse import Langfuse

langfuse = Langfuse()

trace = langfuse.trace(
    name="inference",
    user_id="customer-bank-A",
    metadata={
        "lora": "bank-ledger",
        "model": "glm-5-32b",
        "domain": "ledger"
    }
)

Monthly Cost Billing Example:

Customer	Requests	Tokens	GPU Time	Cost
Bank A	100,000	500M	50 hours	$2,500
Securities B	50,000	250M	25 hours	$1,250
Insurance C	30,000	150M	15 hours	$750

For implementation details, see Agent Monitoring and LLM Tracing Deployment.

SLM Cascade Routing (Cost Optimization)

Cascade Architecture

Sending every request to a large model (LLM) is wasteful. 70% of requests can be handled adequately by a small model (SLM).

Cost Analysis

	SLM Only	LLM Only	Cascade (70:30)
Monthly Cost	$500	$8,900	$3,020
Accuracy	70%	95%	92%
Cost Savings	-	-	66%

ROI Calculation

Adopting Cascade saves $5,880/month ($70,560/year). Setup takes only 1-2 days, making it immediately worthwhile.

Bifrost Cascade Config

{
  "providers": {
    "openai": {
      "keys": [
        {
          "name": "slm",
          "value": "dummy",
          "weight": 0.7,
          "models": ["llama-8b"]
        },
        {
          "name": "llm",
          "value": "dummy",
          "weight": 0.3,
          "models": ["glm5"]
        }
      ],
      "network_config": {
        "base_url": "http://glm5-serving:8000"
      }
    }
  }
}

Bifrost Cascade Limitations

Bifrost's current cascade routing operates at the provider level and does not support automatic routing based on request complexity. It works with simple weight-based distribution or fallback conditions (5xx, latency exceeded). Complexity-based routing must be implemented with llm-d or custom logic.

SLM Deployment YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-slm
  namespace: agentic-serving
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-slm
  template:
    metadata:
      labels:
        app: vllm-slm
    spec:
      nodeSelector:
        node.kubernetes.io/instance-type: g6.xlarge
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["vllm", "serve"]
        args:
          - "meta-llama/Llama-3.3-8B-Instruct"
          - "--served-model-name=llama-8b"
          - "--tensor-parallel-size=1"
          - "--max-model-len=32768"
          - "--host=0.0.0.0"
          - "--port=8000"
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8000
          name: http
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-slm-svc
  namespace: agentic-serving
spec:
  selector:
    app: vllm-slm
  ports:
  - port: 8000
    targetPort: 8000
    protocol: TCP

g6.xlarge Instance Specs

GPU: 1× NVIDIA L4 (24GB VRAM)
Cost: ~$0.31/hr (On-Demand), ~$0.09/hr (Spot)
Sufficient for serving 8B models

Reference: Cost Threshold Analysis

Evaluation Pipeline

LoRA Adapter Evaluation Matrix

Trained adapters must pass multiple evaluations before deployment.

Evaluation Method	Purpose	Tool	Automation
RAGAS	RAG accuracy (faithfulness, relevancy)	ragas	CI/CD integration
SWE-bench	Coding quality (real issue resolution)	swe-bench	CI/CD integration
Domain Expert Review	Business correctness validation	Langfuse Annotation	Manual
Red-teaming	Security/safety (prompt injection, etc.)	Garak	CI/CD integration

Evaluation Thresholds

Minimum criteria for adapter deployment:

RAGAS Faithfulness: ≥ 0.85
SWE-bench Resolved: ≥ 30%
Domain expert approval: At least 2 out of 3
Garak security test: 0 critical findings

LoRA A/B Testing

Before deploying a new adapter version, compare performance using LLM tracing system tags for A/B testing. Recording the lora version in request metadata as a tag allows per-adapter performance comparison in the dashboard.

For implementation examples, see Agent Monitoring - A/B Testing.

A/B Test Comparison Metrics:

Metric	Measurement	Meaning
Accuracy	SWE-bench / domain tests	Code conversion correctness
Latency	LLM tracing p50/p95	Response speed
Token Efficiency	output_tokens / input_tokens	Answer conciseness
User Satisfaction	Annotation Score	Real user evaluation

Reference: RAGAS Evaluation Framework
Reference: LLMOps Observability Evaluation Pipeline

Phased Implementation Roadmap

Phase	Timeline	Components	Cost (USD)	Key Actions
1	Immediate	Base Model + Steering	$8,900/mo (GPU)	vLLM deployment, Bifrost + Langfuse integration
2	1-2 weeks	+ VectorRAG	+infra	Milvus deployment, internal document embedding
3	2-4 weeks	+ SLM Cascade	+$500/mo	SLM deployment, Bifrost cascade configuration
4	1-2 months	+ LoRA Fine-tuning	+$2K (one-time)	Training data collection → QLoRA → Evaluation → Multi-LoRA deployment

Return on Investment (ROI)

Upon completing Phase 4:

COBOL→Java migration: 10,000 modules × 1.5 hours saved = 15,000 hours saved (~$750K)
LoRA training cost: $2,000 (one-time)
Monthly operational savings: $5,880 (Cascade effect)
ROI: 375×

References

Official Documentation

Resource	Link
LoRA Paper (Hu et al., 2021)	arxiv.org/abs/2106.09685
QLoRA Paper (Dettmers et al., 2023)	arxiv.org/abs/2305.14314
vLLM Multi-LoRA	docs.vllm.ai/en/latest/models/lora.html
Unsloth Fast Training	github.com/unslothai/unsloth
NeMo Framework	docs.nvidia.com/nemo-framework
RAGAS Evaluation	docs.ragas.io
Bifrost AI Gateway	docs.getbifrost.ai
Agent Monitoring	agent-monitoring.md
LLM Tracing Deployment	monitoring-observability-setup.md
Custom Model Deployment Guide	custom-model-deployment.md

Overview​

Why You Need a Custom Model Pipeline​

End-to-End Pipeline Flow​

LoRA Training & Deployment Pipeline (Domain Specialization)​

LoRA Fine-tuning Pipeline​

QLoRA GPU Savings​

Training Data Format​

Training Frameworks​

NeMo Framework (NVIDIA)​

Unsloth (2× Faster Training)​

Checkpoint Management​

Multi-LoRA Hot-swap Deployment​

Architecture​

vLLM Multi-LoRA Configuration​

Specifying Adapters in Requests​

Per-customer Routing (Bifrost + X-Customer-Domain Header)​

Per-customer Inference Tracking and Cost Billing​

SLM Cascade Routing (Cost Optimization)​

Cascade Architecture​

Cost Analysis​

Bifrost Cascade Config​

SLM Deployment YAML​

Evaluation Pipeline​

LoRA Adapter Evaluation Matrix​

LoRA A/B Testing​

Phased Implementation Roadmap​

References​

Official Documentation​