Skip to main content

Custom Model Pipeline Guide

Overview

Why You Need a Custom Model Pipeline

SaaS-based AI coding tools (e.g., Kiro, GitHub Copilot) offer a quick start, but they hit fundamental limitations in enterprise environments.

ConstraintSaaS (Kiro, etc.)Self-hosted Pipeline
LoRA Fine-tuningNot possibleDomain-specific adapter training
Data SovereigntyCode sent externallyStays within VPC
Model SelectionLimited to provided modelsFree choice of open-source models
Cost ControlFixed per-token pricing66% savings possible with SLM Cascade
Per-customer OptimizationShared general-purpose modelMulti-LoRA for customer-specific specialization
Core Strategy

The Base Model + LoRA adapter pattern serves multiple domain-specialized models simultaneously on a single GPU. Since base model weights are shared, GPU memory efficiency is maximized.

End-to-End Pipeline Flow

The training pipeline trains domain data with QLoRA, and only adapters that pass evaluation are registered in the registry. The serving pipeline loads multiple adapters simultaneously with vLLM Multi-LoRA and performs cost-optimized routing between SLM/LLM through Bifrost Cascade.

Related Documentation

LoRA Training & Deployment Pipeline (Domain Specialization)

This section covers how to implement LoRA Fine-tuning and Multi-LoRA hot-swap deployment in the domain specialization strategy. For strategic background and decision criteria for domain specialization, see Domain Customization (LoRA + RAG).


LoRA Fine-tuning Pipeline

QLoRA GPU Savings

QLoRA (Quantized LoRA) trains only the LoRA adapter while keeping the base model quantized to INT4. This dramatically reduces GPU requirements compared to full fine-tuning.

ModelFull Fine-tuningLoRAQLoRA
Llama-3.3-70BH100×32 (impractical)H100×8H100×4
VRAM280 GB80 GB40 GB
Training Time-5 days2-3 days
Cost-$8,000$2,000
INT4 Quantization Precision

QLoRA keeps base model weights in INT4 during training, so tasks requiring extremely precise numerical computations (e.g., financial calculations) may show slight accuracy differences compared to LoRA (FP16). Always validate in the domain evaluation stage.

Training Data Format

Prepare training data as input-output pairs in JSONL format.

{
"input": "COBOL: PERFORM CALC-INTEREST USING WS-PRINCIPAL WS-RATE.",
"output": "Java: @Transactional public BigDecimal calcInterest(BigDecimal principal, BigDecimal rate) { return principal.multiply(rate).setScale(2, RoundingMode.HALF_UP); }"
}

Data Collection Strategy:

SourceTransformation MethodExpected Data Volume
Legacy COBOL codeGenerate COBOL → Java translation pairs10,000+ modules
Internal frameworksFramework pattern → code pairs5,000+ patterns
Code review historyPre-fix → post-fix pairs20,000+ commits
Technical docsDocumentation → implementation code pairs3,000+ pages
Data Quality Determines Model Quality

Quality matters more than quantity. 1,000 high-quality pairs reviewed by senior developers are more effective than 10,000 auto-generated ones. Start with at least 500 reviewed pairs.

Training Frameworks

NeMo Framework (NVIDIA)

NVIDIA's official framework optimized for large-scale model training. Natively supports multi-GPU and multi-node distributed training.

python train_lora.py \
--config-path=conf \
--config-name=llama3_70b_lora \
model.data.train_ds.file_path=cobol_to_java.jsonl \
model.peft.lora_tuning.adapter_dim=16
Key NeMo Configuration Parameters
  • adapter_dim (rank): 16 is typical. Can increase to 32-64 for complex domains
  • adapter_dropout: 0.05 recommended (prevents overfitting)
  • target_modules: attention layers (q_proj, k_proj, v_proj, o_proj)

Unsloth (2× Faster Training)

An open-source library that doubles LoRA/QLoRA training speed on a single node while reducing memory usage by up to 50%.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
model_name="meta-llama/Llama-3.3-70B-Instruct",
max_seq_length=4096,
load_in_4bit=True, # QLoRA: INT4 quantization
)

model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
lora_alpha=32, # LoRA scaling factor
target_modules=["q_proj", "k_proj", "v_proj"],
)

trainer = SFTTrainer(
model=model,
train_dataset=dataset,
max_seq_length=4096,
)
trainer.train()
FrameworkStrengthsBest For
NeMoMulti-node distributed training, official NVIDIA supportH100 cluster available, large-scale training
Unsloth2× faster training, memory savings, simple APISingle node, rapid prototyping

Checkpoint Management

Trained LoRA adapters are stored in S3 with version management via MLflow.

# Adapter storage structure
s3://model-registry/
└── lora-adapters/
├── bank-ledger/
│ ├── v1.0/adapter_model.safetensors
│ ├── v1.1/adapter_model.safetensors
│ └── latest -> v1.1
├── stock-order/
│ └── v1.0/adapter_model.safetensors
└── insurance-contract/
└── v1.0/adapter_model.safetensors
MLflow Integration

Recording training metrics (loss, accuracy) alongside adapter paths in MLflow lets you track which dataset and hyperparameter combinations are optimal.


Multi-LoRA Hot-swap Deployment

Architecture

Leveraging vLLM's Multi-LoRA feature, you can load multiple LoRA adapters simultaneously on top of a single base model to serve customized responses per customer.

Multi-LoRA Memory Efficiency

The base model (e.g., 70B) is loaded into GPU memory only once. Each LoRA adapter at rank 16 is approximately 100-200MB, so even loading 10 adapters simultaneously adds less than 2GB of additional memory.

vLLM Multi-LoRA Configuration

vllm serve meta-llama/Llama-3.3-70B-Instruct \
--enable-lora \
--lora-modules \
bank-ledger=/models/lora/bank \
stock-order=/models/lora/stock \
insurance-contract=/models/lora/insurance \
--max-lora-rank 16

Key Options:

OptionDescriptionDefault
--enable-loraEnable Multi-LoRAfalse
--lora-modulesRegister adapters as name=path-
--max-lora-rankMaximum LoRA rank16
--max-lorasMaximum adapters loaded simultaneously1
--max-cpu-lorasNumber of adapters cached in CPU memory-
Hot-swap Considerations

vLLM loads adapters into GPU memory at request time. Using more adapters than --max-loras causes swap latency (hundreds of ms). Set --max-loras to match the number of frequently used adapters.

Specifying Adapters in Requests

Make requests via the OpenAI-compatible API and specify the LoRA name in extra_body.

response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[{"role": "user", "content": "Convert this COBOL ledger code to Java"}],
extra_body={"lora_name": "bank-ledger"}
)

Per-customer Routing (Bifrost + X-Customer-Domain Header)

Use kgateway's HTTPRoute for HTTP header-based per-customer LoRA adapter routing.

# kgateway HTTPRoute - Per-customer LoRA routing
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: lora-routing
spec:
rules:
- matches:
- headers:
- name: X-Customer-Domain
value: bank
backendRefs:
- name: vllm-svc
port: 8000
Routing Flow

Client (Aider/Cline) → Sets X-Customer-Domain: bank header → kgateway → Bifrost → vLLM (auto-maps to lora_name=bank-ledger)

Per-customer Inference Tracking and Cost Billing

Track each customer's inference requests with an LLM tracing system to monitor per-adapter performance and bill monthly costs.

from langfuse import Langfuse

langfuse = Langfuse()

trace = langfuse.trace(
name="inference",
user_id="customer-bank-A",
metadata={
"lora": "bank-ledger",
"model": "glm-5-32b",
"domain": "ledger"
}
)

Monthly Cost Billing Example:

CustomerRequestsTokensGPU TimeCost
Bank A100,000500M50 hours$2,500
Securities B50,000250M25 hours$1,250
Insurance C30,000150M15 hours$750

For implementation details, see Agent Monitoring and LLM Tracing Deployment.


SLM Cascade Routing (Cost Optimization)

Cascade Architecture

Sending every request to a large model (LLM) is wasteful. 70% of requests can be handled adequately by a small model (SLM).

Cost Analysis

SLM OnlyLLM OnlyCascade (70:30)
Monthly Cost$500$8,900$3,020
Accuracy70%95%92%
Cost Savings--66%
ROI Calculation

Adopting Cascade saves $5,880/month ($70,560/year). Setup takes only 1-2 days, making it immediately worthwhile.

Bifrost Cascade Config

{
"providers": {
"openai": {
"keys": [
{
"name": "slm",
"value": "dummy",
"weight": 0.7,
"models": ["llama-8b"]
},
{
"name": "llm",
"value": "dummy",
"weight": 0.3,
"models": ["glm5"]
}
],
"network_config": {
"base_url": "http://glm5-serving:8000"
}
}
}
}
Bifrost Cascade Limitations

Bifrost's current cascade routing operates at the provider level and does not support automatic routing based on request complexity. It works with simple weight-based distribution or fallback conditions (5xx, latency exceeded). Complexity-based routing must be implemented with llm-d or custom logic.

SLM Deployment YAML

apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-slm
namespace: agentic-serving
spec:
replicas: 1
selector:
matchLabels:
app: vllm-slm
template:
metadata:
labels:
app: vllm-slm
spec:
nodeSelector:
node.kubernetes.io/instance-type: g6.xlarge
containers:
- name: vllm
image: vllm/vllm-openai:latest
command: ["vllm", "serve"]
args:
- "meta-llama/Llama-3.3-8B-Instruct"
- "--served-model-name=llama-8b"
- "--tensor-parallel-size=1"
- "--max-model-len=32768"
- "--host=0.0.0.0"
- "--port=8000"
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8000
name: http
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: vllm-slm-svc
namespace: agentic-serving
spec:
selector:
app: vllm-slm
ports:
- port: 8000
targetPort: 8000
protocol: TCP
g6.xlarge Instance Specs
  • GPU: 1× NVIDIA L4 (24GB VRAM)
  • Cost: ~$0.31/hr (On-Demand), ~$0.09/hr (Spot)
  • Sufficient for serving 8B models

Evaluation Pipeline

LoRA Adapter Evaluation Matrix

Trained adapters must pass multiple evaluations before deployment.

Evaluation MethodPurposeToolAutomation
RAGASRAG accuracy (faithfulness, relevancy)ragasCI/CD integration
SWE-benchCoding quality (real issue resolution)swe-benchCI/CD integration
Domain Expert ReviewBusiness correctness validationLangfuse AnnotationManual
Red-teamingSecurity/safety (prompt injection, etc.)GarakCI/CD integration
Evaluation Thresholds

Minimum criteria for adapter deployment:

  • RAGAS Faithfulness: ≥ 0.85
  • SWE-bench Resolved: ≥ 30%
  • Domain expert approval: At least 2 out of 3
  • Garak security test: 0 critical findings

LoRA A/B Testing

Before deploying a new adapter version, compare performance using LLM tracing system tags for A/B testing. Recording the lora version in request metadata as a tag allows per-adapter performance comparison in the dashboard.

For implementation examples, see Agent Monitoring - A/B Testing.

A/B Test Comparison Metrics:

MetricMeasurementMeaning
AccuracySWE-bench / domain testsCode conversion correctness
LatencyLLM tracing p50/p95Response speed
Token Efficiencyoutput_tokens / input_tokensAnswer conciseness
User SatisfactionAnnotation ScoreReal user evaluation

Phased Implementation Roadmap

PhaseTimelineComponentsCost (USD)Key Actions
1ImmediateBase Model + Steering$8,900/mo (GPU)vLLM deployment, Bifrost + Langfuse integration
21-2 weeks+ VectorRAG+infraMilvus deployment, internal document embedding
32-4 weeks+ SLM Cascade+$500/moSLM deployment, Bifrost cascade configuration
41-2 months+ LoRA Fine-tuning+$2K (one-time)Training data collection → QLoRA → Evaluation → Multi-LoRA deployment
Return on Investment (ROI)

Upon completing Phase 4:

  • COBOL→Java migration: 10,000 modules × 1.5 hours saved = 15,000 hours saved (~$750K)
  • LoRA training cost: $2,000 (one-time)
  • Monthly operational savings: $5,880 (Cascade effect)
  • ROI: 375×

References

Official Documentation

ResourceLink
LoRA Paper (Hu et al., 2021)arxiv.org/abs/2106.09685
QLoRA Paper (Dettmers et al., 2023)arxiv.org/abs/2305.14314
vLLM Multi-LoRAdocs.vllm.ai/en/latest/models/lora.html
Unsloth Fast Traininggithub.com/unslothai/unsloth
NeMo Frameworkdocs.nvidia.com/nemo-framework
RAGAS Evaluationdocs.ragas.io
Bifrost AI Gatewaydocs.getbifrost.ai
Agent Monitoringagent-monitoring.md
LLM Tracing Deploymentmonitoring-observability-setup.md
Custom Model Deployment Guidecustom-model-deployment.md