GRPO/DPO Training Job

Overview

This document covers the stage 3 Preference Tuning implementation of the Continuous Training Pipeline. It takes the labeled trace dataset (with final_reward column) from the previous stage and fine-tunes the model using NeMo-RL-based GRPO or TRL-based DPO. Karpenter Spot node pools and Volcano Gang Scheduling minimize GPU costs and idle time.

GRPO vs DPO Concepts

GRPO (Group Relative Policy Optimization)

GRPO ranks multiple responses to the same prompt by reward and learns from the relative ordering.

Prompt: "What are the advantages of EKS Auto Mode?"

Response A (reward=0.9): "AWS fully manages nodes, reducing operational burden..."
Response B (reward=0.6): "Auto Mode is convenient..."
Response C (reward=0.3): "I don't know."

Training: Policy optimization with ranking A > B > C

Advantages:

Learns relative ranking instead of absolute scores → robust to labeling noise
Generates multiple responses per prompt → data-efficient
Simpler than RLHF (no separate reward model training required)

DPO (Direct Preference Optimization)

DPO directly learns from preferred/non-preferred pairs.

Prompt: "What are Karpenter's main features?"

Preferred (reward >= 0.7):
"Karpenter provides automatic node provisioning, bin-packing optimization..."

Non-preferred (reward < 0.5):
"Karpenter is a scaling tool." (too short)

Training: Increase probability of preferred responses ↑, decrease non-preferred ↓

Advantages:

Trains with single loss without separate Value Function like RLHF
Stable training (simpler hyperparameter tuning than PPO)
Many production use cases (Llama 3.1, Claude 3, etc.)

Selection Criteria

Situation	Recommended Method	Reason
Can generate diverse responses	GRPO	Higher data efficiency through ranking
Clear preferred/non-preferred distinction	DPO	Simple and stable
High labeling noise	GRPO	Relative ranking more robust than absolute scores
Rapid prototyping	DPO	Simple hyperparameter tuning

NeMo-RL-based GRPO Training

NeMo Framework is NVIDIA's large-scale model training framework.

# nemo_grpo_training.py
from nemo.collections.llm import GRPO, GPTModel
from nemo.collections.nlp.data import PreferenceDataset

# Load training data
dataset = PreferenceDataset(
    data_path='s3://training-data-lake/labeled-dataset/',
    reward_column='final_reward',
    min_reward_threshold=0.5,  # Exclude below 0.5
)

# Load base model
model = GPTModel.from_pretrained('glm-5-32b')

# GRPO configuration
grpo_config = GRPO(
    num_iterations=1000,
    batch_size=32,
    learning_rate=1e-5,
    kl_coeff=0.1,  # KL divergence penalty (prevent deviation from base model)
    cliprange=0.2,
    vf_coeff=0.5,
)

# Distributed training execution
trainer = Trainer(
    devices=8,  # 8 H100s
    num_nodes=3,  # 3 nodes = 24 GPUs
    precision='bf16',
    strategy='fsdp',  # Fully Sharded Data Parallel
)

trainer.fit(model, grpo_config, dataset)

TRL-based DPO Training

TRL (Transformer Reinforcement Learning) is HuggingFace's RLHF library.

# trl_dpo_training.py
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# Load model
model = AutoModelForCausalLM.from_pretrained('glm-5-32b', torch_dtype='bfloat16')
tokenizer = AutoTokenizer.from_pretrained('glm-5-32b')

# Prepare preferred/non-preferred dataset
def format_dpo_dataset(example):
    """Separate preferred/non-preferred by reward"""
    if example['final_reward'] >= 0.7:
        return {
            'prompt': example['input'],
            'chosen': example['output'],
            'rejected': None,  # Non-preferred examples matched separately
        }
    else:
        return None

dataset = load_dataset('parquet', data_files='s3://training-data-lake/labeled-dataset/*.parquet')
dpo_dataset = dataset.map(format_dpo_dataset).filter(lambda x: x is not None)

# DPO training configuration
training_args = DPOConfig(
    output_dir='/output/glm-5-dpo',
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=5e-7,
    max_length=4096,
    beta=0.1,  # DPO temperature (higher emphasizes preference differences)
    num_train_epochs=1,
    bf16=True,
    logging_steps=10,
    save_strategy='steps',
    save_steps=100,
)

# Execute training
trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)

trainer.train()

Kubernetes Job YAML

# grpo-training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: grpo-training-glm5
  namespace: training-pipeline
spec:
  parallelism: 3  # 3-node parallel execution
  completions: 1
  template:
    metadata:
      labels:
        app: grpo-training
        karpenter.sh/capacity-type: spot  # Use Spot instances
    spec:
      nodeSelector:
        node.kubernetes.io/instance-type: p5en.48xlarge  # 8 H200s
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      - key: karpenter.sh/capacity-type
        operator: Equal
        value: spot
        effect: NoSchedule
      
      volumes:
      - name: checkpoint-storage
        persistentVolumeClaim:
          claimName: training-checkpoints
      
      containers:
      - name: nemo-trainer
        image: nvcr.io/nvidia/nemo:26.02
        command:
        - python
        - /workspace/nemo_grpo_training.py
        args:
        - --data-path=s3://training-data-lake/labeled-dataset/
        - --output-path=/checkpoints/grpo-run-001
        - --num-nodes=3
        - --devices=8
        volumeMounts:
        - name: checkpoint-storage
          mountPath: /checkpoints
        resources:
          requests:
            nvidia.com/gpu: 8
            memory: 1600Gi  # H200 141GB × 8 + overhead
          limits:
            nvidia.com/gpu: 8
            memory: 1600Gi
        env:
        - name: NCCL_DEBUG
          value: "INFO"
        - name: NCCL_MIN_NCHANNELS
          value: "16"
        - name: FI_PROVIDER
          value: "efa"
        - name: FI_EFA_USE_DEVICE_RDMA
          value: "1"
      
      restartPolicy: OnFailure
---
# Karpenter NodePool - Spot instances
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: training-spot-pool
spec:
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot"]
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["p5en.48xlarge"]
      - key: topology.kubernetes.io/zone
        operator: In
        values: ["us-east-2a", "us-east-2b"]
      
      nodeClassRef:
        name: training-gpu-class
      
      taints:
      - key: nvidia.com/gpu
        effect: NoSchedule
      - key: karpenter.sh/capacity-type
        value: spot
        effect: NoSchedule

Volcano Batch Scheduling

Volcano is a batch scheduler for AI/ML workloads. Gang Scheduling waits until all nodes are ready before starting execution simultaneously.

# volcano-job.yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: grpo-training-volcano
spec:
  minAvailable: 3  # Wait until all 3 nodes are ready
  schedulerName: volcano
  queue: training-queue
  tasks:
  - name: trainer
    replicas: 3
    template:
      spec:
        # (Same container spec as above)

Why Gang Scheduling:

Standard Kubernetes:
  Node 1: Starts immediately → Waiting for others → GPU idle
  Node 2: Starts after 5 minutes
  Node 3: Starts after 10 minutes
  → Node 1's GPUs wasted for 10 minutes

Volcano Gang Scheduling:
  Nodes 1, 2, 3: Wait until all ready
  → Start simultaneously after 10 minutes → All GPUs utilized immediately

Cost Example

Resource	Spec	Hourly Cost	Training Time (1 epoch)	Total Cost
p5en.48xlarge Spot	8 H200s × 3 nodes	$10-15/GPU-hr	4-6 hours	$960-2,160
FSx Lustre (training data)	1.2 MB/s/TiB	$0.14/GB-month	-	~$50
S3 checkpoint storage	-	$0.023/GB	-	~$10
Total cost per iteration	-	-	-	$1,020-2,220

Cost Disclaimer

p5en Spot pricing varies with demand. Automatic checkpoint saving is required to handle Spot interruptions. Assuming 24 iterations per year: $24K-53K range.

Next Steps

Eval Gate · Registry · KPI — Threshold verification and Canary deployment of trained checkpoints
Trace → Dataset Materializer — Review dataset construction stage before training

References

Official Documentation

NVIDIA NeMo Framework — GRPO/RLHF training pipeline
HuggingFace TRL — DPO/PPO/ORPO reference
Karpenter — Automatic node provisioning
Volcano Scheduler — Gang Scheduling

Papers & Technical Blogs

GRPO Paper (arxiv 2402.03300) — GRPO algorithm from DeepSeekMath
DPO Paper (arxiv 2305.18290) — Original Direct Preference Optimization
Llama 3.1 Technical Report — DPO production use case

NeMo Framework Overview
Custom Model Pipeline — SFT → preference tuning flow
GPU Resource Management — Karpenter/KEDA/DRA

Overview​

GRPO vs DPO Concepts​

GRPO (Group Relative Policy Optimization)​

DPO (Direct Preference Optimization)​

Selection Criteria​

NeMo-RL-based GRPO Training​

TRL-based DPO Training​

Kubernetes Job YAML​

Volcano Batch Scheduling​

Cost Example​

Next Steps​

References​

Official Documentation​

Papers & Technical Blogs​

Related Documents​