NeMo Framework

NVIDIA NeMo is an end-to-end framework for training, fine-tuning, and optimizing large language models (LLMs). It supports distributed training and efficient model deployment in Kubernetes environments.

Overview

Problems NeMo Solves

When using general-purpose LLMs (GPT-4, Claude, etc.) in Agentic AI platforms, the following limitations exist:

Lack of Domain Knowledge: Insufficient understanding of industry/company-specific terminology and context
Cost Issues: API costs surge with large-scale calls (token-per-request billing)
Latency: Response delays from external API calls
Data Privacy: Cannot transmit sensitive data to external services
On-premises Requirements: Regulated industries (finance/healthcare) need self-hosted infrastructure

NeMo addresses these problems through domain-specific model fine-tuning.

NeMo Core Features

NeMo Framework Components

NeMo Core

Core Framework

Model definition, training loop

NeMo Curator

Data Processing

Data filtering, deduplication

NeMo Aligner

Alignment Training

RLHF, DPO, SFT

NeMo Guardrails

Safety

Input/output filtering

Key Value:

Efficient Fine-tuning: Train only 0.1% of total parameters with LoRA/QLoRA
Distributed Training: Automatic multi-node, multi-GPU parallelization (Tensor/Pipeline/Data Parallelism)
Inference Optimization: 2-4x performance improvement through TensorRT-LLM conversion
Enterprise Support: Checkpoint management, monitoring, production deployment pipeline

EKS Deployment Architecture

NeMo on EKS Configuration

Container Configuration

NeMo Container Image:

nvcr.io/nvidia/nemo:25.02
├── PyTorch 2.5.1
├── CUDA 12.6
├── NCCL 2.23+
├── Megatron-LM (NeMo integrated)
├── TensorRT-LLM 0.13+
└── Triton Inference Server 2.50+

Key Dependencies:

Kubeflow Training Operator: Distributed training orchestration via PyTorchJob CRD
GPU Operator: Automatic NVIDIA driver, Device Plugin, DCGM installation
EFA Device Plugin: Enable inter-node RDMA communication
Karpenter: GPU node autoscaling

GPU Node Requirements

Model Size	Minimum GPU	Recommended Instance	Memory Required
8B	1x A100 80GB	p4d.24xlarge	80GB+
13B	2x A100 80GB	p4d.24xlarge	160GB+
70B	8x A100 80GB	p4d.24xlarge / p5.48xlarge	640GB+
405B	32x H100	p5.48xlarge x4	2.5TB+

Fine-tuning Guide

SFT (Supervised Fine-Tuning) Concept

What is SFT?: A method to improve specific task performance by additionally training a pre-trained model with domain-specific instruction-response data.

Pre-trained Model (general) → SFT → Domain-specific Model

When to Use?

Customer FAQ chatbot: Train on specific product/service Q&A
Financial report generation: Train on financial terminology and formats
Medical diagnostic assistance: Train on medical terminology and diagnostic patterns

Data Format:

{"input": "What is EKS Auto Mode?", "output": "EKS Auto Mode is a fully managed Kubernetes compute option where AWS automatically handles node provisioning, scaling, and security patching."}
{"input": "What are Karpenter's key features?", "output": "Karpenter provides automatic node provisioning, bin-packing optimization, Spot instance integration, and drift detection capabilities."}

PEFT/LoRA: Efficient Fine-tuning

PEFT (Parameter-Efficient Fine-Tuning): Instead of training all model parameters, train only adapter layers to save memory and time.

LoRA (Low-Rank Adaptation): The representative PEFT method that freezes original weights and trains only two low-rank matrices (A, B).

Original weights W (freeze) + LoRA delta (A × B) = Final weights

LoRA Key Parameters:

Parameter	Description	Recommended	Impact
`r` (rank)	Rank of low-rank matrices	8-64	Higher = more expressive, more memory
`alpha`	Scaling factor	Same as r	Controls LoRA weight influence
`dropout`	Dropout rate	0.1	Prevents overfitting
`target_modules`	Layers to train	q_proj, v_proj	Attention layer selection

Memory Savings:

Full Fine-Tuning (7B model): approximately 120GB VRAM needed (A100 80GB × 2)
LoRA Fine-Tuning (7B model): approximately 24GB VRAM needed (A100 80GB × 1)
Savings: approximately 80% memory reduction

Fine-tuning Execution Example

# nemo_lora_finetune.py
from nemo.collections.llm import finetune
from nemo.collections.llm.peft import LoRA

# LoRA configuration
lora_config = LoRA(
    r=32,  # rank
    alpha=32,  # scaling
    dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)

# Execute fine-tuning
model = finetune(
    model_path="/models/llama-3.1-8b.nemo",
    data_path="/data/train.jsonl",
    peft_config=lora_config,
    trainer_config={
        "devices": 8,  # 8 GPU
        "max_epochs": 3,
        "precision": "bf16",  # BFloat16 (A100/H100)
    },
    output_path="/output/llama-3.1-8b-finetuned",
)

Detailed Pipeline: For data preprocessing, multi-node distributed training, and hyperparameter tuning, refer to the Custom Model Pipeline document.

Checkpoint Management

S3-based Checkpoint Storage

NeMo periodically saves checkpoints (model state snapshots) during training. This enables:

Training Resumption: Restart from last checkpoint on failure
Optimal Model Selection: Select checkpoint with lowest validation loss
Version Management: Compare checkpoints across experiments

S3 Storage Structure:

s3://nemo-checkpoints/
└── llama-3.1-8b-finetune/
    ├── checkpoint-epoch=1-step=500/
    │   ├── model_weights.ckpt
    │   ├── optimizer_states.ckpt
    │   └── metadata.yaml
    ├── checkpoint-epoch=2-step=1000/
    └── checkpoint-epoch=3-step=1500/

Large Model Checkpoint Sharding

For 70B+ large models, single checkpoint files reach hundreds of GB. NeMo uses sharding to split storage across multiple files.

Large Model Checkpoint Sharding Strategy

Model Size	Shard Size	Estimated Shards	Save Time
70B	10GB	~40 shards	5-10 min
175B	10GB	~100 shards	15-20 min
405B	10GB	~230 shards	30-40 min

Sharding Configuration:

trainer:
  checkpoint:
    save_sharded_checkpoint: true
    shard_size_gb: 10  # Split in 10GB units
    num_workers: 8  # Parallel save workers
    compression: "gzip"  # Compression (optional)

Sharded Storage Structure:

s3://checkpoints/llama-405b/
└── checkpoint-step=1000/
    ├── shard-00000-of-00040.ckpt  (10GB)
    ├── shard-00001-of-00040.ckpt  (10GB)
    ├── ...
    └── shard-00039-of-00040.ckpt  (10GB)

Checkpoint Conversion

# NeMo → HuggingFace conversion
python -m nemo.collections.llm.scripts.convert_nemo_to_hf \
  --input_path /checkpoints/llama-finetuned.nemo \
  --output_path /models/llama-finetuned-hf \
  --model_type llama

TensorRT-LLM Conversion

What is TensorRT-LLM?

NVIDIA TensorRT-LLM is an optimization engine for LLM inference. It converts PyTorch models into highly optimized execution graphs, improving inference speed by 2-4x.

Performance Improvement Comparison

Optimization Technique	Memory Savings	Speed Improvement	Description
FP8 Quantization	50%	1.5-2x	BFloat16 → FP8 (H100 only)
PagedAttention	40%	-	KV Cache dynamic memory management
In-flight Batching	-	2-3x	Continuous batch processing
Kernel Fusion	-	1.3-1.5x	Operation kernel fusion
Combined Effect	60-70%	2-4x	Compound effect of above techniques

Conversion Concept

from tensorrt_llm import LLM

# Convert HuggingFace model to TensorRT-LLM engine
llm = LLM(
    model="/models/llama-finetuned-hf",
    max_input_len=4096,
    max_output_len=2048,
    max_batch_size=64,
    dtype="fp8",  # FP8 quantization
    enable_paged_kv_cache=True,
    enable_chunked_context=True,
)

# Save engine
llm.save("/engines/llama-finetuned-trt")

Conversion Time: Approximately 10-20 minutes for a 7B model (1x A100)

Triton Inference Server

Triton and NeMo Relationship

Triton Inference Server is NVIDIA's production inference server that serves TensorRT-LLM engines via HTTP/gRPC API.

Client → Triton Server → TensorRT-LLM Backend → GPU

Triton Architecture Concept

Core Features:

Dynamic Batching: Automatically group multiple requests to optimize GPU utilization
Model Ensemble: Connect multiple models in a pipeline (e.g., Tokenizer → LLM → Detokenizer)
Backend Support: TensorRT-LLM, PyTorch, ONNX, TensorFlow, etc.
Metrics Collection: Prometheus-compatible metrics (throughput, latency, GPU utilization)

Model Repository Structure

/models/
└── llama-finetuned/
    ├── config.pbtxt  # Triton configuration file
    ├── 1/  # Version 1
    │   └── model.plan  # TensorRT-LLM engine
    └── tokenizer/
        ├── tokenizer.json
        └── tokenizer_config.json

config.pbtxt Core Settings:

name: "llama-finetuned"
backend: "tensorrtllm"
max_batch_size: 64

parameters {
  key: "max_tokens_in_paged_kv_cache"
  value: { string_value: "8192" }
}

parameters {
  key: "batch_scheduler_policy"
  value: { string_value: "inflight_fused_batching" }
}

NCCL Distributed Communication

NCCL's Role

NCCL (NVIDIA Collective Communication Library) is the core library responsible for high-speed multi-GPU communication in distributed GPU training.

Why is it Important?

NCCL Importance in Distributed Training

Model Parallelism

High

Optimize activation/gradient transfer between GPUs

Data Parallelism

Very High

Fast gradient synchronization via AllReduce

Pipeline Parallelism

High

Optimize activation transfer between stages

Mixed Precision Training

Medium

Optimize compressed gradient communication

Collective Operation Concepts

1. AllReduce (Most Important)

Sums data from all GPUs and distributes the result to all GPUs.

Initial state:
GPU 0: [1, 2, 3]
GPU 1: [4, 5, 6]
GPU 2: [7, 8, 9]
GPU 3: [10, 11, 12]

After AllReduce:
All GPUs: [22, 26, 30]  # Element-wise sum

Use Case: Averaging gradients from each GPU in distributed training

2. AllGather

Collects data from all GPUs and distributes the full data to each GPU.

Initial state:
GPU 0: [1, 2]
GPU 1: [3, 4]

After AllGather:
All GPUs: [1, 2, 3, 4]

Use Case: Gathering distributed tensors in Tensor Parallelism

3. ReduceScatter

First sums data, then partitions and distributes to each GPU (inverse of AllGather).

Initial state:
GPU 0: [1, 2, 3, 4]
GPU 1: [5, 6, 7, 8]

After ReduceScatter:
GPU 0: [6, 8]   # (1+5), (2+6)
GPU 1: [10, 12] # (3+7), (4+8)

Use Case: Passing intermediate results in Pipeline Parallelism

4. Broadcast

Copies one GPU's data to all GPUs.

Initial state:
GPU 0: [1, 2, 3]
GPU 1: [0, 0, 0]

After Broadcast:
All GPUs: [1, 2, 3]

Use Case: Distributing model checkpoints from the master GPU

Network Topology Optimization

NCCL automatically detects the physical connection topology between GPUs and selects the optimal path.

Per-topology Algorithm Selection:

NVSwitch (H100 nodes): Tree algorithm (parallel broadcast)
NVLink (A100 nodes): Ring algorithm (circular transfer)
EFA inter-node: Hierarchical algorithm (intra-node Ring → inter-node Tree)

NCCL Tuning Parameters

# Core NCCL environment variables

# 1. Algorithm selection
export NCCL_ALGO=Ring  # or Tree

# 2. Protocol
export NCCL_PROTO=Simple  # Simple (throughput) or LL (latency)

# 3. Channel count (important!)
export NCCL_MIN_NCHANNELS=4
export NCCL_MAX_NCHANNELS=8  # More = higher bandwidth, more overhead

# 4. EFA settings (AWS)
export FI_PROVIDER=efa
export FI_EFA_USE_DEVICE_RDMA=1
export NCCL_IB_DISABLE=0

# 5. Debug
export NCCL_DEBUG=INFO  # Useful for diagnosing performance issues

Recommended Channel Counts:

8 GPU intra-node: 4-8 channels
Multi-node (16+ GPUs): 8-16 channels
Large-scale (64+ GPUs): 16-32 channels

Monitoring

Key Metrics

Key Monitoring Metrics

training_loss

Training loss

Continuous decrease

validation_loss

Validation loss

Similar to training loss

gpu_utilization

GPU utilization

> 80%

gpu_memory_used

GPU memory usage

< 95%

throughput_tokens_per_sec

Throughput

Monitor

Monitoring Stack: Prometheus + Grafana + DCGM Exporter

For detailed monitoring setup, refer to Monitoring and Observability Setup.

GPU Resource Management - Karpenter, KEDA, DRA-based GPU autoscaling
vLLM Model Serving - Production inference server
MoE Model Serving - Mixture of Experts architecture
Custom Model Pipeline - Full pipeline from data preparation to deployment

Recommendations

Before Fine-tuning: Measure baseline performance with the base model
LoRA First: 80% memory savings vs full fine-tuning
TensorRT-LLM Essential: 2-4x inference performance improvement
NCCL Tuning: 20-30% performance improvement possible through channel count and algorithm optimization in multi-node training

Cautions

GPU Costs: Large-scale training can cost hundreds of thousands per hour. Actively use Spot instances and checkpoints
Checkpoints Essential: Configure automatic saving to persistent storage like S3 (prepare for node failure)
EFA Security Groups: All traffic must be allowed within the same security group when using EFA
Memory Overflow: On OOM, decrease micro_batch_size or enable gradient_checkpointing

Overview​

Problems NeMo Solves​

NeMo Core Features​

EKS Deployment Architecture​

NeMo on EKS Configuration​

Container Configuration​

Fine-tuning Guide​

SFT (Supervised Fine-Tuning) Concept​

PEFT/LoRA: Efficient Fine-tuning​

Fine-tuning Execution Example​

Checkpoint Management​

S3-based Checkpoint Storage​

Large Model Checkpoint Sharding​

Checkpoint Conversion​

TensorRT-LLM Conversion​

What is TensorRT-LLM?​

Performance Improvement Comparison​

Conversion Concept​

Triton Inference Server​

Triton and NeMo Relationship​

Triton Architecture Concept​

Model Repository Structure​

NCCL Distributed Communication​

NCCL's Role​

Collective Operation Concepts​

1. AllReduce (Most Important)​

2. AllGather​

3. ReduceScatter​

4. Broadcast​

Network Topology Optimization​

NCCL Tuning Parameters​

Monitoring​

Key Metrics​

Related Documents​