Skip to main content

NeMo Framework

NVIDIA NeMo is an end-to-end framework for training, fine-tuning, and optimizing large language models (LLMs). It supports distributed training and efficient model deployment in Kubernetes environments.

Overview

Problems NeMo Solves

When using general-purpose LLMs (GPT-4, Claude, etc.) in Agentic AI platforms, the following limitations exist:

  • Lack of Domain Knowledge: Insufficient understanding of industry/company-specific terminology and context
  • Cost Issues: API costs surge with large-scale calls (token-per-request billing)
  • Latency: Response delays from external API calls
  • Data Privacy: Cannot transmit sensitive data to external services
  • On-premises Requirements: Regulated industries (finance/healthcare) need self-hosted infrastructure

NeMo addresses these problems through domain-specific model fine-tuning.

NeMo Core Features

NeMo Framework Components
NeMo Core
Core Framework
Model definition, training loop
NeMo Curator
Data Processing
Data filtering, deduplication
NeMo Aligner
Alignment Training
RLHF, DPO, SFT
NeMo Guardrails
Safety
Input/output filtering

Key Value:

  • Efficient Fine-tuning: Train only 0.1% of total parameters with LoRA/QLoRA
  • Distributed Training: Automatic multi-node, multi-GPU parallelization (Tensor/Pipeline/Data Parallelism)
  • Inference Optimization: 2-4x performance improvement through TensorRT-LLM conversion
  • Enterprise Support: Checkpoint management, monitoring, production deployment pipeline

EKS Deployment Architecture

NeMo on EKS Configuration

Container Configuration

NeMo Container Image:

nvcr.io/nvidia/nemo:25.02
├── PyTorch 2.5.1
├── CUDA 12.6
├── NCCL 2.23+
├── Megatron-LM (NeMo integrated)
├── TensorRT-LLM 0.13+
└── Triton Inference Server 2.50+

Key Dependencies:

  • Kubeflow Training Operator: Distributed training orchestration via PyTorchJob CRD
  • GPU Operator: Automatic NVIDIA driver, Device Plugin, DCGM installation
  • EFA Device Plugin: Enable inter-node RDMA communication
  • Karpenter: GPU node autoscaling
GPU Node Requirements
Model SizeMinimum GPURecommended InstanceMemory Required
8B1x A100 80GBp4d.24xlarge80GB+
13B2x A100 80GBp4d.24xlarge160GB+
70B8x A100 80GBp4d.24xlarge / p5.48xlarge640GB+
405B32x H100p5.48xlarge x42.5TB+

Fine-tuning Guide

SFT (Supervised Fine-Tuning) Concept

What is SFT?: A method to improve specific task performance by additionally training a pre-trained model with domain-specific instruction-response data.

Pre-trained Model (general) → SFT → Domain-specific Model

When to Use?

  • Customer FAQ chatbot: Train on specific product/service Q&A
  • Financial report generation: Train on financial terminology and formats
  • Medical diagnostic assistance: Train on medical terminology and diagnostic patterns

Data Format:

{"input": "What is EKS Auto Mode?", "output": "EKS Auto Mode is a fully managed Kubernetes compute option where AWS automatically handles node provisioning, scaling, and security patching."}
{"input": "What are Karpenter's key features?", "output": "Karpenter provides automatic node provisioning, bin-packing optimization, Spot instance integration, and drift detection capabilities."}

PEFT/LoRA: Efficient Fine-tuning

PEFT (Parameter-Efficient Fine-Tuning): Instead of training all model parameters, train only adapter layers to save memory and time.

LoRA (Low-Rank Adaptation): The representative PEFT method that freezes original weights and trains only two low-rank matrices (A, B).

Original weights W (freeze) + LoRA delta (A × B) = Final weights

LoRA Key Parameters:

ParameterDescriptionRecommendedImpact
r (rank)Rank of low-rank matrices8-64Higher = more expressive, more memory
alphaScaling factorSame as rControls LoRA weight influence
dropoutDropout rate0.1Prevents overfitting
target_modulesLayers to trainq_proj, v_projAttention layer selection

Memory Savings:

  • Full Fine-Tuning (7B model): approximately 120GB VRAM needed (A100 80GB × 2)
  • LoRA Fine-Tuning (7B model): approximately 24GB VRAM needed (A100 80GB × 1)
  • Savings: approximately 80% memory reduction

Fine-tuning Execution Example

# nemo_lora_finetune.py
from nemo.collections.llm import finetune
from nemo.collections.llm.peft import LoRA

# LoRA configuration
lora_config = LoRA(
r=32, # rank
alpha=32, # scaling
dropout=0.1,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)

# Execute fine-tuning
model = finetune(
model_path="/models/llama-3.1-8b.nemo",
data_path="/data/train.jsonl",
peft_config=lora_config,
trainer_config={
"devices": 8, # 8 GPU
"max_epochs": 3,
"precision": "bf16", # BFloat16 (A100/H100)
},
output_path="/output/llama-3.1-8b-finetuned",
)

Detailed Pipeline: For data preprocessing, multi-node distributed training, and hyperparameter tuning, refer to the Custom Model Pipeline document.


Checkpoint Management

S3-based Checkpoint Storage

NeMo periodically saves checkpoints (model state snapshots) during training. This enables:

  • Training Resumption: Restart from last checkpoint on failure
  • Optimal Model Selection: Select checkpoint with lowest validation loss
  • Version Management: Compare checkpoints across experiments

S3 Storage Structure:

s3://nemo-checkpoints/
└── llama-3.1-8b-finetune/
├── checkpoint-epoch=1-step=500/
│ ├── model_weights.ckpt
│ ├── optimizer_states.ckpt
│ └── metadata.yaml
├── checkpoint-epoch=2-step=1000/
└── checkpoint-epoch=3-step=1500/

Large Model Checkpoint Sharding

For 70B+ large models, single checkpoint files reach hundreds of GB. NeMo uses sharding to split storage across multiple files.

Large Model Checkpoint Sharding Strategy
Model SizeShard SizeEstimated ShardsSave Time
70B10GB~40 shards5-10 min
175B10GB~100 shards15-20 min
405B10GB~230 shards30-40 min

Sharding Configuration:

trainer:
checkpoint:
save_sharded_checkpoint: true
shard_size_gb: 10 # Split in 10GB units
num_workers: 8 # Parallel save workers
compression: "gzip" # Compression (optional)

Sharded Storage Structure:

s3://checkpoints/llama-405b/
└── checkpoint-step=1000/
├── shard-00000-of-00040.ckpt (10GB)
├── shard-00001-of-00040.ckpt (10GB)
├── ...
└── shard-00039-of-00040.ckpt (10GB)

Checkpoint Conversion

# NeMo → HuggingFace conversion
python -m nemo.collections.llm.scripts.convert_nemo_to_hf \
--input_path /checkpoints/llama-finetuned.nemo \
--output_path /models/llama-finetuned-hf \
--model_type llama

TensorRT-LLM Conversion

What is TensorRT-LLM?

NVIDIA TensorRT-LLM is an optimization engine for LLM inference. It converts PyTorch models into highly optimized execution graphs, improving inference speed by 2-4x.

Performance Improvement Comparison

Optimization TechniqueMemory SavingsSpeed ImprovementDescription
FP8 Quantization50%1.5-2xBFloat16 → FP8 (H100 only)
PagedAttention40%-KV Cache dynamic memory management
In-flight Batching-2-3xContinuous batch processing
Kernel Fusion-1.3-1.5xOperation kernel fusion
Combined Effect60-70%2-4xCompound effect of above techniques

Conversion Concept

from tensorrt_llm import LLM

# Convert HuggingFace model to TensorRT-LLM engine
llm = LLM(
model="/models/llama-finetuned-hf",
max_input_len=4096,
max_output_len=2048,
max_batch_size=64,
dtype="fp8", # FP8 quantization
enable_paged_kv_cache=True,
enable_chunked_context=True,
)

# Save engine
llm.save("/engines/llama-finetuned-trt")

Conversion Time: Approximately 10-20 minutes for a 7B model (1x A100)


Triton Inference Server

Triton and NeMo Relationship

Triton Inference Server is NVIDIA's production inference server that serves TensorRT-LLM engines via HTTP/gRPC API.

Client → Triton Server → TensorRT-LLM Backend → GPU

Triton Architecture Concept

Core Features:

  • Dynamic Batching: Automatically group multiple requests to optimize GPU utilization
  • Model Ensemble: Connect multiple models in a pipeline (e.g., Tokenizer → LLM → Detokenizer)
  • Backend Support: TensorRT-LLM, PyTorch, ONNX, TensorFlow, etc.
  • Metrics Collection: Prometheus-compatible metrics (throughput, latency, GPU utilization)

Model Repository Structure

/models/
└── llama-finetuned/
├── config.pbtxt # Triton configuration file
├── 1/ # Version 1
│ └── model.plan # TensorRT-LLM engine
└── tokenizer/
├── tokenizer.json
└── tokenizer_config.json

config.pbtxt Core Settings:

name: "llama-finetuned"
backend: "tensorrtllm"
max_batch_size: 64

parameters {
key: "max_tokens_in_paged_kv_cache"
value: { string_value: "8192" }
}

parameters {
key: "batch_scheduler_policy"
value: { string_value: "inflight_fused_batching" }
}

NCCL Distributed Communication

NCCL's Role

NCCL (NVIDIA Collective Communication Library) is the core library responsible for high-speed multi-GPU communication in distributed GPU training.

Why is it Important?

NCCL Importance in Distributed Training
Model Parallelism
High
Optimize activation/gradient transfer between GPUs
Data Parallelism
Very High
Fast gradient synchronization via AllReduce
Pipeline Parallelism
High
Optimize activation transfer between stages
Mixed Precision Training
Medium
Optimize compressed gradient communication

Collective Operation Concepts

1. AllReduce (Most Important)

Sums data from all GPUs and distributes the result to all GPUs.

Initial state:
GPU 0: [1, 2, 3]
GPU 1: [4, 5, 6]
GPU 2: [7, 8, 9]
GPU 3: [10, 11, 12]

After AllReduce:
All GPUs: [22, 26, 30] # Element-wise sum

Use Case: Averaging gradients from each GPU in distributed training

2. AllGather

Collects data from all GPUs and distributes the full data to each GPU.

Initial state:
GPU 0: [1, 2]
GPU 1: [3, 4]

After AllGather:
All GPUs: [1, 2, 3, 4]

Use Case: Gathering distributed tensors in Tensor Parallelism

3. ReduceScatter

First sums data, then partitions and distributes to each GPU (inverse of AllGather).

Initial state:
GPU 0: [1, 2, 3, 4]
GPU 1: [5, 6, 7, 8]

After ReduceScatter:
GPU 0: [6, 8] # (1+5), (2+6)
GPU 1: [10, 12] # (3+7), (4+8)

Use Case: Passing intermediate results in Pipeline Parallelism

4. Broadcast

Copies one GPU's data to all GPUs.

Initial state:
GPU 0: [1, 2, 3]
GPU 1: [0, 0, 0]

After Broadcast:
All GPUs: [1, 2, 3]

Use Case: Distributing model checkpoints from the master GPU

Network Topology Optimization

NCCL automatically detects the physical connection topology between GPUs and selects the optimal path.

Per-topology Algorithm Selection:

  • NVSwitch (H100 nodes): Tree algorithm (parallel broadcast)
  • NVLink (A100 nodes): Ring algorithm (circular transfer)
  • EFA inter-node: Hierarchical algorithm (intra-node Ring → inter-node Tree)

NCCL Tuning Parameters

# Core NCCL environment variables

# 1. Algorithm selection
export NCCL_ALGO=Ring # or Tree

# 2. Protocol
export NCCL_PROTO=Simple # Simple (throughput) or LL (latency)

# 3. Channel count (important!)
export NCCL_MIN_NCHANNELS=4
export NCCL_MAX_NCHANNELS=8 # More = higher bandwidth, more overhead

# 4. EFA settings (AWS)
export FI_PROVIDER=efa
export FI_EFA_USE_DEVICE_RDMA=1
export NCCL_IB_DISABLE=0

# 5. Debug
export NCCL_DEBUG=INFO # Useful for diagnosing performance issues

Recommended Channel Counts:

  • 8 GPU intra-node: 4-8 channels
  • Multi-node (16+ GPUs): 8-16 channels
  • Large-scale (64+ GPUs): 16-32 channels

Monitoring

Key Metrics

Key Monitoring Metrics
training_loss
Training loss
Continuous decrease
validation_loss
Validation loss
Similar to training loss
gpu_utilization
GPU utilization
> 80%
gpu_memory_used
GPU memory usage
< 95%
throughput_tokens_per_sec
Throughput
Monitor

Monitoring Stack: Prometheus + Grafana + DCGM Exporter

For detailed monitoring setup, refer to Monitoring and Observability Setup.


Recommendations
  • Before Fine-tuning: Measure baseline performance with the base model
  • LoRA First: 80% memory savings vs full fine-tuning
  • TensorRT-LLM Essential: 2-4x inference performance improvement
  • NCCL Tuning: 20-30% performance improvement possible through channel count and algorithm optimization in multi-node training
Cautions
  • GPU Costs: Large-scale training can cost hundreds of thousands per hour. Actively use Spot instances and checkpoints
  • Checkpoints Essential: Configure automatic saving to persistent storage like S3 (prepare for node failure)
  • EFA Security Groups: All traffic must be allowed within the same security group when using EFA
  • Memory Overflow: On OOM, decrease micro_batch_size or enable gradient_checkpointing