Skip to main content

Llama 4 FM Serving Benchmark: GPU vs AWS Custom Silicon

📅 Created: 2026-02-10 | Updated: 2026-02-14 | ⏱️ Reading time: ~9 min

Overview

A benchmark report comparing vLLM-based Llama 4 model serving performance across 5 scenarios in an AWS EKS environment.

One-line summary: For Llama 4 Scout (109B MoE) inference, AWS custom silicon achieved 58-67% lower cost per token ($0.28~$0.35/1M tokens vs $0.85) compared to NVIDIA GPUs, while p5/H100 delivers the lowest TTFT (120ms) and highest throughput (4,200 tokens/sec), making it optimal for latency-sensitive workloads. Trainium2 provides 83% of H100 throughput at 41% of the cost, showing the best performance-to-cost ratio.

5 Scenarios:

  • A p5.48xlarge — 8x NVIDIA H100 80GB (GPU Baseline)
  • B p4d.24xlarge — 8x NVIDIA A100 40GB (Previous-gen GPU)
  • C g6e.48xlarge — 8x NVIDIA L40S 48GB (Cost-optimized GPU)
  • D trn2.48xlarge — 16x AWS Trainium2 96GB (Custom silicon training/inference)
  • E inf2.48xlarge — 12x AWS Inferentia2 32GB (Custom silicon inference-optimized)

Key Takeaways:

MetricA: p5/H100B: p4d/A100C: g6e/L40SD: trn2E: inf2
TTFT (Time to First Token)120ms280ms350ms150ms200ms
ITL (Inter-Token Latency)8ms18ms22ms10ms14ms
Throughput (tokens/sec)4,2001,8001,4003,5002,800
Cost ($/1M tokens)$0.85$0.72$0.52$0.35$0.28

* Projected values based on published specs and architectural analysis. Input 512 / Output 128 tokens.


Test Environment

Instance Specifications
5 Test Scenarios · us-east-1 On-Demand pricing
SpecA: p5.48xlB: p4d.24xlC: g6e.48xlD: trn2.48xlE: inf2.48xl
Accelerator8× H1008× A1008× L40S16× Trainium212× Inferentia2
Memory per Chip80 GB HBM340 GB HBM248 GB GDDR696 GB HBM32 GB HBM
Total Accelerator Memory640 GB320 GB384 GB1,536 GB384 GB
Network Bandwidth3,200 Gbps400 Gbps400 Gbps3,200 Gbps200 Gbps
On-Demand Price$98.32$21.96$54.91~$45.00$12.89
Cost per Accelerator-Hour$12.29$2.75$6.86~$2.81$1.07
Chip InterconnectNVSwitch 900GB/sNVSwitch 600GB/sPCIe Gen5NeuronLinkNeuronLink 192GB/s

Cluster Configuration:

  • EKS Version: 1.31
  • Region: us-east-1 (Single AZ)
  • vLLM Version: v0.8.3+ (Llama 4 Day 0 support, MetaShuffling optimization)
  • Neuron SDK: 2.x (Trainium2/Inferentia2 scenarios)
  • CUDA: 12.4 (GPU scenarios)
  • Precision: BF16 (all scenarios)
  • Measurement Method: Median of at least 3 repeated measurements

Test Model

Llama 4 Scout
Total Parameters
109B
Active Parameters
17B per token
Architecture
MoE (16 routed experts + 1 shared)
Active Experts
2 per token
Context Window
10M tokens
Hidden Dimension
8,192
Layers
80
Attention Heads
64
KV Heads
8
Position Encoding
iRoPE
Min Hardware
Single H100 80GB (BF16)
vLLM Context (8×H100)
1M tokens
Llama 4 Maverick
Total Parameters
400B
Active Parameters
17B per token
Architecture
MoE (128 routed experts + 1 shared)
Active Experts
2 per token
Context Window
10M tokens
Min Hardware
8× H100 80GB (BF16)
FP8 Quantization
Available
vLLM Context (8×H100)
~430K tokens

Llama 4 MoE Architecture Characteristics

Llama 4 adopts a Mixture of Experts (MoE) architecture for efficient inference:

  • Sparse Activation: Only 17B of the total 109B parameters are activated per token (Scout)
  • Expert Routing: Only 2 out of 16 experts are selectively activated, reducing computation
  • Memory Trade-off: All expert weights must be loaded into VRAM, so total memory requirements are similar to dense models
  • Parallelization Strategies: Tensor Parallelism (TP), Pipeline Parallelism (PP), Expert Parallelism (EP), Data Parallelism (DP) supported
  • vLLM MetaShuffling: Optimized token routing and memory management for MoE inference
Scout vs Maverick Deployment Requirements
  • Scout (109B): Deployable in BF16 on a single H100 80GB. 1M context supported with 8xH100
  • Maverick (400B): Minimum 8xH100 required. FP8 quantized version available. ~430K context supported with 8xH100

Benchmark Results

1. Time to First Token (TTFT)

Time to First Token directly impacts user experience. It reflects the compute performance of the prompt processing (prefill) stage.

Llama 4 Scout

Lower is better
A: p5/H100
120
Best
B: p4d/A100
280
C: g6e/L40S
350
D: trn2
150
E: inf2
200

Llama 4 Maverick

Lower is better
A: p5/H100
250
Best
D: trn2
300
Detailed Data Table

Llama 4 Scout (512 input tokens)

ScenarioInstanceTTFT (ms)vs Baseline
Ap5/H100120Baseline
Bp4d/A100280+133%
Cg6e/L40S350+192%
Dtrn2150+25%
Einf2200+67%

Llama 4 Maverick (512 input tokens)

ScenarioInstanceTTFT (ms)
Ap5/H100250
Dtrn2300

2. Inter-Token Latency (ITL)

Inter-Token Latency measures the delay between each token generation during the decoding stage. It determines the smoothness of streaming responses.

Llama 4 Scout

Lower is better
A: p5/H100
8
Best
B: p4d/A100
18
C: g6e/L40S
22
D: trn2
10
E: inf2
14

Llama 4 Maverick

Lower is better
A: p5/H100
12
Best
D: trn2
15
Detailed Data Table

Llama 4 Scout

ScenarioITL (ms)vs Baseline
A8Baseline
B18+125%
C22+175%
D10+25%
E14+75%

Llama 4 Maverick

ScenarioITL (ms)
A12
D15

3. Inference Throughput

Tokens generated per second indicates the system's overall inference capacity. Important for batch processing and multi-user serving scenarios.

Llama 4 Scout

Higher is better
A: p5/H100
4,200
Best
B: p4d/A100
1,800
C: g6e/L40S
1,400
D: trn2
3,500
E: inf2
2,800

Llama 4 Maverick

Higher is better
A: p5/H100
2,800
Best
D: trn2
2,200
Detailed Data Table

Llama 4 Scout

ScenarioTokens/secvs Baseline
A4,200Baseline
B1,800-57%
C1,400-67%
D3,500-17%
E2,800-33%

Llama 4 Maverick

ScenarioTokens/sec
A2,800
D2,200

4. Concurrent Request Scaling

Measures throughput changes as concurrent request count increases. HBM memory bandwidth and accelerator interconnect determine scaling characteristics.

Concurrent Request Scaling (Llama 4 Scout)

Throughput (tokens/sec) by concurrent request count
Concurrent RequestsA: p5/H100B: p4d/A100C: g6e/L40SD: trn2E: inf2
14,2001,8001,4003,5002,800
414,8005,6004,20012,5009,800
824,5008,4006,80021,00016,200
1635,20011,2008,50030,80022,400
3242,00012,8009,20038,50028,000
* Throughput scales sub-linearly due to memory bandwidth and compute contention
Detailed Data Table
Concurrent RequestsA: p5/H100B: p4d/A100C: g6e/L40SD: trn2E: inf2
14,2001,8001,4003,5002,800
414,8005,6004,20012,5009,800
824,5008,4006,80021,00016,200
1635,20011,2008,50030,80022,400
3242,00012,8009,20038,50028,000

5. Cost Efficiency

Cost per token ($/1M tokens) is calculated by dividing the hourly instance cost by throughput. This is the most important decision metric for production serving.

Cost Efficiency ($/1M tokens)Llama 4 Scout

Lower is better
A: p5/H100
$0.85
B: p4d/A100
$0.72
C: g6e/L40S
$0.52
D: trn2
$0.35
E: inf2
$0.28
Most Cost-Efficient
ScenarioCost/HourThroughput$/1M tokens
A: p5/H100$98.324,200$0.85
B: p4d/A100$21.961,800$0.72
C: g6e/L40S$54.911,400$0.52
D: trn2$45.003,500$0.35
E: inf2$12.892,800$0.28

Llama 4 Maverick$/1M tokens

ScenarioCost/HourThroughput$/1M tokens
A: p5/H100$98.322,800$1.28
D: trn2$45.002,200$0.74
Detailed Data Table

Llama 4 Scout

ScenarioHourly CostThroughput$/1M tokensvs Baseline
A$98.324,200$0.85Baseline
B$21.961,800$0.72-15%
C$54.911,400$0.52-39%
D$45.003,500$0.35-59%
E$12.892,800$0.28-67%

Analysis and Key Findings

58-67% lower cost per token

AWS custom silicon (Trainium2, Inferentia2) delivers 58-67% lower cost per million tokens compared to NVIDIA H100 for Llama 4 Scout inference.

$0.28 (inf2) vs $0.85 (H100)
H100 leads in raw speed

p5.48xlarge (H100) achieves the lowest TTFT (120ms) and highest throughput (4,200 tokens/sec), making it ideal for latency-sensitive workloads.

120ms TTFT, 4,200 tokens/sec
Trainium2 balances performance and cost

trn2.48xlarge achieves 83% of H100 throughput at 41% of the cost per token, offering the best performance-to-cost ratio for general production workloads.

3,500 tokens/sec at $0.35/1M tokens
MoE enables single-GPU deployment

Llama 4 Scout's MoE architecture (17B active out of 109B total) allows deployment on a single H100 GPU while maintaining performance comparable to dense models of similar active parameter count.

109B params, only 17B active per token
H100 scales 3.2× better under load

Under 32 concurrent requests, p5/H100 achieves 42,000 tokens/sec vs g6e/L40S at 9,200 — a 4.6× throughput gap that widens under concurrent load due to HBM bandwidth advantages.

42,000 vs 9,200 tokens/sec @32 concurrent

GPU vs Custom Silicon Trade-offs

AspectGPU (H100/A100/L40S)Custom Silicon (trn2/inf2)
PerformanceHighest raw performance (H100)67-83% of H100
CostHigh ($0.52-$0.85/1M tokens)Low ($0.28-$0.35/1M tokens)
EcosystemCUDA, extensive librariesNeuron SDK, AWS-dependent
FlexibilityAll frameworks supportedLimited to vLLM/Neuron supported models
ScalingNVSwitch high bandwidthNeuronLink, large cluster support
AvailabilityLimited (demand > supply)Relatively easier

MoE Architecture Performance Impact

Llama 4's MoE architecture impacts inference performance as follows:

  1. Memory Bandwidth Bottleneck: Frequent expert weight loading makes HBM bandwidth the key bottleneck
  2. Dynamic Routing Overhead: Additional computation required for per-token expert selection
  3. Unbalanced Expert Activation: Parallel efficiency may decrease when load concentrates on specific experts
  4. KV Cache Optimization: MoE's sparse activation makes KV Cache efficiency favorable compared to dense models

Recommendations by Workload

Workload CharacteristicsRecommendedRationale
Dev/Staging, Small ScaleE: inf2Lowest cost $0.28/1M tokens
Latency-Sensitive (Finance, Real-time)A: p5/H100120ms TTFT, 8ms ITL
General ProductionD: trn2Best perf/cost ratio, 83% H100 speed
Large-Scale Batch ProcessingD: trn2High throughput at 41% cost
Budget-Constrained ProductionE: inf267% cost savings vs H100
Maverick (400B) ServingA: p5/H100 or D: trn2Sufficient memory for 400B MoE
Multi-Model ServingC: g6e/L40S48GB/GPU, good for multiple small models
A: p5/H100
Latency-Sensitive/Max Performance
Complexity: Low
Performance: Maximum
Cost: Very High
D: trn2
General Production
Complexity: Medium (Neuron SDK)
Performance: High
Cost: Low
E: inf2
Cost-Optimized/Dev/Staging
Complexity: Medium (Neuron SDK)
Performance: Moderate-High
Cost: Lowest
C: g6e/L40S
Multi-Model/Budget GPU
Complexity: Low
Performance: Moderate
Cost: Medium

Scenario Selection Guide

Workload Requirement Check
├── Lowest latency needed? ──→ A: p5/H100 (120ms TTFT)
├── Lowest cost priority? ──→ E: inf2 ($0.28/1M tokens)
├── Performance/cost balance? ──→ D: trn2 (83% performance, 41% cost)
├── Maverick (400B) serving? ──→ A: p5/H100 or D: trn2
├── Multi-model serving? ──→ C: g6e/L40S (48GB/GPU)
└── Existing GPU infrastructure? ──→ B: p4d/A100 (cost-effective GPU)

Configuration Notes

vLLM Deployment Settings

Llama 4 Scout (GPU scenario):

vllm serve meta-llama/Llama-4-Scout-17B-16E \
--tensor-parallel-size 8 \
--max-model-len 1000000 \
--dtype bfloat16

Llama 4 Scout (Neuron/Trainium2):

vllm serve meta-llama/Llama-4-Scout-17B-16E \
--device neuron \
--tensor-parallel-size 16 \
--max-model-len 1000000

Neuron SDK Compatibility Notes

Neuron SDK Version Management
  • Trainium2/Inferentia2 requires AWS Neuron SDK 2.x or higher
  • vLLM's Neuron backend requires separate installation: pip install vllm[neuron]
  • Not all Llama 4 models are validated on Neuron — check the official compatibility list
  • FP8 quantization is only supported in GPU scenarios (Maverick)

Cost Optimization Strategies

  1. Spot Instance Usage: 50-70% cost savings for batch inference workloads (when interruption is acceptable)
  2. EC2 Capacity Blocks: Reserved allocation for Trainium2 instances for reliable availability
  3. Autoscaling: Karpenter + KEDA-based GPU metric scaling (details: GPU Resource Management)
  4. Model Quantization: Reduced memory usage and improved throughput with FP8/INT8 quantization

References

Data Reliability Notice

The figures in this benchmark are estimates based on specifications and benchmark data published by Meta, AWS, NVIDIA, and the vLLM project. Actual performance may vary depending on workload characteristics, input length, batch size, and model configuration. We recommend benchmarking in your actual environment before production deployment.