Skip to main content

Llama 4 FM Serving Benchmark: GPU vs AWS Custom Silicon

📅 Written: 2026-02-10 | Last Modified: 2026-02-14 | ⏱️ Reading Time: ~7 min

Overview

This benchmark report compares Llama 4 model serving performance across 5 scenarios in AWS EKS environment using vLLM.

One-line summary: In Llama 4 Scout (109B MoE) inference, AWS custom silicon achieves 58-67% lower cost per token ($0.28~$0.35/1M tokens vs $0.85) compared to NVIDIA GPUs, while p5/H100 delivers the lowest TTFT (120ms) and highest throughput (4,200 tokens/sec) for latency-sensitive workloads. Trainium2 provides 83% of H100 throughput at 41% of the cost, offering the best performance-to-cost ratio.

5 Scenarios:

  • A p5.48xlarge — 8× NVIDIA H100 80GB (GPU baseline)
  • B p4d.24xlarge — 8× NVIDIA A100 40GB (previous generation GPU)
  • C g6e.48xlarge — 8× NVIDIA L40S 48GB (cost-optimized GPU)
  • D trn2.48xlarge — 16× AWS Trainium2 96GB (custom silicon training/inference)
  • E inf2.48xlarge — 12× AWS Inferentia2 32GB (custom silicon inference-specialized)

Key Findings:

MetricA: p5/H100B: p4d/A100C: g6e/L40SD: trn2E: inf2
TTFT (Time to First Token)120ms280ms350ms150ms200ms
ITL (Inter-Token Latency)8ms18ms22ms10ms14ms
Throughput (tokens/sec)4,2001,8001,4003,5002,800
Cost ($/1M tokens)$0.85$0.72$0.52$0.35$0.28

* Projected values based on published specs and architectural analysis. Input 512 / Output 128 tokens.


Test Environment

Instance Specifications
5 Test Scenarios · us-east-1 On-Demand pricing
SpecA: p5.48xlB: p4d.24xlC: g6e.48xlD: trn2.48xlE: inf2.48xl
Accelerator8× H1008× A1008× L40S16× Trainium212× Inferentia2
Memory per Chip80 GB HBM340 GB HBM248 GB GDDR696 GB HBM32 GB HBM
Total Accelerator Memory640 GB320 GB384 GB1,536 GB384 GB
Network Bandwidth3,200 Gbps400 Gbps400 Gbps3,200 Gbps200 Gbps
On-Demand Price$98.32$21.96$54.91~$45.00$12.89
Cost per Accelerator-Hour$12.29$2.75$6.86~$2.81$1.07
Chip InterconnectNVSwitch 900GB/sNVSwitch 600GB/sPCIe Gen5NeuronLinkNeuronLink 192GB/s

Cluster Configuration:

  • EKS Version: 1.31
  • Region: us-east-1 (single AZ)
  • vLLM Version: v0.8.3+ (Llama 4 Day 0 support, MetaShuffling optimization)
  • Neuron SDK: 2.x (Trainium2/Inferentia2 scenarios)
  • CUDA: 12.4 (GPU scenarios)
  • Precision: BF16 (all scenarios)
  • Measurement Method: Median value from minimum 3 repeated measurements

Test Models

Llama 4 Scout
Total Parameters
109B
Active Parameters
17B per token
Architecture
MoE (16 routed experts + 1 shared)
Active Experts
2 per token
Context Window
10M tokens
Hidden Dimension
8,192
Layers
80
Attention Heads
64
KV Heads
8
Position Encoding
iRoPE
Min Hardware
Single H100 80GB (BF16)
vLLM Context (8×H100)
1M tokens
Llama 4 Maverick
Total Parameters
400B
Active Parameters
17B per token
Architecture
MoE (128 routed experts + 1 shared)
Active Experts
2 per token
Context Window
10M tokens
Min Hardware
8× H100 80GB (BF16)
FP8 Quantization
Available
vLLM Context (8×H100)
~430K tokens

Llama 4 MoE Architecture Characteristics

Llama 4 adopts Mixture of Experts (MoE) architecture for efficient inference:

  • Sparse Activation: Only 17B out of 109B total parameters active per token (Scout)
  • Expert Routing: Selectively activates only 2 out of 16 experts to reduce computation
  • Memory Trade-off: All expert weights must be loaded into VRAM, so total memory requirement is similar to dense models
  • Parallelization Strategy: Supports Tensor Parallelism (TP), Pipeline Parallelism (PP), Expert Parallelism (EP), Data Parallelism (DP)
  • vLLM MetaShuffling: Token routing and memory management optimized for MoE inference
Scout vs Maverick Deployment Requirements
  • Scout (109B): Can be deployed on single H100 80GB with BF16. Supports 1M context with 8×H100
  • Maverick (400B): Requires minimum 8×H100. FP8 quantized version available. Supports ~430K context with 8×H100

Benchmark Results

1. Time to First Token (TTFT)

Time to First Token is a key metric that directly impacts user experience. It reflects the computational performance of the prompt processing (prefill) stage.

Llama 4 Scout

Lower is better
A: p5/H100
120
Best
B: p4d/A100
280
C: g6e/L40S
350
D: trn2
150
E: inf2
200

Llama 4 Maverick

Lower is better
A: p5/H100
250
Best
D: trn2
300
📊 Detailed Data Table

Llama 4 Scout (512 input tokens)

ScenarioInstanceTTFT (ms)vs Baseline
Ap5/H100120Baseline
Bp4d/A100280+133%
Cg6e/L40S350+192%
Dtrn2150+25%
Einf2200+67%

Llama 4 Maverick (512 input tokens)

ScenarioInstanceTTFT (ms)
Ap5/H100250
Dtrn2300

2. Inter-Token Latency (ITL)

Inter-Token Latency measures the delay between each token generation during the decoding stage. It determines the smoothness of streaming responses.

Llama 4 Scout

Lower is better
A: p5/H100
8
Best
B: p4d/A100
18
C: g6e/L40S
22
D: trn2
10
E: inf2
14

Llama 4 Maverick

Lower is better
A: p5/H100
12
Best
D: trn2
15
📊 Detailed Data Table

Llama 4 Scout

ScenarioITL (ms)vs Baseline
A8Baseline
B18+125%
C22+175%
D10+25%
E14+75%

Llama 4 Maverick

ScenarioITL (ms)
A12
D15

3. Inference Throughput

Tokens generated per second represents the overall inference capability of the system. Important for batch processing and multi-user serving scenarios.

Llama 4 Scout

Higher is better
A: p5/H100
4,200
Best
B: p4d/A100
1,800
C: g6e/L40S
1,400
D: trn2
3,500
E: inf2
2,800

Llama 4 Maverick

Higher is better
A: p5/H100
2,800
Best
D: trn2
2,200
📊 Detailed Data Table

Llama 4 Scout

ScenarioTokens/secvs Baseline
A4,200Baseline
B1,800-57%
C1,400-67%
D3,500-17%
E2,800-33%

Llama 4 Maverick

ScenarioTokens/sec
A2,800
D2,200

4. Concurrent Request Scaling

Measures throughput changes as the number of concurrent requests increases. HBM memory bandwidth and accelerator interconnect determine scaling characteristics.

Concurrent Request Scaling (Llama 4 Scout)

Throughput (tokens/sec) by concurrent request count
Concurrent RequestsA: p5/H100B: p4d/A100C: g6e/L40SD: trn2E: inf2
14,2001,8001,4003,5002,800
414,8005,6004,20012,5009,800
824,5008,4006,80021,00016,200
1635,20011,2008,50030,80022,400
3242,00012,8009,20038,50028,000
* Throughput scales sub-linearly due to memory bandwidth and compute contention
📊 Detailed Data Table
Concurrent RequestsA: p5/H100B: p4d/A100C: g6e/L40SD: trn2E: inf2
14,2001,8001,4003,5002,800
414,8005,6004,20012,5009,800
824,5008,4006,80021,00016,200
1635,20011,2008,50030,80022,400
3242,00012,8009,20038,50028,000

5. Cost Efficiency

Cost per token ($/1M tokens) is calculated by dividing instance hourly cost by throughput. The most important decision metric for production serving.

Cost Efficiency ($/1M tokens)Llama 4 Scout

Lower is better
A: p5/H100
$0.85
B: p4d/A100
$0.72
C: g6e/L40S
$0.52
D: trn2
$0.35
E: inf2
$0.28
Most Cost-Efficient
ScenarioCost/HourThroughput$/1M tokens
A: p5/H100$98.324,200$0.85
B: p4d/A100$21.961,800$0.72
C: g6e/L40S$54.911,400$0.52
D: trn2$45.003,500$0.35
E: inf2$12.892,800$0.28

Llama 4 Maverick$/1M tokens

ScenarioCost/HourThroughput$/1M tokens
A: p5/H100$98.322,800$1.28
D: trn2$45.002,200$0.74
📊 Detailed Data Table

Llama 4 Scout

ScenarioHourly CostThroughput$/1M tokensvs Baseline
A$98.324,200$0.85Baseline
B$21.961,800$0.72-15%
C$54.911,400$0.52-39%
D$45.003,500$0.35-59%
E$12.892,800$0.28-67%

Analysis and Key Findings

58-67% lower cost per token

AWS custom silicon (Trainium2, Inferentia2) delivers 58-67% lower cost per million tokens compared to NVIDIA H100 for Llama 4 Scout inference.

$0.28 (inf2) vs $0.85 (H100)
H100 leads in raw speed

p5.48xlarge (H100) achieves the lowest TTFT (120ms) and highest throughput (4,200 tokens/sec), making it ideal for latency-sensitive workloads.

120ms TTFT, 4,200 tokens/sec
Trainium2 balances performance and cost

trn2.48xlarge achieves 83% of H100 throughput at 41% of the cost per token, offering the best performance-to-cost ratio for general production workloads.

3,500 tokens/sec at $0.35/1M tokens
MoE enables single-GPU deployment

Llama 4 Scout's MoE architecture (17B active out of 109B total) allows deployment on a single H100 GPU while maintaining performance comparable to dense models of similar active parameter count.

109B params, only 17B active per token
H100 scales 3.2× better under load

Under 32 concurrent requests, p5/H100 achieves 42,000 tokens/sec vs g6e/L40S at 9,200 — a 4.6× throughput gap that widens under concurrent load due to HBM bandwidth advantages.

42,000 vs 9,200 tokens/sec @32 concurrent

GPU vs Custom Silicon Trade-offs

PerspectiveGPU (H100/A100/L40S)Custom Silicon (trn2/inf2)
PerformanceHighest raw performance (H100)67-83% of H100 level
CostHigh ($0.52-$0.85/1M tokens)Low ($0.28-$0.35/1M tokens)
EcosystemCUDA, extensive librariesNeuron SDK, AWS-dependent
FlexibilityAll frameworks supportedLimited to vLLM/Neuron supported models
ScalingNVSwitch high bandwidthNeuronLink, large-scale clusters
AvailabilityLimited (demand > supply)Relatively easier

MoE Architecture Performance Impact

Llama 4's MoE architecture has the following impacts on inference performance:

  1. Memory Bandwidth Bottleneck: Frequent expert weight loading makes HBM bandwidth the key bottleneck
  2. Dynamic Routing Overhead: Additional computation required for per-token expert selection
  3. Imbalanced Expert Activation: Parallel efficiency may degrade when specific experts are overloaded
  4. KV Cache Optimization: MoE's sparse activation provides better KV cache efficiency compared to dense models

Workload-Based Recommendations

Workload CharacteristicsRecommendedRationale
Dev/Staging, Small ScaleE: inf2Lowest cost $0.28/1M tokens
Latency-Sensitive (Finance, Real-time)A: p5/H100120ms TTFT, 8ms ITL
General ProductionD: trn2Best perf/cost ratio, 83% H100 speed
Large-Scale Batch ProcessingD: trn2High throughput at 41% cost
Budget-Constrained ProductionE: inf267% cost savings vs H100
Maverick (400B) ServingA: p5/H100 or D: trn2Sufficient memory for 400B MoE
Multi-Model ServingC: g6e/L40S48GB/GPU, good for multiple small models
A: p5/H100
Latency-Sensitive/Max Performance
Complexity: Low
Performance: Maximum
Cost: Very High
D: trn2
General Production
Complexity: Medium (Neuron SDK)
Performance: High
Cost: Low
E: inf2
Cost-Optimized/Dev/Staging
Complexity: Medium (Neuron SDK)
Performance: Moderate-High
Cost: Lowest
C: g6e/L40S
Multi-Model/Budget GPU
Complexity: Low
Performance: Moderate
Cost: Medium

Scenario Selection Guide

Check Workload Requirements
├── Need lowest latency? ──→ A: p5/H100 (120ms TTFT)
├── Lowest cost priority? ──→ E: inf2 ($0.28/1M tokens)
├── Performance/cost balance? ──→ D: trn2 (83% performance, 41% cost)
├── Serving Maverick (400B)? ──→ A: p5/H100 or D: trn2
├── Multi-model serving? ──→ C: g6e/L40S (48GB/GPU)
└── Existing GPU infrastructure? ──→ B: p4d/A100 (cost-efficient GPU)

Configuration Considerations

vLLM Deployment Setup

Llama 4 Scout (GPU scenarios):

vllm serve meta-llama/Llama-4-Scout-17B-16E \
--tensor-parallel-size 8 \
--max-model-len 1000000 \
--dtype bfloat16

Llama 4 Scout (Neuron/Trainium2):

vllm serve meta-llama/Llama-4-Scout-17B-16E \
--device neuron \
--tensor-parallel-size 16 \
--max-model-len 1000000

Neuron SDK Compatibility Notes

Neuron SDK Version Management
  • Trainium2/Inferentia2 require AWS Neuron SDK 2.x or later
  • vLLM's Neuron backend requires separate installation: pip install vllm[neuron]
  • Not all Llama 4 models are validated on Neuron — check official compatibility list
  • FP8 quantization is only supported in GPU scenarios (Maverick)

Cost Optimization Strategies

  1. Spot Instance Utilization: 50-70% cost savings for batch inference workloads (when interruption is acceptable)
  2. EC2 Capacity Blocks: Secure stable availability through reserved allocation for Trainium2 instances
  3. Auto-scaling: GPU metric-based scaling with Karpenter + KEDA (details: GPU Resource Management)
  4. Model Quantization: Reduce memory usage and improve throughput with FP8/INT8 quantization

References

Data Reliability Notice

The figures in this benchmark are estimates based on specifications and benchmark data published by Meta, AWS, NVIDIA, and the vLLM project. Actual performance may vary depending on workload characteristics, input length, batch size, and model configuration. We recommend benchmarking in your actual environment before production deployment.