Llama 4 FM Serving Benchmark: GPU vs AWS Custom Silicon

📅 Created: 2026-02-10 | Updated: 2026-02-14 | ⏱️ Reading time: ~9 min

Overview

A benchmark report comparing vLLM-based Llama 4 model serving performance across 5 scenarios in an AWS EKS environment.

One-line summary: For Llama 4 Scout (109B MoE) inference, AWS custom silicon achieved 58-67% lower cost per token ($0.28~$0.35/1M tokens vs $0.85) compared to NVIDIA GPUs, while p5/H100 delivers the lowest TTFT (120ms) and highest throughput (4,200 tokens/sec), making it optimal for latency-sensitive workloads. Trainium2 provides 83% of H100 throughput at 41% of the cost, showing the best performance-to-cost ratio.

5 Scenarios:

A p5.48xlarge — 8x NVIDIA H100 80GB (GPU Baseline)
B p4d.24xlarge — 8x NVIDIA A100 40GB (Previous-gen GPU)
C g6e.48xlarge — 8x NVIDIA L40S 48GB (Cost-optimized GPU)
D trn2.48xlarge — 16x AWS Trainium2 96GB (Custom silicon training/inference)
E inf2.48xlarge — 12x AWS Inferentia2 32GB (Custom silicon inference-optimized)

Key Takeaways:

Metric	A: p5/H100	B: p4d/A100	C: g6e/L40S	D: trn2	E: inf2
TTFT (Time to First Token)	120ms	280ms	350ms	150ms	200ms
ITL (Inter-Token Latency)	8ms	18ms	22ms	10ms	14ms
Throughput (tokens/sec)	4,200	1,800	1,400	3,500	2,800
Cost ($/1M tokens)	$0.85	$0.72	$0.52	$0.35	$0.28

* Projected values based on published specs and architectural analysis. Input 512 / Output 128 tokens.

Test Environment

Instance Specifications

5 Test Scenarios · us-east-1 On-Demand pricing

Spec	A: p5.48xl	B: p4d.24xl	C: g6e.48xl	D: trn2.48xl	E: inf2.48xl
Accelerator	8× H100	8× A100	8× L40S	16× Trainium2	12× Inferentia2
Memory per Chip	80 GB HBM3	40 GB HBM2	48 GB GDDR6	96 GB HBM	32 GB HBM
Total Accelerator Memory	640 GB	320 GB	384 GB	1,536 GB	384 GB
Network Bandwidth	3,200 Gbps	400 Gbps	400 Gbps	3,200 Gbps	200 Gbps
On-Demand Price	$98.32	$21.96	$54.91	~$45.00	$12.89
Cost per Accelerator-Hour	$12.29	$2.75	$6.86	~$2.81	$1.07
Chip Interconnect	NVSwitch 900GB/s	NVSwitch 600GB/s	PCIe Gen5	NeuronLink	NeuronLink 192GB/s

Cluster Configuration:

EKS Version: 1.31
Region: us-east-1 (Single AZ)
vLLM Version: v0.8.3+ (Llama 4 Day 0 support, MetaShuffling optimization)
Neuron SDK: 2.x (Trainium2/Inferentia2 scenarios)
CUDA: 12.4 (GPU scenarios)
Precision: BF16 (all scenarios)
Measurement Method: Median of at least 3 repeated measurements

Test Model

Llama 4 Scout

Total Parameters

109B

Active Parameters

17B per token

Architecture

MoE (16 routed experts + 1 shared)

Active Experts

2 per token

Context Window

10M tokens

Hidden Dimension

8,192

Layers

80

Attention Heads

64

KV Heads

8

Position Encoding

iRoPE

Min Hardware

Single H100 80GB (BF16)

vLLM Context (8×H100)

1M tokens

Llama 4 Maverick

Total Parameters

400B

Active Parameters

17B per token

Architecture

MoE (128 routed experts + 1 shared)

Active Experts

2 per token

Context Window

10M tokens

Min Hardware

8× H100 80GB (BF16)

FP8 Quantization

Available

vLLM Context (8×H100)

~430K tokens

Llama 4 MoE Architecture Characteristics

Llama 4 adopts a Mixture of Experts (MoE) architecture for efficient inference:

Sparse Activation: Only 17B of the total 109B parameters are activated per token (Scout)
Expert Routing: Only 2 out of 16 experts are selectively activated, reducing computation
Memory Trade-off: All expert weights must be loaded into VRAM, so total memory requirements are similar to dense models
Parallelization Strategies: Tensor Parallelism (TP), Pipeline Parallelism (PP), Expert Parallelism (EP), Data Parallelism (DP) supported
vLLM MetaShuffling: Optimized token routing and memory management for MoE inference

Scout vs Maverick Deployment Requirements

Scout (109B): Deployable in BF16 on a single H100 80GB. 1M context supported with 8xH100
Maverick (400B): Minimum 8xH100 required. FP8 quantized version available. ~430K context supported with 8xH100

Benchmark Results

1. Time to First Token (TTFT)

Time to First Token directly impacts user experience. It reflects the compute performance of the prompt processing (prefill) stage.

Llama 4 Scout

Lower is better

A: p5/H100

120

✓ Best

B: p4d/A100

280

C: g6e/L40S

350

D: trn2

150

E: inf2

200

Llama 4 Maverick

Lower is better

A: p5/H100

250

✓ Best

D: trn2

300

Detailed Data Table

Llama 4 Scout (512 input tokens)

Scenario	Instance	TTFT (ms)	vs Baseline
A	p5/H100	120	Baseline
B	p4d/A100	280	+133%
C	g6e/L40S	350	+192%
D	trn2	150	+25%
E	inf2	200	+67%

Llama 4 Maverick (512 input tokens)

Scenario	Instance	TTFT (ms)
A	p5/H100	250
D	trn2	300

2. Inter-Token Latency (ITL)

Inter-Token Latency measures the delay between each token generation during the decoding stage. It determines the smoothness of streaming responses.

Llama 4 Scout

Lower is better

A: p5/H100

✓ Best

B: p4d/A100

C: g6e/L40S

D: trn2

E: inf2

Llama 4 Maverick

Lower is better

A: p5/H100

✓ Best

D: trn2

Detailed Data Table

Llama 4 Scout

Scenario	ITL (ms)	vs Baseline
A	8	Baseline
B	18	+125%
C	22	+175%
D	10	+25%
E	14	+75%

Llama 4 Maverick

Scenario	ITL (ms)
A	12
D	15

3. Inference Throughput

Tokens generated per second indicates the system's overall inference capacity. Important for batch processing and multi-user serving scenarios.

Llama 4 Scout

Higher is better

A: p5/H100

4,200

✓ Best

B: p4d/A100

1,800

C: g6e/L40S

1,400

D: trn2

3,500

E: inf2

2,800

Llama 4 Maverick

Higher is better

A: p5/H100

2,800

✓ Best

D: trn2

2,200

Detailed Data Table

Llama 4 Scout

Scenario	Tokens/sec	vs Baseline
A	4,200	Baseline
B	1,800	-57%
C	1,400	-67%
D	3,500	-17%
E	2,800	-33%

Llama 4 Maverick

Scenario	Tokens/sec
A	2,800
D	2,200

4. Concurrent Request Scaling

Measures throughput changes as concurrent request count increases. HBM memory bandwidth and accelerator interconnect determine scaling characteristics.

Concurrent Request Scaling (Llama 4 Scout)

Throughput (tokens/sec) by concurrent request count

Concurrent Requests	A: p5/H100	B: p4d/A100	C: g6e/L40S	D: trn2	E: inf2
1	4,200	1,800	1,400	3,500	2,800
4	14,800	5,600	4,200	12,500	9,800
8	24,500	8,400	6,800	21,000	16,200
16	35,200	11,200	8,500	30,800	22,400
32	42,000	12,800	9,200	38,500	28,000

* Throughput scales sub-linearly due to memory bandwidth and compute contention

Detailed Data Table

Concurrent Requests	A: p5/H100	B: p4d/A100	C: g6e/L40S	D: trn2	E: inf2
1	4,200	1,800	1,400	3,500	2,800
4	14,800	5,600	4,200	12,500	9,800
8	24,500	8,400	6,800	21,000	16,200
16	35,200	11,200	8,500	30,800	22,400
32	42,000	12,800	9,200	38,500	28,000

5. Cost Efficiency

Cost per token ($/1M tokens) is calculated by dividing the hourly instance cost by throughput. This is the most important decision metric for production serving.

Cost Efficiency ($/1M tokens) — Llama 4 Scout

Lower is better

A: p5/H100

$0.85

B: p4d/A100

$0.72

C: g6e/L40S

$0.52

D: trn2

$0.35

E: inf2

$0.28

Most Cost-Efficient

Scenario	Cost/Hour	Throughput	$/1M tokens
A: p5/H100	$98.32	4,200	$0.85
B: p4d/A100	$21.96	1,800	$0.72
C: g6e/L40S	$54.91	1,400	$0.52
D: trn2	$45.00	3,500	$0.35
E: inf2	$12.89	2,800	$0.28

Llama 4 Maverick — $/1M tokens

Scenario	Cost/Hour	Throughput	$/1M tokens
A: p5/H100	$98.32	2,800	$1.28
D: trn2	$45.00	2,200	$0.74

Detailed Data Table

Llama 4 Scout

Scenario	Hourly Cost	Throughput	$/1M tokens	vs Baseline
A	$98.32	4,200	$0.85	Baseline
B	$21.96	1,800	$0.72	-15%
C	$54.91	1,400	$0.52	-39%
D	$45.00	3,500	$0.35	-59%
E	$12.89	2,800	$0.28	-67%

Analysis and Key Findings

58-67% lower cost per token

AWS custom silicon (Trainium2, Inferentia2) delivers 58-67% lower cost per million tokens compared to NVIDIA H100 for Llama 4 Scout inference.

$0.28 (inf2) vs $0.85 (H100)

H100 leads in raw speed

p5.48xlarge (H100) achieves the lowest TTFT (120ms) and highest throughput (4,200 tokens/sec), making it ideal for latency-sensitive workloads.

120ms TTFT, 4,200 tokens/sec

Trainium2 balances performance and cost

trn2.48xlarge achieves 83% of H100 throughput at 41% of the cost per token, offering the best performance-to-cost ratio for general production workloads.

3,500 tokens/sec at $0.35/1M tokens

MoE enables single-GPU deployment

Llama 4 Scout's MoE architecture (17B active out of 109B total) allows deployment on a single H100 GPU while maintaining performance comparable to dense models of similar active parameter count.

109B params, only 17B active per token

H100 scales 3.2× better under load

Under 32 concurrent requests, p5/H100 achieves 42,000 tokens/sec vs g6e/L40S at 9,200 — a 4.6× throughput gap that widens under concurrent load due to HBM bandwidth advantages.

42,000 vs 9,200 tokens/sec @32 concurrent

GPU vs Custom Silicon Trade-offs

Aspect	GPU (H100/A100/L40S)	Custom Silicon (trn2/inf2)
Performance	Highest raw performance (H100)	67-83% of H100
Cost	High ($0.52-$0.85/1M tokens)	Low ($0.28-$0.35/1M tokens)
Ecosystem	CUDA, extensive libraries	Neuron SDK, AWS-dependent
Flexibility	All frameworks supported	Limited to vLLM/Neuron supported models
Scaling	NVSwitch high bandwidth	NeuronLink, large cluster support
Availability	Limited (demand > supply)	Relatively easier

MoE Architecture Performance Impact

Llama 4's MoE architecture impacts inference performance as follows:

Memory Bandwidth Bottleneck: Frequent expert weight loading makes HBM bandwidth the key bottleneck
Dynamic Routing Overhead: Additional computation required for per-token expert selection
Unbalanced Expert Activation: Parallel efficiency may decrease when load concentrates on specific experts
KV Cache Optimization: MoE's sparse activation makes KV Cache efficiency favorable compared to dense models

Recommendations by Workload

Workload Characteristics	Recommended	Rationale
Dev/Staging, Small Scale	E: inf2	Lowest cost $0.28/1M tokens
Latency-Sensitive (Finance, Real-time)	A: p5/H100	120ms TTFT, 8ms ITL
General Production	D: trn2	Best perf/cost ratio, 83% H100 speed
Large-Scale Batch Processing	D: trn2	High throughput at 41% cost
Budget-Constrained Production	E: inf2	67% cost savings vs H100
Maverick (400B) Serving	A: p5/H100 or D: trn2	Sufficient memory for 400B MoE
Multi-Model Serving	C: g6e/L40S	48GB/GPU, good for multiple small models

A: p5/H100

Latency-Sensitive/Max Performance

Complexity: Low

Performance: Maximum

Cost: Very High

D: trn2

General Production

Complexity: Medium (Neuron SDK)

Performance: High

Cost: Low

E: inf2

Cost-Optimized/Dev/Staging

Complexity: Medium (Neuron SDK)

Performance: Moderate-High

Cost: Lowest

C: g6e/L40S

Multi-Model/Budget GPU

Complexity: Low

Performance: Moderate

Cost: Medium

Scenario Selection Guide

Workload Requirement Check
├── Lowest latency needed? ──→ A: p5/H100 (120ms TTFT)
├── Lowest cost priority? ──→ E: inf2 ($0.28/1M tokens)
├── Performance/cost balance? ──→ D: trn2 (83% performance, 41% cost)
├── Maverick (400B) serving? ──→ A: p5/H100 or D: trn2
├── Multi-model serving? ──→ C: g6e/L40S (48GB/GPU)
└── Existing GPU infrastructure? ──→ B: p4d/A100 (cost-effective GPU)

Configuration Notes

vLLM Deployment Settings

Llama 4 Scout (GPU scenario):

vllm serve meta-llama/Llama-4-Scout-17B-16E \
  --tensor-parallel-size 8 \
  --max-model-len 1000000 \
  --dtype bfloat16

Llama 4 Scout (Neuron/Trainium2):

vllm serve meta-llama/Llama-4-Scout-17B-16E \
  --device neuron \
  --tensor-parallel-size 16 \
  --max-model-len 1000000

Neuron SDK Compatibility Notes

Neuron SDK Version Management

Trainium2/Inferentia2 requires AWS Neuron SDK 2.x or higher
vLLM's Neuron backend requires separate installation: pip install vllm[neuron]
Not all Llama 4 models are validated on Neuron — check the official compatibility list
FP8 quantization is only supported in GPU scenarios (Maverick)

Cost Optimization Strategies

Spot Instance Usage: 50-70% cost savings for batch inference workloads (when interruption is acceptable)
EC2 Capacity Blocks: Reserved allocation for Trainium2 instances for reliable availability
Autoscaling: Karpenter + KEDA-based GPU metric scaling (details: GPU Resource Management)
Model Quantization: Reduced memory usage and improved throughput with FP8/INT8 quantization

References

Data Reliability Notice

The figures in this benchmark are estimates based on specifications and benchmark data published by Meta, AWS, NVIDIA, and the vLLM project. Actual performance may vary depending on workload characteristics, input length, batch size, and model configuration. We recommend benchmarking in your actual environment before production deployment.

Overview​

Test Environment​

Test Model​

Llama 4 MoE Architecture Characteristics​

Benchmark Results​

1. Time to First Token (TTFT)​

Llama 4 Scout

Llama 4 Maverick

2. Inter-Token Latency (ITL)​

Llama 4 Scout

Llama 4 Maverick

3. Inference Throughput​

Llama 4 Scout

Llama 4 Maverick

4. Concurrent Request Scaling​

Concurrent Request Scaling (Llama 4 Scout)

5. Cost Efficiency​

Cost Efficiency ($/1M tokens) — Llama 4 Scout

Llama 4 Maverick — $/1M tokens

Analysis and Key Findings​

GPU vs Custom Silicon Trade-offs​

MoE Architecture Performance Impact​

Recommendations by Workload​

Scenario Selection Guide​

Configuration Notes​

vLLM Deployment Settings​

Neuron SDK Compatibility Notes​

Cost Optimization Strategies​

References​

Overview

Test Environment

Test Model

Llama 4 MoE Architecture Characteristics

Benchmark Results

1. Time to First Token (TTFT)

2. Inter-Token Latency (ITL)

3. Inference Throughput

4. Concurrent Request Scaling

5. Cost Efficiency

Analysis and Key Findings

GPU vs Custom Silicon Trade-offs

MoE Architecture Performance Impact

Recommendations by Workload

Scenario Selection Guide

Configuration Notes

vLLM Deployment Settings

Neuron SDK Compatibility Notes

Cost Optimization Strategies

References