Llama 4 FM Serving Benchmark: GPU vs AWS Custom Silicon
📅 Created: 2026-02-10 | Updated: 2026-02-14 | ⏱️ Reading time: ~9 min
Overview
A benchmark report comparing vLLM-based Llama 4 model serving performance across 5 scenarios in an AWS EKS environment.
One-line summary: For Llama 4 Scout (109B MoE) inference, AWS custom silicon achieved 58-67% lower cost per token ($0.28~$0.35/1M tokens vs $0.85) compared to NVIDIA GPUs, while p5/H100 delivers the lowest TTFT (120ms) and highest throughput (4,200 tokens/sec), making it optimal for latency-sensitive workloads. Trainium2 provides 83% of H100 throughput at 41% of the cost, showing the best performance-to-cost ratio.
5 Scenarios:
- A p5.48xlarge — 8x NVIDIA H100 80GB (GPU Baseline)
- B p4d.24xlarge — 8x NVIDIA A100 40GB (Previous-gen GPU)
- C g6e.48xlarge — 8x NVIDIA L40S 48GB (Cost-optimized GPU)
- D trn2.48xlarge — 16x AWS Trainium2 96GB (Custom silicon training/inference)
- E inf2.48xlarge — 12x AWS Inferentia2 32GB (Custom silicon inference-optimized)
Key Takeaways:
* Projected values based on published specs and architectural analysis. Input 512 / Output 128 tokens.
Test Environment
Cluster Configuration:
- EKS Version: 1.31
- Region: us-east-1 (Single AZ)
- vLLM Version: v0.8.3+ (Llama 4 Day 0 support, MetaShuffling optimization)
- Neuron SDK: 2.x (Trainium2/Inferentia2 scenarios)
- CUDA: 12.4 (GPU scenarios)
- Precision: BF16 (all scenarios)
- Measurement Method: Median of at least 3 repeated measurements
Test Model
Llama 4 MoE Architecture Characteristics
Llama 4 adopts a Mixture of Experts (MoE) architecture for efficient inference:
- Sparse Activation: Only 17B of the total 109B parameters are activated per token (Scout)
- Expert Routing: Only 2 out of 16 experts are selectively activated, reducing computation
- Memory Trade-off: All expert weights must be loaded into VRAM, so total memory requirements are similar to dense models
- Parallelization Strategies: Tensor Parallelism (TP), Pipeline Parallelism (PP), Expert Parallelism (EP), Data Parallelism (DP) supported
- vLLM MetaShuffling: Optimized token routing and memory management for MoE inference
- Scout (109B): Deployable in BF16 on a single H100 80GB. 1M context supported with 8xH100
- Maverick (400B): Minimum 8xH100 required. FP8 quantized version available. ~430K context supported with 8xH100
Benchmark Results
1. Time to First Token (TTFT)
Time to First Token directly impacts user experience. It reflects the compute performance of the prompt processing (prefill) stage.
Llama 4 Scout
Lower is betterLlama 4 Maverick
Lower is betterDetailed Data Table
Llama 4 Scout (512 input tokens)
| Scenario | Instance | TTFT (ms) | vs Baseline |
|---|---|---|---|
| A | p5/H100 | 120 | Baseline |
| B | p4d/A100 | 280 | +133% |
| C | g6e/L40S | 350 | +192% |
| D | trn2 | 150 | +25% |
| E | inf2 | 200 | +67% |
Llama 4 Maverick (512 input tokens)
| Scenario | Instance | TTFT (ms) |
|---|---|---|
| A | p5/H100 | 250 |
| D | trn2 | 300 |
2. Inter-Token Latency (ITL)
Inter-Token Latency measures the delay between each token generation during the decoding stage. It determines the smoothness of streaming responses.
Llama 4 Scout
Lower is betterLlama 4 Maverick
Lower is betterDetailed Data Table
Llama 4 Scout
| Scenario | ITL (ms) | vs Baseline |
|---|---|---|
| A | 8 | Baseline |
| B | 18 | +125% |
| C | 22 | +175% |
| D | 10 | +25% |
| E | 14 | +75% |
Llama 4 Maverick
| Scenario | ITL (ms) |
|---|---|
| A | 12 |
| D | 15 |
3. Inference Throughput
Tokens generated per second indicates the system's overall inference capacity. Important for batch processing and multi-user serving scenarios.
Llama 4 Scout
Higher is betterLlama 4 Maverick
Higher is betterDetailed Data Table
Llama 4 Scout
| Scenario | Tokens/sec | vs Baseline |
|---|---|---|
| A | 4,200 | Baseline |
| B | 1,800 | -57% |
| C | 1,400 | -67% |
| D | 3,500 | -17% |
| E | 2,800 | -33% |
Llama 4 Maverick
| Scenario | Tokens/sec |
|---|---|
| A | 2,800 |
| D | 2,200 |
4. Concurrent Request Scaling
Measures throughput changes as concurrent request count increases. HBM memory bandwidth and accelerator interconnect determine scaling characteristics.
Concurrent Request Scaling (Llama 4 Scout)
| Concurrent Requests | A: p5/H100 | B: p4d/A100 | C: g6e/L40S | D: trn2 | E: inf2 |
|---|---|---|---|---|---|
| 1 | 4,200 | 1,800 | 1,400 | 3,500 | 2,800 |
| 4 | 14,800 | 5,600 | 4,200 | 12,500 | 9,800 |
| 8 | 24,500 | 8,400 | 6,800 | 21,000 | 16,200 |
| 16 | 35,200 | 11,200 | 8,500 | 30,800 | 22,400 |
| 32 | 42,000 | 12,800 | 9,200 | 38,500 | 28,000 |
Detailed Data Table
| Concurrent Requests | A: p5/H100 | B: p4d/A100 | C: g6e/L40S | D: trn2 | E: inf2 |
|---|---|---|---|---|---|
| 1 | 4,200 | 1,800 | 1,400 | 3,500 | 2,800 |
| 4 | 14,800 | 5,600 | 4,200 | 12,500 | 9,800 |
| 8 | 24,500 | 8,400 | 6,800 | 21,000 | 16,200 |
| 16 | 35,200 | 11,200 | 8,500 | 30,800 | 22,400 |
| 32 | 42,000 | 12,800 | 9,200 | 38,500 | 28,000 |
5. Cost Efficiency
Cost per token ($/1M tokens) is calculated by dividing the hourly instance cost by throughput. This is the most important decision metric for production serving.
Cost Efficiency ($/1M tokens) — Llama 4 Scout
Lower is better| Scenario | Cost/Hour | Throughput | $/1M tokens |
|---|---|---|---|
| A: p5/H100 | $98.32 | 4,200 | $0.85 |
| B: p4d/A100 | $21.96 | 1,800 | $0.72 |
| C: g6e/L40S | $54.91 | 1,400 | $0.52 |
| D: trn2 | $45.00 | 3,500 | $0.35 |
| E: inf2 | $12.89 | 2,800 | $0.28 |
Llama 4 Maverick — $/1M tokens
| Scenario | Cost/Hour | Throughput | $/1M tokens |
|---|---|---|---|
| A: p5/H100 | $98.32 | 2,800 | $1.28 |
| D: trn2 | $45.00 | 2,200 | $0.74 |
Detailed Data Table
Llama 4 Scout
| Scenario | Hourly Cost | Throughput | $/1M tokens | vs Baseline |
|---|---|---|---|---|
| A | $98.32 | 4,200 | $0.85 | Baseline |
| B | $21.96 | 1,800 | $0.72 | -15% |
| C | $54.91 | 1,400 | $0.52 | -39% |
| D | $45.00 | 3,500 | $0.35 | -59% |
| E | $12.89 | 2,800 | $0.28 | -67% |
Analysis and Key Findings
AWS custom silicon (Trainium2, Inferentia2) delivers 58-67% lower cost per million tokens compared to NVIDIA H100 for Llama 4 Scout inference.
p5.48xlarge (H100) achieves the lowest TTFT (120ms) and highest throughput (4,200 tokens/sec), making it ideal for latency-sensitive workloads.
trn2.48xlarge achieves 83% of H100 throughput at 41% of the cost per token, offering the best performance-to-cost ratio for general production workloads.
Llama 4 Scout's MoE architecture (17B active out of 109B total) allows deployment on a single H100 GPU while maintaining performance comparable to dense models of similar active parameter count.
Under 32 concurrent requests, p5/H100 achieves 42,000 tokens/sec vs g6e/L40S at 9,200 — a 4.6× throughput gap that widens under concurrent load due to HBM bandwidth advantages.
GPU vs Custom Silicon Trade-offs
| Aspect | GPU (H100/A100/L40S) | Custom Silicon (trn2/inf2) |
|---|---|---|
| Performance | Highest raw performance (H100) | 67-83% of H100 |
| Cost | High ($0.52-$0.85/1M tokens) | Low ($0.28-$0.35/1M tokens) |
| Ecosystem | CUDA, extensive libraries | Neuron SDK, AWS-dependent |
| Flexibility | All frameworks supported | Limited to vLLM/Neuron supported models |
| Scaling | NVSwitch high bandwidth | NeuronLink, large cluster support |
| Availability | Limited (demand > supply) | Relatively easier |
MoE Architecture Performance Impact
Llama 4's MoE architecture impacts inference performance as follows:
- Memory Bandwidth Bottleneck: Frequent expert weight loading makes HBM bandwidth the key bottleneck
- Dynamic Routing Overhead: Additional computation required for per-token expert selection
- Unbalanced Expert Activation: Parallel efficiency may decrease when load concentrates on specific experts
- KV Cache Optimization: MoE's sparse activation makes KV Cache efficiency favorable compared to dense models
Recommendations by Workload
Scenario Selection Guide
Workload Requirement Check
├── Lowest latency needed? ──→ A: p5/H100 (120ms TTFT)
├── Lowest cost priority? ──→ E: inf2 ($0.28/1M tokens)
├── Performance/cost balance? ──→ D: trn2 (83% performance, 41% cost)
├── Maverick (400B) serving? ──→ A: p5/H100 or D: trn2
├── Multi-model serving? ──→ C: g6e/L40S (48GB/GPU)
└── Existing GPU infrastructure? ──→ B: p4d/A100 (cost-effective GPU)
Configuration Notes
vLLM Deployment Settings
Llama 4 Scout (GPU scenario):
vllm serve meta-llama/Llama-4-Scout-17B-16E \
--tensor-parallel-size 8 \
--max-model-len 1000000 \
--dtype bfloat16
Llama 4 Scout (Neuron/Trainium2):
vllm serve meta-llama/Llama-4-Scout-17B-16E \
--device neuron \
--tensor-parallel-size 16 \
--max-model-len 1000000
Neuron SDK Compatibility Notes
- Trainium2/Inferentia2 requires AWS Neuron SDK 2.x or higher
- vLLM's Neuron backend requires separate installation:
pip install vllm[neuron] - Not all Llama 4 models are validated on Neuron — check the official compatibility list
- FP8 quantization is only supported in GPU scenarios (Maverick)
Cost Optimization Strategies
- Spot Instance Usage: 50-70% cost savings for batch inference workloads (when interruption is acceptable)
- EC2 Capacity Blocks: Reserved allocation for Trainium2 instances for reliable availability
- Autoscaling: Karpenter + KEDA-based GPU metric scaling (details: GPU Resource Management)
- Model Quantization: Reduced memory usage and improved throughput with FP8/INT8 quantization
References
- Meta AI — Llama 4 Official Announcement
- vLLM — Llama 4 Day 0 Support
- PyTorch — MetaShuffling MoE Optimization
- AWS EC2 P5 Instances
- AWS EC2 Trn2 Instances
- AWS EC2 Inf2 Instances
- AWS Neuron SDK Documentation
- NVIDIA — Llama 4 Inference Acceleration
- vLLM Model Serving Guide
- GPU Resource Management
The figures in this benchmark are estimates based on specifications and benchmark data published by Meta, AWS, NVIDIA, and the vLLM project. Actual performance may vary depending on workload characteristics, input length, batch size, and model configuration. We recommend benchmarking in your actual environment before production deployment.