MoE Model Serving Concept Guide

Current version: vLLM v0.18+ / v0.19.x (as of April 2026)

Overview

Mixture of Experts (MoE) models are an architecture that maximizes the efficiency of large language models. By activating only a subset of Experts from the total parameters, they achieve equivalent quality with less computation compared to Dense models.

This document covers the core concepts of MoE architecture, per-model resource requirements, and distributed deployment strategies.

Production Deployment Guide

For practical deployment including EKS deployment YAML, helm commands, and multi-node configuration for MoE models, refer to the Custom Model Deployment Guide.

Understanding MoE Architecture

Expert Network Structure

MoE models consist of multiple "Expert" networks and a "Router (Gate)" network that selects them.

Routing Mechanisms

The core of MoE models is the routing mechanism that selects appropriate Experts based on input tokens.

MoE Routing Mechanisms

🎯Top-K Routing

Description

Activate only top K experts

Representative Model

Mixtral (K=2)

🔄Expert Choice

Description

Expert selects tokens to process

Representative Model

Switch Transformer

⚖️Soft MoE

Description

Distribute weights to all experts

Representative Model

Soft MoE

#️⃣Hash Routing

Description

Hash-based deterministic routing

Representative Model

Hash Layers

Routing Operation Principles

Gate Computation: Pass the input token's hidden state through the Gate network
Expert Selection: Select Top-K Experts from Softmax output
Parallel Processing: Selected Experts process the input in parallel
Weighted Summation: Combine Expert outputs with Gate weights

MoE vs Dense Model Comparison

Characteristic

Dense Model

MoE Model

⚡

Parameter Activation

100% (all)

10-25% (some experts)

🔢

Inference Computation

High

Relatively low

💾

Memory Requirements

Proportional to parameter count

Must load all parameters

📚

Learning Efficiency

Standard

Efficient learning with more data

📈

Scalability

Linear growth

Efficient scaling by adding experts

Advantages of MoE Models

Computational Efficiency: Faster inference by activating only a portion of total parameters
Scalability: Model capacity expandable by adding Experts
Specialization: Each Expert can specialize in specific domains/tasks

GPU Memory Requirements

MoE models activate fewer parameters but must load all Experts into memory.

MoE Model GPU Memory Requirements

Model	Total Parameters	Active Parameters	FP16 Memory	INT8 Memory	Recommended GPU
Mixtral 8x7B	46.7B	12.9B	~94GB	~47GB	2x A100 80GB
Mixtral 8x22B	141B	39B	~282GB	~141GB	4x H100 80GB
DeepSeek-V3	671B	37B	~800GB*	~400GB*	8x H100 80GB
DeepSeek-MoE 16B	16.4B	2.8B	~33GB	~17GB	1x A100 40GB
Qwen2.5-MoE-A14B	~50B	14B	~100GB	~50GB	2x A100 80GB
Qwen1.5-MoE-A2.7B	14.3B	2.7B	~29GB	~15GB	1x A100 40GB
DBRX	132B	36B	~264GB	~132GB	4x H100 80GB
GLM-5	744B	40B	~1.5TB	~744GB	2x p5.48xlarge (PP=2)
Kimi K2.5	~1T	32B	~2TB	~500GB	1x p5.48xlarge (INT4)

Latest MoE Model Memory Optimization

DeepSeek-V3: Uses Multi-head Latent Attention (MLA) architecture to significantly reduce KV cache memory. Achieves approximately 40% memory savings compared to traditional MHA, so actual memory requirements may be lower than listed values.

GLM-5 (released February 2026): 744B total parameters / 40B active, 8 of 256 experts activated. SWE-bench Verified 77.8%, Agentic Coding #1 (55.00), MIT license. FP8 quantized version requires approximately 744GB VRAM (2x p5.48xlarge, PP=2). HuggingFace: zai-org/GLM-5-FP8

Kimi K2.5 (released January 2026): approximately 1T total parameters / 32B active, Modified DeepSeek V3 MoE architecture. SWE-bench Verified 76.8%, HumanEval 99%, Agent Swarm support. INT4 quantized version requires approximately 500GB VRAM (1x p5.48xlarge, TP=8). HuggingFace: moonshotai/Kimi-K2.5

Exact memory requirements vary with batch size and sequence length, so profiling is recommended.

Memory Calculation Considerations

KV Cache: Additional memory needed based on batch size and sequence length
Activation Memory: Storage space for intermediate activation values during inference
CUDA Context: Approximately 1-2GB CUDA overhead per GPU
Safety Margin: Recommended 10-20% headroom in production

Distributed Deployment Strategies

Large MoE models cannot be loaded on a single GPU, making distributed deployment essential.

MoE Model Parallelization Strategies

🔷

Tensor Parallelism (TP)

Description

Split tensors within layers across GPUs

Advantages

✓ Low latency

Disadvantages

✗ High communication overhead

🎯

Expert Parallelism (EP)

Description

Distribute experts across GPUs

Advantages

✓ Optimized for MoE

Disadvantages

✗ Requires all-to-all communication

📊

Pipeline Parallelism (PP)

Description

Split layers sequentially across GPUs

Advantages

✓ Memory efficient

Disadvantages

✗ Pipeline bubble overhead

Tensor Parallelism Configuration

Tensor Parallelism distributes each model layer across multiple GPUs.

vLLM Tensor Parallelism Recommended Configuration

Model

Recommended TP Size

GPU Configuration

Memory/GPU

Mixtral 8x7B

2

2x A100 80GB

~47GB

Mixtral 8x22B

4

4x H100 80GB

~70GB

DeepSeek-MoE 16B

1

1x A100 40GB

~33GB

DBRX

4-8

4-8x H100 80GB

~33-66GB

Tensor Parallelism Optimization

NVLink Utilization: Use NVLink-supported instances for high-speed inter-GPU communication
TP Size Selection: Choose minimum TP size based on model size and GPU memory
Communication Overhead: Larger TP size increases All-Reduce communication

Expert Parallelism

Expert Parallelism distributes MoE model Experts across multiple GPUs. In vLLM v0.19.x, Experts are automatically distributed within TP.

Expert Activation Patterns

Understanding Expert activation patterns is important for MoE model performance optimization.

Expert Load Balancing

Auxiliary Loss: Auxiliary loss during training to encourage even distribution across Experts
Capacity Factor: Maximum token limit per Expert
Token Dropping: Drop tokens on capacity overflow (recommended to disable during inference)

700B+ MoE Model Multi-node Deployment Concepts

700B+ MoE models like GLM-5 and Kimi K2.5 cannot be loaded on a single node, making multi-node deployment essential. vLLM v0.18+ supports multi-node deployment based on LeaderWorkerSet (LWS).

Model	Total Parameters	Active Parameters	Recommended Config	VRAM Requirement
GLM-5 FP8	744B	40B	2x p5.48xlarge, PP=2, TP=8	approximately 744GB
Kimi K2.5 INT4	approximately 1T	32B	1x p5.48xlarge, TP=8	approximately 500GB
DeepSeek-V3	671B	37B	2x p5.48xlarge, PP=2, TP=8	approximately 671GB
Mixtral 8x22B	141B	39B	1x p5.48xlarge, TP=4	approximately 282GB
Mixtral 8x7B	47B	13B	1x p4d.24xlarge, TP=2	approximately 94GB

700B+ MoE Model Deployment Recommendations

Use LeaderWorkerSet: Kubernetes-native multi-node deployment without Ray dependency
Pipeline Parallelism: PP=2 or more to partition layers across nodes
FP8 Quantization: Memory savings (GLM-5 FP8 version recommended)
Network Optimization: NCCL configuration for inter-node communication optimization (EFA recommended)
INT4/AWQ Quantization: Consider when single-node deployment is possible (Kimi K2.5)

Multi-node Deployment Cautions

Network Bandwidth: Overhead from inter-node All-Reduce communication (EFA recommended)
Loading Time: 700B+ models may take 20-30 minutes for initial loading
Memory Headroom: 10-15% safety margin required
LeaderWorkerSet CRD: LWS Operator must be installed on the cluster

vLLM-Based MoE Serving Features

vLLM v0.18+ provides the following optimizations for MoE models:

Expert Parallelism: Expert distribution across multiple GPUs
Tensor Parallelism: Intra-layer tensor splitting
PagedAttention: Efficient KV Cache management
Continuous Batching: Dynamic batch processing
FP8 KV Cache: 2x memory savings
Improved Prefix Caching: 400%+ throughput improvement
Multi-LoRA Serving: Simultaneous serving of multiple LoRA adapters on a single base model
GGUF Quantization: GGUF format quantized model support

TGI Maintenance Mode

Text Generation Inference (TGI) entered maintenance mode in 2025. Use vLLM for new deployments. When migrating from existing TGI, vLLM provides an OpenAI-compatible API, minimizing client code changes.

vLLM vs TGI Performance Comparison

Characteristic

vLLM Recommended

TGI (Legacy Reference)

⚡

Throughput (tokens/s)

High

Medium-High

⏱️

Latency (TTFT)

Low

Medium

💾

Memory Efficiency

Very High (PagedAttention)

High

🎯

MoE Optimization

Excellent

Good

🔢

Quantization Support

AWQ, GPTQ, SqueezeLLM

AWQ, GPTQ, EETQ

🔌

API Compatibility

OpenAI compatible

Custom API + OpenAI compatible

👥

Community

Active

AWS Trainium2-Based MoE Inference

AWS Trainium2 / Inferentia2 provide a cost-efficient alternative to GPUs for large-scale MoE models (DBRX, Mixtral 8x22B, Llama 4 MoE, etc.), with lower per-token costs. The Neuron stack maps Expert Parallelism and Tensor Parallelism to NeuronCore units and serves via NxD Inference or vLLM Neuron backend.

Summary

Item	Overview
Hardware	trn2.48xlarge (Trainium2 16 chips / NeuronCore 128 / HBM 1.5TB), inf2 series
SDK	AWS Neuron SDK 2.x, torch-neuronx, neuronx-cc
Inference Framework	NxD Inference (AWS official), vLLM Neuron backend, TGI Neuron fork
Quantization	BF16/FP16/FP8(E4M3/E5M2). Some AWQ/GPTQ, GGUF not supported
Suitable MoE	DBRX 132B, Mixtral 8x7B/8x22B, Llama 4 MoE (within NxD support scope)

GPU vs Trainium2 Cost Comparison

AWS Trainium2 Instance Types

trn2.48xlarge

16

512GB

800 Gbps

Mixtral 8x7B, Llama 3.1 70B

trn2.48xlarge (UltraServer)

32

1TB

1600 Gbps

Mixtral 8x22B, Llama 3.1 405B

GPU vs Trainium2 Cost Comparison

Configuration

Instance

Hourly Cost

Monthly Cost (730h)

Relative Cost

🎮

GPU (NVIDIA)

p5.48xlarge (8x H100)

$98.32

$71,774

100%

🎮

GPU (NVIDIA)

p4d.24xlarge (8x A100)

$32.77

$23,922

33%

⚡

Trainium2

trn2.48xlarge (16 cores)

$21.50

$15,695

22%

💡

Cost Savings: Trainium2 saves 78% vs H100 GPU, 34% vs A100 GPU

Refer to Separate Document for Detailed Guide

For Neuron SDK architecture, instance lineup, Device Plugin deployment, Karpenter NodePool, inference framework comparison (NxD / vLLM Neuron / TGI Neuron), supported model matrix, observability, limitations and considerations, refer to the dedicated document below.

→ AWS Neuron Stack — Trainium2/Inferentia2 on EKS

For NVIDIA vs Neuron decision-making at the node selection stage, refer to EKS GPU Node Strategy.

Performance Optimization Concepts

KV Cache Optimization

KV Cache is a key factor significantly impacting inference performance.

vLLM KV Cache Configuration Parameters

💾

--gpu-memory-utilization

Description

GPU memory usage ratio

Recommended

0.85-0.92

📏

--max-model-len

Description

Maximum context length

Recommended

Within model support range

🔢

--max-num-batched-tokens

Description

Maximum tokens per batch

Recommended

Adjust based on memory

✅

--enable-chunked-prefill

Description

Enable chunked prefill

Recommended

Recommended

Speculative Decoding

Speculative Decoding uses a small draft model to improve inference speed.

Speculative Decoding Effect

Speed Improvement: 1.5x - 2.5x throughput increase (varies by workload)
Quality Maintained: Output quality is identical (guaranteed by verification process)
Additional Memory: Extra GPU memory needed for the draft model

Batch Processing Optimization

Batch Processing Optimization Techniques

🔄

Continuous Batching

Description

Dynamically add/remove requests from batch

Effect

2-3x throughput improvement

⚡

Chunked Prefill

Description

Split prefill into chunks for concurrent decode

Effect

Reduced latency

🎯

Dynamic SplitFuse

Description

Dynamically separate/combine prefill and decode

Effect

Improved GPU utilization

Monitoring Metrics

Key Monitoring Metrics

MoE Model Key Monitoring Metrics

vllm:num_requests_running

Current requests being processed

-

Info

vllm:num_requests_waiting

Requests waiting in queue

> 100 warning

Warning

vllm:gpu_cache_usage_perc

KV Cache utilization

> 95% warning

Critical

vllm:avg_prompt_throughput_toks_per_s

Prompt processing throughput

-

Info

vllm:avg_generation_throughput_toks_per_s

Generation throughput

-

Info

DCGM_FI_DEV_GPU_UTIL

GPU utilization

> 90% warning

Warning

DCGM_FI_DEV_FB_USED

GPU memory usage

> 95% critical

Critical

Key alert criteria:

Metric	Threshold	Severity	Description
P95 Response Latency	> 30s	Warning	MoE model response delay
KV Cache Utilization	> 95%	Critical	May reject new requests
Waiting Request Count	> 100	Warning	Scale-out needed

Summary

Key Points

Architecture Understanding: Grasp the operating principles of Expert networks and routing mechanisms
Memory Planning: Secure sufficient GPU memory as all Experts must be loaded
Distributed Deployment: Appropriately combine Tensor Parallelism and Expert Parallelism
Inference Engine Selection: vLLM recommended (latest optimization techniques and active updates)
Performance Optimization: Apply KV Cache, Speculative Decoding, and batch processing optimization

Next Steps

GPU Resource Management - GPU cluster dynamic resource allocation
Inference Gateway Routing - Multi-model routing strategies
Agentic AI Platform Architecture - Overall platform structure

Overview​

Understanding MoE Architecture​

Expert Network Structure​

Routing Mechanisms​

MoE vs Dense Model Comparison​

GPU Memory Requirements​

Distributed Deployment Strategies​

Tensor Parallelism Configuration​

Expert Parallelism​

Expert Activation Patterns​

700B+ MoE Model Multi-node Deployment Concepts​

vLLM-Based MoE Serving Features​

vLLM vs TGI Performance Comparison​

AWS Trainium2-Based MoE Inference​

Summary​

GPU vs Trainium2 Cost Comparison​

Performance Optimization Concepts​

KV Cache Optimization​

Speculative Decoding​

Batch Processing Optimization​

Monitoring Metrics​

Key Monitoring Metrics​

Summary​

Key Points​

Next Steps​

References​

Overview

Understanding MoE Architecture

Expert Network Structure

Routing Mechanisms

MoE vs Dense Model Comparison

GPU Memory Requirements

Distributed Deployment Strategies

Tensor Parallelism Configuration

Expert Parallelism

Expert Activation Patterns

700B+ MoE Model Multi-node Deployment Concepts

vLLM-Based MoE Serving Features

vLLM vs TGI Performance Comparison

AWS Trainium2-Based MoE Inference

Summary

GPU vs Trainium2 Cost Comparison

Performance Optimization Concepts

KV Cache Optimization

Speculative Decoding

Batch Processing Optimization

Monitoring Metrics

Key Monitoring Metrics

Summary

Key Points

Next Steps

References