Skip to main content

MoE Model Serving Concept Guide

Current version: vLLM v0.18+ / v0.19.x (as of April 2026)

Overview

Mixture of Experts (MoE) models are an architecture that maximizes the efficiency of large language models. By activating only a subset of Experts from the total parameters, they achieve equivalent quality with less computation compared to Dense models.

This document covers the core concepts of MoE architecture, per-model resource requirements, and distributed deployment strategies.

Production Deployment Guide

For practical deployment including EKS deployment YAML, helm commands, and multi-node configuration for MoE models, refer to the Custom Model Deployment Guide.


Understanding MoE Architecture

Expert Network Structure

MoE models consist of multiple "Expert" networks and a "Router (Gate)" network that selects them.

Routing Mechanisms

The core of MoE models is the routing mechanism that selects appropriate Experts based on input tokens.

MoE Routing Mechanisms
🎯Top-K Routing
Description
Activate only top K experts
Representative Model
Mixtral (K=2)
🔄Expert Choice
Description
Expert selects tokens to process
Representative Model
Switch Transformer
⚖️Soft MoE
Description
Distribute weights to all experts
Representative Model
Soft MoE
#️⃣Hash Routing
Description
Hash-based deterministic routing
Representative Model
Hash Layers
Routing Operation Principles
  1. Gate Computation: Pass the input token's hidden state through the Gate network
  2. Expert Selection: Select Top-K Experts from Softmax output
  3. Parallel Processing: Selected Experts process the input in parallel
  4. Weighted Summation: Combine Expert outputs with Gate weights

MoE vs Dense Model Comparison

MoE vs Dense Model Comparison
Characteristic
Dense Model
MoE Model
Parameter Activation
100% (all)
10-25% (some experts)
🔢
Inference Computation
High
Relatively low
💾
Memory Requirements
Proportional to parameter count
Must load all parameters
📚
Learning Efficiency
Standard
Efficient learning with more data
📈
Scalability
Linear growth
Efficient scaling by adding experts
Advantages of MoE Models
  • Computational Efficiency: Faster inference by activating only a portion of total parameters
  • Scalability: Model capacity expandable by adding Experts
  • Specialization: Each Expert can specialize in specific domains/tasks

GPU Memory Requirements

MoE models activate fewer parameters but must load all Experts into memory.

MoE Model GPU Memory Requirements
ModelTotal ParametersActive ParametersFP16 MemoryINT8 MemoryRecommended GPU
Mixtral 8x7B46.7B12.9B~94GB~47GB2x A100 80GB
Mixtral 8x22B141B39B~282GB~141GB4x H100 80GB
DeepSeek-V3671B37B~800GB*~400GB*8x H100 80GB
DeepSeek-MoE 16B16.4B2.8B~33GB~17GB1x A100 40GB
Qwen2.5-MoE-A14B~50B14B~100GB~50GB2x A100 80GB
Qwen1.5-MoE-A2.7B14.3B2.7B~29GB~15GB1x A100 40GB
DBRX132B36B~264GB~132GB4x H100 80GB
GLM-5744B40B~1.5TB~744GB2x p5.48xlarge (PP=2)
Kimi K2.5~1T32B~2TB~500GB1x p5.48xlarge (INT4)
Latest MoE Model Memory Optimization

DeepSeek-V3: Uses Multi-head Latent Attention (MLA) architecture to significantly reduce KV cache memory. Achieves approximately 40% memory savings compared to traditional MHA, so actual memory requirements may be lower than listed values.

GLM-5 (released February 2026): 744B total parameters / 40B active, 8 of 256 experts activated. SWE-bench Verified 77.8%, Agentic Coding #1 (55.00), MIT license. FP8 quantized version requires approximately 744GB VRAM (2x p5.48xlarge, PP=2). HuggingFace: zai-org/GLM-5-FP8

Kimi K2.5 (released January 2026): approximately 1T total parameters / 32B active, Modified DeepSeek V3 MoE architecture. SWE-bench Verified 76.8%, HumanEval 99%, Agent Swarm support. INT4 quantized version requires approximately 500GB VRAM (1x p5.48xlarge, TP=8). HuggingFace: moonshotai/Kimi-K2.5

Exact memory requirements vary with batch size and sequence length, so profiling is recommended.

Memory Calculation Considerations
  • KV Cache: Additional memory needed based on batch size and sequence length
  • Activation Memory: Storage space for intermediate activation values during inference
  • CUDA Context: Approximately 1-2GB CUDA overhead per GPU
  • Safety Margin: Recommended 10-20% headroom in production

Distributed Deployment Strategies

Large MoE models cannot be loaded on a single GPU, making distributed deployment essential.

MoE Model Parallelization Strategies
🔷
Tensor Parallelism (TP)
Description
Split tensors within layers across GPUs
Advantages
Low latency
Disadvantages
High communication overhead
🎯
Expert Parallelism (EP)
Description
Distribute experts across GPUs
Advantages
Optimized for MoE
Disadvantages
Requires all-to-all communication
📊
Pipeline Parallelism (PP)
Description
Split layers sequentially across GPUs
Advantages
Memory efficient
Disadvantages
Pipeline bubble overhead

Tensor Parallelism Configuration

Tensor Parallelism distributes each model layer across multiple GPUs.

vLLM Tensor Parallelism Recommended Configuration
Model
Recommended TP Size
GPU Configuration
Memory/GPU
Mixtral 8x7B
2
2x A100 80GB
~47GB
Mixtral 8x22B
4
4x H100 80GB
~70GB
DeepSeek-MoE 16B
1
1x A100 40GB
~33GB
DBRX
4-8
4-8x H100 80GB
~33-66GB
Tensor Parallelism Optimization
  • NVLink Utilization: Use NVLink-supported instances for high-speed inter-GPU communication
  • TP Size Selection: Choose minimum TP size based on model size and GPU memory
  • Communication Overhead: Larger TP size increases All-Reduce communication

Expert Parallelism

Expert Parallelism distributes MoE model Experts across multiple GPUs. In vLLM v0.19.x, Experts are automatically distributed within TP.

Expert Activation Patterns

Understanding Expert activation patterns is important for MoE model performance optimization.

Expert Load Balancing
  • Auxiliary Loss: Auxiliary loss during training to encourage even distribution across Experts
  • Capacity Factor: Maximum token limit per Expert
  • Token Dropping: Drop tokens on capacity overflow (recommended to disable during inference)

700B+ MoE Model Multi-node Deployment Concepts

700B+ MoE models like GLM-5 and Kimi K2.5 cannot be loaded on a single node, making multi-node deployment essential. vLLM v0.18+ supports multi-node deployment based on LeaderWorkerSet (LWS).

ModelTotal ParametersActive ParametersRecommended ConfigVRAM Requirement
GLM-5 FP8744B40B2x p5.48xlarge, PP=2, TP=8approximately 744GB
Kimi K2.5 INT4approximately 1T32B1x p5.48xlarge, TP=8approximately 500GB
DeepSeek-V3671B37B2x p5.48xlarge, PP=2, TP=8approximately 671GB
Mixtral 8x22B141B39B1x p5.48xlarge, TP=4approximately 282GB
Mixtral 8x7B47B13B1x p4d.24xlarge, TP=2approximately 94GB
700B+ MoE Model Deployment Recommendations
  • Use LeaderWorkerSet: Kubernetes-native multi-node deployment without Ray dependency
  • Pipeline Parallelism: PP=2 or more to partition layers across nodes
  • FP8 Quantization: Memory savings (GLM-5 FP8 version recommended)
  • Network Optimization: NCCL configuration for inter-node communication optimization (EFA recommended)
  • INT4/AWQ Quantization: Consider when single-node deployment is possible (Kimi K2.5)
Multi-node Deployment Cautions
  • Network Bandwidth: Overhead from inter-node All-Reduce communication (EFA recommended)
  • Loading Time: 700B+ models may take 20-30 minutes for initial loading
  • Memory Headroom: 10-15% safety margin required
  • LeaderWorkerSet CRD: LWS Operator must be installed on the cluster

vLLM-Based MoE Serving Features

vLLM v0.18+ provides the following optimizations for MoE models:

  • Expert Parallelism: Expert distribution across multiple GPUs
  • Tensor Parallelism: Intra-layer tensor splitting
  • PagedAttention: Efficient KV Cache management
  • Continuous Batching: Dynamic batch processing
  • FP8 KV Cache: 2x memory savings
  • Improved Prefix Caching: 400%+ throughput improvement
  • Multi-LoRA Serving: Simultaneous serving of multiple LoRA adapters on a single base model
  • GGUF Quantization: GGUF format quantized model support
TGI Maintenance Mode

Text Generation Inference (TGI) entered maintenance mode in 2025. Use vLLM for new deployments. When migrating from existing TGI, vLLM provides an OpenAI-compatible API, minimizing client code changes.

vLLM vs TGI Performance Comparison

vLLM vs TGI Performance Comparison
Characteristic
vLLM Recommended
TGI (Legacy Reference)
Throughput (tokens/s)
High
Medium-High
⏱️
Latency (TTFT)
Low
Medium
💾
Memory Efficiency
Very High (PagedAttention)
High
🎯
MoE Optimization
Excellent
Good
🔢
Quantization Support
AWQ, GPTQ, SqueezeLLM
AWQ, GPTQ, EETQ
🔌
API Compatibility
OpenAI compatible
Custom API + OpenAI compatible
👥
Community
Active
Active

AWS Trainium2-Based MoE Inference

AWS Trainium2 / Inferentia2 provide a cost-efficient alternative to GPUs for large-scale MoE models (DBRX, Mixtral 8x22B, Llama 4 MoE, etc.), with lower per-token costs. The Neuron stack maps Expert Parallelism and Tensor Parallelism to NeuronCore units and serves via NxD Inference or vLLM Neuron backend.

Summary

ItemOverview
Hardwaretrn2.48xlarge (Trainium2 16 chips / NeuronCore 128 / HBM 1.5TB), inf2 series
SDKAWS Neuron SDK 2.x, torch-neuronx, neuronx-cc
Inference FrameworkNxD Inference (AWS official), vLLM Neuron backend, TGI Neuron fork
QuantizationBF16/FP16/FP8(E4M3/E5M2). Some AWQ/GPTQ, GGUF not supported
Suitable MoEDBRX 132B, Mixtral 8x7B/8x22B, Llama 4 MoE (within NxD support scope)

GPU vs Trainium2 Cost Comparison

AWS Trainium2 Instance Types
trn2.48xlarge
16
512GB
800 Gbps
Mixtral 8x7B, Llama 3.1 70B
trn2.48xlarge (UltraServer)
32
1TB
1600 Gbps
Mixtral 8x22B, Llama 3.1 405B
GPU vs Trainium2 Cost Comparison
Configuration
Instance
Hourly Cost
Monthly Cost (730h)
Relative Cost
🎮
GPU (NVIDIA)
p5.48xlarge (8x H100)
$98.32
$71,774
100%
🎮
GPU (NVIDIA)
p4d.24xlarge (8x A100)
$32.77
$23,922
33%
Trainium2
trn2.48xlarge (16 cores)
$21.50
$15,695
22%
💡
Cost Savings: Trainium2 saves 78% vs H100 GPU, 34% vs A100 GPU
Refer to Separate Document for Detailed Guide

For Neuron SDK architecture, instance lineup, Device Plugin deployment, Karpenter NodePool, inference framework comparison (NxD / vLLM Neuron / TGI Neuron), supported model matrix, observability, limitations and considerations, refer to the dedicated document below.

AWS Neuron Stack — Trainium2/Inferentia2 on EKS

For NVIDIA vs Neuron decision-making at the node selection stage, refer to EKS GPU Node Strategy.


Performance Optimization Concepts

KV Cache Optimization

KV Cache is a key factor significantly impacting inference performance.

vLLM KV Cache Configuration Parameters
💾
--gpu-memory-utilization
Description
GPU memory usage ratio
Recommended
0.85-0.92
📏
--max-model-len
Description
Maximum context length
Recommended
Within model support range
🔢
--max-num-batched-tokens
Description
Maximum tokens per batch
Recommended
Adjust based on memory
--enable-chunked-prefill
Description
Enable chunked prefill
Recommended
Recommended

Speculative Decoding

Speculative Decoding uses a small draft model to improve inference speed.

Speculative Decoding Effect
  • Speed Improvement: 1.5x - 2.5x throughput increase (varies by workload)
  • Quality Maintained: Output quality is identical (guaranteed by verification process)
  • Additional Memory: Extra GPU memory needed for the draft model

Batch Processing Optimization

Batch Processing Optimization Techniques
🔄
Continuous Batching
Description
Dynamically add/remove requests from batch
Effect
2-3x throughput improvement
Chunked Prefill
Description
Split prefill into chunks for concurrent decode
Effect
Reduced latency
🎯
Dynamic SplitFuse
Description
Dynamically separate/combine prefill and decode
Effect
Improved GPU utilization

Monitoring Metrics

Key Monitoring Metrics

MoE Model Key Monitoring Metrics
vllm:num_requests_running
Current requests being processed
-
Info
vllm:num_requests_waiting
Requests waiting in queue
> 100 warning
Warning
vllm:gpu_cache_usage_perc
KV Cache utilization
> 95% warning
Critical
vllm:avg_prompt_throughput_toks_per_s
Prompt processing throughput
-
Info
vllm:avg_generation_throughput_toks_per_s
Generation throughput
-
Info
DCGM_FI_DEV_GPU_UTIL
GPU utilization
> 90% warning
Warning
DCGM_FI_DEV_FB_USED
GPU memory usage
> 95% critical
Critical

Key alert criteria:

MetricThresholdSeverityDescription
P95 Response Latency> 30sWarningMoE model response delay
KV Cache Utilization> 95%CriticalMay reject new requests
Waiting Request Count> 100WarningScale-out needed

Summary

Key Points

  1. Architecture Understanding: Grasp the operating principles of Expert networks and routing mechanisms
  2. Memory Planning: Secure sufficient GPU memory as all Experts must be loaded
  3. Distributed Deployment: Appropriately combine Tensor Parallelism and Expert Parallelism
  4. Inference Engine Selection: vLLM recommended (latest optimization techniques and active updates)
  5. Performance Optimization: Apply KV Cache, Speculative Decoding, and batch processing optimization

Next Steps


References