MoE Model Serving Concept Guide
Current version: vLLM v0.18+ / v0.19.x (as of April 2026)
Overview
Mixture of Experts (MoE) models are an architecture that maximizes the efficiency of large language models. By activating only a subset of Experts from the total parameters, they achieve equivalent quality with less computation compared to Dense models.
This document covers the core concepts of MoE architecture, per-model resource requirements, and distributed deployment strategies.
For practical deployment including EKS deployment YAML, helm commands, and multi-node configuration for MoE models, refer to the Custom Model Deployment Guide.
Understanding MoE Architecture
Expert Network Structure
MoE models consist of multiple "Expert" networks and a "Router (Gate)" network that selects them.
Routing Mechanisms
The core of MoE models is the routing mechanism that selects appropriate Experts based on input tokens.
- Gate Computation: Pass the input token's hidden state through the Gate network
- Expert Selection: Select Top-K Experts from Softmax output
- Parallel Processing: Selected Experts process the input in parallel
- Weighted Summation: Combine Expert outputs with Gate weights
MoE vs Dense Model Comparison
- Computational Efficiency: Faster inference by activating only a portion of total parameters
- Scalability: Model capacity expandable by adding Experts
- Specialization: Each Expert can specialize in specific domains/tasks
GPU Memory Requirements
MoE models activate fewer parameters but must load all Experts into memory.
| Model | Total Parameters | Active Parameters | FP16 Memory | INT8 Memory | Recommended GPU |
|---|---|---|---|---|---|
| Mixtral 8x7B | 46.7B | 12.9B | ~94GB | ~47GB | 2x A100 80GB |
| Mixtral 8x22B | 141B | 39B | ~282GB | ~141GB | 4x H100 80GB |
| DeepSeek-V3 | 671B | 37B | ~800GB* | ~400GB* | 8x H100 80GB |
| DeepSeek-MoE 16B | 16.4B | 2.8B | ~33GB | ~17GB | 1x A100 40GB |
| Qwen2.5-MoE-A14B | ~50B | 14B | ~100GB | ~50GB | 2x A100 80GB |
| Qwen1.5-MoE-A2.7B | 14.3B | 2.7B | ~29GB | ~15GB | 1x A100 40GB |
| DBRX | 132B | 36B | ~264GB | ~132GB | 4x H100 80GB |
| GLM-5 | 744B | 40B | ~1.5TB | ~744GB | 2x p5.48xlarge (PP=2) |
| Kimi K2.5 | ~1T | 32B | ~2TB | ~500GB | 1x p5.48xlarge (INT4) |
DeepSeek-V3: Uses Multi-head Latent Attention (MLA) architecture to significantly reduce KV cache memory. Achieves approximately 40% memory savings compared to traditional MHA, so actual memory requirements may be lower than listed values.
GLM-5 (released February 2026): 744B total parameters / 40B active, 8 of 256 experts activated. SWE-bench Verified 77.8%, Agentic Coding #1 (55.00), MIT license. FP8 quantized version requires approximately 744GB VRAM (2x p5.48xlarge, PP=2). HuggingFace: zai-org/GLM-5-FP8
Kimi K2.5 (released January 2026): approximately 1T total parameters / 32B active, Modified DeepSeek V3 MoE architecture. SWE-bench Verified 76.8%, HumanEval 99%, Agent Swarm support. INT4 quantized version requires approximately 500GB VRAM (1x p5.48xlarge, TP=8). HuggingFace: moonshotai/Kimi-K2.5
Exact memory requirements vary with batch size and sequence length, so profiling is recommended.
- KV Cache: Additional memory needed based on batch size and sequence length
- Activation Memory: Storage space for intermediate activation values during inference
- CUDA Context: Approximately 1-2GB CUDA overhead per GPU
- Safety Margin: Recommended 10-20% headroom in production
Distributed Deployment Strategies
Large MoE models cannot be loaded on a single GPU, making distributed deployment essential.
Tensor Parallelism Configuration
Tensor Parallelism distributes each model layer across multiple GPUs.
- NVLink Utilization: Use NVLink-supported instances for high-speed inter-GPU communication
- TP Size Selection: Choose minimum TP size based on model size and GPU memory
- Communication Overhead: Larger TP size increases All-Reduce communication
Expert Parallelism
Expert Parallelism distributes MoE model Experts across multiple GPUs. In vLLM v0.19.x, Experts are automatically distributed within TP.
Expert Activation Patterns
Understanding Expert activation patterns is important for MoE model performance optimization.
- Auxiliary Loss: Auxiliary loss during training to encourage even distribution across Experts
- Capacity Factor: Maximum token limit per Expert
- Token Dropping: Drop tokens on capacity overflow (recommended to disable during inference)
700B+ MoE Model Multi-node Deployment Concepts
700B+ MoE models like GLM-5 and Kimi K2.5 cannot be loaded on a single node, making multi-node deployment essential. vLLM v0.18+ supports multi-node deployment based on LeaderWorkerSet (LWS).
| Model | Total Parameters | Active Parameters | Recommended Config | VRAM Requirement |
|---|---|---|---|---|
| GLM-5 FP8 | 744B | 40B | 2x p5.48xlarge, PP=2, TP=8 | approximately 744GB |
| Kimi K2.5 INT4 | approximately 1T | 32B | 1x p5.48xlarge, TP=8 | approximately 500GB |
| DeepSeek-V3 | 671B | 37B | 2x p5.48xlarge, PP=2, TP=8 | approximately 671GB |
| Mixtral 8x22B | 141B | 39B | 1x p5.48xlarge, TP=4 | approximately 282GB |
| Mixtral 8x7B | 47B | 13B | 1x p4d.24xlarge, TP=2 | approximately 94GB |
- Use LeaderWorkerSet: Kubernetes-native multi-node deployment without Ray dependency
- Pipeline Parallelism: PP=2 or more to partition layers across nodes
- FP8 Quantization: Memory savings (GLM-5 FP8 version recommended)
- Network Optimization: NCCL configuration for inter-node communication optimization (EFA recommended)
- INT4/AWQ Quantization: Consider when single-node deployment is possible (Kimi K2.5)
- Network Bandwidth: Overhead from inter-node All-Reduce communication (EFA recommended)
- Loading Time: 700B+ models may take 20-30 minutes for initial loading
- Memory Headroom: 10-15% safety margin required
- LeaderWorkerSet CRD: LWS Operator must be installed on the cluster
vLLM-Based MoE Serving Features
vLLM v0.18+ provides the following optimizations for MoE models:
- Expert Parallelism: Expert distribution across multiple GPUs
- Tensor Parallelism: Intra-layer tensor splitting
- PagedAttention: Efficient KV Cache management
- Continuous Batching: Dynamic batch processing
- FP8 KV Cache: 2x memory savings
- Improved Prefix Caching: 400%+ throughput improvement
- Multi-LoRA Serving: Simultaneous serving of multiple LoRA adapters on a single base model
- GGUF Quantization: GGUF format quantized model support
Text Generation Inference (TGI) entered maintenance mode in 2025. Use vLLM for new deployments. When migrating from existing TGI, vLLM provides an OpenAI-compatible API, minimizing client code changes.
vLLM vs TGI Performance Comparison
AWS Trainium2-Based MoE Inference
AWS Trainium2 / Inferentia2 provide a cost-efficient alternative to GPUs for large-scale MoE models (DBRX, Mixtral 8x22B, Llama 4 MoE, etc.), with lower per-token costs. The Neuron stack maps Expert Parallelism and Tensor Parallelism to NeuronCore units and serves via NxD Inference or vLLM Neuron backend.
Summary
| Item | Overview |
|---|---|
| Hardware | trn2.48xlarge (Trainium2 16 chips / NeuronCore 128 / HBM 1.5TB), inf2 series |
| SDK | AWS Neuron SDK 2.x, torch-neuronx, neuronx-cc |
| Inference Framework | NxD Inference (AWS official), vLLM Neuron backend, TGI Neuron fork |
| Quantization | BF16/FP16/FP8(E4M3/E5M2). Some AWQ/GPTQ, GGUF not supported |
| Suitable MoE | DBRX 132B, Mixtral 8x7B/8x22B, Llama 4 MoE (within NxD support scope) |
GPU vs Trainium2 Cost Comparison
For Neuron SDK architecture, instance lineup, Device Plugin deployment, Karpenter NodePool, inference framework comparison (NxD / vLLM Neuron / TGI Neuron), supported model matrix, observability, limitations and considerations, refer to the dedicated document below.
→ AWS Neuron Stack — Trainium2/Inferentia2 on EKS
For NVIDIA vs Neuron decision-making at the node selection stage, refer to EKS GPU Node Strategy.
Performance Optimization Concepts
KV Cache Optimization
KV Cache is a key factor significantly impacting inference performance.
Speculative Decoding
Speculative Decoding uses a small draft model to improve inference speed.
- Speed Improvement: 1.5x - 2.5x throughput increase (varies by workload)
- Quality Maintained: Output quality is identical (guaranteed by verification process)
- Additional Memory: Extra GPU memory needed for the draft model
Batch Processing Optimization
Monitoring Metrics
Key Monitoring Metrics
Key alert criteria:
| Metric | Threshold | Severity | Description |
|---|---|---|---|
| P95 Response Latency | > 30s | Warning | MoE model response delay |
| KV Cache Utilization | > 95% | Critical | May reject new requests |
| Waiting Request Count | > 100 | Warning | Scale-out needed |
Summary
Key Points
- Architecture Understanding: Grasp the operating principles of Expert networks and routing mechanisms
- Memory Planning: Secure sufficient GPU memory as all Experts must be loaded
- Distributed Deployment: Appropriately combine Tensor Parallelism and Expert Parallelism
- Inference Engine Selection: vLLM recommended (latest optimization techniques and active updates)
- Performance Optimization: Apply KV Cache, Speculative Decoding, and batch processing optimization
Next Steps
- GPU Resource Management - GPU cluster dynamic resource allocation
- Inference Gateway Routing - Multi-model routing strategies
- Agentic AI Platform Architecture - Overall platform structure