Custom Model Deployment Guide
This document is a hands-on guide to deploying large open-source models with vLLM on EKS. It uses the GLM-5.1 744B MoE FP8 model as a working example, but the same patterns apply to other large models such as DeepSeek-V3, Mixtral, and Qwen-MoE.
This document focuses less on "here's how to do it" and more on "here's what we ran into, and how we solved it." It helps you anticipate and address issues you may encounter during actual production deployments.
1. Model Selection Criteria
Evaluate the following criteria when choosing a model to deploy.
| Criterion | What to Check | Notes |
|---|---|---|
| License | Commercial use permitted (MIT, Apache 2.0, etc.) | Some models have non-commercial licenses |
| Model Size (VRAM) | Required VRAM at FP8/FP16 | Directly affects GPU instance selection |
| vLLM Compatibility | Official vLLM support, transformers version | Custom image needed if unsupported |
| Benchmark Performance | Target task scores (coding, reasoning, conversation, etc.) | SWE-bench, HumanEval, etc. |
| Context Length | Maximum supported token count | 200K+ recommended for agentic workloads |
| MoE Architecture | Total vs. active parameters | MoE is more VRAM-efficient for its performance |
Example: Key Features of GLM-5.1
- GLM-5.1 = Same weights as GLM-5: Only additional post-training RL for coding tasks
- 744B MoE (40B active): 8 out of 256 experts activated per token
- HuggingFace:
zai-org/GLM-5-FP8 - License: MIT License
- Context: 200K tokens supported
- Performance:
- #1 open-source model on Agentic Coding benchmarks (55.00 points)
- SWE-bench 77.8% (vs. GPT-4.1 62.3%)
It has an MIT license allowing commercial use and outperforms OpenAI GPT-4o on Agentic Coding tasks. Its SWE-bench score of 77.8% demonstrates particular strength in code generation and bug-fixing tasks.
With an LLM Classifier, clients send requests to a single endpoint (/v1), and the system automatically selects an SLM or LLM based on prompt content. Simple requests are routed to Qwen3-4B (L4 $0.3/hr), while complex requests (refactoring, architecture, design, etc.) are routed to GLM-5 744B (H200 $12/hr). See Inference Gateway Setup: LLM Classifier for configuration details.
Model Specs (GLM-5.1 Example)
| Item | Details |
|---|---|
| Parameters | 744B (total) / 40B (active) |
| MoE Structure | 256 experts, top-8 routing |
| Precision | FP8 |
| Model Size | ~704GB (weights) |
| Required VRAM | ~744GB (single-node loading) |
| Minimum GPUs | 8x H200 (1,128GB) or 8x B200 (1,536GB) |
GPU Instance Selection Matrix
The most critical decision when deploying large models is the GPU instance type. Select instances based on the model's VRAM requirements.
| Instance | GPU | VRAM | 744B Single Node? | PP=2 Multi-node | Spot Price (us-east-2) | Recommendation |
|---|---|---|---|---|---|---|
| p5.48xlarge | H100×8 | 640GB | ❌ (744GB > 640GB) | ⚠️ vLLM deadlock | $12/hr | ⚠️ |
| p5en.48xlarge | H200×8 | 1,128GB | ✅ | ✅ (unnecessary) | $12/hr | ✅ Optimal |
| p6-b200.48xlarge | B200×8 | 1,536GB | ✅ | ✅ (unnecessary) | $18/hr | ✅ Headroom |
If the model's VRAM requirement exceeds the instance VRAM, PP (Pipeline Parallelism) multi-node is required. However, due to vLLM V1 engine multi-node PP deadlock issues (see Section 6), stable deployment is difficult. We recommend selecting an instance with sufficient VRAM for single-node deployment.
Instance Selection Principles
Choose the cheapest Spot instance with sufficient VRAM.
- Same price: If p5en Spot and p5 Spot are both $12/hr, choose p5en for more VRAM
- VRAM headroom: Secure at least 1.5x VRAM relative to model size (for KV Cache)
- Simplicity: Eliminate multi-node complexity
- Stability: Avoid PP deadlock issues
# Spot price lookup example (us-east-2)
aws ec2 describe-spot-price-history \
--instance-types p5en.48xlarge \
--region us-east-2 \
--start-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--product-descriptions "Linux/UNIX" \
--query 'SpotPriceHistory[0].[SpotPrice,Timestamp]' \
--output table
EKS Deployment Mode Selection
The choice between EKS Auto Mode and Standard Mode depends on the GPU instance type you plan to use.
| Mode | p5.48xlarge | p5en.48xlarge | p6-b200.48xlarge | Stability |
|---|---|---|---|---|
| Auto Mode | ✅ | ❌ NoCompatibleInstanceTypes |