Frequently Asked Questions (FAQ)¶
Common questions and answers about the GenAI on EKS Starter Kit.
Configuration¶
How do the config files work?¶
.env and config.json are loaded first as defaults. Then, the configuration is merged/overridden with values from .env.local and config.local.json if they exist.
Loading order:
Best practice: - Keep defaults in .env and config.json (tracked in git) - Keep your customizations in .env.local and config.local.json (gitignored)
Example:
The final value will be us-east-1.
Domain & Networking¶
How can I use this starter kit without having a Route 53 hosted zone?¶
There are two ingress configurations:
With a domain (recommended): - Single shared ALB with HTTPS - Wildcard ACM certificate - Route 53 DNS records - All services accessible at <service>.<DOMAIN> - Example: litellm.example.com, openwebui.example.com
Without a domain: - Multiple ALBs with HTTP - No DNS records required - Only one service requiring Nginx Ingress basic auth (e.g., Milvus, Qdrant) can be exposed - Access services via ALB DNS names (long URLs)
To use without a domain:
-
Leave
DOMAINempty in.env: -
Each public-facing service will get its own ALB
-
Find service URLs:
Limitations without domain: - Cannot expose multiple services with basic auth - HTTP only (no HTTPS) - Long ALB DNS names instead of friendly URLs
Model Management¶
How can I configure and update the LiteLLM proxy model list?¶
LiteLLM automatically discovers self-hosted models (vLLM, SGLang, Ollama, TGI) running in the cluster.
For Bedrock models, the model list is configured in config.json:
{
"bedrock": {
"llm": {
"models": [
{
"name": "amazon-nova-premier",
"model": "us.amazon.nova-premier-v1:0"
},
{
"name": "claude-4-opus",
"model": "us.anthropic.claude-opus-4-20250514-v1:0"
}
]
}
}
}
To update:
- Edit
config.jsonorconfig.local.json(recommended) - Reinstall LiteLLM:
Self-hosted models:
Self-hosted models are automatically detected from running pods and added to LiteLLM's config.
To add a new model:
To update models:
Infrastructure¶
How can I change the EC2 GPU instance families and purchasing options?¶
The default instance families are g6e, g6, and g5. The default purchasing options are spot and on-demand.
To change:
- Edit
terraform/0-common.tfdirectly, OR - Edit
config.jsonorconfig.local.json:
{
"terraform": {
"vars": {
"instance_families": ["g6e", "p5", "p4d"],
"purchasing_options": ["on-demand"]
}
}
}
- Apply the changes:
Note: Model deployment manifests use nodeSelector to lock to specific instance families:
You'll need to adjust the instanceFamily field in model configurations accordingly.
Common instance families: - g6e - NVIDIA L40S (newest, best price/performance) - g6 - NVIDIA L4 - g5 - NVIDIA A10G - p5 - NVIDIA H100 (highest performance) - p4d - NVIDIA A100 - inf2 - AWS Inferentia 2 (for Neuron)
Neuron & Inferentia¶
How can I use LLM models with AWS Neuron and EC2 Inferentia 2?¶
Supported models have the -neuron suffix. To enable Neuron support:
-
Enable Neuron in
config.jsonorconfig.local.json: -
Install the component (builds vLLM Neuron image, takes ~20-30 mins):
First-time deployment:
When a Neuron model is deployed for the first time, Neuron performs just-in-time (JIT) compilation which takes ~20-30 minutes. The compiled model is cached on EFS for subsequent deployments.
Using INT8 quantization on inf2.xlarge:
Models like Llama-3.1-8B-Instruct, DeepSeek-R1-Distill-Llama-8B, and Mistral-7B-Instruct-v0.3 support INT8 quantization to run on a single inf2.xlarge. However, compilation still requires inf2.8xlarge.
Process:
-
Deploy with
compile: true(uses inf2.8xlarge for compilation): -
Wait for compilation to complete (~20-30 mins)
-
Change to
compile: falsein config: -
Delete the model deployment:
-
Redeploy (now uses inf2.xlarge with cached model):
Docker & Multi-Arch¶
How can I disable the multi-arch container image build?¶
By default, Docker Buildx is used to build multi-arch container images (linux/amd64 and linux/arm64).
To disable:
-
Edit
config.jsonorconfig.local.json: -
Set
archbased on your machine's OS architecture: - Intel/AMD Mac:
linux/amd64 - M1/M2/M3 Mac:
linux/arm64 - Intel/AMD Linux:
linux/amd64 - ARM Linux:
linux/arm64
When to disable: - Faster builds during development - Don't need ARM64 support - Buildx not available
When to keep enabled: - Production deployments - Need both AMD64 and ARM64 support - Using Karpenter with mixed instance types
Bedrock¶
How can I use a different AWS region for Bedrock?¶
By default, Bedrock uses the same region as your EKS cluster (REGION environment variable).
To use a different region:
-
Edit
config.jsonorconfig.local.json: -
Reinstall LiteLLM:
Use cases: - Access models not available in your EKS region - Use Bedrock in a region with lower latency - Comply with data residency requirements
Example:
{
"bedrock": {
"region": "us-east-1",
"llm": {
"models": [
{
"name": "claude-4-opus",
"model": "us.anthropic.claude-opus-4-20250514-v1:0"
}
]
}
}
}
Multi-Cluster¶
How can I provision and manage multiple EKS clusters?¶
You can manage multiple clusters by changing the values of REGION, EKS_CLUSTER_NAME, and DOMAIN in .env or .env.local.
Terraform workspace and kubectl context automatically use these values when running ./cli commands.
Example workflow:
-
Configure first cluster:
-
Deploy first cluster:
-
Configure second cluster:
-
Deploy second cluster:
How it works:
- Terraform uses workspace named after
EKS_CLUSTER_NAME - kubectl context is automatically selected based on cluster name
- Each cluster has independent state
Switch between clusters:
# Edit .env.local to change EKS_CLUSTER_NAME
# All subsequent commands target the specified cluster
./cli ai-gateway litellm install
ECR Pull Through Cache¶
What is ECR Pull Through Cache and should I enable it?¶
ECR Pull Through Cache caches external container images (from Docker Hub, GitHub Container Registry) in your private ECR registry.
Benefits: - Avoids rate limits from public registries - Faster pulls from within AWS - Images stay within your AWS infrastructure - More reliable (no external registry downtime)
Default: Disabled (enable_ecr_pull_through_cache = false)
Why disabled by default? - Cached images stored in ECR incur storage costs - Public registries work fine for most use cases (EKS nodes have internet access) - Requires Docker Hub and GitHub authentication
Rate limits: - Docker Hub anonymous: 100 pulls/6 hours - Docker Hub authenticated: 200 pulls/6 hours - GitHub: Generally higher limits
When to enable: - Hitting rate limits (frequent large-scale deployments) - Need faster, more reliable pulls - Organization requires private registry storage - Air-gapped or restricted network environments
To enable:
- Get credentials:
- Docker Hub access token
-
GitHub Personal Access Token with
read:packages -
Add to
config.local.json: -
Apply:
Supported registries: - vllm/* → Docker Hub - lmsysorg/* → Docker Hub - ollama/* → Docker Hub - huggingface/* → GitHub Container Registry
Cleanup:
When you run terraform destroy, cache rules are deleted but cached repositories remain.
Manual cleanup:
aws ecr describe-repositories --region $REGION | \
jq -r '.repositories[] | select(.repositoryName | startswith("vllm/") or startswith("lmsysorg/") or startswith("ollama/") or startswith("huggingface/")) | .repositoryName' | \
xargs -I {} aws ecr delete-repository --repository-name {} --force --region $REGION
Troubleshooting¶
Components fail to install¶
Problem: ./cli <category> <component> install fails
Common causes:
- Prerequisites not installed:
- Check:
kubectl version,helm version,docker version -
Solution: Install missing tools
-
Cluster not accessible:
- Check:
kubectl get nodes -
Solution: Run
aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION -
Dependencies not installed:
- Some components depend on others (e.g., examples need LiteLLM)
-
Solution: Install dependencies first
-
Resource limits:
- Insufficient GPU nodes
- Solution: Check node availability:
kubectl get nodes -l eks.amazonaws.com/compute-type=gpu
Models fail to deploy¶
Problem: Model pod stays in Pending or ImagePullBackOff
Solutions:
- Pending (no nodes):
- Solution: Wait for Karpenter to provision nodes (~5 mins)
-
Or check instance families are available in your region
-
ImagePullBackOff:
- Solution: Verify HF_TOKEN is set for gated models
-
Or check Docker Hub rate limits (enable ECR Pull Through Cache)
-
CrashLoopBackOff:
- Solution: Increase GPU memory utilization or use larger instance
Services not accessible¶
Problem: Cannot access services via ingress
Solutions:
- With domain:
- Verify Route 53 hosted zone exists
- Check DNS records:
nslookup litellm.$DOMAIN -
Check ALB:
kubectl get ingress -n litellm -
Without domain:
- Find ALB DNS name:
kubectl get ingress --all-namespaces -
Use ALB DNS name directly
-
Certificate issues:
- Check ACM certificate status in AWS console
- Ensure DNS validation records exist