Ollama¶
Run open-source LLMs locally with a simple API. Ollama bundles model weights, configurations, and runtime into a single package, making it easy to deploy and manage models on Kubernetes.
| Category | llm-model |
| Official Docs | Ollama Documentation |
| CLI Install | ./cli llm-model ollama install |
| CLI Uninstall | ./cli llm-model ollama uninstall |
| Namespace | ollama |
Overview¶
Ollama provides a lightweight runtime for running LLMs with: - Simple Model Management: Pull models with a single command - OpenAI-Compatible API: Drop-in replacement for OpenAI API - GPU Acceleration: Automatic GPU detection and utilization - Model Library: Access to a wide range of pre-packaged models - Concurrent Models: Run multiple models simultaneously - Embedding Support: Built-in embedding model support
Installation¶
The installer: 1. Creates the ollama namespace and PVC for model storage 2. Generates ConfigMap with selected model list 3. Deploys Ollama server with GPU support 4. Creates Service and Ingress 5. Auto-pulls configured models on startup
Model Selection¶
Models are configured as a simple list in config.json:
{
"llm-model": {
"ollama": {
"models": [
"qwen3:32b",
"qwen3:30b",
"gemma3:27b",
"deepseek-r1:8b",
"nomic-embed-text:v1.5"
]
}
}
}
Verification¶
# Check Ollama pods
kubectl get pods -n ollama
# Check service
kubectl get svc -n ollama
# Port-forward for testing
kubectl port-forward svc/ollama 11434:11434 -n ollama --address 0.0.0.0 &
# List models
curl http://localhost:11434/api/tags
# Test chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:32b",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
# Test with Ollama native API
curl http://localhost:11434/api/generate \
-d '{
"model": "qwen3:32b",
"prompt": "Explain quantum computing",
"stream": false
}'
Configuration¶
Adding Models¶
Edit config.json and reinstall:
Environment Variables¶
The Ollama deployment automatically adapts to the EKS mode:
| Variable | Description |
|---|---|
EKS_MODE | auto or standard - affects Karpenter node selector prefix |
DOMAIN | Domain for ingress (optional) |
Integration with LiteLLM¶
Ollama endpoints are automatically discovered by LiteLLM:
model_list:
- model_name: qwen3-32b-ollama
litellm_params:
model: ollama/qwen3:32b
api_base: http://ollama.ollama:11434
Troubleshooting¶
Models not pulling¶
# Check init container logs (model pull happens at startup)
kubectl logs -n ollama -l app=ollama -c init-pull
# Check main container logs
kubectl logs -n ollama -l app=ollama
# Manually pull a model
kubectl exec -it -n ollama <ollama-pod> -- ollama pull qwen3:32b
Slow inference¶
# Check GPU is being used
kubectl exec -it -n ollama <ollama-pod> -- nvidia-smi
# Check if model fits in GPU memory
kubectl exec -it -n ollama <ollama-pod> -- ollama ps
PVC storage full¶
# Check PVC usage
kubectl exec -it -n ollama <ollama-pod> -- df -h /root/.ollama
# Remove unused models
kubectl exec -it -n ollama <ollama-pod> -- ollama rm <model-name>