Skip to main content

AWS Neuron Stack — Trainium2/Inferentia2 on EKS

AWS Neuron is a software stack for running training and inference workloads on AWS-designed AI accelerators (Trainium, Inferentia). Similar to how NVIDIA's CUDA + GPU Operator combination abstracts NVIDIA GPUs as Kubernetes resources, Neuron SDK + Neuron Device Plugin abstracts Trainium/Inferentia chips as Kubernetes resources on EKS.

This document covers the Neuron software stack, Device Plugin, Karpenter configuration, and inference framework selection criteria for operating Trainium2/Inferentia2 instances on EKS. For NVIDIA GPU-based stacks, refer to NVIDIA GPU Stack; for overall node type selection, see EKS GPU Node Strategy.

LayerRoleCore Components
Infrastructure AutomationNeuron driver, runtime, Device Pluginaws-neuron-dkms, neuron-device-plugin
CompilerModel → NEFF (Neuron Executable) compilationneuronx-cc (Neuron Compiler)
RuntimeNeuronCore execution, memory managementaws-neuron-runtime, neuronx-collectives
Inference FrameworkLarge-scale LLM servingNxD Inference, vLLM Neuron backend, TGI Neuron
ObservabilityNeuronCore metrics, profilingneuron-monitor, neuron-top, neuron-ls

1. Why Neuron

1.1 Three Reasons to Choose Neuron

1) Cost Efficiency (Per-Token TCO)

According to AWS official materials, Trainium2/Inferentia2 offer lower per-token costs compared to similar-performance GPUs. The effect is particularly significant under these conditions:

  • Stable inference traffic sustained long-term (>3 months)
  • Standard Transformer-family models based on FP8/INT8/BF16
  • Workloads eligible for AWS Reserved/Savings Plans

2) Capacity Availability

When NVIDIA H100/H200/B200 supply is tight, Trainium2 is relatively easier to procure. Particularly when p5/p5en inventory is scarce in certain US/Asia regions, Neuron becomes a practical alternative.

3) Continuity with Bedrock

Some Foundation Models served by Bedrock (Claude, Llama, Titan, etc.) run internally on the Neuron stack. Choosing Neuron in the Bedrock → Self-hosted migration path allows reuse of compiled artifacts and operational patterns.

1.2 Suitable/Unsuitable Workloads

CategoryWorkloads
SuitableStandard Llama/Mistral/Qwen-family inference, large-scale long-term operations, FP8/BF16-based serving, Bedrock-style governance
Caution RequiredNewly released models with novel architectures (support delay), workloads dependent on custom CUDA kernels, some AWQ/GPTQ quantization formats
UnsuitableResearch/experimental environments with frequent model architecture changes, code tightly coupled to CUDA-only libraries (Triton inference server custom kernels)
Neuron vs NVIDIA Decision Principles
  • Model ecosystem recency is critical → NVIDIA GPU (H100/H200/B200)
  • Long-term operational TCO / Capacity is critical → Trainium2 / Inferentia2
  • Hybrid operations with Bedrock → Prioritize Neuron review

2. Instance Lineup

This is the Neuron instance lineup as of April 2026 based on AWS official product pages and EC2 user guides. Actual regional availability must be verified in the AWS console.

2.1 Inference-Only Instances (Inferentia2)

InstanceChipsNeuronCoreTotal Accelerator MemoryvCPUMemoryNetwork
inf2.xlarge1× Inferentia2232 GB416 GBUp to 15 Gbps
inf2.8xlarge1× Inferentia2232 GB32128 GBUp to 25 Gbps
inf2.24xlarge6× Inferentia212192 GB96384 GB50 Gbps
inf2.48xlarge12× Inferentia224384 GB192768 GB100 Gbps

2.2 Training/Inference Dual-Purpose Instances (Trainium1/Trainium2)

InstanceChipsNeuronCoreTotal Accelerator MemoryvCPUMemoryNetwork
trn1.2xlarge1× Trainium1232 GB832 GBUp to 12.5 Gbps
trn1.32xlarge16× Trainium132512 GB128512 GB800 Gbps EFA
trn1n.32xlarge16× Trainium132512 GB128512 GB1,600 Gbps EFA
trn2.48xlarge16× Trainium21281.5 TB (HBM3)1922 TiB3.2 Tbps EFA v3
trn2 Ultra (trn2u.48xlarge, preview/limited availability)64× Trainium2 (4×trn2 NeuronLink)5126 TB (HBM3)--12.8 Tbps
Version and Specification Notes
  • NeuronCore counts and memory capacity are based on AWS official materials, and reporting units may vary with SDK releases. Verify actual devices with neuron-ls before deployment.
  • trn2 Ultra (trn2u) is defined in AWS official announcements as "an ultraserver combining 4 trn2s with NeuronLink." As of April 2026, it may be in preview or limited availability; general availability, regional scope, and Spot support must be confirmed with AWS account teams or official documentation.
  • inf1 (1st generation Inferentia) is not covered in this document. Use Inferentia2/Trainium2 for new deployments.

3. Neuron SDK 2.x Stack Architecture

3.1 Layer Structure

3.2 Core Components

ComponentDescriptionDeployment Form
aws-neuron-dkmsLinux kernel module. Creates /dev/neuron* device nodesPre-installed in AMI or DKMS package
aws-neuron-runtime (libnrt)NeuronCore execution, memory management, schedulingIncluded in container image
aws-neuronx-collectivesCollectives for distributed training/inference (AllReduce, AllGather, etc.)Included in container image
neuronx-ccGraph compiler. Converts PyTorch/JAX models to NEFFUsed in development/build stage
torch-neuronxPyTorch 2.x frontend. torch.compile(backend="neuronx")pip package
neuron-device-pluginKubernetes Device Plugin. Registers aws.amazon.com/neuron* resourcesDaemonSet
neuron-monitor / neuron-top / neuron-lsObservability and profiling toolsContainer image/CLI
Neuron SDK 2.x (Latest stable version as of April 2026)

Neuron SDK is regularly updated in the 2.x release train. Key features of the latest stable version as of April 2026:

  • Official support for Trainium2 (trn2) + trn2 Ultra (NeuronLink)
  • NxD Inference LLM library (pre-compiled checkpoints for Llama 3/4, DBRX, Mistral families)
  • Official vLLM Neuron backend support (continuous batching, PagedAttention-like structure)
  • PyTorch 2.5+ / JAX compatibility
  • FP8 (E4M3/E5M2) inference path

For exact minor version, check AWS Neuron SDK Release Notes.

3.3 Compilation Model and NEFF

Neuron uses an Ahead-of-Time (AOT) compilation model. It does not run directly in PyTorch eager mode; neuronx-cc must convert the computation graph into NeuronCore hardware instructions (NEFF, Neuron Executable File Format) for execution.

PyTorch / JAX model
↓ torch-neuronx trace/compile
Neuron IR (HLO)
↓ neuronx-cc
NEFF (Neuron Executable) — Initial compilation 5-30 min, then cache reuse
↓ aws-neuron-runtime
Trainium / Inferentia hardware execution

Operational Implications:

  • First Pod startup can take 20-30+ minutes for NEFF compilation → Pre-compile and cache in S3/ECR
  • Model weight changes require recompilation → Manage NEFF artifacts in CI pipeline
  • NxD Inference provides pre-compiled checkpoints for official models to reduce initial startup time

4. EKS Integration

4.1 Neuron Device Plugin Deployment

Neuron Device Plugin registers the node's /dev/neuron* devices as Kubernetes extended resources. Use AWS official YAML/Helm charts.

# Example deployment with official YAML
kubectl apply -f https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin.yml
kubectl apply -f https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin-rbac.yml

After deployment, the following appears in node resources:

kubectl describe node <trn2-node> | grep aws.amazon.com
# Allocatable:
# aws.amazon.com/neuron: 16 # trn2.48xlarge: Trainium2 chip count
# aws.amazon.com/neuroncore: 128 # Total NeuronCore count
# aws.amazon.com/neurondevice: 16 # Device file count

4.2 Resource Request Patterns

When requesting Neuron resources in Pod specs, choose one of three units:

ResourceMeaningWhen to Use
aws.amazon.com/neuronNeuron chip unit (Trainium2 chips in trn2)When chip-level allocation is clear
aws.amazon.com/neuroncoreNeuronCore unit (8 per trn2 chip)For fine-grained core-level scheduling
aws.amazon.com/neurondevice/dev/neuron* device file unitFor legacy / specific tool compatibility
# Example: Use entire trn2.48xlarge (16 chips = 128 NeuronCores)
apiVersion: v1
kind: Pod
metadata:
name: llama3-70b-neuron
spec:
nodeSelector:
node.kubernetes.io/instance-type: trn2.48xlarge
tolerations:
- key: aws.amazon.com/neuron
operator: Exists
effect: NoSchedule
containers:
- name: server
image: public.ecr.aws/neuron/pytorch-inference-neuronx:2.x
resources:
limits:
aws.amazon.com/neuron: "16"
requests:
aws.amazon.com/neuron: "16"
# Example: Use only 4 NeuronCores (2 inf2.xlarge + half NeuronCores)
resources:
limits:
aws.amazon.com/neuroncore: "4"

4.3 Node Taint / Toleration Pattern

The same pattern as NVIDIA GPU nodes is recommended.

# Apply taint to node (configured in Karpenter NodePool)
taints:
- key: aws.amazon.com/neuron
effect: NoSchedule
# Declare toleration in Pod
tolerations:
- key: aws.amazon.com/neuron
operator: Exists
effect: NoSchedule

4.4 AMI Selection

AMINeuron DriverRecommended Use
EKS Optimized AMI (Neuron)Pre-installedProduction standard — --ami-type AL2023_x86_64_NEURON or equivalent
Deep Learning AMI (Neuron)Pre-installed + Neuron SDK toolsDevelopment/debugging nodes
General AL2023Manual installation (DKMS)Not recommended

When using Neuron-optimized AMI with EKS managed node groups, nodeadm automatically configures Neuron drivers.


5. Karpenter NodePool Examples

5.1 trn2 Training/Large Inference NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: neuron-trn2
spec:
template:
metadata:
labels:
accelerator: neuron
accelerator-family: trainium2
spec:
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["trn2"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-2a", "us-east-2b", "us-east-2c"]
taints:
- key: aws.amazon.com/neuron
effect: NoSchedule
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: neuron-nodeclass
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 10m
limits:
aws.amazon.com/neuron: "64"

5.2 inf2 Low-Cost Inference NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: neuron-inf2
spec:
template:
metadata:
labels:
accelerator: neuron
accelerator-family: inferentia2
spec:
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["inf2"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
taints:
- key: aws.amazon.com/neuron
effect: NoSchedule
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: neuron-nodeclass
limits:
aws.amazon.com/neuron: "48"

5.3 EC2NodeClass (Neuron AMI)

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: neuron-nodeclass
spec:
amiSelectorTerms:
- alias: al2023@latest # When using Neuron-optimized AMI variant, explicit id specification recommended
role: KarpenterNodeRole-eks-genai
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: eks-genai
subnet-type: private
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: eks-genai
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 500Gi # NEFF cache + model weights storage
volumeType: gp3
iops: 16000
throughput: 1000
encrypted: true
metadataOptions:
httpTokens: required
AMI Selection Notes

In Karpenter EC2NodeClass amiSelectorTerms, the al2023 alias refers to the standard AL2023 AMI. To use variants with pre-installed Neuron drivers, specify the SSM parameter or explicit AMI ID for the Neuron-optimized AMI published by AWS. Installing Neuron DKMS via UserData is possible but not recommended.


6. Inference Frameworks

There are three main frameworks for serving LLMs on Neuron.

6.1 NxD Inference (Neuron Distributed Inference)

AWS officially maintains this large-scale LLM inference library. Based on PyTorch, it provides Tensor/Pipeline Parallelism, Continuous Batching, PagedAttention-like memory management, and Speculative Decoding for Llama families and major public models.

Features:

  • Official provision of pre-compiled checkpoints for Llama 3/4, DBRX, Mistral, Mixtral, etc.
  • TP/PP configuration API at NeuronCore level
  • Optimization profile similar to Bedrock internal serving path
  • Apache 2.0 / AWS Official Support

6.2 vLLM Neuron backend

vLLM's Neuron backend was introduced experimentally starting in 2024, and feature parity has been rapidly improving in 2025~2026. As of April 2026, continuous batching and OpenAI-compatible API serving are available for major LLMs (Llama 3/4, Qwen, Mistral).

Features:

  • Compatible with existing vLLM deployment scripts in the form vllm --device neuron --tensor-parallel-size N
  • While PagedAttention itself is a CUDA implementation, the Neuron backend provides equivalent block-based KV management
  • Neuron parity for latest vLLM features (speculative decoding, chunked prefill, prefix caching) varies by feature, so release notes must be checked

6.3 TGI (Text Generation Inference) Neuron fork

HuggingFace maintains a TGI fork based on optimum-neuron. HuggingFace models can be easily compiled and served on Neuron through optimum[neuronx].

Features:

  • Tightly integrated with HuggingFace Hub-based workflows
  • TGI itself has been in maintenance mode since 2025 → new features are slower compared to vLLM
  • Compatibility with SageMaker's HuggingFace LLM DLC

6.4 Framework Comparison

AspectNxD InferencevLLM Neuron backendTGI Neuron fork
MaintainerAWS OfficialvLLM community + AWS contributionsHuggingFace + AWS contributions
Model CoverageAWS-selected official models (Llama/Mistral/DBRX, etc.)Neuron-ported vLLM-supported modelsoptimum-neuron supported HuggingFace models
Pre-compiled checkpointProvidedPartialPartial
OpenAI-compatible APISupportedSupportedSupported
Continuous BatchingSupportedSupportedSupported
Speculative DecodingSupported (model-specific)Partially supportedLimited
Prefix CachingModel-specificLimitedLimited
Update SpeedAWS release cyclevLLM release cycle (fast)Slow (maintenance mode)
Recommended UseLarge-scale production with AWS official modelsDiverse models and latest vLLM featuresHuggingFace ecosystem continuity
Framework Selection Guide
  • Large-scale Llama production → NxD Inference (pre-compiled checkpoint advantage)
  • Diverse models and latest vLLM features → vLLM Neuron backend
  • Existing HuggingFace Hub-based pipelines → TGI Neuron fork
  • New projects are recommended to choose between NxD Inference or vLLM Neuron

7. Supported Model Matrix

Based on AWS Neuron official Model Zoo and NxD Inference support matrix. For the latest support coverage, check AWS Neuron Samples GitHub and NxD Inference documentation.

7.1 Major Officially Supported Models (as of April 2026)

ModelSizeRecommended InstancePre-compiledNotes
Llama 4 Scout / Maverick8B / 70Binf2.48xlarge / trn2.48xlargeNxD Official
Llama 4 Maverick70Btrn2.48xlargeNxD Official
Llama 4 (Scout/Maverick)17B-400Btrn2.48xlarge / trn2 UltraCheck release statusNxD support expanding
Mistral 7B / Mixtral 8x7B7B / 47Binf2.48xlargeNxD/vLLM Supported
Mixtral 8x22B141Btrn2.48xlargePartialMoE, EP required
Qwen3 series4B-32Binf2 / trn2PartialvLLM Neuron backend recommended
DBRX 132B132Btrn2.48xlargeNxD Official (MoE)
DeepSeek V3671B MoEtrn2 UltraLimitedCheck compilation & memory constraints
Future Model Support

Latest large MoE models (DeepSeek V3, Llama 4 Maverick, GLM-5, etc.) are being added to Neuron support incrementally. Before deployment, always check the NxD Inference support matrix and Release Notes at that time. This document's table is for reference purposes and does not guarantee specific version support.

7.2 Quantization Support

FormatNeuron Support
BF16Default
FP16Supported
FP8 (E4M3, E5M2)Trainium2 Supported, Inferentia2 Limited
INT8 (weights)Supported (model-specific)
AWQLimited (check model & version)
GPTQLimited
GGUFNot supported

8. Observability

8.1 Neuron-Specific Tools

ToolRoleWhen to Use
neuron-lsList Neuron devices on the nodeInitial diagnosis
neuron-topReal-time NeuronCore utilization, memory, powerReal-time monitoring
neuron-monitorStream metrics in JSON formatPrometheus exporter input
neuron-profileProfile NEFF executionPerformance optimization

8.2 Prometheus / CloudWatch Integration

Collection chain:

  • neuron-monitor outputs NeuronCore utilization, HBM usage, device temperature, execution latency, etc. as JSON stream
  • OSS community neuron-monitor-prometheus exporter converts this to Prometheus format
  • AMP (Amazon Managed Prometheus) collects via remote-write and dashboards in AMG (Amazon Managed Grafana)
  • CloudWatch Container Insights Neuron metrics can also be utilized

For detailed AMP/AMG configuration, refer to Monitoring & Observability Setup.

8.3 Key Metrics

MetricDescriptionUsage
neuron_core_utilizationNeuronCore utilization (%)HPA/KEDA trigger
neuron_device_memory_usedHBM usage (MB)OOM prevention, capacity planning
neuron_execution_latencyInference request processing latencySLO monitoring
neuron_hardware_ecc_eventsECC error countHardware health check
neuron_power_wattsPower per chip (W)Thermal & cost management

9. Limitations and Considerations

9.1 Feature Constraints

CategoryConstraint
Custom KernelsCUDA-only kernels (FlashAttention custom impl, etc.) require direct porting to Neuron
QuantizationAWQ/GPTQ some variants, GGUF not supported
Compilation TimeNew model initial compilation 20-30+ minutes → NEFF cache essential
DebuggingGPU telemetry tool ecosystem is narrower compared to nvidia-smi
Open Source EcosystemNo equivalent "integrated orchestrator" like NVIDIA GPU Operator / DCGM — combination of Neuron Device Plugin + separate exporter

9.2 Operational Considerations

Pre-Production Deployment Checklist
  • Verify that target model is supported in current version by NxD Inference or vLLM Neuron backend
  • Pre-compile model NEFF in CI stage and manage as artifact in S3/ECR
  • Configure readinessProbe / startupProbe considering first Pod startup delay (with sufficiently large initialDelaySeconds)
  • Verify that neuron_core_utilization is properly collected in HPA/KEDA trigger metrics
  • Verify Trainium2 capacity by region and establish Spot interrupt policy
  • Document model version synchronization policy for hybrid operations with Bedrock

9.3 Workloads to Avoid on Neuron

  • R&D experiments changing model architecture weekly — compilation cost occurs repeatedly
  • Existing stack deeply coupled with Triton Inference Server + custom Python backend
  • Rare calls with very small requests (<10 tokens/sec) — warmup/compilation overhead is relatively large

References