CUDA Toolkit 13.0 for Jetson Thor: Unified Arm Ecosystem and More
Sources: https://developer.nvidia.com/blog/whats-new-in-cuda-toolkit-13-0-for-jetson-thor-unified-arm-ecosystem-and-more, https://developer.nvidia.com/blog/whats-new-in-cuda-toolkit-13-0-for-jetson-thor-unified-arm-ecosystem-and-more/, NVIDIA Dev Blog
Overview
CUDA Toolkit 13.0 for Jetson Thor delivers a unified CUDA toolchain for Arm platforms, eliminating the need to maintain separate toolchains for embedded devices and SBSA-compliant servers. Powered by the NVIDIA Blackwell GPU architecture, Jetson Thor gains Unified Virtual Memory (UVM) with full coherence, enabling pageable host memory to be accessed by the GPU via the host page tables. This aligns Jetson platforms with dGPU systems in terms of memory sharing semantics and reduces the burden of memory management across CPU and GPU domains. The release aims to streamline development, simulation, testing, and deployment by consolidating toolchains and container images into a single, portable lineage across edge and server targets. An important exception is Orin (sm_87), which continues on its current path for now. The broader impact is a streamlined development workflow: build once, simulate on high-end systems, and deploy the same binary to embedded targets like Jetson Thor without code changes. The unification also extends to containers, enabling a shared image lineage across simulation and edge deployments and paving the way for concurrent usage of integrated GPU (iGPU) and discrete GPU (dGPU) on Jetson and IGX platforms. Beyond unification, CUDA 13.0 introduces several features that improve GPU utilization, developer productivity, and interoperability. OpenRM-based memory sharing via dmabuf is supported, enabling cross-subsystem zero-copy sharing when supported. NUMA support is introduced for Tegra devices, helping multi-socket or NUMA-aware workloads place memory closer to CPUs. NVIDIA Management Library (NVML) and the nvidia-smi utility are now supported on Jetson Thor, giving developers familiar APIs and tooling for monitoring GPU usage. OpenRM/dmabuf interop is complemented by existing EGL and NvSci-based memory sharing options on Tegra platforms. Jetson Thor also gains capabilities that improve multi-process and real-time workloads. Multi-Process Service (MPS) enables multiple processes to share the GPU concurrently with reduced context-switch overhead, improving occupancy and throughput. Green contexts provide lightweight, pre-allocated GPU resources to help isolate and deterministically allocate SMs for latency-sensitive tasks. The combination of MPS, green contexts, and future Multi-Instance GPU (MIG) capabilities enables more predictable multi-process execution across robotics, SLAM, perception, and planning workloads. In terms of interoperability, CUDA 13.0 supports importing a dmabuf into CUDA memory and exporting CUDA allocations as dmabuf fds on OpenRM platforms. This enables cross-stack sharing with other kernel drivers and userspace components via a standardized, zero-copy buffer exchange mechanism. The Driver API exposes cuMemGetHandleForAddressRange() for exporting memory ranges as dmabufs and cuDeviceGetAttribute(…, CU_DEVICE_ATTRIBUTE_HOST_ALLOC_DMA_BUF_SUPPORTED) to query support for host-allocated DMA-BUF sharing.
Key features
- Unified CUDA toolkit for Arm across server-class SBSA targets and embedded Jetson Thor (Orin remains on current path).
- Unified Virtual Memory (UVM) with full coherence on Jetson Thor; host memory accessible via host page tables.
- OpenRM/dmabuf interoperability for memory sharing with external drivers and stacks.
- NUMA support for Tegra platforms to improve memory placement on multi-socket systems.
- Multi-Process Service (MPS) to consolidate lightweight workloads into a single GPU context; easier adoption with no app changes.
- Green contexts to pre-allocate SMs for deterministic execution; supports combinations with MPS and future MIG.
- Improved developer tooling: nvidia-smi and NVIDIA Management Library (NVML) on Jetson Thor.
- Shared container lineage across simulation and edge deployment to reduce rebuilds and CI overhead.
- Concurrent iGPU/dGPU usage on Jetson and IGX platforms for unified compute experiences.
- OpenRM-based memory sharing: import/export dmabuf using CUDA External Resource Interoperability.
- CUDA-related memory sharing flows with EGL/NvSci on Tegra platforms as part of a broader interoperability story.
- CUDA 13.0 memory visibility improvements include pageable host memory mapped into GPU address space; cudaMallocManaged allocations are not GPU-cached for coherence parity.
- MPS binaries: nvidia-cuda-mps-control and nvidia-cuda-mps-server under /usr/bin; MPS client runs with the same pipe/log dirs as the daemon.
Common use cases
- Robotics and edge AI: running SLAM, object detection, and motion planning concurrently with real-time constraints.
- Multi-process applications that wish to share a GPU efficiently without large context-switch overhead.
- Simulation-to-edge workflows: develop and simulate on high-performance systems (e.g., GB200, DGX Spark) and deploy the exact same binaries to embedded targets.
- Real-time, latency-sensitive workloads requiring deterministic SM allocation and resource isolation via green contexts and, in the future, MIG slices.
- Cross-stack memory sharing scenarios using OpenRM/dmabuf for zero-copy integration with third-party device stacks (EGL, NvSci, etc.).
Setup & installation
Notes drawn directly from the article: there are two binaries associated with MPS (nvidia-cuda-mps-control and nvidia-cuda-mps-server) typically stored under /usr/bin. To start the MPS Control Daemon, follow the steps in the article. To run an application as an MPS client, set the same pipe and log directory as the daemon, then run the application normally. Logs are stored in $CUDA_MPS_LOG_DIRECTORY/control.log and $CUDA_MPS_LOG_DIRECTORY/server.log. For memory sharing, OpenRM/dmabuf interop uses cuMemGetHandleForAddressRange() and cuDeviceGetAttribute(… CU_DEVICE_ATTRIBUTE_HOST_ALLOC_DMA_BUF_SUPPORTED).
MPS setup (example commands)
# MPS binaries exist under /usr/bin
ls -l /usr/bin/nvidia-cuda-mps-control /usr/bin/nvidia-cuda-mps-server
# Start MPS control daemon (example; exact flags may vary)
nvidia-cuda-mps-control
# Start MPS server (example)
nvidia-cuda-mps-server
# Run an application as MPS client (example)
export CUDA_MPS_LOG_DIRECTORY=/path/to/mps/logs
export CUDA_MPS_PIPE_DIRECTORY=/path/to/mps/pipe
./my_cuda_app
# Logs
# control.log and server.log will be written under $CUDA_MPS_LOG_DIRECTORY
OpenRM/dmabuf interop setup (example concepts)
#include
#include
int main() {
CUdevice dev;
cuInit(0);
cuDeviceGet(&dev, 0);
int host_dma_buf_supported = 0;
cuDeviceGetAttribute(&host_dma_buf_supported, CU_DEVICE_ATTRIBUTE_HOST_ALLOC_DMA_BUF_SUPPORTED, dev);
printf("HOST_ALLOC_DMA_BUF_SUPPORTED=%d\n", host_dma_buf_supported);
// If supported, you can import/export DMABUF memory using cuMemGetHandleForAddressRange()
// Import/export APIs are provided via OpenRM integration described in the article
return 0;
}
Quick start: minimal runnable example (mmap-based memory path)
#include
#include
#include
#include
#include
#include
// Simple CUDA kernel that builds a histogram over a small byte range
__global__ void hist_kernel(const unsigned char* data, unsigned int* hist, size_t n) {
size_t i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >>(d_data, d_hist, n);
cudaDeviceSynchronize();
// Read results (hist now contains counts in host-mapped region)
for (int i = 0; i < 256; ++i) {
printf("%d: %u\n", i, hist[i]);
}
munmap(data, sz);
munmap(hist, 256 * sizeof(unsigned int));
close(fd);
return 0;
}
Pros and cons
- Pros
- Single, unified CUDA toolkit for Arm across embedded Thor and SBSA servers simplifies builds and CI.
- UVM with full coherence enables direct access to pageable host memory from the GPU, reducing explicit copies.
- MPS reduces context-switch overhead for multi-process GPU workloads and helps improve occupancy and throughput.
- Green contexts enable deterministic SM allocation, aiding latency-sensitive modules; future MIG will extend this capability.
- OpenRM/dmabuf interop broadens interoperability with non-CUDA subsystems, enabling zero-copy sharing where supported.
- NUMA support and NVML/nvidia-smi on Jetson Thor provide visibility and performance tuning capabilities familiar to data-center workflows.
- Container image unification reduces CI overhead and helps maintain a single build lineage across simulation and edge deployments.
- Cons
- Orin (sm_87) remains on its current path for now; unification does not cover Orin yet.
- Some features in nvidia-smi, such as clock, power, thermal queries, per-process utilization, and SoC memory monitoring, are not yet available on Jetson Thor.
- While the DMABUF/OpenRM path enables interoperability, the exact level of support and maturity varies by platform and stack.
Alternatives (brief comparisons)
- Maintaining separate toolchains for SBSA servers and embedded devices remains an approach, but it increases CI overhead and code duplication; CUDA 13.0 aims to reduce this fragmentation.
- EGL or NvSci memory sharing vs OpenRM/dmabuf interop offer different interoperability paths; OpenRM/dmabuf adds a standardized memory-sharing channel that can work across vendors and open stacks, complementing EGL/NvSci-based approaches on Tegra.
- In environments without unified toolchains, you may rely on platform-specific simulations and cross-compilation workflows, which CUDA 13.0 for Jetson Thor seeks to minimize by enabling the same binary to run on embedded devices.
Pricing or License
Not specified in the source material.
References
More resources
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap
Leverage GPU memory swap (model hot-swapping) to share GPUs across multiple LLMs, reduce idle GPU costs, and improve autoscaling while meeting SLAs.
Improving GEMM Kernel Auto-Tuning Efficiency with nvMatmulHeuristics in CUTLASS 4.2
Introduces nvMatmulHeuristics to quickly select a small set of high-potential GEMM kernel configurations for CUTLASS 4.2, drastically reducing auto-tuning time while approaching exhaustive-search performance.
Make ZeroGPU Spaces faster with PyTorch ahead-of-time (AoT) compilation
Learn how PyTorch AoT compilation speeds up ZeroGPU Spaces by exporting a compiled model once and reloading instantly, with FP8 quantization, dynamic shapes, and careful integration with the Spaces GPU workflow.
Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
Guide to fine-tuning gpt-oss with SFT + QAT to recover FP4 accuracy while preserving efficiency, including upcasting to BF16, MXFP4, NVFP4, and deployment with TensorRT-LLM.
How Small Language Models Are Key to Scalable Agentic AI
Explores how small language models enable cost-effective, flexible agentic AI alongside LLMs, with NVIDIA NeMo and Nemotron Nano 2.
Getting Started with NVIDIA Isaac for Healthcare Using the Telesurgery Workflow
A production-ready, modular telesurgery workflow from NVIDIA Isaac for Healthcare unifies simulation and clinical deployment across a low-latency, three-computer architecture. It covers video/sensor streaming, robot control, haptics, and simulation to support training and remote procedures.