VectorWare logoVectorWare
Dispatches

Rust threads on the GPU

15 min read
Pedantic mode:Off

GPU code can now use Rust's threads. We share the implementation approach and what this unlocks for GPU programming.

At VectorWare, we are building the first GPU-native software company. Today, we are excited to announce that we can successfully use Rust's std::thread on the GPU. This milestone marks a significant step towards our vision of enabling developers to write complex, high-performance applications that leverage the full power of GPU hardware using familiar Rust abstractions.

Execution models

CPUs and GPUs execute programs in fundamentally different ways. A CPU program begins on a single thread and spawns additional threads as needed. Each thread runs independently and the programmer controls when and how concurrency is introduced.

main()main threadthread::spawn()main threadnew threadthread::join()main threadexit()

GPU programs work differently. A GPU program consists of one or more kernels. Each kernel is launched with many instances that run in parallel. Concurrency is not something the programmer introduces explicitly. It is inherent in the way GPU programs are run by the hardware.

launch_by_cpu()kernel()Warp 0kernel()Warp 1kernel()Warp 2kernel()Warp 3kernel()Warp Nexit()

This model works well for uniform workloads like matrix multiplication, image processing, and graphics rendering where every warp does the same thing to different data.

As GPU programs grow more sophisticated, developers use warp specialization to activate different parts of the same program on different warps concurrently.

launch_by_cpu()kernel()a()Warp 0kernel()b()Warp 1kernel()c()Warp 2kernel()d()Warp 3kernel()n()Warp Nexit()

Functions as programs

Most CPU programming models begin with a main function as the program's entry point. Because execution begins with exactly one thread, representing the program as a function makes sense: the function body describes the work performed by that single thread.

fn main() {
    // Single threaded CPU code
}

Surprisingly, most GPU programming models use a function as their entry point as well. The programmer writes the function as if it executes once but the hardware then launches it thousands of times in parallel. GPU kernels are functions that look like normal CPU functions but behave very differently.

// CUDA C kernel
__global__ void scale(float* data) {
    // This code is executed thousands of times in parallel,
    // each with the same function parameters, but different values for the global
    // indexes such as `blockIdx` and `threadIdx`.
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    data[i] *= 2.0f;
}

This mismatch between programming model and execution model is part of the reason why GPU programming is so hard. A function that runs once has very different semantics from one that runs thousands of times in parallel, yet both the compiler and the programmer cannot easily infer this by looking at the code alone. In practice this makes the programmer responsible for manually upholding invariants such as correct indexing into shared data and avoiding races.

GPU programs written in Rust follow the same pattern and are modeled as functions. Consider the same GPU kernel written in Rust:

use core::arch::nvptx::*;
 
pub unsafe extern "ptx-kernel" fn scale(data: *mut f32) {
    // This code is executed thousands of times in parallel,
    // each with the same function parameters, but different values
    // for the global indexes such as `_block_idx_x` and `_thread_idx_x`.
    let i = (_block_idx_x() * _block_dim_x() + _thread_idx_x()) as usize;
    *data.add(i) *= 2.0;
}

The kernel requires unsafe and takes a *mut f32 raw pointer rather than a reference. Because the GPU runs thousands of instances of this function simultaneously and each instance receives the same pointer, there is no way to express this safely as a function using Rust's ownership model. Rust was designed around the CPU's execution model where fn main runs on a single thread and the language can enforce safety. The GPU's execution model is foreign to the language and the kernel boundary is treated like an FFI boundary: raw pointers, unsafe, and no compiler guarantees. While this works, ideally Rust's safety guarantees would extend to the GPU as well.

We could introduce new types and annotations to capture GPU-specific semantics, but that would create a new programming model that is separate from ordinary Rust. It would require programmers to learn new abstractions and write GPU-specific code. We want GPU code to look like ordinary Rust code that integrates natively with the Rust ecosystem.

We could teach Rust about the GPU execution model and extend the compiler to reason about GPU-specific invariants. That is a long-term project that will take a lot of careful design work. As members of the Rust compiler team we are keen to contribute to this effort, but we want to write safe GPU code in Rust today.

Some safety can be recovered by building a CPU harness that invokes the function in ways similar to how it runs on the GPU. Each GPU instance can be modeled as a CPU thread, which allows tools like miri to explore possible interleavings and check for undefined behavior under Rust's memory model. At VectorWare we use such a harness.

A CPU-based harness may be able to capture some aspects of GPU execution but it will always remain a leaky abstraction. The ideal end-state is for GPU programs to behave like CPU programs so the Rust compiler can reason about the same invariants in both environments. Single-function kernel entry points make this difficult because concurrency is implicit. Models where concurrency is explicit are easier for both the programmer and the compiler to reason about. Threads are one such model.

Why support std::thread on the GPU?

Rust programs use two primary models for concurrency: futures and threads.

In a previous post we demonstrated futures and async/await running on the GPU for the first time. However, when we brought Rust's std to the GPU we did not implement threads. It was unclear how to do so and we already had ergonomic concurrency via async/await for writing GPU-native apps.

Yet much of the Rust ecosystem is built around threads rather than futures. Widely used thread pools such as rayon, async runtimes like tokio, and many libraries for parallelism all depend on std::thread. Supporting threads unlocks a large portion of the existing ecosystem.

Why not map std::thread to GPU threads?

Within warps, GPUs have many threads (also called lanes). An obvious approach is to map each std::thread to one of them.

launch_by_cpu()Lane 0Lane 1Lane 2Lane 32Warp 0Lane 0Lane 1Lane 2Lane 32Warp 1Lane 0Lane 1Lane 2Lane 32Warp 2Lane 0Lane 1Lane 2Lane 32Warp Nexit()

But a GPU "thread" is not what a CPU programmer means by "thread." A GPU thread is a single lane within a warp, more analogous to a SIMD lane on a CPU than an independent execution context.

A CPU thread has its own stack, its own program counter, and can be independently scheduled. GPU lanes do not work this way. Lanes within a warp advance together in lockstep. Mapping std::thread to GPU lanes would violate the semantics that Rust expects.

It would also be slow. When lanes within a warp take different branches, the GPU hardware masks off inactive lanes and can execute each path sequentially. This is called divergence. If thread::spawn() mapped to a lane, the spawned lane and the calling lane would be in the same warp running different code. The hardware might serialize them, negating any concurrency benefit.

Warp 0thread::spawn()parent thread coderun firstspawned thread coderun secondthread::join()reconverge

A world first: std::thread on the GPU

Running threads on the GPU is difficult to demonstrate visually because the code looks and runs like ordinary Rust. By design, the same syntax used on the CPU runs unchanged on the GPU. Here is a Rust program that spawns two threads, does work on each, and joins them:

use std::thread;
 
fn main() {
    let a = thread::spawn(|| {
        let mut sum = 0u64;
        for i in 0..1000 {
            sum += i;
        }
        sum
    });
 
    let b = thread::spawn(|| {
        let mut product = 1u64;
        for i in 1..20 {
            product *= i;
        }
        product
    });
 
    let sum = a.join().unwrap();
    let factorial = b.join().unwrap();
 
    println!("sum: {sum}, factorial: {factorial}");
}

Below is a recording of this program running on the GPU. The code is compiled as a GPU kernel and launched on the device. Both threads execute on separate warps and their results are printed from the GPU using our std support.

Implementation

Supporting std::thread on the GPU is enabled by three key observations:

  1. Warps can behave similarly to CPU threads. Each warp has its own program counter, its own register file, and can execute independently from other warps. The GPU's warp scheduler switches between warps to hide latency, much like an OS scheduler switches between CPU threads. Anything computationally a CPU thread can do, a warp can do too.
  2. A GPU kernel does not need to be concurrent at launch. The default GPU execution model starts all warps simultaneously, but nothing requires them to do useful work right away. By starting with a single active warp and enabling others on demand, we recover the CPU's execution model: one thread of control that introduces concurrency explicitly.
  3. Warp specialization is manual thread partitioning with no language support. GPU developers already assign different tasks to different warps. std::thread is the same concept behind a language-provided API with ownership, type checking, and lifetime enforcement built in.

Here is how it works. We map each std::thread to a GPU warp. When a kernel starts, only Warp 0 is active. Warp 0 runs main, just like the main thread on a CPU. All other warps sleep. Calling thread::spawn() wakes a sleeping warp to run the spawned closure. Calling thread::join() blocks the parent warp until the child warp finishes.

kernel()Warp 0Warp 1Warp 2Warp 3Warp Nmain threadenabled?no, sleepenabled?no, sleepenabled?no, sleepenabled?no, sleepthread::spawn()Warp 0Warp 1Warp 2Warp 3Warp Nmain threadenabled?yes, runenabled?no, sleepenabled?no, sleepenabled?no, sleepthread::spawn()Warp 0Warp 1Warp 2Warp 3Warp Nmain threadenabled?yes, runenabled?no, sleepenabled?no, sleepthread::join()exit()

Here is the same program annotated to show which warp runs each section of code:

use std::thread;
 
fn main() {
let a = thread::spawn(|| {
let mut sum = 0u64;
for i in 0..1000 {
sum += i;
}
sum
});
 
let b = thread::spawn(|| {
let mut product = 1u64;
for i in 1..20 {
product *= i;
}
product
});
 
let sum = a.join().unwrap();
let factorial = b.join().unwrap();
 
println!("sum: {sum}, factorial: {factorial}");
}
Warp 0Warp 1Warp 0Warp 2Warp 0

With the warp-as-thread model in place, the rest of std::thread follows naturally. thread::current() returns an identifier for the current warp. thread::sleep() puts the warp to sleep using nanosleep. thread::yield_now() yields the warp back to the scheduler. Thread names, thread IDs, and builder patterns all work as expected.

Our implementation targets NVIDIA GPUs but there is nothing CUDA-specific about the approach. Vulkan has subgroups and HIP/ROCm has wavefronts, both of which can be used to implement the same warp-as-thread mapping.

Benefits

A major benefit of this approach is that the programmer does not need to think about warps, blocks, or grids. Just like on the CPU, spawned threads get their own execution context, run different code from their parent, and can be joined. The mechanics of how this happens are abstracted away.

Another advantage of this approach is that it prevents divergence by construction. Divergence occurs when lanes within a warp take different branches. Because thread::spawn() maps one closure to one warp, every lane in that warp runs the same code. There is no way to express divergent branching within a single std::thread, so divergence cannot occur. The worst case is that a workload only uses one lane per warp and the remaining lanes sit idle. But idle lanes are strictly better than divergent lanes: idle lanes waste capacity while divergent lanes serialize execution. Code that needs to go wide across lanes can still do so explicitly within a thread's closure using warp-level intrinsics like warp_shuffle_idx.

Most importantly, with this approach Rust's borrow checker and lifetimes just work. These are the semantics Rust programmers already know and existing code is written against. We are not introducing a new GPU programming model to Rust. We are mapping Rust's programming model onto the GPU. At VectorWare, we are making GPUs behave like a normal Rust platform.

With both std::thread and async/await now working on the GPU, a large portion of the Rust ecosystem becomes viable on GPU hardware. Libraries that use threads for parallelism, async for I/O, or a combination of both can now target the GPU with minimal or no changes. Future blog posts will showcase some of the exciting applications this enables.

That said, we do not think the end goal is merely to run existing CPU-oriented software on the GPU. We are excited to write new GPU-native applications that take advantage of the hardware in ways CPU-native software cannot.

Downsides

Warps are a finite resource. Spawning too many threads will exhaust available warps and require either queuing or failing. In practice, this is less of a concern than it might seem because most Rust code already adapts to hardware. The standard library provides std::thread::available_parallelism() to query the number of threads the hardware can run concurrently. On the GPU, we make this function return the number of available warps.

Thread synchronization on the GPU is more expensive than on the CPU. Mutexes, condition variables, and other blocking primitives require warp-level scheduling decisions that can reduce occupancy and throughput.

The warp-as-thread model means each spawned thread consumes an entire warp. If the spawned code does not use all lanes, GPU hardware is underutilized. This is acceptable for complex, divergent workloads but wasteful for simple parallel tasks where data parallelism would be more appropriate.

Warps that have not yet been assigned a thread sit idle, consuming hardware resources without doing useful work. We have mitigations for this and will be discussing our approach in a future post.

Finally, stack memory for threads must be allocated from GPU memory, which is more constrained than CPU memory. Deep call stacks or many concurrent threads can exhaust available memory. The default CUDA stack size is too small for most complex applications and must be changed via cudaLimitStackSize.

Is VectorWare only focused on Rust?

We completed this work months ago. The speed at which we are able to make progress on the GPU is a testament to the power of Rust's abstractions and ecosystem.

As a company, we understand that not everyone uses Rust. Our future products will support multiple programming languages and runtimes. However, we believe Rust is uniquely well suited to building high-performance, reliable GPU-native applications and that is what we are most excited about.

Follow along

Follow us on X, Bluesky, LinkedIn, or subscribe to our blog to stay updated on our progress. We will be sharing more about our work in the coming months. You can also reach us at hello@vectorware.com.