Un-locking Zig

For the Zigtoberfest 2025 I gave a talk about lock-free — or non-blocking — data structures. This page provides a small summary of the talk.

Not in a reading mood? Let me do the talking over on YouTube:

Intro

Do you like cooking? Also with multiple people at the same time? If so, you might have experienced the problem of multiple cooks trying to use the same utensil at the same time. You are constantly blocking each other from using the single available knife, someone is always cutting the veggies wrong, and in the end, it takes longer than if you just cooked alone.

It’s the same with programming. When multiple threads try to access the same data structure at the same time, they need to coordinate their access. Usually, this is done with locks like mutexes, semaphores, or read-write locks. And while locks are a simple and often times effective solution, one can easily overlook their flaws.

Just to name a few hidden costs:

Contention: When multiple threads try to acquire the same lock, they have to wait for each other, leading to performance degradation.
Deadlocks: If two or more threads are waiting for each other to release locks, they can end up in a deadlock situation where none of them can proceed.
Priority Inversion: A lower-priority thread holding a lock can block a higher priority thread, leading to suboptimal scheduling and performance issues.
Composability: Combining multiple lock-based data structures can lead to complex locking hierarchies, increasing the risk of deadlocks and making reasoning about the code more difficult.

So … can we synchronize access without using locks?

How to go lock-free

To get rid of locks, we need to rethink the critical section. If it were so small, that it wouldn’t be possible for multiple threads to enter it at the same time, locks would be obsolete. This is where atomic operations come into play.

Atomic operations are hardware-supported, meaning that the CPU directly ensures that no intermediate results are visible to other threads. Zig provides access to those instructions through built-ins like @atomicLoad, @atomicStore, and @cmpxchgStrong. The latter is especially interesting, as it powers most lock-free data structures. We call it Compare And Swap (CAS). This is how it could be implemented non-atomically:

fn cas(comptime T: type, ptr: *T, expected: T, new: T) ?T {
    if (ptr.* != expected) {
        return null;
    }
    const old = ptr.*;
    ptr.* = new;
    return old;
}

The idea is simple: We want to update the value at ptr to new if, and only if, its current value is expected (so nobody else changed the state underneath us). If the current value is not expected, we fail and don’t change anything. In reality, Zig’s @cmpxchgStrong will directly compile down to the respective CPU instruction without any of the above logic happening in software.

Typically, CAS is further composed into a CAS-loop. I’ll show this using a Stack’s push operation:

fn push(self: *Stack, value: T) !void {
    var new_head = try self.allocator.create(Node);
    new_head.* = Node{
        .value = value,
        .next = null,
    };

    while (true) {
        const old_head = @atomicLoad(&self.head, .acquire);
        new_head.next = old_head;
        if (@cmpxchgStrong(&self.head, old_head, new_head,
            .release, .acquire) == null) {
            return; // success, exit the function
        }
        // failed, try again
    }
}

Now that you have a rough idea of how lock-free data structures work, let’s further look at some properties and considerations.

Freedom

In contrast to lock-based data structures, where arguing about progression guarantees is tricky, non-blocking data structures can fulfill different levels of freeness.

Obstruction-freedom: The weakest form of freeness. A thread can make progress if it runs in isolation. However, if multiple threads are contending, they might prevent each other from making progress.
Lock-freedom: A stronger guarantee. At least one thread will make progress in a finite number of steps, regardless of contention. This ensures system-wide progress, but individual threads might still starve.
Wait-freedom: The strongest guarantee. Every thread will make progress in a finite number of steps, regardless of contention. This ensures both system-wide progress and individual thread progress, eliminating starvation.

With this hierarchy, stronger guarantees automatically entail weaker ones.

Every non-blocking data structure is always at least obstruction-free. Achieving lock-freedom is also easy, as most CAS-loops are lock-free by design. Think about it: A CAS of one thread can only fail if another thread successfully changed the state by progressing. Wait-freedom, however, is harder to achieve and often requires more complex algorithms or additional helping mechanisms. E.g., threads could reserve certain memory slots for themselves to ensure they can always make progress when accessing those. Or Threads could also help each other by completing each other’s operations ( non-blocking implementations of queues often uses this technique).

Considerations

Another thing we can reason about is the correct memory ordering of our operations. In essence, we have the problem that the CPU as well as the compiler will reorder instruction to, e.g., hide latencies. While this is usually beneficial for performance, it can lead to unexpected behavior in concurrent programs, as semantic data dependencies might not be obvious to the optimizer.

To have control over this, we can introduce certain types of memory barriers or orderings to our atomic operations. Zig (through LLVM) supports the following:

unordered: No ordering guarantees. Don’t bother with this.
monotonic: Operations accessing the same address will happen in a consistent order.
acquire: Ensures that subsequent reads and writes cannot be reordered before this operation.
release: Ensures that previous reads and writes cannot be reordered after this operation.
acq_rel: Combines the effects of both acquire and release.
sequentially_consistent: The strongest ordering. Ensures a total order of operations across all threads.

Know what? Most of the time, this should work: Put a .release on your writes and a .acquire on your reads. When in doubt, go with .seq_cst, this will cost you a bit of performance, but ensure correctness.

Challenges

While lock-free data structures have many advantages, they also come with their own set of challenges. A common one is the ABA problem. Consider the following illustration:

Here, the driver is waiting at a red light until it turns green. However, since they have to wait for so long, they get sleepy and don’t notice when the light turns green. Only after the light has turned red again do they wake up. From the driver’s perspective, the light never changed, even though it did.

This can also happen in lock-free data structures. Imagine a thread reads a value A from a shared variable. Before it can perform a CAS operation to update the value, another thread changes the value from A to B and then back to A. When the first thread performs the CAS operation, it sees that the value is still A and successfully updates it, unaware that the value had changed in the meantime.

When working with pointers, this can quickly corrupt the data structure, leading to undefined and definitely unwanted behavior.

To mitigate the ABA problem, several techniques can be employed:

DCAS: Double CAS can atomically compare and swap two memory locations. To mitigate the ABA problem, one location can hold the actual data pointer, while the other holds a version counter. Each time the data is modified, the version counter is incremented. This way, even if the data pointer returns to its original value, the version counter will differ, allowing threads to detect changes.
Tagged Pointers: Since DCAS is not always available and memory alignment usually leaves some pointer bits unused, we can also use those bits to store a counter. Similar to DCAS, this counter is incremented with each modification, helping to detect changes even if the pointer value returns to its original state.
Hazard Pointers: A memory reclamation technique where threads announce which pointers they might access. Before reclaiming memory, a thread checks if any other thread has announced a hazard pointer pointing to that memory. If so, the memory is not reclaimed, preventing some ABA related issues.

TL;DR

Now how do I decide whether to use lock-free data structures or not? Here are some guidelines:

Your application is not performance critical: Don’t bother, slap a Mutex on it
You want to orchestrate threads and not just protect data: Again, go with locks
Optimizing for throughput under high contention is your dream: Lock-free it is
You need real-time guarantees and low latency: Lock-free is the way to go

Don’t be afraid to use locks, but don’t limit yourself to them either!