CAS loops rely on the compare-and-swap instruction (here we use the C# Interlocked library abstracting platform specifics away), which “compares two values for equality and, if they are equal, replaces the first value.” Since we want Add() function users to not worry about this function potentially failing, a loop is used to retry if it fails because some other thread beat us to updating m_Sum.

This retry loop is, in essence, a “busy-wait” loop. This has a nasty implication for performance scaling: If multiple threads enter the CAS loop at the same time, only one will ever leave at a time, serializing the operations each thread is performing. Fortunately, CAS loops generally do an intentionally small amount of work, but it can still have large negative impacts on performance. As more cores execute the loop in parallel, it will take each thread longer to complete the loop while the threads are in contention.

Further, because CAS loops rely on atomic read-and-writes to shared memory, each thread generally requires its cache lines to be invalidated on each iteration, causing additional overhead. This overhead can be very expensive in comparison to the cost of redoing the calculations inside the CAS loop (in the case above, redoing the work of adding two numbers together). So, how high the cost is can be non-obvious at first glance.

Under the 2017.3 job scheduler, when worker threads were not running jobs, they were looking for work in either a shared, lock-free stack or queue. Both of these data structures used at least one CAS loop to remove work from the data structure. So, as more cores became available, the cost of taking work from the stack or queue increased when the data structures had contention. In particular, when jobs were small, worker threads proportionally spent more time looking for work in the queue or stack.

In a small project, I’ve generated deterministic job graphs that a typical game may have for its frame update. The graph below is composed of single jobs and parallel jobs (each parallelizing into 1–100 parallel jobs), where each job may have 0–10 job dependencies and the main thread has occasional explicit sync points where it must wait for certain jobs to finish before scheduling more. If I generate 500 jobs in the job graph, and make each take a fixed amount of time to execute (each portion of a parallel job takes this time as well), you can see that, as more cores are used, overhead in the job system increases.

Source: Unity Technologies Blog