Concurrency Hazards: False Sharing

False sharing and how to detect and avoid it.

By Nicholas Butler

Publicly recommend on Google

Note: The demo requires .NET 4.0 Beta 2.

Graph 1: Speedup

Introduction
Nutshell
Details
Demo application
Solution
Detection
- Qualitative
- Quantitative
Points of interest
- Two threads case
Conclusion

Introduction

There are many reasons why a program doesn't scale well across multiple cores. Synchronization is often to blame, or it could be a saturated memory bus. In this article, I will concentrate on another concurrency hazard: false sharing.

In the graph above, the green line shows good scaling for up to 8 cores. There is close to linear speedup, which is as good as it gets for most applications. However, the other three lines show an actual slowdown as more threads are used. This means that as you add more hardware, the wall-clock time to complete the processing actually gets longer.

There is very little difference in the data structures used for each of the four tests, but their performances are dramatically different. In fact, comparing the best and worse times ( the red and green lines for 8 threads ) shows a difference of about 100 times the speed. That is, instead of achieving a speedup of 7.76 x, the slowest case, in fact, shows a slowdown of 12.6 x. In seconds, that is the difference between 8.5 ms and 830 ms.

This article explores the cause of this anomaly and how it can be detected and avoided. I will also explain why one intrinsic data structure in .NET ( Array ) is particularly susceptible.

Nutshell

False sharing is also known as cache-line ping-ponging. It is caused by one or more cores repeatedly invalidating the caches of the other cores, even while accessing isolated state. This forces the other cores to read data from main memory instead of their local cache, which slows them down considerably: in the demo, by up to two orders of magnitude.

Details

Cache lines

The data in a cache is grouped into blocks called cache-lines, which are typically 64 or 128 bytes wide. These are the smallest units of memory that can be read from, or written to, main memory. This works well in most programs as data that is close in memory is often needed close in time by a particular thread. However, this is the root of the false sharing problem.

Cache coherence

When a program writes a value to memory, it goes firstly to the cache of the core that ran the code. If any other caches hold a copy of that cache line, their copy is marked as invalid and cannot be used. The new value is written to main memory, and the other caches must re-read it from there if they need it. Although this synchronization is implemented in hardware, it still takes time. And, of course, reading from main memory takes a few hundred clock cycles by itself.

Modern processors use the MESI protocol to implement cache coherence. This basically means each cache line can be in one of four states:

M odified
E xclusive
S hared
I nvalid

When a core modifies any data in a cache line, it transitions to "Modified", and any other caches that hold a copy of the same cache line are forced to "Invalid". The other cores must then read the data from main memory next time they need it. That's all I need to say about this here. A detailed explanation is available on Wikipedia[^], if you're interested.

False sharing

Imagine two different variables are being used by two different threads on two different cores. This appears to be embarrassingly parallel, as the different threads are using isolated data. However, if the two variables are located in the same cache line and at least one is being written, then there will be contention for that cache line. This is false sharing.

It is called false sharing because even though the different threads are not sharing data, they are, unintentionally, sharing a cache line.

Demo application

The demo basically increments counters in an int[] and measures the time taken in different configurations. For each of the four layouts of the array ( padding and spacing ), it measures the time taken, using from one to Environment.ProcessorCount threads. It then calculates the speedup ( sequential time / concurrent time ) and the efficiency ( speedup / thread count ).

Note: the demo should be built without optimizations, so we know exactly what code is being run.

Basic implementation

Here is a simple version of the test method:

partial class Worker
{
  const int ITERS = ( int ) 1e7;

  public TimeSpan Work( int threadCount )
  {
    // declare the counters
    int[] data = new int[ threadCount ];
    
    // each thread does an equal amount of the work
    int iters = ITERS / threadCount;

    // synchronization 
    var mre = new ManualResetEvent( false );
    var countdown = new CountdownEvent( threadCount );

    // spawn threads 
    for ( int thread = 0 ; thread < threadCount ; thread++ )
    {
      int iThread = thread; // capture 

      new Thread( () =>
      {
        // anchor each thread to a core 
        SetThreadAffinityMask( GetCurrentThread(), new UIntPtr( 1u << iThread ) );

        int offset = iThread;

        mre.WaitOne();

        for ( int x = 0 ; x < iters ; x++ )
            data[ offset ]++; // the 'work' 

        countdown.Signal();
    
      } ) { IsBackground = true }.Start();
    }
    
    var sw = StopWatch.StartNew();
  
    mre.Set();
  
    countdown.Wait();
  
    return sw.Elapsed;
}

Efficiency graph

I will use graphs of efficiency from here on, as it is easier to make comparisons than when using speedup.

Graph 2: Efficiency

No padding, no spacing

The red line shows the performance just using a basic int[]. Below is a diagram of how the counters ( C₀ - C₇ ) are laid out in memory. The numbers below the counters are the offsets in bytes.

As you can see, all the counters together occupy just 32 bytes, so all of them can fit in a single cache line. This means that all eight cores are reading and writing to the same cache line as fast as they can. This causes massive contention, which is shown by the slow time measurements. This is as bad as it gets for false sharing.

No padding, spacing

Let's try to separate the counters so that they are on different cache lines. We can do this by adding space between them in the array.

We do this by altering the basic code just a little:

...
  // declare the counters
  // int[] data = new int[ threadCount ];
  int[] data = new int[ threadCount * spacing ];
...
  // int offset = iThread;
  int offset = iThread * spacing;
...

The results of this test are shown in purple in the graphs. As you can see, there is some improvement, but performance is still nowhere near linear efficiency. To understand why this is, I need to take a quick look at the way Array is implemented in .NET.

Array in .NET

Every heap-allocated object has some metadata, similar to the vtable in C++, which is stored in memory just before the actual data. This metadata includes things like an object header and a pointer to a method table. The Array object has some additional metadata. This depends on whether it is a CLR SZARRAY ( single-dimension, zero-based ) or an MDARRAY ( multi-dimension ), but both include the element type and array length.

We are interested in the array size entry. This is read most times your code indexes into the array to check that your index is within the bounds of the array. I say most because for certain code patterns, some of the checks can be optimized away, but it's mostly true. The checks are done so that if your index points outside your array, you will get an IndexOutOfRangeException.

So basically, every time you write data[ offset ], there is a read of the memory just prior to the actual data in the array.

Padding, no spacing

In the code so far, the first counter has always been stored at index 0 of the data array. This is, very probably, in the same cache line as the array size member of the Array metadata. Since one core is continuously writing to this counter, the cache line is being invalidated continuously, which causes contention with all the other threads that are checking the size every time they index into the array. This is false sharing again, and is the reason why just spacing the array didn't work.

So I have added padding before the first counter in the array so that it no longer shares a cache line with the hidden size metadata.

This is achieved by this alteration of the basic code:

...
  // declare the counters
  // int[] data = new int[ threadCount ];
  int[] data = new int[ padding + threadCount ];
...
  // int offset = iThread;
  int offset = padding + iThread;
...

These results are shown in blue in the graphs. They are better than the basic array results, but apart from the two threads case, they are worse than the previous results using spacing. See the "Points of Interest" section below for an explanation of the two threads case.

In this test, we have removed the hidden size false sharing, but the counters still share a cache line. This is again causing false sharing and poor performance.

Padding, spacing

The last test case incorporates both the above solutions.

The final code looks like this:

...
  // declare the counters
  // int[] data = new int[ threadCount ];
  int[] data = new int[ padding + ( threadCount * spacing ) ];
...
  // int offset = iThread;
  int offset = padding + ( iThread * spacing );
...

These results are shown in green on the graphs, and give near-linear efficiency. So with this optimization, I have finally removed all the false sharing.

This approach is an optimization that trades space for time performance. In this case, for an eight-core machine, I use about 1 KB of space instead of 32 bytes, but the performance has increased by about a factor of 100.

Mostly readers

The last thing I want to show you with the demo is that false sharing can occur even when you are mostly reading, as long as there is at least one writer. With the check box checked, only the first counter is written and the rest are only read. Here is the results graph:

Graph 3: Mostly readers

The performance is slightly better than when all the threads are writing, but it is still terrible for the three false sharing tests. This shows that it just takes one core, continuously writing to a cache line, to cause enough contention to thwart the other seven cores.

Solution

The previous section on the evolutions of the demo show how to solve the problem of false sharing, and it is quite simple: move hot variables apart in memory by at least the size of a cache line - usually 64 or 128 bytes. This is actually the opposite of the hot / cold optimization technique for sequential programs where you try to put the most frequently accessed data close in memory.

It really is that simple. The "art" is in diagnosing false sharing in the first place and locating the data and code paths that are causing the problem. This is addressed in the next section.

Detection

Qualitative

There are many possible causes of poor performance in concurrent software. As I said, synchronization is often to blame, if you have had to use it. Diagnosing a problem is still more art than science, but the clues are there.

If you suspect poor scaling, you probably have a rough feeling of how fast your code should be executing for linear efficiency. Your wall clock measurements tell you that you are falling short of that. It's a start, but you must also have a good understanding of how your implementation fits together. If you don't, you won't be able to reason about how the measurements you make relate to your code.

The first tool you will use is Task Manager. For false sharing, you will see 100% processor usage across all cores. This contrasts with say, an asynchronous IO implementation, which would show low processor usage. In the Task Manager, you can also check the memory usage, the number of threads, and the number of handles held. These won't be a problem for false sharing, but they can rule out other causes.

Next, it is useful to accurately measure performance as I have done in the graphs: measuring times using from one to all cores. Obviously, to do this properly, you will need a machine that has at least as many cores as your expected clients' machines. For false sharing, you will see efficiencies significantly less than 100%, and which vary a fair bit between test runs. This is in contrast to say, a coarse grained lock convoy, which would show a consistent speedup of 1. That is to say, as you add cores, the wall clock time would stay pretty much constant.

Just to make things interesting, your measurements are unlikely to be as conclusive as those in this article. I have used a very small unit of work ( just incrementing a counter ) to highlight the effects of false sharing. In real code, your false sharing will probably affect a much smaller percentage of the work and so the effect on run times will be masked.

Quantitative

The previous section may sound like a lot of hand-waving, but with experience you can perform a basic triage of possible problems and rank them for further investigation. This section presumes that you suspect false sharing and shows you how to use Visual Studio to verify or eliminate false sharing as a cause.

I am using VS 2010 beta 2, but these profiling tools were available in some versions of VS 2008. There are also a whole load of excellent new debugging tools in VS 2010, but that is a subject for another article.

I will now walk through the process of detecting false sharing. Create a new performance session using sampling and open its properties. In the Sampling page you can change the sample event from "Clock Cycles" to "Performance Counter". This will present you with a bewildering range of counters to choose from. Select either "L2 Misses" ( if it is available ) or "L2 Lines In" as shown here:

Performance properties

You can also set the "Sampling Interval" to something smaller than the default. I have chosen to sample at each 100,000 events.

Now, launch the profiler and exercise the suspect functionality. If false sharing is occurring, you will see a lot of cache misses. This is effectively measuring the number of times a cache line has been marked as "Invalid" because another core has written to it. The summary screen will helpfully show your "Hot Path" like this:

Profile summary

If you click on the offending method, you will be taken to the "Function Details" page. Here, even more helpfully, the offending statement(s) will be highlighted:

Worker.cs

This result has confirmed that the increment statement is causing a lot of cache misses, which is an indication of false sharing. You now know where to look in your code and what to look for. Hopefully, you will be able to relocate the hot data in memory and fix the bug.

Points of interest

Two threads case

The test results for the two threads case look suspiciously good. The reason for this is the hardware cache layout in my machine and the fact that I am P/Invoking SetThreadAffinityMask to anchor each thread to a particular core.

In my machine, each core has 32 KB of L1 data cache, and each pair of cores share 6 MB of unified L2 cache. So, when testing with two threads on cores 0 and 1, they have a shared L2 cache. This means that only this L2 cache needs to be involved in cache coherence for the false-shared cache line. Although the L2 cache is slower than the L1 cache ( 17 cycles instead of 3 cycles ), it is still a lot quicker than main memory ( hundreds of cycles ). This mitigates the false sharing effect, although it is still measurable.

Conclusion

Well, that's about it. False sharing is a real world problem that exists in a lot of concurrent software. As always, prevention is better than cure, and I hope I have raised awareness of this hazard and that you will avoid it altogether. If you do suspect false sharing in existing code, I have shown you how to diagnose and fix it.

I hope you've enjoyed reading this article. Thanks.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License.

Contact

If you have any feedback, please feel free to contact me.

Publicly recommend on Google:

Simply Genius .NET

Software that makes you smile