menu

Saturday, February 25, 2017

Volatile variable, false sharing and @Contented


 Introduction:

         Today I will try to explain the use of volatile variable in java while touching the concepts of cache lines, false sharing and the new java 8 annotation @Contented. We will also see how to use @Contended in java 9. Lets first look at the use cases that we can and cannot use volatile.

 Use of volatile:

         Volatile variables came into our life with java 5. It generally provides a lock-free mechanism for concurrent applications. Below you can find a list of characteristics of a volatile variable.

1.) Happens-before guarentee; A write to a volatile variable happens before any subsequent read.

2.) No reordering: JVM can normally reorder your code for various reasons like performance which may cause problems for concurrent application. If you define a variable as volatile you tell the JVM not to reorder that variable.

3.) Guarenteed visibility: This is actually a result of happens-before principle. In multithreaded applications which operates on an multi CPU environment, each thread may copy variable from main memory into the CPU cache to increase performance. In that case if a thread change a variable,  another thread may not see the latest version of that variable. To solve that issue we can define the variable as volatile, and any change made by a thread guarenteed to be visible to all of the threads.

4.) Atomic read of long and double variables: This is about to become obsolete point since its only related with 32 bit operating systems or 32 bit java virtual machines. The point is the long and double variables are 64 bit lenght in java and in 32 bit systems there is no way of reading the 64 bit variable at once. So there should be 2 subsequent reads to get a value of 64 bit variable. If we use volatile modifier we can guarentee an atomic read operation on long or double values. As I said however for 64 bit environments there is no such requirement.

Those are the abilities that volatile modifier added to java. On the other hand we should aware that volatile modifier does not bring us the atomicity, you can only use the volatile variable if you are making an assignment. For example we cannot use it on a variable which is incremented by multiple threads. Say 2 independent thread read a volatile variable at the same time and increment the value by one, then flushed it to the main memory since its a volatile variable. In this case we possibly see only one increment since we did not use any synchronization on the variable. Volatile does not provide that ability, we should use synchronized keyword which provides a monitor lock and guarentees an ordered access to the variable by multiple threads.

We can use volatile modifier for common variables like flags between threads or can use it in double checked locking idiom in singleton pattern which is broken without a volatile variable as a result of a java memory model behaviour.


public class DoubleCheckedLocking{
    private volatile Singleton singleton;
    public Singleton getSingleton() {
        if (singleton == null) {
            synchronized(this) {
                if (singleton == null) {
                    singleton= new Singleton ();
                }
            }
        }
        return singleton;
    } }

At this point lets go into the some details about cpu cache, cache lines and false sharing. 

 
CPU Cache:

     
Generally speaking in multi core environment each cpu has its own cache which takes in place betweeen the cpu and the main memory. Although the situation may change depending on the cpu  architecture which is different for each of the vendor, the cpu cache is generally consist of 3 caches L1, L2, and L3 where L1 is the closest one to CPU, L3 is closest to main memory.

L1-cache is the fastest one and usually it takes place in the processor chip. The size capacity is typically in between 8KB-64KB and it uses SRAM which is faster than DRAM which generally used in main memory.
L2-cache comes between the L1-cache and the RAM and has a bigger storage capacity typically in between 64KB-4MB.
L3-cache comes after L2 cache and its closer to RAM. L3-cache are generally exists on motherboard instead of the processor and it is used to keep the common data between multiple cores.

During the data fetching first the L1-cache is looked up, then L2-cache, then L3-cache and at the end if the data cannot be found in CPU cache, it is read from the main memory.

So we can say that read performance of a volatile variable will possibly be worse than a normal variable, since to be able to read the variable we should go to the main memory instead of cpu cache in many cases.

For cpu caching there exists different strategies in multi core environment. For example a replicated cache holds all the keys in all nodes, on the other hand a distributed cache holds keys only on some of the nodes to provide redundancy and fault tolerance and it provides a more scalable solution. You can find more information on this 
here.

Now, lets go deeper on cpu caching by examining the cache lines and the false sharing concepts.

 
Cache Lines and False Sharing:

    CPU's read memory in some block of bytes, usually 64 bytes. We call this block of bytes as cache lines. Generally a cpu maintain consistency on cache line basis, that means if any single byte of a cache line is changed all cache line is invalited and this invalidation takes place for all cpu's in cluster. 

For example let's say we have 2 threads which are operates on 2 different cores in a multi-core environment. If we have a shared variable x which is read by thread 1 and thread 2, both thread will put that variable into their cpu cache line's. Then if thread 1 updates the variable x thread 1's cache line must be invalidated and also thread 2 told to invalidate its cache line. This is an expected behaviour and it is called true sharing. Here if we make the variable x volatile we can guarentee that behaviour, on the other hand for non-volatile variables, since we will not force the cpu to flush its memory barrier we may not see the invalidation. 

Say we have an additional y variable in the same cache line with x. Say thread 1 updates x and thread 2 does only want to operate on y. However both thread has the cache line that holds x and y, since we read 64 bit block of memory. Even in that case thread 2 told to invalidate its cache because of the change in variable x which thread 2 does not matter. This situation is called false sharing.

Say we have 2 independent variables x and y which are not shared between thread 1 and thread 2. Thread 1 operates on x and thread 2 operates on y. If x and y are still in the same cache line, any change to x or y will invalidate the cache line's in thread 1 and thread 2's cpu caches which is also false sharing. 

If this false sharing start to happen many times, our system can suffer performance problems since a cpu will need to wait the cache line to be loaded while it would do many iterations in that time. This is called stall and it can cause a silence performance problem.

To solve this performance problem and to prevent from false-sharing we can try padding technique or even better use the new @Contended annotation which comes by java 8. Let's explain them in detail.

 
Padding and @Contended:

     
The main point that causes false sharing is unintentionally sharing variables in cache lines. To prevent it we can try to pad our data structure or variable to span the whole cache line. For example if we have a 64 bit cache line and a int variable which consumes 4 bytes we can use 7 dummy long variables and 1 dummy int variable along with our targeted int variable to span whole 64 bit space. In that case we will hold only our int variable with some dummy variables in the cache line. See below;


public class TestPadding {
    private volatile int myValue;
    private volatile int dummyInt;
    private volatile long dummyLong1;
    private volatile long dummyLong2;
    private volatile long dummyLong3;
    private volatile long dummyLong4;
    private volatile long dummyLong5;
    private volatile long dummyLong6;
    private volatile long dummyLong7;
}


The problem with this approach is, the JVM could possibly eliminate or reorder the unused fields or the objects could not take place on heap. To get rid of this problem we can use volatile modifier as I did above, but this also does not guarentee the elimination in some cases. Also we should carefully investigate the machine and JVM details whether its a 32 or 64 bit environment and determine the real cache line size. We should also be sure how much space our objects hold by checking CompresedOops jvm variable and checking the data structure in detail (you can see my previous post about actual memory consumption here) which are all are not easy and time consuming jobs.

A better approach is to use the @Contented annotation which actually do a similar operation in a safer manner delegating the job to the JVM.

As an aim of in 
JEP 142, we can annotate the fields or classes by @Contented annotation which we think a candidate for false sharing. JEP 142 generally depends on making the padding at class loading time and touching to Allocation Code (to be sure the allocation of objects are correctly aligned), JIT (to know which allocations need to be aligned) and GC (to be sure the object remains aligned after GC)

This annotation is a hint for JVM to place the annotated sections in different cache lines. The result may contain padding or a combination of padding with an aligned allocation. The side effect is of course an increased memory usage since we are using additional space for padding.

Note that Contended annotation does not work on user classpath by default and only work for classes on bootclasspath. So we need to add -XX:-RestrictContended VM argument on JVM startup.

public class TestContended {
      private String str1;

      @Contended
      private String str2;

      @Contended
      private int x;
}

In this example we wanted to keep str2 in a padded cache line and x in a different padded cache line.

If we want to keep 2 variable in the same cache line we can force it by defining a value for Contended annotation like below.

public class TestContended {
      @Contended("test")
      private String str2;

      @Contended("test")
      private int x;
}

Note that @Contended annotation already used in many classes like Thread, ForkJoinPool and ConcurrentHashMap.

We can use jol tool to see if Contended annotation work. Add the following dependency for that.

    <dependency>
        <groupId>org.openjdk.jol</groupId>
        <artifactId>jol-core</artifactId>
        <version>0.9</version>
    </dependency>

And run the following code for the TestPaddingAndContended example below.

System.out.println(VM.current().details());
System.out.println(ClassLayout.parseClass(TestPaddingAndContended.class).toPrintable());

It will print the following by default (along with the VM details). As we see no padding took place.

 falsesharing.TestPaddingAndContended object internals:

 OFFSET        SIZE     TYPE DESCRIPTION                          VALUE

      0          12        (object header)                           N/A

     12          4    int TestPaddingAndContended.myVolatileValue1   N/A

     16          4    int TestPaddingAndContended.myVolatileValue2   N/A

     20          4        (loss due to the next object alignment)

Instance size: 24 bytes

Space losses: 0 bytes internal + 4 bytes external = 4 bytes total


But with -XX:-RestrictContended JVM arg we will see the following.

falsesharing.TestPaddingAndContended object internals:

 OFFSET      SIZE         TYPE DESCRIPTION                     VALUE

      0       12         (object header)                        N/A

     12       128        (alignment/padding gap)                   

    140       4    int TestPaddingAndContended.myVolatileValue1  N/A

    144       128        (alignment/padding gap)                   

    272        4    int TestPaddingAndContended.myVolatileValue2 N/A

    276        4        (loss due to the next object alignment)

Instance size: 280 bytes
Space losses: 256 bytes internal + 4 bytes external = 260 bytes total

We see that we have 256 bytes space loss because of the padding as a result of Contended annotation.

Lets finish our job by a concrete and more detailed example. In this example we will use 2 different int variables, and 2 threads that will do a number of assignment operation on that variables. If we run the below code for the following different cases we'll get the below results.

* 2 non-shared volatile int variable without padding, suppose to be on the same cache line.
   Avg. running time: 3.34 sec

* 2 non-shared volatile int variable with padding, suppose to be on different cache line.
  Avg. running time: 3.07 sec

* 2 non-shared volatile int variable with @Contended, suppose to be on different cache line.
  Avg. running time: 1.39 sec

* 2 non-shared non-volatile int variable without padding, suppose to be on the same cache line.
  Avg. running time: 1.11 sec

* 2 non-shared non-volatile int variable with padding, suppose to be on different cache line.
  Avg. running time: 1.16 sec

As you can see using volatile variable definitely affects the read/write performance when compared with non-volatile variables. On the other hand when we use custom padding or @Contended annotation we get a much more better performance. It is important to note that here the 2 threads may or may not be running on different cores or they may run on different core at some percentage and run on the same core at some percentage. And there is no way of forcing a thread to be run on specific cpu core. Here the aim is to run the code repeatedly and try to get an avarage result to understand the false sharing concept. If we increase the iteration time we can see that the padding and especially the @Contended approach is really improves performance by preventing false sharing and reducing stall time in cpu. We see a more clear result by using @Contended since it also takes care of allocation on heap and warn JIT and GC about the contended objects.

package falsesharing;

import sun.misc.Contended;

/**
 * This class is used to test the false sharing concept.
 * It uses 2 threads changing 2 independent int variables.
 * By changing the variables to volatile and non-volatile,
 * marking and unmarking the variables as @Contented,
 * and adding 7 long and 1 int variables to span
 * whole cache line for a volatile value which is 64 bytes.
 * At the bottom of the class you can find various results
 * of the tests with this class.
 * Contended annotation does not work on user classpath
 * by default and only work for classes on bootclasspath.
 * So we need to add -XX:-RestrictContended VM argument
 * on JVM startup.
 * @author Ali Gelenler
 */

public class TestPaddingAndContended {

    private static final long NUM_OF_ITERATION = 100000000L;

    public static final TestPaddingAndContended INSTANCE =
 new TestPaddingAndContended();

    private TestPaddingAndContended() {

    }

    @Contended
    private volatile int myVolatileValue1; // 4 bytes

// fields used for padding to prevent false sharing
// uncomment to test padding after removing @Contented above
// private volatile int dummyInt1;   // 4 bytes
// private volatile long dummyLong1; // 8 bytes
// private volatile long dummyLong2;
// private volatile long dummyLong3;
// private volatile long dummyLong4;
// private volatile long dummyLong5;
// private volatile long dummyLong6;
// private volatile long dummyLong7;

    @Contended
    private volatile int myVolatileValue2; // 4 bytes

    public static void main(String[] args) {
        Thread thread1 = new Thread(new Thread1Runnable());
        Thread thread2 = new Thread(new Thread2Runnable());
        thread1.start();
        thread2.start();
    }

    public static class Thread1Runnable implements Runnable {

        public void run() {
            long start = System.nanoTime();
            long i = NUM_OF_ITERATION;
            while (--i != 0) {
                INSTANCE.myVolatileValue1 = (int) i;
            }
System.out.println("End of thread 1, last value of" +
  " myVolatileValue1 is " + INSTANCE.myVolatileValue1
   + " it took " +
  (System.nanoTime() - start) + " nanoseconds");
    }

    public static class Thread2Runnable implements Runnable {

        public void run() {
            long start = System.nanoTime();
            long i = NUM_OF_ITERATION;
            while (--i != 0) {
                INSTANCE.myVolatileValue2 = (int) i;
            }
System.out.println("End of thread 2, last value of" +
   " myVolatileValue2 is " + INSTANCE.myVolatileValue2
    + " it took " +
   (System.nanoTime() - start) + " nanoseconds");
    }
}



/**
 2 non-shared volatile without padding, suppose to be
 on the same cache line
 First Attempt:
 End of thread 1, last value of myVolatileValue1 is 1
 it took 3133182407 nanoseconds
 End of thread 2, last value of myVolatileValue2 is 1
 it took 3167983652 nanoseconds
 Second Attempt:
 End of thread 2, last value of myVolatileValue2 is 1
 it took 3383002049 nanoseconds
 End of thread 1, last value of myVolatileValue1 is 1
 it took 3418189666 nanoseconds
 Third Attempt:
 End of thread 2, last value of myVolatileValue2 is 1
 it took 3480275412 nanoseconds
 End of thread 1, last value of myVolatileValue1 is 1
 it took 3492587998 nanoseconds
 Avg. running time: 3.34 sec
 --------------------------------------------------
 2 non-shared volatile with padding, suppose to be
 on different cache line
 First Attempt:
 End of thread 2, last value of myVolatileValue2 is 1
 it took 3000347735 nanoseconds
 End of thread 1, last value of myVolatileValue1 is 1
 it took 3032000048 nanoseconds
 Second Attempt:
 End of thread 1, last value of myVolatileValue1 is 1
 it took 3104793729 nanoseconds
 End of thread 2, last value of myVolatileValue2 is 1
 it took 3143762961 nanoseconds
 Third Attempt:
 End of thread 1, last value of myVolatileValue1 is 1
 it took 3091972217 nanoseconds
 End of thread 2, last value of myVolatileValue2 is 1
 it took 3106549306 nanoseconds
 Avg. running time: 3.07 sec
 ----------------------------------------------------
 2 non-shared volatile with @Contended, suppose to be
 on different cache line
 First Attempt:
 End of thread 1, last value of myVolatileValue1 is 1
 it took 1406085998 nanoseconds
 End of thread 2, last value of myVolatileValue2 is 1
 it took 1684038215 nanoseconds
 Second Attempt:
 End of thread 1, last value of myVolatileValue1 is 1
 it took 1329552500 nanoseconds
 End of thread 2, last value of myVolatileValue2 is 1
 it took 1374891105 nanoseconds
 Third Attempt:
 End of thread 2, last value of myVolatileValue2 is 1
 it took 1093849103 nanoseconds
 End of thread 1, last value of myVolatileValue1 is 1
 it took 1489967853 nanoseconds
 Avg. running time: 1.39 sec
 --------------------------------------------------------
 2 non-shared non-volatile without padding, suppose to be
 on the same cache line
 First Attempt:
 End of thread 2, last value of myVolatileValue2 is 1
 it took 112383522 nanoseconds
 End of thread 1, last value of myVolatileValue1 is 1
 it took 114422239 nanoseconds
 Second Attempt:
 End of thread 1, last value of myVolatileValue1 is 1
 it took 115202830 nanoseconds
 End of thread 2, last value of myVolatileValue2 is 1
 it took 115687606 nanoseconds
 Third Attempt:
 End of thread 2, last value of myVolatileValue2 is 1
 it took 105320160 nanoseconds
 End of thread 1, last value of myVolatileValue1 is 1
 it took 106646504 nanoseconds
 Avg. running time: 1.11 sec
 ------------------------------------------------------
 2 non-shared non-volatile with padding, suppose to be
 on different cache line
 First Attempt:
 End of thread 1, last value of myVolatileValue1 is 1
 it took 113062087 nanoseconds
 End of thread 2, last value of myVolatileValue2 is 1
 it took 113854150 nanoseconds
 Second Attempt:
 End of thread 2, last value of myVolatileValue2 is 1
 it took 126021245 nanoseconds
 End of thread 1, last value of myVolatileValue1 is 1
 it took 126020036 nanoseconds
 Third Attempt:
 End of thread 2, last value of myVolatileValue2 is 1
 it took 109028728 nanoseconds
 End of thread 1, last value of myVolatileValue1 is 1
 it took 109101776 nanoseconds
 Avg. running time: 1.16 sec
 */

 Update for Java 9:

With Java 9 the Contended annotation moved from sun.misc package to jdk.internal.vm.annotation package. The jdk.internal.** modules are not publicly available for us by default. So if we try to run our example with java 9 we will get "package jdk.internal.vm.annotation is declared in module java.base, which does not export it to module java9" error on compile time. To overcome this error we need to add the following command line argument to javac command.
--add-exports java.base/jdk.internal.vm.annotation=java9
or for all versions
--add-exports java.base/jdk.internal.vm.annotation=ALL-UNNAMED

Note that here we named my module as java9 by using the module-info.java in my classpath.

To set the above command in IntelliJ IDEA you can go Settings -> Build, Execution, Deployment -> Java Compiler and set the command in "Additional command line parameters" section.

Note that here we used --add-exports instead of --add-opens because we only want to access to the public Contended class which is not exported by default. So exporting is enough here since it is public.
If we want to use some private fields or methods we would need to use --add-opens which need to be added on java command not javac command.

 Conclusion:

     We see that the volatile variable has an important place in concurrent multithread applications. We should use it carefully to provide ordering guarantee and visibility. Remember that we cannot use it for atomicity purposes, for which we need synhronized keyword. We also see the CPU cache in detail which generally uses cache lines to hold data in L1, L2 or L3 caches to get quick access to data instead of reading it from main memory.
Finally we see the false sharing concepts which occur on cache lines invalidation where we unintentionally hold our variables in the same cache line. To prevent it we may use manual padding or even better we can try the @Contended annotation that comes with Java 8 which does the padding for us directing JVM, JIT and GC respectively. We finally saw how to enable Contended in java 9.

You can download the source code for the detailed example 
here.

References:
  • http://openjdk.java.net/jeps/142
  • http://beautynbits.blogspot.com/2012/11/the-end-for-false-sharing-in-java.html
  • http://www.cs.wustl.edu/~schmidt/PDF/DC-Locking.pdf
  • https://en.wikipedia.org/wiki/Double-checked_locking
  • https://docs.oracle.com/cd/E15357_01/coh.360/e15723/cache_intro.htm#COHDG5049
  • http://openjdk.java.net/projects/code-tools/jol/