Introduction:
Today I will try to explain the use of volatile variable in java while touching the concepts of cache lines, false sharing and the new java 8 annotation @Contented. We will also see how to use @Contended in java 9. Lets first look at the use cases that we can and cannot use volatile.
Use of volatile:
Volatile variables came into our life with java 5. It generally provides a lock-free mechanism for concurrent applications. Below you can find a list of characteristics of a volatile variable.
1.) Happens-before guarentee; A write to a volatile variable happens before any subsequent read.
2.) No reordering: JVM can normally reorder your code for various reasons
like performance which may cause problems for concurrent application. If you
define a variable as volatile you tell the JVM not to reorder that variable.
3.) Guarenteed visibility: This is actually a result of happens-before principle. In multithreaded applications which operates on an multi CPU environment, each thread may copy variable from main memory into the CPU cache to increase performance. In that case if a thread change a variable, another thread may not see the latest version of that variable. To solve that issue we can define the variable as volatile, and any change made by a thread guarenteed to be visible to all of the threads.
4.) Atomic read of long and double variables: This is about to become obsolete point since its only related with 32 bit operating systems or 32 bit java virtual machines. The point is the long and double variables are 64 bit lenght in java and in 32 bit systems there is no way of reading the 64 bit variable at once. So there should be 2 subsequent reads to get a value of 64 bit variable. If we use volatile modifier we can guarentee an atomic read operation on long or double values. As I said however for 64 bit environments there is no such requirement.
Those are the abilities that volatile modifier added to java. On the other hand we should aware that volatile modifier does not bring us the atomicity, you can only use the volatile variable if you are making an assignment. For example we cannot use it on a variable which is incremented by multiple threads. Say 2 independent thread read a volatile variable at the same time and increment the value by one, then flushed it to the main memory since its a volatile variable. In this case we possibly see only one increment since we did not use any synchronization on the variable. Volatile does not provide that ability, we should use synchronized keyword which provides a monitor lock and guarentees an ordered access to the variable by multiple threads.
We can use volatile modifier for common variables like flags between threads or can use it in double checked locking idiom in singleton pattern which is broken without a volatile variable as a result of a java memory model behaviour.
public class DoubleCheckedLocking{
private
volatile Singleton singleton;
public
Singleton getSingleton() {
if
(singleton == null) {
synchronized(this)
{
if
(singleton == null) {
singleton= new Singleton ();
}
}
}
return
singleton;
} }
At this point lets go into the some details about cpu cache, cache lines and false sharing.
CPU Cache:
Generally speaking in multi core environment each cpu has its own cache which takes in place betweeen the cpu and the main memory. Although the situation may change depending on the cpu architecture which is different for each of the vendor, the cpu cache is generally consist of 3 caches L1, L2, and L3 where L1 is the closest one to CPU, L3 is closest to main memory.
L1-cache is the fastest one and usually it takes place in the processor chip. The size capacity is typically in between 8KB-64KB and it uses SRAM which is faster than DRAM which generally used in main memory.
L2-cache comes between the L1-cache and the RAM and has a bigger storage
capacity typically in between 64KB-4MB.
L3-cache comes after L2 cache and its closer to RAM. L3-cache are generally
exists on motherboard instead of the processor and it is used to keep the
common data between multiple cores.
During the data fetching first the L1-cache is looked up, then L2-cache, then L3-cache and at the end if the data cannot be found in CPU cache, it is read from the main memory.
So we can say that read performance of a volatile variable will possibly be
worse than a normal variable, since to be able to read the variable we should
go to the main memory instead of cpu cache in many cases.
For cpu caching there exists different strategies in multi core environment. For example a replicated cache holds all the keys in all nodes, on the other hand a distributed cache holds keys only on some of the nodes to provide redundancy and fault tolerance and it provides a more scalable solution. You can find more information on this here.
Now, lets go deeper on cpu caching by examining the cache lines and the false sharing concepts.
Cache Lines and False Sharing:
CPU's read memory in some block of bytes, usually 64 bytes. We call this block of bytes as cache lines. Generally a cpu maintain consistency on cache line basis, that means if any single byte of a cache line is changed all cache line is invalited and this invalidation takes place for all cpu's in cluster.
For example let's say we have 2 threads which are operates on 2 different cores in a multi-core environment. If we have a shared variable x which is read by thread 1 and thread 2, both thread will put that variable into their cpu cache line's. Then if thread 1 updates the variable x thread 1's cache line must be invalidated and also thread 2 told to invalidate its cache line. This is an expected behaviour and it is called true sharing. Here if we make the variable x volatile we can guarentee that behaviour, on the other hand for non-volatile variables, since we will not force the cpu to flush its memory barrier we may not see the invalidation.
Say we have an additional y variable in the same cache line with x. Say thread 1 updates x and thread 2 does only want to operate on y. However both thread has the cache line that holds x and y, since we read 64 bit block of memory. Even in that case thread 2 told to invalidate its cache because of the change in variable x which thread 2 does not matter. This situation is called false sharing.
Say we have 2 independent variables x and y which are not shared between thread 1 and thread 2. Thread 1 operates on x and thread 2 operates on y. If x and y are still in the same cache line, any change to x or y will invalidate the cache line's in thread 1 and thread 2's cpu caches which is also false sharing.
If this false sharing start to happen many times, our system can suffer performance problems since a cpu will need to wait the cache line to be loaded while it would do many iterations in that time. This is called stall and it can cause a silence performance problem.
To solve this performance problem and to prevent from false-sharing we can try padding technique or even better use the new @Contended annotation which comes by java 8. Let's explain them in detail.
Padding and @Contended:
The main point that causes false sharing is unintentionally sharing variables in cache lines. To prevent it we can try to pad our data structure or variable to span the whole cache line. For example if we have a 64 bit cache line and a int variable which consumes 4 bytes we can use 7 dummy long variables and 1 dummy int variable along with our targeted int variable to span whole 64 bit space. In that case we will hold only our int variable with some dummy variables in the cache line. See below;
public class TestPadding {
private volatile int myValue;
private volatile int
dummyInt;
private volatile long
dummyLong1;
private volatile long
dummyLong2;
private volatile long
dummyLong3;
private volatile long
dummyLong4;
private volatile long
dummyLong5;
private volatile long
dummyLong6;
private volatile long
dummyLong7;
}
The problem with this approach is, the JVM could possibly eliminate or
reorder the unused fields or the objects could not take place on heap. To get
rid of this problem we can use volatile modifier as I did above, but this also
does not guarentee the elimination in some cases. Also we should carefully
investigate the machine and JVM details whether its a 32 or 64 bit environment
and determine the real cache line size. We should also be sure how much space
our objects hold by checking CompresedOops jvm variable and checking the data
structure in detail (you can see my previous post about actual memory
consumption here) which are all are not
easy and time consuming jobs.
A better approach is to use the @Contented annotation which actually do a similar operation in a safer manner delegating the job to the JVM.
As an aim of in JEP 142, we can annotate the fields or classes by @Contented annotation which we think a candidate for false sharing. JEP 142 generally depends on making the padding at class loading time and touching to Allocation Code (to be sure the allocation of objects are correctly aligned), JIT (to know which allocations need to be aligned) and GC (to be sure the object remains aligned after GC)
This annotation is a hint for JVM to place the annotated sections in different cache lines. The result may contain padding or a combination of padding with an aligned allocation. The side effect is of course an increased memory usage since we are using additional space for padding.
Note that Contended annotation does not work on user classpath by default and only work for classes on bootclasspath. So we need to add -XX:-RestrictContended VM argument on JVM startup.
public class TestContended {
private String str1;
@Contended
private String str2;
@Contended
private int x;
}
In this example we wanted to keep str2 in a padded cache line and x in a
different padded cache line.
If we want to keep 2 variable in the same cache line we can force it by
defining a value for Contended annotation like below.
public class TestContended {
@Contended("test")
private String str2;
@Contended("test")
private int x;
}
Note that @Contended annotation already used in many classes like Thread, ForkJoinPool and ConcurrentHashMap.
We can use jol tool to see if Contended annotation work. Add the following
dependency for that.
<dependency>
<groupId>org.openjdk.jol</groupId>
<artifactId>jol-core</artifactId>
<version>0.9</version>
</dependency>
And run the following code for the TestPaddingAndContended example
below.
System.out.println(VM.current().details());
System.out.println(ClassLayout.parseClass(TestPaddingAndContended.class).toPrintable());
It will print the following by default (along with the VM details). As we
see no padding took place.
falsesharing.TestPaddingAndContended object internals:
OFFSET SIZE TYPE DESCRIPTION
VALUE
0
12 (object header)
N/A
12
4 int TestPaddingAndContended.myVolatileValue1
N/A
16
4 int TestPaddingAndContended.myVolatileValue2
N/A
20 4 (loss due
to the next object alignment)
Instance size: 24 bytes
Space losses: 0 bytes internal + 4 bytes external = 4
bytes total
But with -XX:-RestrictContended JVM arg we will see the following.
falsesharing.TestPaddingAndContended object internals:
OFFSET
SIZE TYPE DESCRIPTION
VALUE
0 12
(object
header)
N/A
12 128
(alignment/padding gap)
140 4
int TestPaddingAndContended.myVolatileValue1 N/A
144 128
(alignment/padding gap)
272 4
int TestPaddingAndContended.myVolatileValue2 N/A
276 4
(loss due to the next object alignment)
Instance size: 280 bytes
Space losses: 256 bytes internal + 4 bytes external =
260 bytes total
We see that we have 256 bytes space loss because of the padding as a result
of Contended annotation.
Lets finish our job by a concrete and more detailed example. In this example we will use 2 different int variables, and 2 threads that will do a number of assignment operation on that variables. If we run the below code for the following different cases we'll get the below results.
* 2 non-shared volatile int variable without padding, suppose to be on the same cache line.
Avg. running time: 3.34 sec
* 2 non-shared volatile int variable with padding,
suppose to be on different cache line.
Avg. running time: 3.07 sec
* 2 non-shared volatile int variable with @Contended,
suppose to be on different cache line.
Avg. running time: 1.39 sec
* 2 non-shared non-volatile int variable without padding,
suppose to be on the same cache line.
Avg. running time: 1.11 sec
* 2 non-shared non-volatile int variable with padding,
suppose to be on different cache line.
Avg. running time: 1.16 sec
As you can see using volatile variable definitely affects the
read/write performance when compared with non-volatile variables. On the other
hand when we use custom padding or @Contended annotation we get a much more
better performance. It is important to note that here the 2 threads may or may
not be running on different cores or they may run on different core at some
percentage and run on the same core at some percentage. And there is no way of
forcing a thread to be run on specific cpu core. Here the aim is to run the
code repeatedly and try to get an avarage result to understand the false
sharing concept. If we increase the iteration time we can see that the padding
and especially the @Contended approach is really improves performance by
preventing false sharing and reducing stall time in cpu. We see a more clear
result by using @Contended since it also takes care of allocation on heap
and warn JIT and GC about the contended objects.
package falsesharing;
import sun.misc.Contended;
/**
* This class is
used to test the false sharing concept.
* It uses 2
threads changing 2 independent int variables.
* By changing
the variables to volatile and non-volatile,
* marking and
unmarking the variables as @Contented,
* and adding 7
long and 1 int variables to span
* whole cache
line for a volatile value which is 64 bytes.
* At the bottom
of the class you can find various results
* of the tests
with this class.
* Contended
annotation does not work on user classpath
* by default
and only work for classes on bootclasspath.
* So we need to
add -XX:-RestrictContended VM argument
* on JVM
startup.
* @author Ali
Gelenler
*/
public class TestPaddingAndContended {
private
static final long NUM_OF_ITERATION = 100000000L;
public
static final TestPaddingAndContended INSTANCE =
new
TestPaddingAndContended();
private
TestPaddingAndContended() {
}
@Contended
private
volatile int myVolatileValue1; // 4 bytes
// fields used for padding to prevent false sharing
// uncomment to test padding after removing @Contented
above
// private volatile int dummyInt1; // 4 bytes
// private volatile long dummyLong1; // 8 bytes
// private volatile long dummyLong2;
// private volatile long dummyLong3;
// private volatile long dummyLong4;
// private volatile long dummyLong5;
// private volatile long dummyLong6;
// private volatile long dummyLong7;
@Contended
private
volatile int myVolatileValue2; // 4 bytes
public
static void main(String[] args) {
Thread
thread1 = new Thread(new Thread1Runnable());
Thread
thread2 = new Thread(new Thread2Runnable());
thread1.start();
thread2.start();
}
public
static class Thread1Runnable implements Runnable {
public
void run() {
long
start = System.nanoTime();
long
i = NUM_OF_ITERATION;
while (--i != 0) {
INSTANCE.myVolatileValue1 = (int) i;
}
System.out.println("End of thread 1, last value
of" +
"
myVolatileValue1 is " + INSTANCE.myVolatileValue1
+ " it
took " +
(System.nanoTime() - start) + " nanoseconds");
}
public
static class Thread2Runnable implements Runnable {
public
void run() {
long
start = System.nanoTime();
long
i = NUM_OF_ITERATION;
while (--i != 0) {
INSTANCE.myVolatileValue2 = (int) i;
}
System.out.println("End of thread 2, last value
of" +
"
myVolatileValue2 is " + INSTANCE.myVolatileValue2
+ " it
took " +
(System.nanoTime() - start) + " nanoseconds");
}
}
/**
2 non-shared
volatile without padding, suppose to be
on the same
cache line
First Attempt:
End of thread
1, last value of myVolatileValue1 is 1
it took
3133182407 nanoseconds
End of thread
2, last value of myVolatileValue2 is 1
it took
3167983652 nanoseconds
Second Attempt:
End of thread
2, last value of myVolatileValue2 is 1
it took
3383002049 nanoseconds
End of thread
1, last value of myVolatileValue1 is 1
it took
3418189666 nanoseconds
Third Attempt:
End of thread
2, last value of myVolatileValue2 is 1
it took
3480275412 nanoseconds
End of thread
1, last value of myVolatileValue1 is 1
it took
3492587998 nanoseconds
Avg. running
time: 3.34 sec
--------------------------------------------------
2 non-shared
volatile with padding, suppose to be
on different
cache line
First Attempt:
End of thread
2, last value of myVolatileValue2 is 1
it took
3000347735 nanoseconds
End of thread
1, last value of myVolatileValue1 is 1
it took
3032000048 nanoseconds
Second Attempt:
End of thread
1, last value of myVolatileValue1 is 1
it took
3104793729 nanoseconds
End of thread
2, last value of myVolatileValue2 is 1
it took
3143762961 nanoseconds
Third Attempt:
End of thread
1, last value of myVolatileValue1 is 1
it took
3091972217 nanoseconds
End of thread
2, last value of myVolatileValue2 is 1
it took
3106549306 nanoseconds
Avg. running
time: 3.07 sec
----------------------------------------------------
2 non-shared
volatile with @Contended, suppose to be
on different
cache line
First Attempt:
End of thread
1, last value of myVolatileValue1 is 1
it took
1406085998 nanoseconds
End of thread
2, last value of myVolatileValue2 is 1
it took
1684038215 nanoseconds
Second Attempt:
End of thread
1, last value of myVolatileValue1 is 1
it took
1329552500 nanoseconds
End of thread
2, last value of myVolatileValue2 is 1
it took
1374891105 nanoseconds
Third Attempt:
End of thread
2, last value of myVolatileValue2 is 1
it took
1093849103 nanoseconds
End of thread
1, last value of myVolatileValue1 is 1
it took
1489967853 nanoseconds
Avg. running
time: 1.39 sec
--------------------------------------------------------
2 non-shared
non-volatile without padding, suppose to be
on the same
cache line
First Attempt:
End of thread
2, last value of myVolatileValue2 is 1
it took
112383522 nanoseconds
End of thread
1, last value of myVolatileValue1 is 1
it took
114422239 nanoseconds
Second Attempt:
End of thread
1, last value of myVolatileValue1 is 1
it took
115202830 nanoseconds
End of thread
2, last value of myVolatileValue2 is 1
it took
115687606 nanoseconds
Third Attempt:
End of thread
2, last value of myVolatileValue2 is 1
it took
105320160 nanoseconds
End of thread
1, last value of myVolatileValue1 is 1
it took
106646504 nanoseconds
Avg. running
time: 1.11 sec
------------------------------------------------------
2 non-shared
non-volatile with padding, suppose to be
on different
cache line
First Attempt:
End of thread
1, last value of myVolatileValue1 is 1
it took
113062087 nanoseconds
End of thread
2, last value of myVolatileValue2 is 1
it took
113854150 nanoseconds
Second Attempt:
End of thread
2, last value of myVolatileValue2 is 1
it took
126021245 nanoseconds
End of thread
1, last value of myVolatileValue1 is 1
it took
126020036 nanoseconds
Third Attempt:
End of thread
2, last value of myVolatileValue2 is 1
it took
109028728 nanoseconds
End of thread
1, last value of myVolatileValue1 is 1
it took
109101776 nanoseconds
Avg. running
time: 1.16 sec
*/
Update for Java 9:
With Java 9 the Contended annotation moved from sun.misc package to jdk.internal.vm.annotation package. The jdk.internal.** modules are not publicly available for us by default. So if we try to run our example with java 9 we will get "package jdk.internal.vm.annotation is declared in module java.base, which does not export it to module java9" error on compile time. To overcome this error we need to add the following command line argument to javac command.
--add-exports
java.base/jdk.internal.vm.annotation=java9
or for all versions
--add-exports java.base/jdk.internal.vm.annotation=ALL-UNNAMED
or for all versions
--add-exports java.base/jdk.internal.vm.annotation=ALL-UNNAMED
Note that here we named my module as java9 by using the module-info.java in my classpath.
To set the above command in IntelliJ IDEA you can go Settings -> Build, Execution, Deployment -> Java Compiler and set the command in "Additional command line parameters" section.
Note that here we used --add-exports instead of --add-opens because we only want to access to the public Contended class which is not exported by default. So exporting is enough here since it is public.
If we want to use some private fields or methods we would need to use
--add-opens which need to be added on java command not javac command.
Conclusion:
We see that the volatile variable has an important place in concurrent multithread applications. We should use it carefully to provide ordering guarantee and visibility. Remember that we cannot use it for atomicity purposes, for which we need synhronized keyword. We also see the CPU cache in detail which generally uses cache lines to hold data in L1, L2 or L3 caches to get quick access to data instead of reading it from main memory.
Finally we see the false sharing concepts which occur on cache lines
invalidation where we unintentionally hold our variables in the same cache
line. To prevent it we may use manual padding or even better we can try the
@Contended annotation that comes with Java 8 which does the padding for us
directing JVM, JIT and GC respectively. We finally saw how to enable Contended
in java 9.
References:
- http://openjdk.java.net/jeps/142
- http://beautynbits.blogspot.com/2012/11/the-end-for-false-sharing-in-java.html
- http://www.cs.wustl.edu/~schmidt/PDF/DC-Locking.pdf
- https://en.wikipedia.org/wiki/Double-checked_locking
- https://docs.oracle.com/cd/E15357_01/coh.360/e15723/cache_intro.htm#COHDG5049
- http://openjdk.java.net/projects/code-tools/jol/