menu

Monday, May 4, 2015

String and String Related Object's Memory Usage in Java

 Introduction:

     Strings are everywhere and almost every application heavily use them. For example, for web applications all the data coming from the view are firstly Strings, after taking them into the application they are converted to their real data types. Hence it's crucial to understand the memory consumption of String and related Objects. In this write we'll examine the history of String Object in Java, through Java versions looking into the some String methods like substring, and then we try to figure out StringBuilder Object's memory consumption and finally see how we can canonicalize String objects to reduce memory occupied by String and related Objects.

History of the Well Known String Object:

     We examine the String class in three part regarding to Java versions as before 1.7.0_06, after 1.7.0_06 before 1.8, and after 1.8.

Before 1.7.0_06;

     String class before 1.7.0_06 has a private char array, and 3 int variables; offset, count and hash respectively. That means a String "A" object will occupy 56 bytes as explained below. For details of memory calculations you can refer to this post.

12 header + 4 ref to char[] + (4*3) for 3 int = 28 rounded to 32 bytes; plus
12 header  + 4 len + 1(len of "A") * 2 = 18 rounded to 24 bytes; totally 56 bytes.

In this implementation, we had offset and count int variables, and the reason of them is to share the char array inside the String object among the substrings created with the substring method.

Let's see it on the following example which run on Java version before 1.7.0_06.

Example 1:

String s1 = "My String";  
String s2 = s1.substring(3);  
System.out.println(s1); // My String  
System.out.println(s2); // String  

Field field = String.class.getDeclaredField("value");  
field.setAccessible(true);  
char[] valueInString = (char[])field.get(s1);  
valueInString [3] = 'J';  
valueInString [4] = 'a';  
valueInString [5] = 'v';  
valueInString [6] = 'a';
 valueInString [7] = ' ';     
 valueInString [8] = ' '; 

System.out.println(s1); // My Java  
System.out.println(s2); // Java  

Here we see that, the String object s2 obtained by the substring method share the underlying char array in the object s1, so that any change to the char array of s1 will directly affect the s2. With this approach we create the substring in constant time by just changing the offset and count values of s2 and referencing to the same char array, however at the same time we possibly get a memory leak if we do not want to use the s1 object anymore, since it will cause the whole object continue to alive not the substring only. To get rid of this memory leak possibility we may use new String(string) constructor around the String obtained with substring.

An important change has been made after 1.7.0_06 to the String class, let's examine it now.

After 1.7.0_06, Before 1.8;

     String class after 1.7.0_06 before 1.8 has a private final char array, and 2 int variables, hash and hash32 respectively. The offset and count variables are removed from the implementation, and the char[] in substrings are not shared anymore. Another change to this version is, the char[] is a final variable indicating it's immutability, although we can change it's value either by reflection or array manipulation.In this version, the String "A" object will occupy 48 bytes as explained below. For details of memory calculations you can refer to this post.

12 header + 4 ref to char[] + (4*2) for 2 int = 24 bytes; plus
12 header  + 4 len + 1(len of "A") * 2 = 18 rounded to 24 bytes; totally 48 bytes.

Let's run the code we saw in Example 1 with this String version.

Example 2:

String s1 = "My String";  
String s2 = s1.substring(3);  
System.out.println(s1); // My String  
System.out.println(s2); // String  

Field field = String.class.getDeclaredField("value");  
field.setAccessible(true);  
char[] valueInString = (char[])field.get(s1);  
valueInString [3] = 'J';  
valueInString [4] = 'a';  
valueInString [5] = 'v';  
valueInString [6] = 'a';
 valueInString [7] = ' ';     
 valueInString [8] = ' '; 

System.out.println(s1); // My Java  
System.out.println(s2); // String  

We see the change to the char[] in the first String doesn't affect the second String obtained with the substring method of the first String. As we see, if you use reflection we can't anymore say that String is immutable. Immutability is valid only if you use String object without reflection.

After 1.8;

     String class after 1.8 has a private final char array, and 1 int variable, hash. The hash32 variable removed from the implementation, which was used as an alternative hash implementation if there are too many collisions with the default hashing algorithm when using with HashMap. However with Java 8 the HashMap implementation has changed and if there are too many collisions in the buckets, HashMap dynamically replaces the way of holding the buckets from linkedlist to ad-hoc implementation of TreeMap. Using this way we end up with O(logn) complexity instead of O(n) since the bucket's elements are now ordered thanks to the TreeMap implementation. In this new implementation HashMap order the bucket's elements that collison occured in. It requires the keys to be Comparable and uses the compareTo methods to order them. Thanks to this new implementation, hash32 variable removed from the String implementation.

In this version, the String "A" object will occupy 48 bytes as explained below.

12 header + 4 ref to char[] + (4*1) for 1 int = 20 bytes rounded to 24 bytes; plus
12 header  + 4 len + 1(len of "A") * 2 = 18 rounded to 24 bytes; totally 48 bytes.

StringBuilder Memory Consumption:

     We now see the StringBuilder in terms of memory usage. We use the last Java version's StringBuilder implementation. The StringBuilder class holds a char[] and an int variable 'count' in the abstract super class AbstractStringBuilder. Since the default capacity of StringBuilder is 16, the total amount of a empty StringBuilder object will occupy 72 bytes as explained below.

12 header + 4 ref to char[] + (4*1) for count = 20 bytes rounded to 24 bytes; plus
12 header  + 4 len + 16 * 2 = 48 bytes; totally 72 bytes.

The initial capacity for a non-empty StringBuilder will be 16 + length and we can get this capacity with capacity() method of it. At anytime we can shrink the capacity to the current lenght with the trimToSize() method if we pretty sure the lenght won't be increase, otherwise we lose with the effort of capacity increase. In the last implementation of the StringBuilder when we append content using the append method, StringBuilder check if the required new length (old length + new String's length) is less than the current capacity of the inner char[], and if so increase capacity to "old capacity * 2 + 2". If this value is still less than the required minimum less capacity than set the capacity to the required minimum capacity, that is old length + new String's length.

If we sure we finish appending to a StringBuilder object, we can shrink it's size with trimToSize() method at any time to reduce the amount of memory spanned with StringBuilder object.

Canonicalization of String:

Using Canonicalization we can sure there is only one unique content of an object. The JVM already maintained a canonicalization technique for all String objects, which is called pooling. If we define the Strings as hardcoded literals such as s= "a", it will be pooled automatically. The pooled String objects are kept in perm gen area until Java 7 and then removed from perm gen to the heap by Java 7 and later versions.

At this point , a small information will be helpful about perm gen space and String pooling. As we said String pool is removed from perm gen to the Heap after Java 7. By Java 8 an important change has been made and the permanent generation is completely removed. Class meta-data informations are now hold in native-memory which is called as "MetaSpace" and interned Strings and static variables are hold in Heap. By this way instead of limiting the memory of the class meta-data, interned Strings and static variables by the -XXMaxPerSize, JVM will allocate and free the machine's native memory dynamically for class meta-data and the Heap memory tuned by the -Xms and -Xmx variables will be used for static variables and interned Strings which are subject to garbage collection, as the normal java objects. By this change the difficulty of tuning perm gen space is gone.
So , keeping in mind these important changes we can say that tuning the JVM again is important when you are moving to Java 8.
You can set the metaspace size by -XX:MaxMetaSpaceSize which has no limit by default and has a 21 MB initial value for 64 bit JVM. To print meta space related statistics we have to set -XX:+UnlockDisagnosticVMOptions.

Note: There is also a canonicalization for  all Wrappers (except Double and Float) implemented as cached array inside the Wrapper classes.

If your Strings are coming from different sources such as database or file, the values will not be pooled. In this case you may use native intern method to force the JVM to pool the String objects. But be careful with the intern method with user-generated Strings as it may cause a memory leak and an OutOfMemory error if there is an attack by sending large number of different Strings to your application.
Also reconsider using intern methods if you are using Java 6 or earlier, since the pooled Strings are kept in perm gen and the perm gen has a fixed size determined by -XX:MaxPermSize JVM parameter.

If we use intern methods heavily, we have to use -XXStringTableSize JVM parameter after Java 7 to set the map size of String pool (Default value is 60013 in Java 8). We have to carefully determine the max number of distinct Strings that the application may hold and set the map size according to this.
We can also use -XXPrintStringTableStatistics JVM parameter to see the usage of String pool.

Using intern method is pretty straightforward as shown in the below example;

Example 3:

String s1 = readAsStringFromSomeExternalSource();//say it's value is "a"
s1.intern();
String s2 = "a";
if (s1 == s2)
   System.out.println("String is pooled by intern method");

When we use intern method we may use == operator instead of equals method which can also get some slight performance advantage. However this is so negligible since the first statement of the equals method also uses the == operator to check the equality, and calling the equals method may be inlined by the JVM. Also be careful that if you forget intern at some point == operator will fail.

For String pool we may use our implementation with a ConcurrentHashMap or WeakHashMap.
WeakHashMap will be the correct implementation since it will remove the String from the map when there is no other reference, which is the default behaviour of the String pool. Yes, the String pool also subject to Garbage Collection, and if there is no live reference to an Object in the native JVM String pool, it will garbage collected.
If you don't use WeakHashMap, we have to manually control the size of the map as it may grow too much after some time.
Remember also to use a synchronized version of WeakHashMap if you're in a multithread environment. You may also use Google Collections API's MapMaker to use concurrent WeakHashMap

Using Map instead of intern method causes to create one extra String object for every String used in the application as you have to pass the String object to the custom pool method and this parameter will be directly become eligible to the garbage collection. Subsequent parameter passes will be pooled by the JVM. One advantage of using map pool instead of intern method is there is no method to turn back from the intend operation.

See in the following example;

Example 4:

private ConcurrentMap<String, String> stringPool= new ConcurrentHashMap<String, String>(500);

public String getCanonicalString(String param) {
//Here the literal param String object created on heap and added to the String pool by JVM
//Thanks to the interned String pool, later calls with the same param String are not created
//on heap and just get from the interned String pool. So we can say that every String added 
//to our string pool also added to the interned String pool.
String pooledVersion= stringPool.putIfAbsent(paramparam);
return (pooledVersion == null) ? param pooledVersion ;
}

Remember to clear the stringPool if it's size grow too much.

Using byte[] instead of String:

We may use byte[] instead of String class in some situtations to reduce the memory usage, although it has limited usage because you can't use some characters with some charsets as it can hold only 256 ascii characters not full unicode. See in the following example; 

Example 5:

byte[] arr = new byte[5];
arr[0] = 'a';
arr[1] = 'b';
arr[2] = 'c';
arr[3] = 'd';
arr[4] = 'e';
System.out.println(SizeUtil.fullSizeOf(arr)); // 24 bytes
String s ="abcde";
System.out.println(SizeUtil.fullSizeOf(s));   // 56 bytes

}

For the implementation of SizeUtil look for this 
post. You see that byte array spans significantly smaller memory. Until you need the String representation you can keep the byte array, then can use
String s2 = new String(arr, CHARSET_NAME); to obtain the String object. You can especially use this technique when sending object across network or similar situation.

Conclusion:

     Today we see how String and related objects are kept in memory, and how we can reduce memory usage of them and get higher performance. We saw the history of String Object in Java through Java versions and see some methods to reduce memory usage. For example by holding byte[] instead String object we can reduce the memory usage by half or we can use StringBuilder object with trims to the current size and reduce the memory usage by this way.Also we have consider the Java version we're using, and be careful with the implementation changes. Finally we have to consider canonicalization of String objects to reduce their memory usage.

No comments:

Post a Comment