As we did with our previous guest post, we asked our friends Martijn Verburg and Kirk Pepperdine at jClarity if we could share this great blog post with RebelLabs readers. The original post, Why didn’t my app throw an OutOfMemoryError?, was crafted by Kirk back in January, and we are posting it here with their permission.
Every once in a while I run into someone that has a Hotspot JVM that is running back to back garbage collections and yet the heap is still almost full after each attempt! When they discover that their problem is related to the JVM not having enough memory they often ask the question, why didn’t the JVM throw an
OutOfMemoryError (OOME)? After all, my application is not making any forward progress and the reason is Java heap is almost completely exhausted, right?
First things first, if you’re lucky, you may never realize that your JVM is running a Ponzi scheme on you. It’s telling your application that you have a infinite amount of memory so go ahead and just use it. The JVM’s garbage collectors are responsible for maintaining this illusion that all’s well in memory land.
It’s not the only Ponzi scheme in town mind you! This behaviour is similar to how many O/S’s deal with memory allocations for native applications. You’ll keep on allocating native memory until you hit swap and then you’re in a bad way – not completely paused but fairly close to it due to low disk speed compared to main memory.
In order to maintain this illusion the garbage collector has to magically freeze time using a “safe-pointing” spell so that the application doesn’t even realize what happened. However, as much as time may have stopped for your application, time hasn’t stopped for those using your application. If your application keeps a lot of live data the garbage collector is going to have to work much harder to maintain the illusion. While your application may not notice how often or for how long it’s been frozen for, your users will!
Because your application trusts the JVM and the JVM will work hard, heroically (even foolishly) to maintain the illusion when it’s clear the gig is up leaving people wondering, “why didn’t my app throw an OOME?”
An example using a ParNew Young collection (with CMS)
Lets follow a GC log fragment to see if we can sort things out, here we start at the beginning of about 10 seconds worth of log.
85.578: [GC 85.578: [ParNew: 17024K->2110K(19136K), 0.0352412 secs] 113097K->106307K(126912K), 0.0353280 secs]
This is a normally completing parallel collection of Young most likely triggered by an allocation failure in Eden. Let’s break out the numbers:
- The live bytes left behind total is 106307K
- The live bytes in Survivor space is 2110K
- This implies the occupancy of Old is now at 104197K (106307-2110)
Driving into the numbers a little deeper:
- However total heap size is 126912K
- Of which Young is 19136K
- This implies that Tenured is 107776K
With a little more math we can see that Old is 104197K/107776K or just under 97% occupied, which is dangerously close to being completely full!
The CMS that follows
The next set of records show that the previous ParNew was running inside a CMS cycle and that cycle did manage to complete. Yet the next record following the end of the CMS cycle is another CMS cycle. The reason why, is that the previous cycle recovered 104197K – 101397K = 2800K, which is approximately 2.5% of Tenured leaving the GC just able to continue, but one hiccup from being exposed!
86.306: [CMS-concurrent-abortable-preclean: 0.744/1.546 secs] 86.306: [GC[YG occupancy: 10649 K (19136 K)]86.306: [Rescan (parallel) , 0.0039103 secs]86.310: [weak refs processing, 0.0005278 secs] [1 CMS-remark: 104196K(107776K)] 114846K(126912K), 0.0045393 secs] 86.311: [CMS-concurrent-sweep-start] 86.366: [CMS-concurrent-sweep: 0.055/0.055 secs] 86.366: [CMS-concurrent-reset-start] 86.367: [CMS-concurrent-reset: 0.001/0.001 secs] 86.808: [GC [1 CMS-initial-mark: 101397K(107776K)] 119665K(126912K), 0.0156781 secs]
A Concurrent Mode Failure (CMF) almost seems inevitable at this stage and sure enough…
The following Concurrent Mode Failure (CMF)
The next records show that unfortunately for the collector, that one bad thing has just happened, the CMS concurrent-mark barely gets a chance to do it’s thing and now that pesky ParNew wants to promote a lot of data into tenured and can’t as there isn’t enough space to do so.
86.824: [CMS-concurrent-mark-start] 86.875: [GC 86.875: [ParNew: 19134K->19134K(19136K), 0.0000167 secs]86.875: [CMS87.090: [CMS-concurrent-mark: 0.265/0.265 secs] (concurrent mode failure): 101397K->107775K(107776K), 0.7588176 secs] 120531K->108870K(126912K), [CMS Perm : 15590K->15589K(28412K)], 0.7589328 secs]
To make things worse, the ParNew’s attempts to get memory caused the CMS cycle to fail (concurrent mode failure) and rather than let the application know all is not well and to keep the game going, GC decides to call in the big gun, a Full GC. But even with this move, things aren’t looking good as the Full GC ends up leaving 107775K in Old and it’s only 107776K in size! That’s close to 100% full. However, we still can move on as Young contains 1095K (108870K-107775K) which is all stuffed into a Survivor space. That was a close call but GC still has to scramble if it is to keep the Ponzi scheme from being exposed…
A following Full GC
Another Full GC now occurs to paper over the problem that the JVM has run out of memory. This one happens at 87.734 seconds after JVM startup. The previous pause time was 0.7589328. Add that to 86.875 and you get 87.634 meaning the application got just about 100ms of run time before it was once again interrupted.
But once again the heroic effort buys the GC a wee bit more time before it attempts yet another CMS cycle, a ParNew that fails and reverts (silently in this case) to yet another Full GC all the while stringing along the application into believing all is well when it isn’t.
87.734: [Full GC 87.734: [CMS: 107775K->107775K(107776K), 0.5109214 secs] 111054K->109938K(126912K), [CMS Perm : 15589K->15589K(28412K)], 0.5110117 secs]
And the misery continues
Yet more back to back CMS cycles with concurrent mode failures all indicating that the garbage collector is just able to keep things going but in all honesty you have to start wondering if the cost is too high and if at this point the better thing to do would be to throw some sort of warning or error.
88.246: [GC [1 CMS-initial-mark: 107775K(107776K)] 109938K(126912K), 0.0040875 secs] 88.250: [CMS-concurrent-mark-start] 88.640: [CMS-concurrent-mark: 0.390/0.390 secs] 88.640: [CMS-concurrent-preclean-start] 88.844: [CMS-concurrent-preclean: 0.204/0.204 secs] 88.844: [CMS-concurrent-abortable-preclean-start] 88.844: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] 88.844: [GC[YG occupancy: 11380 K (19136 K)]88.844: [Rescan (parallel) , 0.0109385 secs]88.855: [weak refs processing, 0.0002293 secs] [1 CMS-remark: 107775K(107776K)] 119156K(126912K), 0.0112696 secs] 88.855: [CMS-concurrent-sweep-start] 88.914: [CMS-concurrent-sweep: 0.059/0.059 secs] 88.914: [CMS-concurrent-reset-start] 88.915: [CMS-concurrent-reset: 0.001/0.001 secs] 89.260: [GC 89.260: [ParNew: 19135K->19135K(19136K), 0.0000156 secs]89.260: [CMS: 105875K->107775K(107776K), 0.5703972 secs] 125011K->116886K(126912K), [CMS Perm : 15589K->15584K(28412K)], 0.5705219 secs] 89.831: [GC [1 CMS-initial-mark: 107775K(107776K)] 117010K(126912K), 0.0090772 secs] 89.840: [CMS-concurrent-mark-start] 90.192: [CMS-concurrent-mark: 0.351/0.352 secs] 90.192: [CMS-concurrent-preclean-start] 90.343: [Full GC 90.343: [CMS90.379: [CMS-concurrent-preclean: 0.187/0.187 secs] (concurrent mode failure): 107775K->104076K(107776K), 0.5815666 secs] 126911K->104076K(126912K), [CMS Perm : 15586K->15585K(28412K)], 0.5816572 secs] 90.973: [GC [1 CMS-initial-mark: 104076K(107776K)] 104883K(126912K), 0.0025093 secs] 90.976: [CMS-concurrent-mark-start] 91.335: [CMS-concurrent-mark: 0.359/0.359 secs] 91.335: [CMS-concurrent-preclean-start] 91.367: [CMS-concurrent-preclean: 0.031/0.032 secs] 91.367: [CMS-concurrent-abortable-preclean-start] 92.136: [GC 92.136: [ParNew: 17024K->17024K(19136K), 0.0000167 secs]92.136: [CMS92.136: [CMS-concurrent-abortable-preclean: 0.054/0.769 secs] (concurrent mode failure): 104076K->107775K(107776K), 0.5377208 secs] 121100K->110436K(126912K), [CMS Perm : 15588K->15586K(28412K)], 0.5378416 secs] 92.838: [GC [1 CMS-initial-mark: 107775K(107776K)] 112679K(126912K), 0.0050877 secs] 92.843: [CMS-concurrent-mark-start] 93.209: [CMS-concurrent-mark: 0.366/0.366 secs] 93.209: [CMS-concurrent-preclean-start] 93.425: [CMS-concurrent-preclean: 0.215/0.215 secs] 93.425: [CMS-concurrent-abortable-preclean-start] 93.425: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] 93.425: [GC[YG occupancy: 13921 K (19136 K)]93.425: [Rescan (parallel) , 0.0130859 secs]93.438: [weak refs processing, 0.0002302 secs] [1 CMS-remark: 107775K(107776K)] 121697K(126912K), 0.0134232 secs] 93.439: [CMS-concurrent-sweep-start] 93.505: [CMS-concurrent-sweep: 0.067/0.067 secs] 93.506: [CMS-concurrent-reset-start] 93.506: [CMS-concurrent-reset: 0.001/0.001 secs]
So what exactly is Out of Memory for the JVM?
Defining Out Of Memory
The elephant in the room is that Java heap is too small to support this application. A larger heap would help GC perpetuate the Ponzi scheme indefinitely. While the application may not realize that all is not well, end users surely do. We’d really like the application to complain before the users do. Unfortunately we don’t have an
AlmostOutOfMemoryError but we can redefine when an
OutOfMemoryError is thrown but adjusting the
The default values for
GCTimeLimit is 98% meaning that if 98% of recent time is spent in GC than an OutOfMemoryError will be thrown.
GCHeapFreeLimit is the amount of free heap available after a collection. The default value for this variable is 2%.
If we run the numbers from the GC log fragment we can see that GC does just well enough that it never exceeds
either of these thresholds. Thus the GC is able to keep the Ponzi scheme limping along. And since these values are somewhat arbitrary, one could redefine them to say; if you have to work so hard to maintain the illusion maybe it’s best to just throw in the towel and let the application know that it’s struggling for memory. In this case setting the
GCHeapFreeLimit to 5% and or setting
GCTimeLimit to 90% would have triggered an
OutOfMemoryError. This would have been the biggest hint as to why this application was not responding well and would have helped the victims of the Ponzi scheme know that they’d been had!
Thanks to Kirk, Martijn and jClarity for their great content, check out their website for more info about what they do!