Sizing Kubernetes pods for JVM apps without fearing the OOM Killer
Lately at work I’ve been helping out some teams migrating their workloads from on-prem / EC2 infrastructure to Kubernetes. It’s being a good boot camp on Kubernetes.
In this post I will go over one of the issues we faced recently, related to resource assignments for JVM applications. I will explain how running in Kubernetes forces us to think about capacity planning more than we’re used to, and changes some of the assumptions we made before containers.
I will assume familiarity with basic Kubernetes concepts (you know what are nodes, pods, etc..) and the JVM.
Help! My application is getting OOM killed
Resource limits are one of the most frequent pain points for teams new to Kubernetes. Our cluster configuration is relatively (too?) aggresive in promoting small, canonical microservices that scale horizontally. Developers are used to being liberal allocating memory for JVMs that usually run on their very own EC2 instance. Finding themselves suddenly constrained within pods under 5GB often results in various flavours of OOM kills or OutOfMemoryErrors
An engineer from one of our product teams found himself in this situation last week and was asking for help configuring their deployment. The Kubernetes manifest declared:
resources:
limits:
memory: 4Gi
cpu: 1
requests:
memory: 1Gi
cpu: 1
And the relevant part of the Dockerfile
was:
FROM openjdk:8u181-jre-slim
...
ENTRYPOINT exec java -Xmx1512m -Xms1g [...] -cp app:app/lib/* com.schibsted.yada.Yada
Pods were being OOMKilled at inconvenient times, so he came asking for help in sizing his pods and JVM.
What are pod request
/ limits
in Kubernetes
Let’s first see what are the resources
parameters for.
When we ask Kubernetes to run our application, the Kubernetes
scheduler
looks for nodes in the cluster where our pods can run. It will search
based on multiple criteria, where the most basic is whether a node has
enough memory and CPU to run the container. This is what the parameters
in the resources
section are for. requests
sets the minimum amount
of resources each container in our
Pod
needs to start successfully. Kubernetes will schedule our pods only in
suitable nodes. If there are not enough, then our pod will become
Unschedulable.
In the snippet above, the application declares a minimum of 1Gi of
memory and 1 CPU as requests
.
But requests
is not a hard limit. Once it’s running, if the container
needs additional memory / CPU it can ask the kernel for extra capacity
from its node. This elasticity is useful to handle bursts of load, but
also creates a risk that greedy pods hog too many resources. limits
is
there to control this by fixing a maximum memory / CPU that each
container in the
pod
will be allowed to use.
The Kubernetes documentation has a good
explanation
on the tradeoffs that developers can play with by using requests
and
limits
By configuring memory requests and limits for the Containers that run in
your cluster, you can make efficient use of the memory resources
available on your cluster’s Nodes. By keeping a Pod’s memory request
low, you give the pod a good chance of being scheduled. By having a
memory limit that is greater than the memory request, you accomplish two
things:
- The pod can have bursts of activity where it makes use of memory
that happens to be available.
- The amount of memory a pod can use during a burst is limited to
some reasonable amount
Why is our Pod getting OOM killed?
Coming back to our Dockerfile
, the first thing that calls our attention is:
ENTRYPOINT exec java -Xmx1512m -Xms1g [...] -cp app:app/lib/* com.schibsted.yada.Yada
The JVM will allocate 1GiB of heap up front, consuming the full capacity
that the container we got through requests
. Given that the JVM needs
additional memory (code cache, off-heap, thread stacks, GC
datastructures..), as does the operating system, our containers are born
undersized.
It’s evident that requests
is too small.
Wait: didn’t Hotspot 1.8 have problems tracking memory in containers?
This is the first thing to suspect when containerized JVMs are getting OOM killed. Our teams have hit this a few times. Long story short, when you use Hotspot JVM at any version below 8u121, you can start a JVM inside a container limited to 2GB of memory, and find that the JVM tries to start with a much larger heap and gets killed. The container is exceeding memory limits, but it’s the JVM’s fault for being too greedy. Why?
The cause of this problem was that the JVM used to retrieve the available memory. It used to look at a system file that contains the host’s memory capacity, not the container’s (or more precisely a kernel’s control group that is used to implement the container.) This issue was dealt with in JDK-8189497. From u181, the Hotspot JVM added a couple of flags that let you instruct the JVM to use the cgroup memory. From u121 this behaviour became default. We’re using a version > u181.
However, this issue was only relevant when the JVM had to calculate the
min/max heap sizes by itself. In our case, since we’re passing the
-Xmx
and -Xms
flags explicitly, so we’re not affected. We can
confirm because both heap committed metrics above stay withn the 1.5GiB,
as we’ll see below.
What would be a reasonable requests
value then?
Here is a snapshot of the application under normal load. I’m showing
the committed
value as this represents memory actually reserved by the
JVM, as opposed to usage
(which = committed - not used).
metric | memory |
---|---|
jvm_memory_committed_bytes{area=”nonheap”,id=”Code Cache”,} | 0.05 GiB |
jvm_memory_committed_bytes{area=”nonheap”,id=”Metaspace”,} | 0.07 GiB |
jvm_memory_committed_bytes{area=”nonheap”,id=”Compressed Class Space”,} | 0.01 GiB |
jvm_memory_committed_bytes{area=”heap”,id=”PS Eden Space”,} | 0.52 GiB |
jvm_memory_committed_bytes{area=”heap”,id=”PS Survivor Space”,} | 0.01 GiB |
jvm_memory_committed_bytes{area=”heap”,id=”PS Old Gen”,} | 1.06 GiB |
total committed | 1.72 GiB |
Notice that our JVM heap is not 1Gib but 1.5GiB. This is to be expected
as the JVM may expand the heap once the application starts doing actual
work and puts pressure on the heap, specially under load. However,
thanks to the explicit -Xmx
flag, we can trust that the heap won’t use
more than 1.5GiB. requests=1Gi
is therefore too low. But still, but
limits=4Gi
should be giving plenty of head room to prevent an OOM
Kill.
What else could be pushing our memory footprint over limits
?
One answer was outside the JVM. We looked at
tmpfs mounts. These
volumes look as ordinary directories in our filesystem, but they
actually reside in memory and thus contribute to the overall memory
consumption. This application is based on Kafka Streams, which uses a
RocksDB
instance for an internal cache that is persisted to disk. The team
decided to use the /tmp
mount to avoid IO overhead, which was a good
idea, but did not account for the implicit tradeoff in additional memory
usage. To make things worse, the /tmp
folder was also storing logs
for the application. All in all, RocksDB and logs added almost 2GiB to
the container’s memory footprint. With limits
set to 4GiB we were
already flirting with the OOM killer.
This is the typical change of assumptions that appears when moving from physical machines or VMs (which tend to be large) to containers (which in our cluster tend to be much smaller.)
Our second measure was to relocate the RocksDB storage to disk (chosing
less memory utilization over IO load) and ensure that the application
kept log files small, and frequently rotated. After this change, the
/tmp
directory stays under 20MiB.
OOM killer keeps acting
While these changes made the situation better for a while, OOM kills kept happening.
So far we’ve just focused on the heap, but we know that the JVM uses memory for other purposes (some of which appear above: Code Cache, Metaspace, Compressed Class Space…). Let’s take a look at the real memory usage of the JVM process.
root@myapp-5567b547f-tk54j:/# cat /proc/1/status | grep Rss
RssAnon: 3677552 kB
RssFile: 15500 kB
RssShmem: 5032 kB
I believe these three add up to the RSS (Resident Set Size), which tells us the memory that our JVM process has in main memory at this point in time. Note that JVM alone is enough to consume 3.6GiB out of our 4GiB limit.
And we didn’t look at the operating system either. In containers
memory information is found under the /sys/fs/cgroup
tree, let’s check
there:
root@myapp-5567b547f-tk54j:/# cat /sys/fs/cgroup/memory/memory.stat
cache 435699712
rss 3778998272
rss_huge 3632267264
shmem 9814016
mapped_file 5349376
dirty 143360
writeback 8192
swap 0
pgpgin 2766200501
pgpgout 2766061173
pgfault 2467595
pgmajfault 379
inactive_anon 2650112
active_anon 3786137600
inactive_file 198434816
active_file 227450880
unevictable 0
hierarchical_memory_limit 4294967296
hierarchical_memsw_limit 8589934592
total_cache 435699712
total_rss 3778998272
total_rss_huge 3632267264
total_shmem 9814016
total_mapped_file 5349376
total_dirty 143360
total_writeback 8192
total_swap 0
total_pgpgin 2766200501
total_pgpgout 2766061173
total_pgfault 2467595
total_pgmajfault 379
total_inactive_anon 2650112
total_active_anon 3786137600
total_inactive_file 198434816
total_active_file 227450880
total_unevictable 0
All these are bytes. The full Resident Set Size for our container is
calculated with the rss
+ mapped_file
rows, ~3.8GiB.
We see that mapped_file
, which includes the tmpfs
mounts, is low
since we moved the RocksDB data out of /tmp
. So in practise, non-JVM
footprint is small (~0.2 GiB). The remaining 3.6GiB are consumed by our
JVM. Our container runs with ~200 MiB to spare, so it’s likely that any
burst in memory usage will tip us over the ‘limit’. cat
/sys/fs/cgroup/memory/memory.failcnt
will tell us how many times we
hit memory usage
limits
Interestingly, while our JVM is consuming 3.6GiB, we just use ~1.72GiB of heap committed according to the metrics we showed above. This means we have almost 2GiB unaccounted for in the JVM. The first suspect should be off-heap memory. A quick look at the JVM metrics discards this:
jvm_buffer_total_capacity_bytes{id="direct",} 284972.0
jvm_buffer_total_capacity_bytes{id="mapped",} 0.0
That’s barely 0.2MiB, so negligible in this JVM. Understanding what in the JVM is consuming the additional 2GiB will take another post.
For now I will wrap up this post by highlighting a couple of things I learned from this exercise, and some action points that we could already apply.
requests
is a guarantee, limits
is an obligation
There is a subtle change of semantics when we go from requests
to
limits
. For the application developer, requests
is a guarantee
offered by Kubernetes that any pod scheduled will have at least the
minumum amount of memory. limits
is an obligation to stay under
the maximum amount of memory, which will be enforced by the kernel.
In other words: containers can’t rely on being able to grow from their
initial requests
capacity to the maximum allowance set in limits
.
This is problematic. Maybe it was the JVM that needed more capacity in order to grow the heap. If it doesn’t get it, we can expect at best a saturated heap with constant GC cycles, application pauses and increased CPU load. At worst, an OutOfMemoryError. Maybe it was the OS that needed more capacity for $purposes. In any case, if we fail to increase memory when needed, we will be implicitly asking the OOM killer to find a process to kill. The JVM has most tickets in that raffle, as it’s by far the biggest memory consumer.
Suggested changes
First, although not related to the OMM kill: just set -Xmx
= -Xms
and ensure your JVM reserves all the heap it will use up-front.
Having both JVM and Docker (or cgroups to be precise) growing memory dynamically just makes it harder to reason about capacity or understanding this type of issues.
Second, requests
should be at least bigger than -Xmx
, and ensure
that to add enough head room for the JVM and the OS. How much more?
Depends on what runs inside your container. A simple microservice will
probably be fine with requests
~%25 higher than -Xmx
. But does your
microservice (or any library inside it) use off-heap memory? Write logs
in a tmpfs
volume? Read/Write volumes shared with other containers in
the pod? All these (and other factors) impact your memory footprint.
You need to know this in with reasonable detail if you want to avoid
surprises.
Third, setting limits
much higher than requests
is meant to handle
bursts of activity where your pod makes use of “memory that happens to
be available”. Notice the wording. If getting that extra memory is a
necessity for your app to do its job, don’t gamble and just reserve it
up-front with requests
.
Fourth, running applications inside containers forces us to have a solid
idea of the memory requirements. I left unanswered why our JVM is using
~3.6GiB despite having 1.5GiB max heap, and no off-heap memory, I’ll try
to write a followup to go deeper into this. But, assuming we know our
memory footprint, my intuition (as a non-Kubernetes expert) is that we
should prefer to have requests
== limits
at least for memory.
One last learning: don’t try to predict OS footprint
I mentioned that we should ensure that our requests
leaves extra
memory for the operating system. My temptation when I first looked at
this data was to use the values we got from
/sys/fs/cgroup/memory/memory.stat
to predict future OS footprint
(~0.2GiB) while running this application. By reading through the
kernel
and Docker
documentation I realized this would probably not be wise.
When the kernel watches for the limits assigned to a container, it
counts all of the RSS, as well as part of the page cache (the cache
row, see see 2.2.1 and
2.3 for
concrete accounting details). The page cache is used by the OS to speed
up access to data in disk by keeping the most accessed pages in memory
(whichcounts for memory utilization). But, what happens if several
containers access the same file? The
Docker docs
explain this clearly:
Accounting for memory in the page cache is very complex. If two
processes in different control groups both read the same file
(ultimately relying on the same blocks on disk), the corresponding
memory charge is split between the control groups. It’s nice, but it
also means that when a cgroup is terminated, it could increase the
memory usage of another cgroup, because they are not splitting the cost
anymore for those memory pages.
The memory footprint of our container depends on both its neighbours and
which files on the node’s disk they read. Because containers come and
go, share of the memory accounted to a given container may change. If
we’re already flirting with our limits
this might push us into OOM
Kill territory.
Any thoughts? send me an email!
To get notified on new posts, follow me on Bluesky / Twitter, or subscribe via RSS feed or email:
Archive
- Why aren't we all serverless yet?
- Identifiers are better off without meaning
- Alert on symptoms, not causes
- How about we forget the concept of test types?
- How organisations cripple engineering teams with good intentions
- Migrating an Eureka-based microservice fleet to Kubernetes
- Talk write-up: "How to build a PaaS for 1500 engineers"
- Kubernetes made my latency 10x higher
- Sizing Kubernetes pods for JVM apps without fearing the OOM Killer
- GC forensics by example: multi-second pauses and allocation pressure
- How does the default hashCode() work?
- Frugal memory management on the JVM (Meetup)
- DirectBuffer creation / disposal has hidden contention on sun.misc.Cleaner