Galo Navarro

Identifiers are better off without meaning

Wed, 01 May 2024 22:01:00 +0000

Once at Last.fm we had an integer overflow in an identify field. I can’t recall where exactly. But I do remember that the inconvenience of having a bunch of Hadoop jobs disrupted while we rushed to update the relevant type couldn’t spoil the collective pride for having more than 2 billion of whatever needed so many ids.

Being frugal with identifiers is seldom a good idea, but for me the worst identifier-related headaches came from IDs that had semantic value.

At Tuenti (once the largest social network in Spain) there was a concept similar to Facebook pages. Pages had types and subtypes. A page type might have been “group”, which had subtypes “business” or “community”. Another type could be “place” with subtypes like “store” or “landmark”.

Page identifiers were strings composed by concatenating numeric identifiers of the type, subtype, and then an increment field in a DB. If you visited https://tuenti.com/p/3_2_6691 you’d instantly know the meaning. It was a “place” (3) of type “store” (2), and the store ID was 6691. Page IDs decomposed in this way would be useful for multiple purposes. Chosing a type-specific implementation of controllers to compose the relevant page, routing to a database shard, that kind of thing.

At some point Product wanted to change the classification of pages. To their frustration, this became problematic because the entire taxonomy was encrusted across the code all the way from URLs to databases.

Another example are New Relic entities. Entities are an abstraction that broadly represents anything that can send telemetry to New Relic. A host, a Kubernetes cluster, an application, a JVM or a network router can be entities. Of course those are all things you want to identify, so entities have Global Unique IDentifier, or GUID. Every single telemetry datapoint is stamped with the GUID of the entity that produced it, so they act as the keystone of the New Relic platform. Features like Service maps, distributed tracing, entity relationships, and many others are built upon them.

Entity GUIDs have meaning. An example of a GUID is 1|APM|APPLICATION|23 where 1 is the account, APM is the domain, APPLICATION is the unique type within that domain, and 23 is a unique identifier within the domain and type. That application might be running in a host 1|INFRA|HOST|12. If we wanted to store that relation, we’d have an entry in some database saying:

"1|INFRA|HOST|12" RUNS "1|APM|APPLICATION|23".

Like at Tuenti, these semantics come handy. If you’re processing millions of telemetry datapoints per second, it’s useful to tell the type of reporting entity on the fly by decomposing the GUID, rather than perform an expensive lookup to an external service. Account IDs can be used to route data to cells (this talk from Andrew Bloomgarden explains how NR used this pattern to scale).

Domains like INFRA or APM corresponded to the original product verticals at New Relic. Years later Product decided (with good criteria) that they created unnecessary fragmentation on the user experience. Types change, get renamed or merged with others. Sometimes (more often than it seems) entities had to migrate from one account to another. In all these cases you would be altering the identifier of many entities.

It was painful but possible to work around many of these problems. But replacing GUIDs with semantic-free identifiers was straight impossible. By virtue of being present in thousands of URLs, NRQL queries, etc. GUIDs had become a public API that thousands of customers relied upon. A technical solution to replace identifiers would have been a major project, but doable. What wasn’t possible was to run a find/replace across the private documentation and workflows of your entire customer base.

If you look closely, the world is full of semantic identifiers. Sometimes they are hard to avoid, but almost always a pain in the neck. Because they embed a specific model of the world. But models become obsolete faster than we’d like.

Addresses make notable examples. The “complex and idiosyncratic” Japanese address system reflects the organic growth of its urban areas. In British postal codes the final part can designate anything from a street to a flat depending on the amount of mail received by the premises.

When I was a kid, license plates would give up the province where the owner lived causing an array of nuisances. They were alleviated with the adoption of European standards. But partly. The root problem being that, like the Domain Name System identifiers (“Galo’s website”) remain tied to administrative authorities (“.net”), which can change regulations, or even disappear.

Nowadays I find most semantic identifiers in resource management. For some reason, when infrastructure teams define access rules to the resources of a particular service, they prefer to create a group named after the team that owns that service, something like “owner-team-access-list”. Identifiers tied to the org chart don’t like it when the service moves to another team, or owner-team is reorged away.

–

Update: a commenter in Reddit pointed at another great example of problems of identifiers with semantics: the German tank problem. Do send more!

Update: Discussion in HackerNews

Alert on symptoms, not causes

Wed, 06 Mar 2024 12:00:00 +0000

When you are bringing a new system to production you know that you ought to define SLIs, set up instrumentation, alerting, etc. Nowadays there is an abundance of tooling and infrastructure to extract data from your service and the entire stack it runs on. But this leaves you with a problem. What can we do with that data? Should you put all of it on a dashboard, in many? What should trigger an alert and wake you up in the middle of the night? The possibilities are endless, as Lou Reed would sing.

Teams usually start sieving through their data from generalised monitoring dashboards available in every major observability platform. They find the usual charts with resources like CPU, memory, IO, and threads. JVM users get heap and non-heap usage, pauses, and collection times. APIs expose latency, throughput, and HTTP response codes, all exhaustively broken down by endpoint. There is an understandable urge to watch for any source of trouble. All that data seems important, so the next step is setting alert triggers on all of them with whatever thresholds seem reasonable at the time.

What follows is a Niagara of alerts that aren’t quite one. It could be that a sudden CPU spike to 91% crosses the threshold, only to be dismissed as being due to the JVM just cleaning some garbage. The team will try to fine tune the alert (”maybe trigger only if it’s 90% for more than 2 minutes”), only to be hit by another false positive on whatever other of the dozens of alerts. After much stitching, false positives will go down. Then, an actual incident will slip undetected through the patchwork of alert thresholds.

Sometimes teams will treat this as part of the inevitable toil that goes with operating production software, like carrying extra weight in your backpack during the on-call shift.

Being perhaps too lazy, when I build systems I aspire to make operational toil so small that on-call feels like a free bonus. Of course it’s hard to get there because operating software creates problems that sometimes require human effort. But a deluge of noisy instrumentation isn’t one of those problems. Don’t resign yourself to it.

Symptoms before causes

To turn instrumentation data into high signal my default approach is to start by focusing on symptoms before causes.

This often feels counter intuitive to engineers. Our instinct says that high CPU usage is a harbinger for trouble, so it would seem wise to alert when it exceeds that threshold. While there is nothing intrinsically wrong with this, it quickly becomes impractical. There are innumerable other causes for trouble. Just like high CPU usage, you could take each of the dozens of “seems quite important” default metrics produced by your instrumentation framework, and think of thresholds above which things might start going awry.

It’s a case of how an aggregation of good individual decisions sometimes produces a negative outcome. High CPU usage might signal a failure with, for example, 75% success rate. But that also means a 25% chance of being false positive. Every time this alert triggers preemptively you place a burden on the operator to confirm whether it is actually relevant. As you add more metrics for the many other possible causes of trouble, the odds of suffering false positives accumulates to unpleasant levels quite fast. Discomfort is not even the biggest problem with noisy alerts. Lots of alerts with low signal ratio slowly induce your team to ignore them, miss actual incidents and worsen their impact.

Going beyond “just in case” alerting

The cause-first approach is “just in case” alerting. A symptom-first approach is less about probabilities than certainties. It’s not about answering “what might cause a problem” but “what external behaviour manifests that my service is not healthy”.

An example of a symptom could be an API’s latency going through the roof. We don’t know what is causing it, it could be CPU or a million other things. But we are positive that the service is not doing its job as expected, so we know that the PagerDuty horn that’s about to bring the sleeping engineer into cardiac arrest is worth blowing. It is at this point that the knowledge and intuitions about probable causes and the fancy instrumentation you collected from the system become useful and usable. You start with the symptom: latency. Then you move on to check probable causes, like CPU usage or GC activity. If those aren’t the actual causes, you have a treasure trove of telemetry at your disposal to search for anomalies that might explain the problem that you do know you have.

Causes for failure are permanent residents

Another problem with focusing on causes is that after a system reaches a certain level of complexity (by 2024, even the most trivial software sits on top of complex systems), causes for failure are not only diverse. They are also everywhere. This is Richard Cook on “How complex systems fail”:

3. Complex systems contain changing mixtures of failures latent within
them. The complexity of these systems makes it impossible for them to
run without multiple flaws being present. […]

4. Complex systems run in degraded mode. Complex systems run as broken
systems. The system continues to function because it contains so many
redundancies and because people can make it function, despite the
presence of many flaws.

His point is that in a complex system, there are always potential causes of trouble lying somewhere in between the cogs. You can’t blow the horn continuously just in case the potential cause did create an actual problem this time. When something isn’t exceptional, alerting on its presence loses most of its value and, as we noted above, tends to become counter-productive by desensitising operators.

Instead, your system will be better off treating the presence of potential causes of failure as normal operation. It should work despite being exposed to them, and trigger an alert only when it is unable to perform its function and attention from operators is really necessary.

Good SLOs are almost always symptoms

Answers to “what external behaviour manifests that my service is not healthy” tend to overlap with answers to “how would a user tell that the service doesn’t work” or “how would the business tell that we’re making money“. Symptoms get you talking about page loading times, conversions, successful page loads, and so on. Terms closer to user experience and business objectives. They may seem too far away from the guts of the system, but will give you clarity and purpose. It’s almost impossible to figure out how much time per month you can tolerate with >90% CPU utilisation. It’s much easier to figure out what is the acceptable percentage of failed requests.

Being on the same wavelength as the business gives engineers more control over alert fatigue and toil. The handful of symptoms that become SLOs allow you to reduce drastically the amount of alerts we need to configure and attend to.

SLOs also imply an error budget, and knowing how much failure is acceptable in a period of time gives your team the option to keep the PagerDuty horn silent until your error budget is at risk. This doesn’t mean you should become complacent and tolerate errors, but that you can (and should!) administer the demand for your team’s limited energy. Similarly, associating each cause of failure with its effect on SLOs helps prioritise investing in the most impactful (which won’t necessarily be the most frequent!).

Summary

When you define health alerts for your systems it’s more useful to start with symptoms of trouble, rather than potential causes.

Causes usually look at resources (CPU, memory, IO, etc.) or internal processes (GC, thread scheduling, etc.), but the presence of a cause does not guarantee an issue. This adds the burden of confirmation on the operator, which gets multiplied because there are many potential causes of failure. In a complex system, they are also always present. All these factors lead to excess alert noise, fatigue operators, and make them less effective at keeping the system healthy.

Symptoms look at high-level function (requests served in time, data flowing, payments being processed). Symptoms are few, and provide reliable indicators that the system is unable to perform its job. An alert gives near certainty that attention for the operator is necessary. Focus on symptoms aligns engineering priorities with those of the business, and helps define SLOs and error budgets that guide engineering towards more effective and efficient use of their effort, and gives control over operational toil.

(Some) further reading

“Love (and Alerting) in the Time of Cholera (and Observability)”.
“Choosing good SLIs”.
Google’s SRE bookshelf.

How about we forget the concept of test types?

Tue, 06 Feb 2024 21:00:00 +0000

I have found that the concept of test types (unit, integration, and so on) does more harm than good. People often give me odd looks when I say this, so having the explanation in a URL will come handy for future reference. I will use it as an introduction to a series of practical case studies on the topic of testing software.

Test types shape the way engineers and even adjacent professionals reason about testing software. This goes far beyond how an engineer may decide to test a single change. It also influences how teams and entire organisations analyse, design, implement, and evolve their testing pipelines. Which is no joke, given that test writing, test infrastructure, and test tooling accounts for around 30-40% of the cost of maintaining software (ChatGPT or Google queries are consistent in that ballpark, which roughly matches my experience). This cost is well justified by the impact of testing on any company whose product depends on building software.

Imagine you’re in one of those organisations. It might be new, and growing. How do you structure the testing pipelines, the infrastructure, the principles and “best practices” that govern work inside each individual engineering team and across them. How do you find the right balance between quality and velocity? Do we need a QA team? How will all those elements behave when growing pains appear? Will they adapt and keep up with the business, or will your engineering machinery grind to a halt?

What happens if the organisation was consolidated, and already at that breaking point? Suffering quality issues, slow development, gruelling and unpredictable delivery cycles. How do you approach the task of improving the quality of the overall product, and each of its components? How do you identify the necessary changes in the existing testing pipeline? How do you convince leadership to fund them? How do you execute those changes without disrupting the business? When these changes impact the way product teams develop, test, and distribute their software, how do we exercise an effective influence on their (likely reluctant) engineers, their tech leads, their engineering and product managers? What type of shared testing infrastructure needs to be built to support all those teams, and which should never be built, even if teams ask for it? Was there a QA team already? Does their role change? How different are the analysis and the solutions if the product is not just a web site with a bunch of server-side components, but also has a mobile application, or native ones, or firmware? How do we know if we’re making progress?

Having solid foundations to reason about your testing is essential to answer any of those questions.

Test types consolidated on that foundational role when Michael Cohn introduced the Test Pyramid in “Succeeding with Agile” (2010). The key concept it put in the collective mindset was that you can classify tests in types which are then laid down as layers. Bottom to top these were ”unit”, “service” (nowadays perhaps more commonly known as “integration”), and “user interface”. You want to have more of the lower ones, fewer of the upper ones.

Here is Cohn himself:

"At the base of the test automation pyramid is unit testing. Unit testing should be the foundation of a solid test automation strategy and as such represents the largest part of the pyramid. Automated unit tests are wonderful because [...]"

This says something about their importance relative to other types but nothing about how to distinguish them. I couldn’t find a clearer definition in “Succeeding with Agile”. In my experience, when people talk about unit tests they imply a focus on verifying a narrow surface of a code base, although it’s unclear how narrow. Wikipedia says that a “Unit is the smallest component that can be isolated within the complex structure of an app. It could be a function, a subroutine, a method or property”, but it comes along with a “citation needed” and if I take that definition seriously, then Cohn’s later point that Service testing “fill[s] the gap between unit and user interface testing” sounds like defining water surfaces as “ponds” and “everything else”.

So let’s consult more consultants. Martin Fowler explains that when writing unit tests at C3 they would “set up a test fixture that created that object with all the necessary dependencies so it could execute its methods.” Now we seem to be at the granularity of an object. But in the same explanation he provides a quote from Kent Beck’s Extreme Programming, which Fowler reads as meaning that “’unit test’ means anything written by the programmers as opposed to a separate testing team”.

That scope seems panoramic compared with “the smaller component that can be isolated within […] an app” that we read in Wikipedia and many other sources. Besides, linking the type of test to who writes it is also problematic: I know of many organisations where programmers write most tests, even those that verify large areas of the system or the UI. Does this mean that tests will be “unit” in one company but not in another?

According to Martin Fowler, there was “considerable discussion” about Kent Beck’s formulation to the point that “one test expert vigorously lambasted Kent for his usage”. They asked the expert for his definition of unit testing and he replied that “in the first morning of my training course I cover 24 different definitions of unit test.”

We can’t expect much of a conceptual framework based on test types when the key terms, in Fowler’s own words, “have always been rather murky, even by the slippery standards of most software terminology”. It certainly explains why using them in any conversation about testing among engineers works like the proverbial can of worms. Once you open it, the slimy creeps cause trouble everywhere.

But it’s worse than that. If engineers can’t have a meaningful conversation about testing, then the communication with business stakeholders about quality and software delivery is doomed to be nothing but dysfunctional.

I have seen my share of the consequences. Organisations that take the pyramid at heart and over invest on the basis with a multitude of hyper granular unit tests for individual functions and methods. They run into the law of diminishing returns. Issues proliferate at the articulation between pieces that were considered individually correct. Business stakeholders are unimpressed by the high coverage number and would rather see the product work after assembling the pieces. The “two unit tests, zero integration tests” memes circulate. Teams get burnt by those problems, some dismiss unit tests as “encumbering and superficial” and conclude that unit testing is overrated. They decide that they need less of that type, and more of the comprehensive types that exercise complete use cases across wider surfaces of the system. A good amount of those teams later end up buried under the weight of byzantine test infrastructures that silently grew slow and unscalable in technical, organisational, or both dimensions. In the meantime business grows more and more frustrated with the slow pace of delivery.

All the way through this mess, people try to figure out what went wrong. They have opinionated debates around the company’s or the internet’s water cooler that seldom reach a conclusion. Maybe the problem is in the choice of shape? What if the Pyramid has fallen out of style? Perhaps we should get more creative. Let’s try the Test Trophy. The Test Honeycomb. The Test Diamond. The Test Crab. We were not able to define one type properly but why not add more? Component tests. API tests. Fowler himself proposes Subcutaneous tests, Broad stack tests, Solitary and Sociable tests. Sometimes they depend on how much surface of code is touched. Others it’s on how that code is structured. On what part of the stack the code belongs to. How are tests written, or who writes them. Anything goes.

The whole thing reminds of Borges’ Celestial Emporium of Benevolent Knowledge. All are valid classifications, none authoritative, and their utility limited to some. Which is fine for certain topics. But when it comes about testing software it seems that the importance of the subject would deserve, if not demand, that software engineers have a solid vocabulary to hold a rational conversation, among themselves and with stakeholders.

How about we stop being content with “slippery standards” on our professional terminology and pay less attention to these murky terms? What if instead, we focus on what we want to achieve and frame the problem as what it is: a system. After all a test pipeline is just that, a system whose inputs are code and related artefacts produced by development teams, and whose function is to give as much confidence as possible on whether they work as expected. We want this system to remain efficient, performant, cost-effective, and scalable to keep up with the needs of the business. The usual drill.

With this approach the problem stops being about shoehorning the complexity of modern software in awkwardly shaped geometric ideals that fit someone’s wooden box. Instead, we are designing a system with a clear purpose. We are doing engineering.

What do we find if we look at a testing pipeline from a systems perspective? Well, one can think about three basic, familiar properties we always care about: latency, throughput, error rate.

Latency tells us how long it takes to verify a given change, to run a test, or the full test suite. It measures the length of our feedback loops.
Throughput tells us how many of those verifications I can run per unit of time.
Error rate tells us what percentage of test executions fail to do their job. Note that this is not the same as failed tests (that is a successful execution!). Errors would be false positives (regressions that pass all tests and therefore slip through our safety net), or false negatives (often flakes that fail tests even though there wasn’t an actual regression).

These are not ground breaking ideas! They permeate the literature about software testing in one form or another (the reason for preferring unit tests boils down to trade-offs around latency, error rate, throughput). But for some reason types, categories, and shapes take the spotlight and dominate the discussion. Bringing back the focus to the domain of systems design rather than abstract classification games, helps reason about problems around testing much more productively.

As I mentioned above this is meant to be an introduction to a short series of posts about testing software. The next posts will be practical applications of a systems perspective:

Analyze a testing pipeline with multiple teams involved, which suffers many of the pathologies described above. It will be based on real examples that I have found in the wild. I will model these pipelines as an actual system and show how this gives us a much better understanding of what we can do to improve the situation.
Show concrete interventions that can be implemented, from the individual team level to the larger organisation, and use our model of the system to observe and measure the impact. I will also try to reference real work done in some of the companies I’ve worked with.

I’m curious about how people design their own testing pipelines, challenges, and useful patterns, so anything you want to share will be very welcome.

How organisations cripple engineering teams with good intentions

Tue, 09 Jan 2024 11:00:00 +0000

I believe that engineers are at their best when they complement strong technical expertise with skills from other disciplines such as product, project and people management, customer support, HR, finance, UX, and many others. I believe that any software engineer should structure their growth plan to acquire the basics of some of those disciplines. I recommend undergoing a tour of duty wearing one of those hats.

I also believe that engineering teams are better when they are not limited to executing technical work, but also understand why. Engineers power-up when they have a clear understanding of the business and product strategy. When they are involved with other domain experts (product, project, people managers, customer support, HR, finance, UX, etc.) in designing the organisational structure and processes that govern their day-to-day technical work.

Some may disagree with this. That’s fine, but that’d be a different discussion. Here I assume that we agree on those points, and therefore we should want to design organisations to pursue those goals. Where engineers are supported and encouraged to develop a diverse toolbox of skills from other disciplines, and where engineers are supported and encouraged to participate in more aspects of the business than merely typing code.

Here I want to discuss how attempts to implement these worthy objectives often backfire.

Let me present a simplified version of a pattern I’ve witnessed a few times.

Leadership learns from product manager feedback that technical decisions are not well aligned with user needs. They diagnose (rightly) that engineering is not close enough to the user. Since we have an interest in augmenting engineers with product management skills, it seems like a good idea to introduce a change in the organisation’s processes so that engineers spend more quality time with PMs and customers when defining epics / stories. It’s hard to argue with this! It makes total sense.
Some time later UX designers raise that the software is disconnected from the actual user experience. After a similar analysis, it seems like a good idea to modify our processes to allow engineers to spend more quality time with UX designers when designing features. Again, it’s hard to argue with this! It makes total sense.
Some time later leadership notices that project management work is falling through the organisational cracks which harms delivery, quality, etc. They realise that this is a good opportunity to help engineers develop their project management skills, so we incorporate some project management responsibilities into engineering teams. Again, this makes sense!
Then leadership meets with Customer Support.
Then, with Sales.

You see where I’m going, right?

Here is a simplified version of another common pattern.

Product stakeholders are defining product priorities for the quarter. We want an inclusive work environment where any engineer can contribute ideas for the product. This all makes sense, so we ask the managers to work with their teams in proposing ideas for initiatives and projects.
HR are looking to improve social media presence and attract leads to the hiring pipeline. Having Engineer-generated content in the corporate blog would be powerful! It also helps engineers build writing skills, get a public presence. Let’s ask engineers to write!
Hiring processes are about to be redefined. We dig inclusiveness. We want engineers engaged and involved. The hiring managers and HR ask teams to get the engineering hive mind to work and crunch some proposals.
Customer support needs new standards, some consolidation of processes and tools. Engineer feedback and engagement is valuable! We ask each team’s manager to collect feedback and ideas from their teams.
And so on.

A sum of individual good ideas doesn’t guarantee a good outcome

It’s a common mistake to assume that by adding up rational individual decisions you get to a good aggregate outcome. It’s usually the opposite. In an evacuation, running for the exit is a sensible individual decision, but aggregate them and you get a lethal stampede.

Something similar happens in our little stories. Making process changes to bring PM, UX, Customer Support, etc. closer to engineers, or asking engineering teams to propose ideas for organisational aspects makes sense as individual decisions. But put together, they can have bad consequences which harm engineering teams (and by extension, the larger organisation). As we implement them, we suddenly realise that engineers barely spend time in actual engineering.

Now, I know this is a trigger for many people. They think. “Hah! Here we go again! An engineer arguing that engineers should be left alone with the code”. This is not what I’m saying (I said the very opposite in the first paragraph!). But my strong belief in multi dimensional engineers that are engaged with the larger organisation is compatible with the belief that engineers should spend most of their time in engineering.

That statement is not even controversial if you swap industries. Think construction. There is value in bricklayers, plumbers, electricians, designers and architects knowing some of each other’s skills. Still, bricklayers must spend a substantial amount of their time laying bricks. Electricians wiring. Plumbers plumbing. Architects architecting. Or else the building won’t get done.

Software engineering is not different. Engineers need to spend time doing engineering. Who else will do the technical designs, write, review, test, operate software?

A more measured objection to my point is to accept that while engineers should spend time engineering, they should still spend some of their time in activities outside of their core expertise area. Sure. I agree. But “some” is quite broad. Let’s be more precise. What % of time spent in core engineering activities are we talking about? 50%? 25%? Give me a ballpark.

Minimum reasonable focus

Or rather, let’s change places. What is the minimum % of time in activities not related to the core expertise that you consider reasonable for your own role? What is the minimum % of time that a PM should spend on product management tasks? Is 50% reasonable or does it seem too low? It does sound low to me! Isn’t it more like 70%? Would it be reasonable for a UX designer to spend less than 70% of their time in core UX design activities? What about people managers? Customer Support?

But let’s go further. I guess we agree that multi dimensionality also applies to non-engineering disciplines: that they are also enriched by acquiring the basics of engineering (among others). So let me ask the PMs, UX designers, People Managers, HR, Sales and customer support folks in the room. What % of your time do you spend doing core engineering activities like code, tests, code reviews, technical documents, operations and so on. I am fairly confident that if I take the average of this poll, it would round up to low single digits. That sounds reasonable! But doesn’t that mean that having engineers spend low single-digit %s of their time on those non-engineering activities is also reasonable?

Building a house needs bricklayers, architects, plumbers, designers, electricians, and so on. All matter. All are valuable. All benefit from learning the basics about the other’s expertise area. And yet, plumbers spend most of their time plumbing. The many experts involved in building software are in exactly the same situation. All matter. All are valuable. All are richer when they learn the basics of the other’s expertise area. And yet, all need to dedicate a substantial % of their focus and dedication to their core activities.

We can now go back to that sequence of individual, rational, sound decisions that the leaders who design organisations tend to make. Of course it’s great to design organisations that help engineers acquire a diverse set of skills, that they engage with defining strategy, organisation, process. But because time is limited, we must be conscious that every time we direct an engineer’s attention away from engineering, we’re chipping away from that minimum reasonable allocation of time that they, like any other professional, need for their core activities. I reiterate that this is not just writing code. It’s also about code reviews, technical designs, and so on. I struggle to see a minimum reasonable allocation for those core engineering activities of less than 70%.

You might think that a budget of 30% non-engineering time doesn’t seem so bad. It’s 12h in a standard 40h week! It can fit a lot of stuff.

But notice that we didn’t even talk about the baseline of day-to-day overhead that goes into every individual engineering team. I’m thinking of activities involved in backlog grooming or ordinary human coordination that already consume a good chunk of that non-engineering budget. Those activities tend to be inflated with a proliferation of rituals, meetings, paperwork, rich in post-its and generally under the umbrella of a methodology, that go well beyond the necessary to achieve their purpose with pragmatism. Not much is left of those 12h.

Project/product/people management specialists easily overlook that overhead and inflation because, from their perspective, that time seems well spent (engineers are project managing! Growing multi disciplinary skills! Applying the latest methodologies! Good stuff!) But what happens if we now add some time working on requirement gathering? And on customer support? And on designing interfaces? And on pitching project ideas? And writing posts for the corporate blog? Great learning! Inclusiveness! But Ars longa, vita brevis. Engineers spend less time in engineering.

Would you, product manager, people manager, UX designer, $insert_your_discipline_here, be able to do your job properly if you had to spend 5% of your time in each of 10 other disciplines? If your core activity was loaded with a crust of unnecessary ritual? Of course not!

Work fragmentation hurts engineers

There is a factor that makes this problem even worse in engineering (as well as in other disciplines). I will refer here to the well-known Maker’s schedule, Manager’s schedule. An engineer’s schedule is like a glass jar that you want to fill with stones, pebbles and sand. You can only succeed in that exact order. If you try to put the pebbles and sand first, the stones won’t fit. Maker-type work is primarily like stones, it requires solid blocks of uninterrupted time. Manager-type work is mostly like sand or pebbles, it can fit in a more fragmented schedule with small blocks of time. Maker schedules don’t work like that: it’s not just a matter of how much time is spent in non-engineering work. It’s also the fragmentation. Put diverse, varied activities into an engineering team and the quality of the engineering will go down because the engineering rocks won’t fit. This isn’t to say that Manager-type work is less important! It’s equally important! But they are different jobs for reasons like this one.

Consequences

When the minimum reasonable threshold of time dedicated to core engineering tasks is broken, things backfire. I’ve seen this in two main varieties:

Teams neglect their engineering standards.

This is unsurprising if they don’t have time because they are doing too much work in adjacent expertise areas, like figuring out requirements, talking to customers, writing blog posts, or having meeting after meeting to push JIRAs around. Of course all those activities are important. Of course engineers grow by doing those activities. Of course they should individually do them at some point in their careers. But time available for those is relatively small. They can’t do all, plus engineering at the right standards. It just can’t happen.

The amount of time matters. But for engineers and other maker roles, so does fragmentation. Put diverse, varied activities into an engineering team and the quality of the engineering will go down. The engineering rocks won’t fit.

It burns people out.

On one hand, red tape and bureaucracy are well known demotivators for engineers (“I could fix this in less time than I write the JIRAs”). On the other hand, we have seen quite a few instances of the following pattern. A gap appears in the people management / product management / project management area. Senior engineer is spotted as capable of plugging that hole. The engineer’s manager makes the case that taking those activities will broaden their toolbox. This makes sense! Engineer accepts. Does less engineering, more people/product/project management. Gradually, no engineering. Some of these people become happy project, product or people managers. But a good chunk of those end up stuck in a position they don’t quite enjoy, not knowing how to go back, constrained by a web of pressure points (e.g. lack of equally clear growth plan in the engineering track as for management roles), until they burn out and interview elsewhere for an engineering position. And yes, they now shine as a multi-dimensional engineer. But shine elsewhere.

Both are bad outcomes for the organisation, even if they derive from an accumulation of individual decisions that are rational and hard to disagree with.

So I guess I have two messages

First, to leaders and managers outside of engineering.

We are all aligned on the value of multidimensional engineers, on transparency, on inclusiveness. You should design your organisation accordingly. Sometimes, engineers will have to be strong-armed against their will or their preference! Many times, the “we have project management work that’s falling through the cracks” or “we need UX and engineering to be closer” serve as great opportunities to learn the basics. All that is welcome.

But please, you need to balance this with an awareness that time matters and context matters. That you cannot have engineers participate in product and project management, UX, HR, customer support and three more things at the same time, as part of their day to day, full-time job as engineer. That sometimes yes, it’d be great to have engineers join sales, or HR, or customer support, or something else, but this is incompatible with keeping a “maker schedule” that is vital to healthy engineering. That involving the team in that well-intentioned brainstorming session to figure out the next quarter’s priorities can disable the same team from delivering the last quarter’s goals up to the right standards. That sometimes, if there is a hole in <some_discipline> work, maybe the solution is not to throw an engineer at the problem, but rather go and get the <discipline_domain_experts> to plug it. That you may be interested in old or new methodologies, tools of your trade, etc. but introducing them in engineering teams may inflate the amount of non-engineering time they have to deal with unnecessarily, to the detriment of the time available for their core responsibility.

That all of this does not mean devaluing your discipline. It just means that engineering is a different one.

Second, to Individual Contributors in engineering leadership roles.

You have to convey to engineers around you the importance of understanding the Why of your work. The value of growing a diverse toolbox of skills. Be a role model in this. Help them engage with the business. Push them to get out of the code sometime, and walk the organisation learning what’s beyond the brick laying. To get into the project manager or the UX designer’s shoes, to go help customers ensure that disaster of an API we designed behind our noise-cancelling headphones. To do the project management that’s falling through the cracks, and learn from the experience.

This is all essential. But it is equally essential that you help engineers keep their focus, attention, raw, solid, uninterrupted quality time, on core engineering activities. Ensure that their jars fill with big stones first. Sometimes you do this by pushing back when the organisation wants to use some of the engineer’s time budget into for purposes that seem good, well motivated and rational, but can have unintentional side effects on the engineers’ ability to write, review, operate, deliver and maintain the engineering dimension of high quality software up to high standards.

This doesn’t mean that engineering is the only dimension that matters. Nor that it’s the most important one. Nor should you think of yourself and your team as sacred cows among lesser professionals.

It just means that you’re engineers, and you have a job to do.

Migrating an Eureka-based microservice fleet to Kubernetes

Wed, 12 Feb 2020 12:00:00 +0000

I have published some articles recently related to the Kubernetes-based PaaS that my team builds for the engineers behind the online marketplaces of Adevinta. Some concerned low level issues that we found as we brought teams in to the platform, like dealing with OOM kills with JVM apps or understanding why one service suddenly exhibited 10x latency after deploying it in Kubernetes.

The last one was more high level, focusing on our vision and strategy. I spoke there about how a lot of the value we provide in the glue between systems. This article will go in depth into one of those bits of glue. I will talk first about motivation, our architectural decisions, outcomes, and future plans. At the end there is a deeper dive in the implementation and testing strategy for that bit of glue.

If you want to receive notifications for new posts, subscribe to the RSS feed or follow me on Twitter.

Context and goals

We are currently busy migrating a large fleet of microservices from one tenant into our PaaS. This tenant covers Adevinta’s online marketplaces in Spain (MilAnuncios, Infojobs, Fotocasa, Habitaclia, and others.) Most of these share a microservice platform based on EC2 that leverages tools in the NetflixOSS stack. A key component is Eureka, which provides service discovery and underpins client-side load balancing.

Kubernetes is able to deal with both service discovery and load balancing on its own, although using very different approaches. This change works for us in the grand scheme of things, but I here won’t discuss client vs. server-side load balancing (you can read about them here or here). Migrating to Kubernetes does enable us to eliminate moving pieces from the legacy infrastructure: we can remove Eureka, microservices can drop the corresponding integration, and forget about load balancing.

But performing a migration from EC2 to Kubernetes while introducing major architecture changes is not a good idea for reasons that are, I hope, obvious enough to omit. We know from the beginning of the project that the migration would happen gradually, avoiding any change that is not strictly necessary for our goal in order to minimize risk and impact on product teams. Any product engineer should be able to migrate a microservice in 1-2h and get on with their work.

That means for the duration of the project (months) we’ll have services fully deployed on the legacy EC2 infrastructure, others only in Kubernetes, others inbetween worlds. We had to figure out a way to make the differences in environments transparent until we are ready to deprecate the old infrastructure of this tenant.

For our Platform team, there is a second goal: bring teams from Adevinta Spain into the fold of our Golden Path, together with other Adevinta Marketplaces. Reducing software sprawl means that Platform teams use economies of scale to provide more value for the business.

Understanding our legacy architecture

The legacy microservice platform is deployed in a handful of AWS accounts following a fairly standard architecture. Here is a (very) simplified view of one account:

Eureka architecture (source)

There are about 200 services, all communicate with HTTP. In the diagram I show two, Bob and Alice represented with stars and pentagons, 3 replicas each. Microservices talk to each other and may serve requests coming from the Internet.

Internet traffic reaches backends through a single API Gateway that is shared by all AWS accounts. We use Zuul, an L7 gateway that can handle TLS termination, header manipulation, dynamic routing, etc. It is configured to map names such as bob.spain.adevinta.com to a given backend service, find an instance, and proxy traffic to it.

But which replica? Zuul and all microservices decide based on the information maintained in a service registry, Netflix Eureka. All instances in the network (both microservices and Zuul) interact with the service registry in two ways:

They register their service name, IP, port, health endpoint, and other metadata to Eureka as soon as they come up, and keep refreshing the same information with regular heartbeats as long while they are available. When heartbeats stop, Eureka evicts the instance from the registry after a timeout.
They poll Eureka to learn the set of available instances for each service and implement richer logic on top (load balancing policies, retries, etc.) We rely on Netflix Ribbon for this.

Let’s see an example. Given the diagram above, Eureka’s registry would look something like this:

Service	Instance ID	Endpoint	Port	…
Bob	Bob1	168.1.1.21	80	…
	Bob2	168.1.1.22	80	…
	Bob3	168.1.1.23	80	…
Alice	Alice1	168.1.2.31	80	…
	Alice2	168.1.2.32	80	…
	Alice3	168.1.2.33	80	…

When someone wants to send a request to Bob, it would do something like this:

ribbon.sendRequest("http://bob/hello", ..)

Behind the scenes, Ribbon would look up which instances are associated to service bob, choose an instance and send the request there.

We can then distinguish two channels of communication.

The Data Plane (blue), used to exchange application data.
The Control Plane (red), used to exchange information that determines how traffic will flow through the Data Plane (e.g. which instances can receive traffic.) Eureka is the core component of the Control Plane in the legacy infrastructure.

Data and control plane may actually use the same networks, but are fundamentally different concerns.

Understanding the target architecture in Kubernetes

To deploy microservices like Bob and Alice we use ordinary Deployment objects that generate a number of Pods running the microservice. For each Deployment we also create a Service that defines how to reach those Pods. This is how Kubernetes wires things up:

How we deploy a microservice in Kubernetes

Each Service object gets a private DNS record like carol.svc.mycluster that resolves to a virtual IP (172.17.0.1 in the diagram). When any Pod talks to carol.svc.mycluster Kubernetes will do its magic and route traffic to healthy Pods running Carol.

But this is problematic for us. Our Kubernetes cluster runs on its own AWS account (not EKS, for $reasons.) We use VPC peerings to link a private subnet in that AWS account, to in each of the AWS accounts of the legacy environments. This gives connectivity, but only up to the Kubernetes underlay network, that is, to the Kubernetes nodes. Not to any of the resources in the overlay network like Pods or Services.

Let’s see the diagram updated with a VPC peering between both AWS accounts.

Our starting point: legacy and new infrastructures

When our Pods come up, they will be able to register in Eureka (we also configured network policies to allow traffic from Kubernetes to AWS accounts). But they will report their container’s IP on the overlay network, which is private to Kubernetes. Eureka’s registry will contain the following entries:

Service	Instance ID	Endpoint	Port	…
Bob	Bob1	168.1.1.21	80	…
	Bob2	168.1.1.22	80	…
	Bob3	168.1.1.23	80	…
Alice	Alice1	168.1.2.31	80	…
	Alice2	168.1.2.32	80	…
	Alice3	168.1.2.33	80	…
	Alice-k8s-1	10.0.1.7	80	…
	Alice-k8s-2	10.0.1.8	80	…
	Alice-k8s-3	10.0.1.9	80	…
Carol	Carol-k8s-1	10.0.0.1	80	…
	Carol-k8s-2	10.0.0.2	80	…
	Carol-k8s-3	10.0.0.3	80	…

All 10.0.*.* IPs are inaccessible from the legacy environment, so this will be an utter mess. All requests from Bob (only deployed in EC2) to Carol (only in Kubernetes) will fail. Half of Bob’s requests to Alice will also fail. Ribbon will be clever enough to blacklist unreachable instances from the local caches of the service registry in each microservice, but other cases will be completely broken. For example, half of Alice’s instances in EC2 won’t be able to reach Carol. And Carol might be able to send requests to Bob and Alice just depending on how network policies are configured.

This is clearly not what we want.

Fixing the control plane

Our problem is that the control plane is broken. Instances of our microservices that run in Kubernetes feed the service registry with data that induce bad decisions about where application traffic should be addressed (e.g. sending requests to unreachable IPs.)

There are some options to fix this. One is to adapt the Data Plane, wiring both networks so that the Kubernetes overlay belongs in the private subnet of the cluster’s AWS account. This, combined with the VPC peerings, means that we can route to/from the tenant’s AWS account. This option would work well with EKS.

We chose a different approach for various reasons. The main one relates to the fact that we maintain several multi tenant clusters. Tailoring their architecture to each tenant’s infrastructure quirks is a slippery slope which inflates our operational burden and has bitten us hard in the past. Instead, we prefer to keep our Kubernetes infrastructure as standard as possible, and treat our clusters as cattle.

The approach we followed has two parts:

Exposing the Service objects outside the cluster through standard Kubernetes constructs.
Joining the Kubernetes Control Plane to the legacy Control Plane.

Exposing `Service` objects outside the cluster

Kubernetes offers three well defined mechanisms for this. The two most basic ones. are NodePort and LoadBalancer, which cover up to L4 (TCP/UDP). For higher level functionality (TLS termination, HTTP path based routing, etc.) you can use an Ingress Controller. This is the standard in our platform.

When we have a Service that needs to be exposed outside the cluster, we create an Ingress for it, with a standard host name (<service_name>.k8s.adevinta.com). The Ingress Controller (NGINX in our case) will handle requests to external endpoints, and route them to a Service. Here is the diagram updated after adding two Ingress objects for Alice and Carol.

Adding the Ingress

NGINX will map HTTP requests with a Host header carol.k8s.adevinta.com to the service’s private DNS record, carol.svc.mycluster, which resolves to the virtual IP, etc.

We give our tenants the choice of various Ingress classes depending on whether they want their services reachable from the Internet, or through a private ELB accessible only via VPC peerings. In this case, we’re going for the latter as we want to keep all traffic among microservices private, that’s why the Internet cloud is not connected to the ELB at the bottom.

After creating an Ingress, the remaining challenge is making the legacy control plane aware of the workloads running in Kubernetes.

Joining control planes

As noted before, every instance of our microservices should not register their container IP, but use an address that hits the Ingress controller: the private ELB’s IP. Let’s say it is 10.28.16.110, then Eureka’s registry should look like this:

Service	Instance ID	Endpoint	Port	…
Bob	Bob1	168.1.1.21	80	…
	Bob2	168.1.1.22	80	…
	Bob3	168.1.1.23	80	…
Alice	Alice1	168.1.2.31	80	…
	Alice2	168.1.2.32	80	…
	Alice3	168.1.2.33	80	…
	Alice-k8s-1	10.28.16.110	80	…
	Alice-k8s-2	10.28.16.110	80	…
	Alice-k8s-3	10.28.16.110	80	…
Carol	Carol-k8s-1	10.28.16.110	80	…
	Carol-k8s-2	10.28.16.110	80	…
	Carol-k8s-3	10.28.16.110	80	…

Each microservice now has an entry for its Pods. Notice how all have the private ELB’s IP, 10.28.16.110. This is allowed in Eureka as long as the instance IDs is different.

But this approach has downsides:

We have to modify microservices so they report their own host’s or the ELB’s depending on where they are running. This is custom logic so it would need code changes.
If our ELB changes IP (which happens sometimes), we have to propagate the change to every instance running in Kubernetes so they can in turn update their entries in Eureka.

To avoid them, we decided to follow a different approach. We built Eurek8s, a Kubernetes controller that glues together the Control Plane in Kubernetes and the legacy microservice platform of Adevinta Spain.

Eurek8s implements a simple control loop that propagates the state of each microservice to entries in Eureka. This gives us some benefits:

We need no changes in microservices.
We resolve the ELB’s IP dynamically in the controller before sending a heartbeat to Eureka. This way, changes in the ELB’s IP are propagated seamlessly.

The full picture

I’ll add a section at the end of this article with some implementation details of Eurek8s, but focus first on the main implications for our product engineers, and our plans moving forward.

Deprecating client-side load balancing

We realized that there was too much overhead in monitoring every microservice’s Pod lifecycle events and their status to decide when and how to register them in Eureka. Moreover, this would mean sneaking through a backdoor into the abstractions why we came to Kubernetes in the first place. What determines if our microservice is healthy or not is a whole Deployment, we shouldn’t care about individual Pods.

Eventually, we decided to ignore the Pods and focus on the higher level abstraction. Eurek8s registers only one endpoint per microservice (a Deployment in practice), rather than one endpoint per Pod. Therefore, based on the diagram above, Carol and Alice would only have one endpoint regardless of the number of replicas:

Service	Instance ID	Endpoint	Port	…
Bob	Bob1	168.1.1.21	80	…
	Bob2	168.1.1.22	80	…
	Bob3	168.1.1.23	80	…
Alice	Alice1	168.1.2.31	80	…
	Alice2	168.1.2.32	80	…
	Alice3	168.1.2.33	80	…
	Alice-k8s	10.28.16.110	80	…
Carol	Carol-k8s	10.28.16.110	80	…

If you think this breaks client-side load balancing, you are quite right.

The microservices we migrate used originally a strategy that round-robins requests evenly across instances. For example, when one wants to talk to Alice, it would check the service registry table (above), see 4 endpoints, and split traffic sending 1/4th of the traffic to each one.

But Eurek8s is lying, it reports an endpoint , Alice-k8s, that actually hides 3 replicas. In reality, each instance in EC2 would receive 1/4th of the traffic while those in Kubernetes would receive 1/12th.

This was just not such a big issue for us. Why?

Each service will run in both environments temporarily for just a few days while the product team that owns it shifts traffic over from EC2 to Kubernetes checking that nothing breaks. This limits the time where our load balancing is broken. As soon as 100% of the traffic is going to Kubernetes, client-side load balancing is a NOOP, given there will only be 1 endpoint per microservice, like in Carol’s case above.
Moving forward, we are OK with Kubernetes taking over load balancing functions, the effort in keeping client-side load balancing working for a couple of weeks longer would not be well spent.
Product teams can be made aware of this detail.

Outcome

After moving a few dozens of services, our assumptions worked out well. Because we did a major investment in automating most of the migration burden, it is indeed possible for engineers to get a service running in Kubernetes and start routing traffic there in just a few hours. Most services handle 100% of their traffic in Kubernetes in less than one week. Eurek8s has been quite stable so far, the focus on simplicity has surely helped making this possible. We discovered with minor edge cases discovered in the first batches of migrations that were easy to resolve.

The temporary Load Balancing distorsion has been harmless.

Future plans

Eurek8s will continue to enable the full migration of all workloads. By the time we’re done, both environments will look like this:

All microservices are migrated to Kubernetes

So Eureka registry will be quite boring:

Service	Instance ID	Endpoint	Port	…
Bob	Bob-k8s	10.28.16.110	80	…
Alice	Alice-k8s	10.28.16.110	80	…
Carol	Carol-k8s	10.28.16.110	80	…

Eureka will have become a dumb 1:1 mapping from service name to our Ingress IP. At this point, removing it from our infrastructure will be easy:

For service-to-service communication, a trivial change in each service disables the Eureka client and replaces the endpoints of its dependencies (e.g. from bob to bob.svc.mycluster). We can automate this change making it transparent to product engineers.
On the API Gateway, Zuul, we can disable Eureka and configure the 1:1 mapping from the public host names to our internal ones (e.g. bob.adevinta.com to bob.k8s.adevinta.com).
Decommission the Eureka server.

The fundamental change behind this process is to outsource Load Balancing and Service Registry to Kubernetes. This will initiate a larger conversation about the API Gateway.

Traditionally, API Gateways have been in charge of solving functions such as load balancing, TLS termination, header manipulation, dynamic traffic routing. After migrating to our PaaS, there is a significant scope overlap in the capabilities avaliable in Kubernetes and its ecosystem (e.g. from Ingress controllers, to service mesh solutions like Istio or Linkerd). We plan to spend some thinking time to define the split of responsibilities between the API Gateway and Kubernetes. This conversation is particularly relevant for other projects we’re working on, such as supporting transparent multi-cluster deploys that enable resiliency to full cluster failures, etc. Christian Posta has an interesting post on the role of API Gateways within the current industry landscape.

Eurek8s internals, for the curious

Finally, I’m adding an overview of how Eurek8s is implemented. Here is a high level diagram:

Eurek8s

Eurek8s has three channels that are used to signal different types of events.

The first channel is fed from a Kubernetes Informer, that subscribes to changes in Ingress objects for a namespace with a particular annotation. Depending on whether the Ingress was created/updated or deleted, a message is sent in one or the other channel. Why listen on the Ingress and not the Deployment? Conceptually it’s the same, we’re just monitoring “a microservice”. We considered an Ingress preferable because:

We need the URL for the service to resolve the ELB’s IP. This URL is configured in the Ingress object, so it’s cheaper to load it straight away, rather than listen to the deployment, and then load its Ingress.
Ingress objects are updated less frequently. But each Deployment changes every deploy). Less notification churn reduces pressure on the Kubernetes API.

The other two channels are fed from timers that emit one message at set intervals. One signals when we should refresh the health status from Kubernetes. The other determines how often we send send heartbeats to Eureka.

All work is done by a single go routine, so we don’t need to coordinate access to shared datastructures. This is roughly how each message is handled:

When an event appears in the “Created/Updated/Deleted Ingress” channel, the worker updates a local cache of Eureka entries, and immediately registers / updates / deregisters the entry in Eureka.
When an event appears in the “Sync k8s status” channel, the worker iterates through all known Ingresses in the local cache, and updates the entry with the health of the relevant Deployment object in Kubernetes.
When an event appears in the “Send heartbeats” channel, the worker does just that, for all known Ingresses.

Both timed events at (2) and (3) involve multiple API calls to Kubernetes and Eureka so processing a single microservice can take ~250ms. With ~200s microservices, handling each sync / heartbeat loop would take ~50s. That is too much lag between state in Kubernetes and Eureka.

To reduce this time but keep the simplicity of a single-threaded application, we just deploy Eurek8s per Kubenetes Namespace. This achieves the same purpose (less Ingress objects to handle) and also reduces the blast radius of one Eurek8s crash. This decision is easy to reverse, using a single Eurek8s instance per cluster and splitting (2) and (3) in separate Go routines.

Another decision was to run just one Eurek8s Pod per Namespace. What’s the tradeoff?

Replicating Eurek8s increases resiliency at the expense of complexity coordinating instances so that replicas compete for ownership of a subset of Ingress objects in their namespace. This is doable, but not trivial.
Having a single Pod implies that losing it will stop all heartbeats. As a result, Eureka could deregister all instances from services in that Namespace. This would be bad, but the risk in practise is quite low. First, if the Eurek8s Pod crashes, Kubernetes will spawn a new one, reducing the period of downtime to a few seconds (~5 in our tests.) This is actually so harmless that it happens every time we deploy a new version of Eurek8s. Heartbeats stop for ~5s and resume shortly after. Given that the endpoint timeout in Eureka is much higher (30s), such restarts are imperceptible. Again, this strategy is easy to reverse if necessary. Heartbeats are idempotent, so we can scale up to two Pods anytime while we implement the coordination logic (at the expense of some load in Kubernetes and Eureka APIs.)

Testing Eurek8s

We have some basic unit tests, but given the nature of the tool we wanted an e2e testing suite that resembled production conditions. Kind has been beyond awesome for this purpose. If you haven’t heard about it, you should really check it out. It is a “tool for running local Kubernetes clusters using Docker container ‘nodes’. Kind was primarily designed for testing Kubernetes itself, but may be used for local development or CI.” It does the job beautifully.

We build a simple harness that uses Kind to create a local Kubernetes cluster, we deploy a Eureka server inside it for convenience, along with Eurek8s, and then implement a bunch of tests that mimic our production use cases. The whole thing is quite lightweight and verifies our tool in minutes.

Done

If you made it so far, thanks. This has grown definitely longer than I expected. Feel free to send me questions / corrections / feedback via Twitter, or the email you’ll find in my GitHub profile.

Talk write-up: "How to build a PaaS for 1500 engineers"

Thu, 02 Jan 2020 10:00:00 +0000

This article is based on a presentation I gave as part of AdevintaTalks in Barcelona on November 2019. I’m experimenting with this format: I went through the slides typing what I’d speak over them, edited the text, and added some of the most relevant slides inbetween paragraphs. Let me know how it works.

Discussion in Hacker News

–

Today’s topic is about Technical Infrastructure, a term I found first in a great talk by Will Larson. He defines Technical Infrastructure as “the software and systems to create, evolve, and operate our businesses.” That includes Cloud services, build tools, compilers, editors, source control systems, data infrastructure (Kafka, Hadoop, Airflow…), routing and messaging systems (Envoy, gRPC, Thrift…), Chef, Consul, Puppet, Terraform, and so on.

Companies that reach a certain size typically create one or more teams that take care of different subsets of those tools. They get named something like “infrastructure”, “platform”, “foundations”… I will use “Platform team” in this text.

If you have been part of one, you will know that life for a Platform team is tough. I could find an actual picture that shows one of them on the field:

Pieter Brueghel the Elder - "The Triumph of Death" (1562)

A good deal of the job is ultimately about finding the right balances between standardization and autonomy. To make meaningful impact, Platform teams depend on having standards in their organization. Trying to too support every possible language ecosystem, framework, DB, messaging system, and whatnot spreads Platforms teams too thin to be effective.

On the other hand, it is also wise to respect every other team’s autonomy to make their own technical decisions. Oppinionated Platform teams risk coming across as a patronizing, ivory-tower-dwelling jerks that impose capricious choices on other engineering teams. Standardization and autonomy are complex factors to juggle.

What if you multiplied this problem by 20? What if you had 20 companies to support, each with their own cocktail of technologies, cultures, philias and phobias, genealogies of decisions… and their own Platform team!

That’s my team’s job at Adevinta. Adevinta builds online marketplaces. Each of them are and need to be special in their own way. At the same time, they all share many other bits and pieces that don’t need to be different. That are, in fact, more cost-effective if they are built once and shared by all. Much of this falls in the category of Technical Infrastructure. My team works on that type of plumbing.

Generating value as a Platform team

You know who these people are?

They are the way fiction represents a profit centre. Those are parts of a company that bring money in, they generate revenue.

You know who these other people are?

They don’t smell of money, right? Fiction represents them in this way because they belong to a cost centre. They pay AWS bills, buy computers. In other words, money exits the company through them.

Profit centres tend to have a crisp value proposition that is easy to see and understand. It’s the opposite for cost centres. This doesn’t mean that they give no value to the company. On the contrary: that Kubernetes cluster is critical for the business. What is hard is explaining to the people in the first picture why the engineering team should spend cycles migrating from EC2 to Kubernetes instead of just shipping more money-generating features.

In a way, cost centres are valuable to the extent that they power-up the profit centres. This is why Platform teams should be aware that they live under constant scrutiny, visible or not. The business will always wonder, why are we paying this team of expensive engineers? Can’t Amazon, or Google, or Microsoft, or Digital Ocean, or Heroku, or whoever else, provide whatever these people do? Can’t we make use of those many open source projects funded by dozens of commercial companies, and use the headcount elsewhere?

In 2020, Technical Infrastructure is a hot market. While I gave this talk, Amazon was running Re:Invent in Las Vegas. The first day they announced Quantum computing as a service. They continued for an entire week running presentations that felt like..

So when a Platform team comes with their quarterly demo the perception is hopelessly.. small.

And those internal demos are great, don’t get me wrong, a lot of effort and that. But when a Platform team is (intentionally or not) building systems that have 3rd party alternatives they are competing in an uneven playing field. A Platform team should really avoid competing against AWS, Google, or any commercial company. It doesn’t matter if their homegrown CI system is superior, today, to $commercial_ci_system. The market will catch up, faster than expected, and make that team redundant. Every Platform team should be asking themselves: what is our differentiator? What do we offer than makes it worthwhile for our company to invest in our team, rather than throwing those engineers at the product?

To recap so far: because Platform teams are cost centers, they must really focus on formulating a clear and realistic value proposition that is relevant to the business. They must also ensure that their impact is intelligible to the business, and visible. Finally, they must ensure that those properties remain within a fast-moving industry.

Our PaaS

My team focuses on a PaaS that helps engineers across Adevinta build, deploy and operate software in the cloud. We focus mainly on stateless workloads (e.g. microservices.)

Our pitch for product engineers is simple: “look, we know you’re fighting in a competitive market. Our mission is to help you win that competition. You prefer to do the plumbing yourself? It’s your call, we respect that. But if you prefer to focus your attention on your product we have a team of domain experts that do the heavy lifing for you. We take care of providing you with the technical infrastructure that you need so you can focus on winning races”.

We define a Golden Path: a reduced set of sane, proven choices of tools that are effective to build, deploy, and operate microservices (the core systems we support are in the left-hand side slide below). Is each tool the best in their category? Probably not. But we know they get the job done, are well supported, maintained, and standardised in the industry. It’s not about local optimums, but global maximums.

It also not about chosing a bunch of random pieces and give product teams an Ikea-like platform that they must assemble piece by piece.

We bundle GitHub, Travis, Artifactory, Spinnaker, FIAAS, Kubernetes, Prometheus, Datadog, Sumologic, ELK.

The main value we provide is in the joints, the articulation, the glue. In how we integrate together all these systems. Our PaaS favours composition of multiple modular components, trying to become a pierceable abstraction. The main benefits are:

Avoid losing battles with commercial companies. Each of our components has at least one whole company behind it. It’s hardly realistic to expect that we can dedicate 5-7 of our engineers (1/4th of the team!) to build in-house alternatives to a CI system, a CD system, a metrics platform, etc. against businesses that have 20 or 50 times that capacity. Instead, we focus on what is specific for our company, tailoring off-the-shelf solutions to our needs. Commercial competitors are more likely to focus on what’s generic to larger portions of the industry.
Well defined articulations become escape hatches for those teams that can’t or won’t use the entire bundle. This is a requirement to be able to support a highly heterogeneous population of companies and teams as we do. But it’s also good for us as it increases adoption, and creates low-friction “gateway drugs” to the full PaaS. We often see how teams that adopt one tool gradually adopt more organically as they gain trust and realize we can relieve them from doing a lot of undifferentiated heavy lifting.
The same flexibility enables us to replace individual pieces when it makes sense, causing minimal impact for the users. One of our current initiatives is to ensure that upgrading or switching any of those pieces is completely transparent to users.

Small impact, repeated many times

One of the strategies we use is to spot where we can introduce tools that generate small impact to a wide surface. This worked quite well applied to reducing toil on the development process.

Of most of the tasks involved in getting a new change merged to the main tree, there may be only two that really need involving human brains: writing the code, and reviewing it. We set out to automate lots of the other small chores in the dev process: assigning reviewers, analyzing coverage and static analysis reports, propagating dependency updates, keeping branches up to date with their base, merging approved PRs… Each of these actions may have a tiny impact. But multiplied by a population of hundreds of engineers, month after month, you get economies of scale.

The image above is the profile for the service user that executes all these actions in our internal GitHub Enterprise instance. Assuming each action is worth 1 minute (many of them are actually more than that) it adds up to 62 engineer-days per year. That’s impact that can be very easily translated to money, to terms that the rest of the business can understand.

At this point you should have a siren wailing in your heads. “Wait, you said before that we should avoid competing with commercial companies. GitHub released Actions earlier this year, and some of the automation you just mentioned seems very similar to what is available in the public GitHub. Does that mean your team has just become obsolete?”

Remember the point about creating differentiators. The core functionality that makes the PaaS (things like “running builds” or “store these metrics and let me query them”, or “do $something when a PR is created”) will become commoditized sooner or later. But glue is a different story. Our differentiator here is simple: GitHub Actions can only react to GitHub events. Our automation can react to events in the entire Adevinta development ecosystem. All of it.

Because we didn’t spend that much time building the core tools, we could focus on the glue. The slide above shows Devhose, a component that collects events from every tool in our dev ecosystem (GitHub, Travis, Spinnaker, Artifactory, JIRA, Slack, Kubernetes… and even several tools outside of the Golden Path), stores them in a log, and broadcasts them in an “engineering event bus”. We also built some tooling around it that gives us the ability to easily implement new functionalities that interact back with the ecosystem. For example, one bot we prototyped recently listens to events in Kubernetes, detects killed pods, collects troubleshooting information and ships it to a Slack channel so the team that owns the service is alerted.

Thanks to having invested in glue, when GitHub Actions reaches the Enterprise version, it will bring value that we can leverage, rather than an existential threat.

Virtuous feedback loops

Having everything that happens in our dev ecosystem registered and broadcast in an event bus turned out to be valuable for multiple purposes. One of them was to build insights into the development process itself.

We built a system called Ledger to help with this. It is an event consumer that reads from Devhose’s event bus and crunches all types of productivity metrics. Which ones? One of our references is Accelerate and their annual “State of Devops” reports (2018, 2019). Their main claim is that the performance of software delivery teams can and does provide a competitive advantage to companies. This is backed by extensive industry research that links specific practises to the most effective and efficient ways to develop and deliver technology. This is precisely our team’s mission.

The authors identify four key metrics that capture the effectiveness of the dev & delivery process: Lead Time for Changes, Change Failure Rate, Deployment Frequency, and Time to Restore. These can be used as high level performance indicators that reliably gauge the ability of an organisation to achieve its goals.

Because we provide the plumbing for most engineering processes, we are in the right place to measure them. Here is one of our dashboards about Continuous Delivery, including Deployment Frequency, one of the Accelerate indicators.

Teams that use our PaaS get these out of the box. Along with a lot more metrics. Build durations, code coverage, static analysis issues, security issues, lead time for changes, stats about the code review process, etc.

I have a certain liking to the Code tab, that shows the correlation between Pull Request size and time to merge in their team. Here is an example:

You have surely noticed at least two interesting details:

It’s obvious that small PRs get merged faster. In fact, differences of just a couple dozen lines double the time to merge from hours to days.
Even though time to merge grows with PR size, it looks like very large PRs (the last bucket) take a lot less to merge. You can guess why.

This example is a good way to show how Ledger helps us influence best practises without confrontation, enforcement, or alienating engineers. We bring no opinions to the table. We show over 2 years worth of data that is contextualized for a whole team, a whole company, but never exposes personally identifiable information. We care about how teams perform, not how many points of code coverage are accountable to $engineer. This is not a tool for managers to measure performance, but for teams to understand and make informed decisions about their processes.

Again, does this have some overlap with tools like SonarQube? Definitely. But we have differentiators. We can analyze everything in the dev process, not just code quality. We can tailor to the deploy and release workflows most common to our teams. We can enrich data with organisational information specific to Adevinta’s org chart. We can correlate quality with other phases in the process (e.g. the space we want to move into next is answering questions like “Does high code coverage correlate to less incidents?”). We can reprocess over 2 years of raw data on demand and generate new stats as we develop them (or fix bugs :).

The Accelerate report notes that these indicators can be used in two ways. Teams can inform improvements in their software delivery performance. Organisations can learn how to support engineering productivity with metrics that are intelligible to non-technical stakeholders. In other words, by providing these metrics, we facilitate a conversation between tech and business, between cost and profit centres. And precisely because we measure the productivity of teams, we can also measure in what degree our job delivers on our mission: support teams to deliver at their best.

Investing in a component

Sometimes we do invest in components of the Platform.

In the talk I didn’t have time to go over Spinnaker, where we’ve been (and remain) active contributors for ~4 years along with Netflix, Google, Target and many others. Being part of the community made it easier for us to upstream features that made sense for us, such as Cloud Formation support or integrations with Travis, as well as dozens of bug fixes and other improvements.

But the best example of in-house investments are the Kubernetes clusters we build and operate on EC2. The inevitable question is: why not use EKS, GKE, or other managed solutions?

The “glue” strategy should already come to mind.

Below is a simplified comparison of what a raw installation of Kubernetes provides, and what our clusters provide. A bare Kubernetes cluster works, you can schedule containers, but that’s pretty much it. GKE, EKS, etc. provide slightly more than the left picture: they may autoscale nodes and handle other basic ops burden. But they are still far from covering the typical needs of product teams that wish to run production workloads securely, with little to no operational burden.

Some examples:

Our clusters are multi-tenant, allowing several marketplaces to share infrastructure securely while optimizing costs through higher density. We deal with all the cons of multitenancy. We handle network isolation for each tenant. We provide sensible defaults, pod security policies, etc. We added a validating webhook to the NGINX ingress controller that we contributed upstream, to reduce the blast radius of ingresses that break the NGINX configuration.

We maintain clusters in several geographical regions. This is instrumental for some of Adevinta’s central teams that build centralised functions like Messaging, Reputation systems, etc. that are meant to be used in several marketplaces that serve users in distant geographical locations (from Europe to Latin America). Our clusters offer a homogeneous, managed runtime environments, close to all Marketplaces, where central functions can be deployed.

In every cluster we provide integrations that work out of the box. You get automatic certificates leveraging cert-manager and Let’s Encrypt. Users can use authentication tokens generated through our company’s SSO. They get metrics automatically scraped and sent to Datadog or our internal Prometheus, as they chose with a simple config option. The same feature is provided for logs.

Users can avoid learning the full Kubernetes and use FIAAS, a commodity abstraction on top of Kubernetes. It was created in-house at one of our marketplaces ~7 years ago and was OSS’d in early 2019.

In every cluster we run automated canary tests periodically that deploy canonical applications and test connectivity and most integrations, we try to detect problems earlier than our users. Our team provides dedicated, 24/7/365 on-call.

We do full-stack aware upgrades. Because Google or AWS may upgrade your Kubernetes version, but will not care about the integrity of everything you’ve got running there. We upgrade at a slower pace than GKE/EKS, but when we do we ensure that the entire stack bundled in our cluster works, not just the core of Kubernetes.

And we provide insights into your costs down to the pod level, informing users about potential savings if they are over-provisioning. Here is a PoC of a Grafana dashboard with this info:

I made a point before about creating articulations that facilitate pivoting to commercial choices once they become commoditized. That’s our strategy with EKS. We keep a close eye on its roadmap and have given feedback about our needs to AWS. As soon as EKS is a suitable base Kubernetes installation for us, we are ready to lift-and-shift our integrations on top.

Zero-friction onboarding

All this is (possibly) quite cool, but if engineers need weeks or months to start using it, nobody will. That was one of our top challenges a year ago. The core components in the PaaS were mostly there, but each team would have to spend days or weeks configuring each of them by hand on their repositories. Throughout the last year we’ve invested heavily in streamlining the onboarding process. We turned what was an Ikea-like experience into an almost seamless process. Users get a web interface, enter their repository URL, click on a button, and our automation takes it from there. It takes ~10 minutes for their repository to be automatically configured with CI/CD pipelines, deploying to their team’s private namespace, and integrated with metrics and logging systems. If something fails or manual action is required from our Platform teams, the onboarding tool notifies, keeps track of the issue in JIRA, and resumes the process when we’ve unblocked it.

There is a point about the importance of automation here, but I want to stress something else. If you buy the proposition that Platform teams have a significant degree of competing scope with commercial companies, UX spacialists are a must-have. A team big enough to serve hundreds of engineers must reserve headcount for a good UX designer. It will pay off. Not only will you stop inflicting backend-made-UIs to your engineers, but also because the Platform team will learn how to understand the needs of their users, test their assumptions (most of which will be wrong), the impact they make with their work, and deliver a more professional product.

The team in charge of those automation and UIs has been using instrumentation data collected in the onboarding process to polish the experience, improve the failure rate and cover corner cases. Last quarter they moved on to solving other pain points in the platform. For example, troubleshooting a failed deploy. Right after Christmas we plan to release a team dashboard with all applications maintained by a given team. It summarizes status for each, and highlights when something fails in the build, deployment pipelines, or runtime, with relevant information collected from any system in the PaaS. Again, glue.

Change hurts, we should feel it too

Regardless of the effort we put in to improving the onboarding experience for our users, at the end of the day we’re moving engineers from a known territory, their on-prem infra, EC2, or wherever they run their services today, to an unknown one. At some point, every migration feels like this.

Théodore Géricault - The Raft of the Medusa. (1819)

When this happens, someone from our team should sail in that raft with them.

Specially for mid/large sized migration projects, we allocate at least 1-2 of our engineers to support teams on-site. To give an example, we now have an ongoing migration project for ~200 microservices from all the Spanish sites. For several months we’ve had engineers from our team, as well as the local Platform team inside Adevinta Spain sitting together in a shared seating area. Both share OKRs, regular plannings, weekly syncs, etc. In our previous engagement with the Subito.it team, three of our engineers travelled to Italy for several weeks during the quarter. We are also creating different workshops that we reuse and adapt for newly onboarded teams, including Kubernetes basics, hands-on exercises, etc.

There are two key outcomes to working closely to our users:

Trust: engineers in the product teams stop perceiving those in the Platform teams as a Slack handle, but actual people with faces that they can confidently ask questions to. The local Platform teams in the marketplaces start seeing us as partners rather than an existential threat.
Engineers in our Platform team learn the needs of product engineers, and gain experience on how it feels to use the tools we build. They also leverage the expertise of local Platform teams that have been working in this space for years in their respective marketplaces. While it’s hard to leave your team for weeks or months, the experience is an eye-opener and everyone brings invaluable insights back to our teams.

Is it working?

It’s a bumpy ride by nature, but yes. How do we measure it?

As I mentioned above, metrics like those proposed by Accelerate seemed relevant as they measure precisely the properties we wish to improve. After getting some help from a Itamar Gilad, a professional product management consultant, we settled for Successful Deployments per week as our North Star.

We believe it is a good proxy for productive habits that we want to incentivise: deployments are easy and automated, work reaches users early, and defects can be repaired quickly. For us, more frequent deployments have two positive implications: more services use our PaaS, and those that use it behave productively by deploying more often. We only count “Successful” deployments to ensure that our tools help and incentivise pushing healthy code out.

Here is our North Star tracking dashboard for the past year (the dip at the right is just the last incomplete period.)

This quarter we’ve spent time collecting a whole set of other metrics that influence the North Star (e.g. number of active repos, build durations, etc.) and are fundamental to defining good OKRs.

Harder to quantify, but easier to appreciate, is feedback from our users. A couple of days before the talk we found this in Twitter from an engineer in one of the Spanish marketplaces that were onboarding the Platform in those days.

Translation: "The transition to Kubernetes is going so well that I'm not feeling it. Basically everything works perfectly. Hat tip to everyone who made it."

We have lots of rough edges and potential for improvement, but it feels we are in the right track. My +10 for that hat tip to all the excellent colleagues that have contributed to making this possible, from infrastructure to UX and everything inbetween.

Kubernetes made my latency 10x higher

Tue, 22 Oct 2019 06:07:31 +0000

Update: it looks this post has gotten way more attention than I anticipated. I’ve seen / received feedback that the title is misleading and some people get dissapointed. I see why, so at the risk of spoiling the surprise, let me clarify what this is about before starting. As we migrate teams over to Kubernetes, I’m observing that every time someone has an issue, like “latency went up after migrating”, there is a knee-jerk reaction that “Kubernetes is to blame”, and investigations almost invariably result in “well, not really”. This post exemplifies one of these cases. The title is what our dev said, by the end you’ll see it wasn’t Kubernetes at all. There are no break through revelations to be learned about Kubernetes here, but I think there are good lessons about complex systems.

Last week my team was busy with the migration of one microservice to our central platform, which bundles CI/CD, a Kubernetes based runtime, metrics and other goodies. This exercise is meant to provide a blueprint to accelerate moving a fleet of ~150 more in the coming months, all of which power some of the main online marketplaces in Spain (Infojobs, Fotocasa, etc.)

After we had the application deployed in Kubernetes and routed some production traffic to it, things started looking worrisome. Request latencies in the Kubernetes deployment were up to x10 higher than on EC2. Unless we found a solution this would not only become a hard blocker to migrate this microservice, but could also jettison the entire project.

Why is latency so much higher in Kubernetes than EC2?

To pinpoint the bottleneck we collected metrics for the entire request path. The architecture is simple, an API Gateway (Zuul) that proxies requests to the microservice instances in EC2 or Kubernetes. In Kubernetes, we use the NGINX Ingress controller and backends are ordinary Deployment objects running a JVM application based in Spring.

                                  EC2
                            +---------------+
                            |  +---------+  |
                            |  |         |  |
                       +-------> BACKEND |  |
                       |    |  |         |  |
                       |    |  +---------+  |                   
                       |    +---------------+
             +------+  |
Public       |      |  |
      -------> ZUUL +--+
traffic      |      |  |              Kubernetes
             +------+  |    +-----------------------------+
                       |    |  +-------+      +---------+ |
                       |    |  |       |  xx  |         | |
                       +-------> NGINX +------> BACKEND | |
                            |  |       |  xx  |         | |
                            |  +-------+      +---------+ |
                            +-----------------------------+

The problem seemed to be upstream latency at the backend (I marked the connection with “xx” in the graph). When the application was deployed in EC2 it took ~20ms to respond. In Kubernetes it was taking 100-200 ms.

We quickly discarded likely suspects that could have appeared with the change of runtime. The JVM version was identical. Issues related to containerisation were discarded as the application already ran in containers on EC2. It wasn’t related to load, as we saw high latencies even with 1 request per second. GC pauses were negligible.

One of our Kubernetes admins asked whether the application had any external dependencies as DNS resolution had caused similar problems in the past, this was our best hypothesis so far.

Hypothesis 1: DNS resolution

On every request, our application makes 1-3 queries to an AWS ElasticSearch instance at a domain similar to elastic.spain.adevinta.com. We got a shell inside the containers and could verify that DNS resolution to that domain was taking too long.

DNS queries from the container:

[root@be-851c76f696-alf8z /]# while true; do dig "elastic.spain.adevinta.com" | grep time; sleep 2; done
;; Query time: 22 msec
;; Query time: 22 msec
;; Query time: 29 msec
;; Query time: 21 msec
;; Query time: 28 msec
;; Query time: 43 msec
;; Query time: 39 msec

The same queries from one of the EC2 instances that run this application:

bash-4.4# while true; do dig "elastic.spain.adevinta.com" | grep time; sleep 2; done
;; Query time: 77 msec
;; Query time: 0 msec
;; Query time: 0 msec
;; Query time: 0 msec
;; Query time: 0 msec

Given the ~30ms resolution time, it seemed clear that our application was adding DNS resolution overhead talking to its ElasticSearch.

But this was strange for two reasons:

We already have many applications in Kubernetes that communicate with AWS resources and don’t suffer such high latencies. Whatever the cause it had to be be specific to this one.
We know that the JVM implements in-memory DNS caching. Looking at the configuration in these images, the TTL configured at $JAVA_HOME/jre/lib/security/java.security and it was set to networkaddress.cache.ttl = 10. The JVM should be caching all DNS queries for 10 seconds.

To confirm the DNS hypothesis we decided to avoid DNS resolution and see if the problem disappeared. Our first attempt was to have the application talk directly to the Elasticsearch IP, rather than the domain name. This required changing code and a new deploy so instead we simply added a line mapping the domain to its IP in /etc/hosts:

34.55.5.111 elastic.spain.adevinta.com

This way the container would resolve the IP almost instantly. We did observe a relative improvement, but nowhere near our target latency. Even though DNS resolution was too high, there real cause was still hidden.

Network plumbing

We decided to tcpdump from the container so we could see exactly what was going on with the network.

[root@be-851c76f696-alf8z /]# tcpdump -leni any -w capture.pcap

We then sent a few requests and downloaded the capture (kubectl cp my-service:/capture.pcap capture.pcap) to inspect it with Wireshark.

There was nothing suspicious with DNS queries (except a detail I’ll mention later.) But something in the way our service handled each request seemed strange. Below is a screenshot of the capture, showing the reception of a request until the start of the response.

The packet numbers are shown in the first column. I coloured the different TCP streams for clarity.

The green stream starting at packet 328 shows the client (172.17.22.150) opened a TCP connection to our container (172.17.36.147). After the initial handshake (328-330), packet 331 brought an HTTP GET /v1/.., the incoming request to our service. The whole process was over in 1ms.

The grey stream from packet 339 shows that our service sent an HTTP request to the Elasticsearch instance (you don’t see the TCP handshake because it was using an existing TCP connection.) This took 18ms.

So far this makes sense, and the times roughly fit in the overall response latencies we expected (~20-30ms measured from the client).

But inbetween both exchanges, the blue section consumes 86ms. What was going on there? At packet 333, our service sent an HTTP GET request to /latest/meta-data/iam/security-credentials, and right after, on the same TCP connection, another GET to /latest/meta-data/iam/security-credentials/arn:...

We verified that this was happening on every single request for the entire trace. DNS resolution is indeed a bit slower in our containers (the explanation is interesting, I will leave that for another post). But the actual cause for the high latencies were queries to the AWS Instance Metadata service on every single request.

Hypothesis 2: rogue calls to AWS

Both endpoints are part of the AWS Instance Metadata API. This service is used from our microservice during reads from Elasticsearch. The two calls are a basic authorisation workflow. The endpoint queried in the first request yields the IAM role associated to the instance.

/ # curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
arn:aws:iam::<account_id>:role/some_role

The second request queries the second endpoint for temporary credentials for that instance:

/ # curl http://169.254.169.254/latest/meta-data/iam/security-credentials/arn:aws:iam::<account_id>:role/some_role`
{
    "Code" : "Success",
    "LastUpdated" : "2012-04-26T16:39:16Z",
    "Type" : "AWS-HMAC",
    "AccessKeyId" : "ASIAIOSFODNN7EXAMPLE",
    "SecretAccessKey" : "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    "Token" : "token",
    "Expiration" : "2017-05-17T15:09:54Z"
}

The client is able to use them for a short period of time, and is expected to retrieve new ones periodically (before the Expiration deadline). The model is simple: AWS rotates temporary keys often for security reasons, but clients can cache them for a few minutes amortizing the performance penalty of retrieving new credentials.

The AWS Java SDK should be taking care of this process for us but, for some reason, it is not.

Searching among its GitHub issues we landed on #1921 which gave us the clue we needed.

The AWS SDK refreshes credentials when one of two conditions is met:

Expiration is within an EXPIRATION_THRESHOLD, hardcoded to 15 minutes.
The last attempt to refresh credentials is greater than the REFRESH_THRESHOLD, hardcoded to 60 minutes.

We wanted to see the actual expiration time in the certificates we were getting so we ran the two cURL commands shown above against the AWS API, both from the container and EC2 instance. The one retrieved from the container was much shorter: exactly 15 mins.

The problem was now clear: our service would fetch temporary credentials for the first request. Since these had a 15 min expiration time, in the next request, the AWS SDK would decide to eagerly refresh them. The same would happen on every request.

Why was the credential expiration time shorter?

The AWS Instance Metadata Service is meant to be used from an EC2 instance, not Kubernetes. It is still convenient to let applications retain the same interface. For this we use KIAM, a tool that runs an agent on each Kubernetes node, allowing users (engineers deploying applications to the cluster) to associate IAM roles to Pod containers as if they were EC2 instances. It works by intercepting calls to the AWS Instance Metadata service and serving them from its own cache, pre-fetched from AWS. From the point of view of the application, there is no difference with running in EC2.

KIAM happens to provide short-lived credentials to Pods, which makes sense as it’s fair to assume that the average lifetime of a Pod is shorter than EC2 instances. The default is precisely 15 min.

But if you put both defaults together, you have a problem. Each certificate provided to the application has a 15 min expiration time. The AWS Java SDK will force refreshing any certificate with less than 15 min expiration time left.

The result is that every request will be forced to refresh the temporary certificate, which requires two calls to the AWS API that add a huge latency penalty to each request. We later found a feature request in the AWS Java SDK that mentions this same issue.

The fix was easy. We reconfigured KIAM to request credentials with a longer expiration period. Once this change was applied, requests started being served without involving the AWS Metadata service and returned to an even lower latency than in EC2.

Takeaways

In our experience with these migrations, one of the most frequent sources of problems is not bugs in Kubernetes or other pieces of the platform. It isn’t either about fundamental flaws in the microservices we’re migrating. Problems often appear just because we put some pieces together in the first place.

We are blending complex systems that had never interacted together before with the expectation that they collaborate forming a single, larger system. More moving pieces, more surface for failures, more entropy.

In this case, our high latency wasn’t the result of bugs or bad decisions in Kubernetes, KIAM, the AWS Java SDK, or our microservice. It was a behaviour resulting from the combination of two independent defaults, in KIAM and the AWS Java SDK. Both choices make sense in isolation: an eager credential refresh policy in the AWS Java SDK, and the lower default expiration in KIAM. It is when they come together that the results are pathological. Two independently sound decisions don’t necessarily make sense together.

Sizing Kubernetes pods for JVM apps without fearing the OOM Killer

Wed, 29 May 2019 21:10:54 +0000

Lately at work I’ve been helping out some teams migrating their workloads from on-prem / EC2 infrastructure to Kubernetes. It’s being a good boot camp on Kubernetes.

In this post I will go over one of the issues we faced recently, related to resource assignments for JVM applications. I will explain how running in Kubernetes forces us to think about capacity planning more than we’re used to, and changes some of the assumptions we made before containers.

I will assume familiarity with basic Kubernetes concepts (you know what are nodes, pods, etc..) and the JVM.

Help! My application is getting OOM killed

Resource limits are one of the most frequent pain points for teams new to Kubernetes. Our cluster configuration is relatively (too?) aggresive in promoting small, canonical microservices that scale horizontally. Developers are used to being liberal allocating memory for JVMs that usually run on their very own EC2 instance. Finding themselves suddenly constrained within pods under 5GB often results in various flavours of OOM kills or OutOfMemoryErrors

An engineer from one of our product teams found himself in this situation last week and was asking for help configuring their deployment. The Kubernetes manifest declared:

resources:
  limits:
    memory: 4Gi
    cpu: 1
  requests:
    memory: 1Gi
    cpu: 1

And the relevant part of the Dockerfile was:

FROM openjdk:8u181-jre-slim
...
ENTRYPOINT exec java -Xmx1512m -Xms1g [...] -cp app:app/lib/* com.schibsted.yada.Yada

Pods were being OOMKilled at inconvenient times, so he came asking for help in sizing his pods and JVM.

What are pod `request` / `limits` in Kubernetes

Let’s first see what are the resources parameters for.

When we ask Kubernetes to run our application, the Kubernetes scheduler looks for nodes in the cluster where our pods can run. It will search based on multiple criteria, where the most basic is whether a node has enough memory and CPU to run the container. This is what the parameters in the resources section are for. requests sets the minimum amount of resources each container in our Pod needs to start successfully. Kubernetes will schedule our pods only in suitable nodes. If there are not enough, then our pod will become Unschedulable. In the snippet above, the application declares a minimum of 1Gi of memory and 1 CPU as requests.

But requests is not a hard limit. Once it’s running, if the container needs additional memory / CPU it can ask the kernel for extra capacity from its node. This elasticity is useful to handle bursts of load, but also creates a risk that greedy pods hog too many resources. limits is there to control this by fixing a maximum memory / CPU that each container in the pod will be allowed to use.

The Kubernetes documentation has a good explanation on the tradeoffs that developers can play with by using requests and limits

By configuring memory requests and limits for the Containers that run in
your cluster, you can make efficient use of the memory resources
available on your cluster’s Nodes. By keeping a Pod’s memory request
low, you give the pod a good chance of being scheduled. By having a
memory limit that is greater than the memory request, you accomplish two
things:

- The pod can have bursts of activity where it makes use of memory
  that happens to be available.
- The amount of memory a pod can use during a burst is limited to
  some reasonable amount

Why is our Pod getting OOM killed?

Coming back to our Dockerfile, the first thing that calls our attention is:

ENTRYPOINT exec java -Xmx1512m -Xms1g [...] -cp app:app/lib/* com.schibsted.yada.Yada

The JVM will allocate 1GiB of heap up front, consuming the full capacity that the container we got through requests. Given that the JVM needs additional memory (code cache, off-heap, thread stacks, GC datastructures..), as does the operating system, our containers are born undersized.

It’s evident that requests is too small.

Wait: didn’t Hotspot 1.8 have problems tracking memory in containers?

This is the first thing to suspect when containerized JVMs are getting OOM killed. Our teams have hit this a few times. Long story short, when you use Hotspot JVM at any version below 8u121, you can start a JVM inside a container limited to 2GB of memory, and find that the JVM tries to start with a much larger heap and gets killed. The container is exceeding memory limits, but it’s the JVM’s fault for being too greedy. Why?

The cause of this problem was that the JVM used to retrieve the available memory. It used to look at a system file that contains the host’s memory capacity, not the container’s (or more precisely a kernel’s control group that is used to implement the container.) This issue was dealt with in JDK-8189497. From u181, the Hotspot JVM added a couple of flags that let you instruct the JVM to use the cgroup memory. From u121 this behaviour became default. We’re using a version > u181.

However, this issue was only relevant when the JVM had to calculate the min/max heap sizes by itself. In our case, since we’re passing the -Xmx and -Xms flags explicitly, so we’re not affected. We can confirm because both heap committed metrics above stay withn the 1.5GiB, as we’ll see below.

What would be a reasonable `requests` value then?

Here is a snapshot of the application under normal load. I’m showing the committed value as this represents memory actually reserved by the JVM, as opposed to usage (which = committed - not used).

metric	memory
jvm_memory_committed_bytes{area=”nonheap”,id=”Code Cache”,}	0.05 GiB
jvm_memory_committed_bytes{area=”nonheap”,id=”Metaspace”,}	0.07 GiB
jvm_memory_committed_bytes{area=”nonheap”,id=”Compressed Class Space”,}	0.01 GiB
jvm_memory_committed_bytes{area=”heap”,id=”PS Eden Space”,}	0.52 GiB
jvm_memory_committed_bytes{area=”heap”,id=”PS Survivor Space”,}	0.01 GiB
jvm_memory_committed_bytes{area=”heap”,id=”PS Old Gen”,}	1.06 GiB
total committed	1.72 GiB

Notice that our JVM heap is not 1Gib but 1.5GiB. This is to be expected as the JVM may expand the heap once the application starts doing actual work and puts pressure on the heap, specially under load. However, thanks to the explicit -Xmx flag, we can trust that the heap won’t use more than 1.5GiB. requests=1Gi is therefore too low. But still, but limits=4Gi should be giving plenty of head room to prevent an OOM Kill.

What else could be pushing our memory footprint over `limits`?

One answer was outside the JVM. We looked at tmpfs mounts. These volumes look as ordinary directories in our filesystem, but they actually reside in memory and thus contribute to the overall memory consumption. This application is based on Kafka Streams, which uses a RocksDB instance for an internal cache that is persisted to disk. The team decided to use the /tmp mount to avoid IO overhead, which was a good idea, but did not account for the implicit tradeoff in additional memory usage. To make things worse, the /tmp folder was also storing logs for the application. All in all, RocksDB and logs added almost 2GiB to the container’s memory footprint. With limits set to 4GiB we were already flirting with the OOM killer.

This is the typical change of assumptions that appears when moving from physical machines or VMs (which tend to be large) to containers (which in our cluster tend to be much smaller.)

Our second measure was to relocate the RocksDB storage to disk (chosing less memory utilization over IO load) and ensure that the application kept log files small, and frequently rotated. After this change, the /tmp directory stays under 20MiB.

OOM killer keeps acting

While these changes made the situation better for a while, OOM kills kept happening.

So far we’ve just focused on the heap, but we know that the JVM uses memory for other purposes (some of which appear above: Code Cache, Metaspace, Compressed Class Space…). Let’s take a look at the real memory usage of the JVM process.

root@myapp-5567b547f-tk54j:/# cat /proc/1/status | grep Rss
RssAnon:         3677552 kB
RssFile:           15500 kB
RssShmem:           5032 kB

I believe these three add up to the RSS (Resident Set Size), which tells us the memory that our JVM process has in main memory at this point in time. Note that JVM alone is enough to consume 3.6GiB out of our 4GiB limit.

And we didn’t look at the operating system either. In containers memory information is found under the /sys/fs/cgroup tree, let’s check there:

root@myapp-5567b547f-tk54j:/# cat /sys/fs/cgroup/memory/memory.stat
cache 435699712
rss 3778998272
rss_huge 3632267264
shmem 9814016
mapped_file 5349376
dirty 143360
writeback 8192
swap 0
pgpgin 2766200501
pgpgout 2766061173
pgfault 2467595
pgmajfault 379
inactive_anon 2650112
active_anon 3786137600
inactive_file 198434816
active_file 227450880
unevictable 0
hierarchical_memory_limit 4294967296
hierarchical_memsw_limit 8589934592
total_cache 435699712
total_rss 3778998272
total_rss_huge 3632267264
total_shmem 9814016
total_mapped_file 5349376
total_dirty 143360
total_writeback 8192
total_swap 0
total_pgpgin 2766200501
total_pgpgout 2766061173
total_pgfault 2467595
total_pgmajfault 379
total_inactive_anon 2650112
total_active_anon 3786137600
total_inactive_file 198434816
total_active_file 227450880
total_unevictable 0

All these are bytes. The full Resident Set Size for our container is calculated with the rss + mapped_file rows, ~3.8GiB.

We see that mapped_file, which includes the tmpfs mounts, is low since we moved the RocksDB data out of /tmp. So in practise, non-JVM footprint is small (~0.2 GiB). The remaining 3.6GiB are consumed by our JVM. Our container runs with ~200 MiB to spare, so it’s likely that any burst in memory usage will tip us over the ‘limit’. cat /sys/fs/cgroup/memory/memory.failcnt will tell us how many times we hit memory usage limits

Interestingly, while our JVM is consuming 3.6GiB, we just use ~1.72GiB of heap committed according to the metrics we showed above. This means we have almost 2GiB unaccounted for in the JVM. The first suspect should be off-heap memory. A quick look at the JVM metrics discards this:

jvm_buffer_total_capacity_bytes{id="direct",} 284972.0
jvm_buffer_total_capacity_bytes{id="mapped",} 0.0

That’s barely 0.2MiB, so negligible in this JVM. Understanding what in the JVM is consuming the additional 2GiB will take another post.

For now I will wrap up this post by highlighting a couple of things I learned from this exercise, and some action points that we could already apply.

`requests` is a guarantee, `limits` is an obligation

There is a subtle change of semantics when we go from requests to limits. For the application developer, requests is a guarantee offered by Kubernetes that any pod scheduled will have at least the minumum amount of memory. limits is an obligation to stay under the maximum amount of memory, which will be enforced by the kernel.

In other words: containers can’t rely on being able to grow from their initial requests capacity to the maximum allowance set in limits.

This is problematic. Maybe it was the JVM that needed more capacity in order to grow the heap. If it doesn’t get it, we can expect at best a saturated heap with constant GC cycles, application pauses and increased CPU load. At worst, an OutOfMemoryError. Maybe it was the OS that needed more capacity for $purposes. In any case, if we fail to increase memory when needed, we will be implicitly asking the OOM killer to find a process to kill. The JVM has most tickets in that raffle, as it’s by far the biggest memory consumer.

Suggested changes

First, although not related to the OMM kill: just set -Xmx = -Xms and ensure your JVM reserves all the heap it will use up-front.

Having both JVM and Docker (or cgroups to be precise) growing memory dynamically just makes it harder to reason about capacity or understanding this type of issues.

Second, requests should be at least bigger than -Xmx, and ensure that to add enough head room for the JVM and the OS. How much more? Depends on what runs inside your container. A simple microservice will probably be fine with requests ~%25 higher than -Xmx. But does your microservice (or any library inside it) use off-heap memory? Write logs in a tmpfs volume? Read/Write volumes shared with other containers in the pod? All these (and other factors) impact your memory footprint. You need to know this in with reasonable detail if you want to avoid surprises.

Third, setting limits much higher than requests is meant to handle bursts of activity where your pod makes use of “memory that happens to be available”. Notice the wording. If getting that extra memory is a necessity for your app to do its job, don’t gamble and just reserve it up-front with requests.

Fourth, running applications inside containers forces us to have a solid idea of the memory requirements. I left unanswered why our JVM is using ~3.6GiB despite having 1.5GiB max heap, and no off-heap memory, I’ll try to write a followup to go deeper into this. But, assuming we know our memory footprint, my intuition (as a non-Kubernetes expert) is that we should prefer to have requests == limits at least for memory.

One last learning: don’t try to predict OS footprint

I mentioned that we should ensure that our requests leaves extra memory for the operating system. My temptation when I first looked at this data was to use the values we got from /sys/fs/cgroup/memory/memory.stat to predict future OS footprint (~0.2GiB) while running this application. By reading through the kernel and Docker documentation I realized this would probably not be wise.

When the kernel watches for the limits assigned to a container, it counts all of the RSS, as well as part of the page cache (the cache row, see see 2.2.1 and 2.3 for concrete accounting details). The page cache is used by the OS to speed up access to data in disk by keeping the most accessed pages in memory (whichcounts for memory utilization). But, what happens if several containers access the same file? The Docker docs explain this clearly:

Accounting for memory in the page cache is very complex. If two
processes in different control groups both read the same file
(ultimately relying on the same blocks on disk), the corresponding
memory charge is split between the control groups. It’s nice, but it
also means that when a cgroup is terminated, it could increase the
memory usage of another cgroup, because they are not splitting the cost
anymore for those memory pages.

The memory footprint of our container depends on both its neighbours and which files on the node’s disk they read. Because containers come and go, share of the memory accounted to a given container may change. If we’re already flirting with our limits this might push us into OOM Kill territory.

GC forensics by example: multi-second pauses and allocation pressure

Sat, 28 Apr 2018 09:31:22 +0000

This post will analyze a Hotspot GC log exhibiting large GC pauses (> 1 min) leading to allocation pressure and system load as a cause of pathological behaviour on Hotspot’s garbage collector.

I will assume some familiarity with the JVM, GC, that you understand the basic lifecycle of an object, have notions of what a generational collector is and how the G1 collector structures the heap. If you want more info on these topics I recommend Oracle JDK documentation as a good starting point.

The log

The application ran with Oracle’s 1.8 JVM mostly on default settings (I didn’t keep the exact parameters though). It did use the G1 collector (-XX:+UseG1GC) and aproximately the following settings regarding GC logs:

-Xloggc:$PATH
-XX:+PrintGCDetails
-XX:+PrintTenuringDistribution
-XX:+PrintPromotionFailure
-XX:+PrintGCApplicationStoppedTime

This would output logs at $PATH, with content similar to these:

...
2017-03-15T17:45:47.751+0100: 0.621: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 4194304 bytes, new threshold 15 (max 15)
, 0.0213976 secs]
   [Parallel Time: 9.5 ms, GC Workers: 18]
      [GC Worker Start (ms): Min: 621.3, Avg: 623.5, Max: 630.7, Diff: 9.4]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.2, Max: 1.4, Diff: 1.4, Sum: 3.1]
...

Which is a detailed log of each GC event in the JVM. They’ll be useful to us later, but it’s not the best format to get a sense of how GC is behaving over time. A tool like GC Viewer comes handy in these situations.

This section correspond to one of the large pauses that we will be looking into:

The horizontal axis shows time. Both memory size in MBs and pause time are shown in seconds on the vertical axis.

Two stacked areas represent the size of the Young and Old Generation (yellow and pink respectively). We had about 9.5GB Old Gen plus 8.5GB of Young Gen. These only show memory reserved by the JVM for the heap. Usage is represented with thin lines inside each area. In the Old Generation (pink area), a darker pink line shows Used Old Gen. In the Young Gen, usage is represented with a light grey line.

The see-saw pattern that we see for the first half of the graph represents a stable cycle. In each iteration, the application allocated new objects in the Young Gen until it eventually filled up (identifiable by the grey line hitting the top limit of the yellow area.). Then the JVM was forced to make room in the Young Gen by triggering a Minor Collection. Memory was released and utilisation in the Young Gen dropped back to the bottom of the yellow area. The application continued creating objects and the cycle repeated.

Minor collections recover memory in two ways. First, they release memory consumed by objects residing in the Young Gen when they are no longer referenced. This allowed the JVM to keep up with the application memory usage by collecting all 8.5GB on each iteration.

But this changes right after 20:00:00, with growth in both the Old Gen (pink area) and its utilisation (pink line) grow. This growth is explained because the JVM promotes objects from the Young Gen based on a maximum age: the tenuring threshold. This age is defined as the number of minor collections that an object survived in the Young Gen). After an object exceeds that threshold, the collector will promote it to the Old Gen.

According to our logs, the JVM was promoting so many objects that the Old Gen filled up, so it was necessary to expand it. (I will keep speaking about Young/Old Gen as individual contiguous space, in G1 this is valid from a logical point of view, but the actual heap layout is different.)

The graph shows how our GC event resulted in over half of the Young Gen being moved to Old Gen. Because the JVM had a fixed heap of 18GB (-Xmx18g) it wasn’t able to simply expand the heap. Instead, it shrunk the Young Gen to reassign the space to the old Gen. This brought two interesting effects.

First, the smaller Young Gen now filled up much faster as the service kept allocating new objects at the same rate. Right below the 20:02:00 mark we see a tighter see-saw pattern that indicates more frequent Young Collections (of course, each iteration releases just a bit of memory.)
Second, after two cycles in the tigher see-saw pattern, utilisation in the Old Gen (pink line) suddenly drops to the same level as it was prior to the resize. The JVM had just moved ~6GB to Old Gen, only to garbage-collect it shortly after. (╯°□°）╯︵ ┻━┻

This behaviour is common under high allocation pressure. The application creates too many short-lived objects that fill the Young Gen fast. This forces the JVM to trigger more frequent Minor Collections, which in turn causes live objects to age faster. That makes them hit the tenuring threshold earlier, causing premature promotions to the Old Gen as they will soon be unused and eligible for collection.

Why is this bad?

Take a look at the grey rectangle at the bottom-right of the graph, precisely in the interval where the JVM is growing the Old Gen. This is how GcViewer represents a GC Pause. Looking at the timestamps, this is well over a minute.

Why the pause? Well, none of this memory shuffling is cheap. To make things worse, these minor collections are also stop-the-world (STW) events: this means that the application was completely stopped. Let’s look at the GC logs to understand the details. Below is the relevant log for the event.

2017-03-15T20:00:26.471+0100: 8079.342: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 593494016 bytes, new threshold 15 (max 15)
- age   1:      60136 bytes,      60136 total
- age   2:       4312 bytes,      64448 total
- age   3:       3864 bytes,      68312 total
- age   4:       3784 bytes,      72096 total
- age   5:       3784 bytes,      75880 total
- age   6:       7568 bytes,      83448 total
- age   7:       3784 bytes,      87232 total
- age   8:       3784 bytes,      91016 total
- age   9:       3784 bytes,      94800 total
- age  10:       7568 bytes,     102368 total
- age  11:       3784 bytes,     106152 total
- age  12:       3784 bytes,     109936 total
- age  13:       3784 bytes,     113720 total
- age  14:       7568 bytes,     121288 total
- age  15:       3784 bytes,     125072 total
(to-space exhausted), 80.1974703 secs]
[Parallel Time: 77390.8 ms, GC Workers: 18]
  [GC Worker Start (ms): Min: 8079342.1, Avg: 8079342.2, Max: 8079342.2, Diff: 0.1]
  [Ext Root Scanning (ms): Min: 0.2, Avg: 0.3, Max: 0.4, Diff: 0.1, Sum: 5.2]
  [Update RS (ms): Min: 155.4, Avg: 267.9, Max: 567.9, Diff: 412.4, Sum: 4821.3]
     [Processed Buffers: Min: 2, Avg: 7.3, Max: 14, Diff: 12, Sum: 131]
  [Scan RS (ms): Min: 0.1, Avg: 0.4, Max: 3.1, Diff: 3.0, Sum: 6.3]
  [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.1]
  [Object Copy (ms): Min: 76814.9, Avg: 77116.1, Max: 77231.5, Diff: 416.5, Sum: 1388089.1]
  [Termination (ms): Min: 0.0, Avg: 5.9, Max: 7.2, Diff: 7.2, Sum: 106.6]
  [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 2.1]
  [GC Worker Total (ms): Min: 77390.4, Avg: 77390.6, Max: 77390.8, Diff: 0.3, Sum: 1393030.8]
  [GC Worker End (ms): Min: 8156732.7, Avg: 8156732.8, Max: 8156732.9, Diff: 0.2]
[Code Root Fixup: 0.0 ms]
[Code Root Migration: 0.3 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 1.6 ms]
[Other: 2804.7 ms]
  [Evacuation Failure: 2778.0 ms]
  [Choose CSet: 0.0 ms]
  [Ref Proc: 5.6 ms]
  [Ref Enq: 0.0 ms]
  [Redirty Cards: 10.0 ms]
  [Free CSet: 1.5 ms]
[Eden: 9044.0M(9044.0M)->0.0B(320.0M) Survivors: 4096.0K->1132.0M Heap: 16.2G(18.0G)->14.8G(18.0G)]
  [Times: user=77.62 sys=2.54, real=80.20 secs]

There is a lot of information there, but let’s focus on the key relevant points for the topic we’re investigating.

Lines 1-18 yield a summary of the event enclosed in brackets [GC Pause ... The first two lines explain that the event was triggered due to G1 Evacuation pause (young), meaning that the Young Gen filled up and needed to be cleaned up. Lines 3-17 give details about the distribution of object ages in the Young Gen, which I’ll skip as it’s out of scope of this post. Line 18 ((to-space exhausted)...) indicates that the target space (Old Gen) is not big enough to hold all the objects that need to be moved. The full event takes 80.2s.

Line 19 explains that this collection used 18 threads and took over 77s.

19 [Parallel Time: 77390.8 ms, GC Workers: 18]

Lines 20-30 cover how long it took to execute the different phases of the collection. Most times went to Object Copy:

26 [Object Copy (ms): Min: 76814.9, Avg: 77116.1, Max: 77231.5, Diff: 416.5, Sum: 1388089.1]

Min/Max/Avg are stats per thread. The Sum is the aggregate for all worker threads (hence, Sum = N threads * Avg). Each one of the 18 threads are spending 77s on average simply to copy objects around.

We have non-negligible times in other phases. Line 22 reveals that almost 5s aggregated were spent in the Update RS phase.

22 [Update RS (ms): Min: 155.4, Avg: 267.9, Max: 567.9, Diff: 412.4, Sum: 4821.3]

RS stands for Remembered Set, an internal data structure kept by the JVM for each region. Given a region X, its RS would track references from other regions that point to objects that reside in X. Whenever the application updates the references (e.g. when we assign a reference field in an object), the JVM takes care of updating the RS.

For efficiency reasons those updates are buffered and processed concurrently with the application. However, when the JVM starts a collection it holds pending updates in this buffer while GC workers perform an initial scan of the heap (Ext Root Scan at L21) looking for live objects. Only when this scan is complete are RS updates in the buffer drained. It looks like this step is taking significant time. We’ll try to explain possible causes later.

Finally, we spot one more phase with significant times:

36 [Evacuation Failure: 2778.0 ms]

This line indicates that the JVM wasn’t able to move any more objects to either Survivor space or the Old Gen because the target space is already at max capacity. Evacuation failures can be expensive as handling them creates additional work for the JVM.

Moving on. Line 42:

42 [Eden: 9044.0M(9044.0M)->0.0B(320.0M) Survivors: 4096.0K->1132.0M Heap: 16.2G(18.0G)->14.8G(18.0G)]

Young Gen is both Eden (containing brand new objects) plus Survivor spaces (containing objects that survived >1 Minor collections, but < tenuring threshold). Eden was completely emptied (9044MB -> 0), and it was also shrunk from 9044MB to 320MB. Some of the objects evacuated from Eden stayed in Survivor spaces as they were not yet over the tenuring threshold, 1132MB in total.

The total heap size remains at 18GB (as determined by the JVM option -Xmx18g), and usage went from 16.2GB -> 14.8GB. Overall, of the 9GB evacuated from Young Gen. 2GB were released, 1GB remained in the survivor space, and 6GB were moved over to Old Gen.

Using -XX:+G1PrintHeapRegions and -XX:+PrintAdaptiveSizePolicy would have generated more details on this process, including a more verbose description of the resizing decisions taken by the JVM.

-XX:+PrintAdaptiveSizePolicy

Unfortunately it was disabled at the time that the log was collected.

Finally, line 43 summarizes the times.

43 [Times: user=77.62 sys=2.54, real=80.20 secs]

User here is CPU time spent by the JVM. sys is time in OS calls or waiting for system events. Real is clock time where the application was stopped. So, to summarize:

18 threads moved ~6GB worth of objects.
The full copy took about 77s for each thread.
Update RS took in aggregate 5s.
Evacuation Failures consumed over 2s.
Real pause time was 80s, where 77.6s were in user space, 2.5s in OS.

Let’s look at the two main issues.

The problem with premature promotions

The most obvious observation is that we’re running into trouble as a result of allocation pressure. Frequent Minor GCs are needed to scan the live objects and move those that remain across survivor spaces. This frequency ages objects faster, the JVM ends up spilling objects to Old Gen (and in the particular case shown in the graph, triggering a region resize.)

Even though those objects are being promoted objects, their lifespan is not that large. Soon after being promoted to Old Gen most of them become garbage (no other objects reference them) and will need to be collected at some point by a Major collection (minor -> young gen, major -> old gen). These are the relevant logs from our example:

2017-03-15T20:01:47.849+0100: 8160.719: [GC concurrent-root-region-scan-start]
2017-03-15T20:01:47.849+0100: 8160.719: [GC concurrent-root-region-scan-end, 0.0001351 secs]
2017-03-15T20:01:47.849+0100: 8160.719: [GC concurrent-mark-start]
2017-03-15T20:01:51.521+0100: 8164.392: [GC concurrent-mark-end, 3.6726632 secs]
2017-03-15T20:01:51.522+0100: 8164.392: [GC remark 8164.393: [GC ref-proc, 0.0003795 secs], 0.1287467 secs]
 [Times: user=0.17 sys=0.01, real=0.12 secs]
2017-03-15T20:01:51.714+0100: 8164.585: Total time for which application threads were stopped: 0.0634409 seconds
2017-03-15T20:01:51.714+0100: 8164.585: [GC concurrent-cleanup-start]
2017-03-15T20:01:51.797+0100: 8164.667: [GC concurrent-cleanup-end, 0.0826101 secs]

Which correspond to a light blue line in the graphs. This is relatively cheap under G1 because the JVM is able to do most of the work concurrently with the application. Nevertheless, our app still still suffered a 60ms freeze, which is really not good if it’s minimally sensitive.

That Major collection needs an initial phase, which can be spotted with the “initial-mark” tag in a Minor collection (the Old Gen collector piggy backs on it.) I’m cutting the details, but note the additional time:

2017-03-15T20:01:46.983+0100: 8159.853: [GC pause (G1 Evacuation Pause) (young) (initial-mark)
Desired survivor size 75497472 bytes, new threshold 1 (max 15)
[...]
 [Times: user=10.49 sys=0.13, real=0.86 secs]

That’s another 0.86s freeze.

This is in fact a relatively good case. If the pattern of prematurely pushing objects to Old Gen continues, it may fill Old Gen and force the JVM to trigger a Full GC, which is even more expensive.

Preventing premature promotions

In cases such as this one, we would want tune our workloads for two properties:

Minimize Minor collections: so that we avoid scans over the live set in the Young Gen, and copying between survivor spaces.
Make each Minor collection count: by tuning your app’s workload so that each collection releases sufficient space in the Young gen to avoid promoting any objects.

The solution is usually a combination of both.

Having larger a Young Gen helps accomodate more objects (therefore less Minor GCs) and clean up more on each one. Some handy options for this are -XX:NewRatio, XX:G1MaxNewSizePercent, -XX:G1NewSizePercent which are well documented. However fine tuning region sizes is a fragile equilibrium and hard to achieve in non trivial applications. In practise, it’s better to just use -XX:MaxGCPauseMillis=<number> so the JVM can try to size regions (and tune other parameters) in the best possible way to meet those targets. They are, however, not guarantees but best-effort. Note that fixing Young Gen sizes overrides the target pause goal

Another useful flag in this context -XX:G1ReservePercent which reserves a percentage of heap to be available the Survivor space during evacuation (default is 10%.)

In my experience having a relatively precise idea of your app’s allocation rate is important and effective. The memory allocated per unit of work (e.g. a request handled) also tells you the garbage that the JVM will have to clean after it’s completed. This information, plus heap sizes, helps doing reasonable capacity planning on your servers. For example, at 1MB/request and 100 requests/s, a Young gen with 100MB will fill up roughly once per second.

That was the route we followed at the time, along with taking advantage of several opportunities to reduce the amount of garbage generated per request (that’s a topic for another post.)

Poor GC throughput

Although our application’s workload is certainly causing too much work on the garbage collector, looking at the actual times they look excessive.

Having 18 threads move ~6GB in 77s per thread implies that each thread’s throughput was a tiny 4.3MB/s. Just for the sake of comparison, the same application running in similar hardware reported 407 MB/s. Regardless of the theoretical memory and JVM throughput, a 100x difference indicates that something is impacting the performance of GC worker threads on that specific environment. It’s not about only about the amount of garbage we’re generating.

LinkedIn has a couple of posts touching on some common causes, also mentioned in the Oracle documentation.

The JVM is fighting for CPU time with other processes in the system.
The JVM is having to allocate memory pages.
The JVM is returning memory pages to the OS, which swaps them out to disk making later accesses more expensive.
GC workers may be recruited by the OS for disk flushes, hitting them with syscalls and waits in the IO stack.

We could discard the second because the JVM pre-allocated the full heap capacity up front. The swap hypothesis may be coming into play, but having sys time at 2s, swapping doesn’t really explain the 77s in user land.

The main hypothesis to confirm is whether the system is overloaded. Second, whether we’re having heavy I/O on the disk where logs are being written to.

We do find later in the logs some events that reinforce the overloaded option:

41034    [Eden: 804.0M(804.0M)->0.0B(804.0M) Survivors: 116.0M->116.0M Heap: 4585.4M(18.0G)->4620.3M(18.0G)]
41035  [Times: user=7.52 sys=0.25, real=131.86 secs]

As seen here, the sys time is negligible, and instead GC threads were spending a lot of time idle waiting for CPU time. This service was indeed sharing the machine with a couple other processes that were found to be the culprits for this. Unfortunatley I did not keep sufficient evidence from the time to share more details here.

Wrap up

The post has grown quite large so far so I’ll leave it here. To summarize, we’ve seen an example of a Hotspot GC log exhibiting big application pauses. A more detailed review of the logs showed that excessive allocation pressure was causing premature promotion, copying large sections of memory reserved for objects that would be eligible for collection soon after being tenured. We explored some measures that would help gain more insights into the process and options to improve the situation. Finally, we observed how GC times were also being impacted by neighbouring processes on the same host that caused significant CPU contention.

How does the default hashCode() work?

Mon, 30 Jan 2017 08:29:02 +0000

In which scratching the surface of hashCode() leads to a speleology trip through the JVM source reaching object layout, biased locking, and surprising performance implications of relying on the default hashCode().

Abundant thanks to Gil Tene and Duarte Nunes reviewing drafts of this article and their very valuable insights, suggestions and edits. Any remaining errors are my own.

A trivial mystery

Last week at work I submitted a trivial change to a class, an implementation of toString() so logs would be meaningful. To my surprise, the change caused a ~5% coverage drop in the class. I knew that all new code was covered by existing unit tests so, what could be wrong? Comparing coverage reports a sharper colleague noticed that the implementation of hashCode() was covered before the change but not after. Of course, that made sense: the default toString() calls hashCode():

public String toString() {
    return getClass().getName() + "@" + Integer.toHexString(hashCode());
}

After overriding toString(), our custom hashCode() was no longer being called. We were missing a test.

Everyone knew the default toString() but..

What is the default implementation of `hashCode()`?

The value returned by the default implementation of hashCode() is called identity hash code so I will use this term from now on to distinguish it from the hash provided by overriden implementations of hashCode(). FYI: even if a class overrides hashCode(), you can always get the identity hash code of an object o by calling System.identityHashCode(o).

Common wisdom is that the identity hash code uses the integer representation of the memory address. That’s also what the J2SE JavaDocs for Object.hashCode() imply:

... is typically implemented by converting the internal address of
the object into an integer, but this implementation technique is not
required by the Java™ programming language.

Still, this seems problematic as the method contract requires that:

Whenever it is invoked on the same object more than once during an
execution of a Java application, the hashCode method must consistently
return the same integer.

Given that the JVM will relocate objects (e.g. during garbage collection cycles due to promotion or compaction), after we calculate an object’s identity hash we must be able to retain it in a way that survives object relocation.

A possibility could be to take the current memory position of the object on the first call to hashCode(), and save it somewhere along with the object, like the object’s header. That way, if the object is moved to a different memory location, it would carry the original hash with it. A caveat of this method is that it won’t prevent two objects from having the same identity hash, but that’s allowed by the spec.

The best confirmation would be to to look at the source. Unfortunately, the default java.lang.Object::hashCode() is a native function:

public native int hashCode();

Helmets on.

Will the real `hashCode()` please stand up

Note that the identity hashCode() implementation is dependant on the JVM. Since I will only look at OpenJDK sources, you should assume this specific implementation whenever I talk about the JVM. All links refer to changeset 5820:87ee5ee27509 of the Hotspot tree, I assume that most of it will also be applicable to Oracle’s JVM, but things could (in fact, are) different in others (more about this later.)

OpenJDK defines entry points for hashCode() at src/share/vm/prims/jvm.h and src/share/vm/prims/jvm.cpp. The latter has:

JVM_ENTRY(jint, JVM_IHashCode(JNIEnv* env, jobject handle))
 JVMWrapper("JVM_IHashCode");
 // as implemented in the classic virtual machine; return 0 if object is NULL
 return handle == NULL ? 0 : ObjectSynchronizer::FastHashCode (THREAD, JNIHandles::resolve_non_null(handle)) ;
JVM_END

ObjectSynchronizer::FastHashCode() is also called from identity_hash_value_for, which is used from a few other call sites (e.g.: System.identityHashCode())

intptr_t ObjectSynchronizer::identity_hash_value_for(Handle obj) {
 return FastHashCode (Thread::current(), obj()) ;
}

One might naively expect ObjectSynchronizer::FastHashCode() to do something like:

if (obj.hash() == 0) {
    obj.set_hash(generate_new_hash());
}
return obj.hash();

But it turns out to be a hundred line function that seems to be far more complicated. At least we can spot a couple of if-not-exists-generate blocks like:

685   mark = monitor->header();
...
687   hash = mark->hash();
688   if (hash == 0) {
689     hash = get_next_hash(Self, obj);
...
701   }
...
703   return hash;

Which seems to confirm our hypothesis. Let’s ignore that monitor for now, and be satisfied that it gives us the object header. It is kept at mark, a pointer to an instance of markOop, which represents the mark word that belongs in the low bits of the object header. So, tries to get a hash inside the mark word. If it’s not there, it’s generated using get_next_hash, saved, and returned.

The actual identity hash generation

As we saw, this happens at get_next_hash. This function offers six methods based on the value of some hashCode variable.

A randomly generated number.
A function of memory address of the object.
A hardcoded 1 (used for sensitivity testing.)
A sequence.
The memory address of the object, cast to int.
Thread state combined with xorshift (https://en.wikipedia.org/wiki/Xorshift)

So what’s the default method? OpenJDK 8 seems to default on 5 according to globals.hpp:

1127   product(intx, hashCode, 5,                                                \
1128           "(Unstable) select hashCode generation algorithm")                \

OpenJDK 9 keeps the same default. Looking at previous versions, both OpenJDK 7 and OpenJDK 6 use the first method, a random number generator.

So, unless I’m looking at the wrong place the default hashCode implementation in OpenJDK has nothing to do with the memory address, at least since version 6.

Object headers and synchronization

Let’s go back a couple of points that we left unexamined. First, ObjectSynchronizer::FastHashCode() seems overly complex, needing over 100 lines to perform what we though was a trivial get-or-generate operation. Second, who is this monitor and why does it have our object’s header?

The structure of the mark word is a good place to start making progress. In OpenJDK, it looks like this

// The markOop describes the header of an object.
//
// Note that the mark is not a real oop but just a word.
// It is placed in the oop hierarchy for historical reasons.
//
// Bit-format of an object header (most significant first, big endian layout below):
//
//  32 bits:
//  --------
//             hash:25 ------------>| age:4    biased_lock:1 lock:2 (normal object)
//             JavaThread*:23 epoch:2 age:4    biased_lock:1 lock:2 (biased object)
//             size:32 ------------------------------------------>| (CMS free block)
//             PromotedObject*:29 ---------->| promo_bits:3 ----->| (CMS promoted object)
//
//  64 bits:
//  --------
//  unused:25 hash:31 -->| unused:1   age:4    biased_lock:1 lock:2 (normal object)
//  JavaThread*:54 epoch:2 unused:1   age:4    biased_lock:1 lock:2 (biased object)
//  PromotedObject*:61 --------------------->| promo_bits:3 ----->| (CMS promoted object)
//  size:64 ----------------------------------------------------->| (CMS free block)
//
//  unused:25 hash:31 -->| cms_free:1 age:4    biased_lock:1 lock:2 (COOPs && normal object)
//  JavaThread*:54 epoch:2 cms_free:1 age:4    biased_lock:1 lock:2 (COOPs && biased object)
//  narrowOop:32 unused:24 cms_free:1 unused:4 promo_bits:3 ----->| (COOPs && CMS promoted object)
//  unused:21 size:35 -->| cms_free:1 unused:7 ------------------>| (COOPs && CMS free block)

The format is slightly different on 32 and 64 bits. The latter has two variants depending on whether Compressed Object Pointers are enabled. Both Oracle and OpenJDK 8 do by default.

Object headers may thus relate to a free block or an actual object, in which case there are multiple possible states. In the simplest, (“normal object”) the identity hash is stored directly in the low addresses of the header.

But in other states, we find a pointer to a JavaThread or a PromotedObject. The plot thickens: if we put the identity hash in a “normal object”, will someone take it away? Where? If the object is biased, where can we get/set the hash? What is a biased object?

Let’s try to answer those questions.

Biased locking

Biased objects appear as a result of Biased Locking. A (patented!) feature enabled by default from HotSpot 6 that tries to alleviate the cost of locking objects. Such operations are expensive because their implementation often relies on atomic CPU instructions (CAS) in order to safely handle lock/unlock requests on the object from different threads. It was observed that in most applications, the majority of objects are only ever locked by one thread so paying the cost of the atomic operation was often a waste. To avoid it, JVMs with biased locking allow threads to try and “bias” an object towards themselves. While an object is biased, the lucky thread can lock/unlock the object without atomic instructions. As long as there are no threads contending for the same object, we’ll gain performance.

The biased_lock bit in the header indicates whether an object is biased by the thread pointed at by JavaThread*. The lock bits indicate whether the object is locked.

Precisely because OpenJDK’s implementation of biased locking requires writing a pointer in the mark word, it also needs to relocate the real mark word (which contains the identity hash.)

This could explain the additional complexity in FastHashCode. The header not only holds the identity hash code, but also locking state (like the pointer to the lock’s owner thread). So we need to consider all cases and find where the identity hash resides.

Let’s go read FastHashCode. The first thing we find is:

intptr_t ObjectSynchronizer::FastHashCode (Thread * Self, oop obj) {
 if (UseBiasedLocking) {
   if (obj->mark()->has_bias_pattern()) {
          ...
     BiasedLocking::revoke_and_rebias(hobj, false, JavaThread::current());
          ...
     assert(!obj->mark()->has_bias_pattern(), "biases should be revoked by now");
   }
 }

Wait. It just revoked existing biases, and disabled biased locking on the object (the false means “don’t attempt rebias”). A few lines down, this is indeed an invariant:

637   // object should remain ineligible for biased locking
638   assert (!mark->has_bias_pattern(), "invariant") ;

If I’m reading correctly, this means that simply asking for the identity hash code of an object will disable biased locking, which in turn forces any attempt to lock the object to use expensive atomic instructions. Even if there is only one thread.

Oh boy.

Why does keeping biased locking state conflict with keeping the identity hash code?

To answer this question we must understand which are the possible locations of the mark word (that contains the identity hash) depending on the lock state of the object. The transitions are illustrated in this diagram from the HotSpot Wiki:

My (fallible) reasoning is the following.

For the 4 states at the top of the diagram, the OpenJDK will be able to use “thin” lock representations. In the simplest case (no locks) this means having the identity hash and other data directly in the object’s space for the mark word:

46 //  unused:25 hash:31 -->| unused:1   age:4    biased_lock:1 lock:2 (normal object)

in more complex cases, it needs that space to keep a pointer to the “lock record”. The mark word will thus be “displaced” and put somewhere else.

While we have only one thread trying to lock the object, that pointer will actually refer to a memory location in the thread’s own stack. Which is twice good: it’s fast (no contention or coordination to access that memory location), and it suffices for the thread to identify that it owns the lock (because the memory location points to its own stack.)

But this won’t work in all cases. If we have contended objects (e.g. objects used on synchronized statements that many threads traverse) we will need a more complex structure that fits not only a copy of the object’s header (again, “displaced”), but also a list of waiters. A similar need for a list of waiters appears if a thread executes object.wait().

This richer data structure is the ObjectMonitor, which is referred to as a the “heavyweight” monitor in the diagram. The value left in the object’s header doesn’t point to a “displaced mark word” anymore, but to an actual object (the monitor). Accessing the identity hash code will now require “inflating the monitor”: chasing a pointer to an object and reading/mutating whichever field contains the displaced mark word. Which is more expensive and requires coordination.

FastHashCode does have work to do.

Lines L640 to L680 deal with finding the header and checking for a cached identity hash. I believe these are a fast path that probe for cases that don’t need to inflate the monitor.

From L682 it needs to bite the bullet:

 // Inflate the monitor to set hash code
 monitor = ObjectSynchronizer::inflate(Self, obj);

 // Load displaced header and check it has hash code
 mark = monitor->header();
...
 hash = mark->hash();

At this point, if the id. hash is there (hash != 0), the JVM can return. Otherwise we’ll get one from get_next_hash and safely store it in the displaced header kept by the ObjectMonitor.

This seems to offer a reasonable explanation to why calling hashCode() on an object of a class that doesn’t override the default implementation makes the object ineligible for biased locking:

In order to keep the identity hash of an object consistent after relocation we need to store the hash in the object’s header.
Threads asking for the identity hash may not even care about locking the object, but in practise they will be sharing data structures used by the locking mechanism. This is a complex beast in itself that might be not only mutating, but also moving (displacing) the header contents.
Biased locking helped perform lock/unlock operations without atomic operations, and this was effective as long as only one thread locked the object because we could keep the lock state in the mark word. I’m not 100% sure here, but I understand that since other threads may ask for the identity hash, even if there is a single thread interested in the lock, the header word will be contended and require atomic operations to be handled correctly. Which defeats the whole point of biased locking.

Recap

The default hashCode() implementation (identity hash code) has nothing to do with the object’s memory address, at least in OpenJDK. In versions 6 and 7 it is a randomly generated number. In 8 and, for now, 9, it is a number based on the thread state. Here is a test that yields the same conclusion.
- Proving that “implementation-dependent” warns are not aesthetic: Azul’s Zing does generate the identity hash from the object’s memory address.
In HotSpot, the result of the identity hash generation is generated once, and cached in the mark word of the object’s header.
- Zing uses a different solution to keep it consistent despite object relocations, in which they delay storing the id. hash until the object relocates. At that point, it’s stored in a “pre-header”
In HotSpot, calling the default hashCode(), or System.identityHashCode() will make the object ineligible for biased locking.
- This implies that if you are synchronizing on objects that have no contention, you’d better override the default hashCode() implementation or you’ll miss out on JVM optimizations.
It is possible to disable biased locking in HotSpot, on a per-object basis.
- This can be very useful. I’ve seen applications very heavy on contended producer/consumer queues where biased locking was causing more trouble than benefit, so we disabled the feature completely. Turns out, we could’ve done this only on specific objects/classes simply by calling System.identityHashCode() on them.
~~I have found no HotSpot flag that allows changing the default generator, so experimenting with other options might need to compile from source~~.
- Admittedly, I didn’t look much. Michael Rasmussen kindly pointed out that -XX:hashCode=2 can be used to change the default. Thanks!

Benchmarks

I wrote a simple JMH harness to verify those conclusions.

The benchmark (source) does something equivalent to this:

object.hashCode();
while(true) {
    synchronized(object) {
        counter++;
    }
}

One configuration (withIdHash) synchronizes on an object that uses the identity hash, so we expect that biased locking will be disabled as soon as hashCode() is invoked. A second configuration (withoutIdHash) implements a custom hash code so biased locking should not be disabled. Each configuration is ran first with one thread, then with two threads (these have the suffix “Contended”.)

By the way, we must enable -XX:BiasedLockingStartupDelay=0 as otherwise the JVM will take 4s to trigger the optimisation distorting the results.

The first execution:

Benchmark                                       Mode  Cnt       Score      Error   Units
BiasedLockingBenchmark.withIdHash              thrpt  100   35168,021 ±  230,252  ops/ms
BiasedLockingBenchmark.withoutIdHash           thrpt  100  173742,468 ± 4364,491  ops/ms
BiasedLockingBenchmark.withIdHashContended     thrpt  100   22478,109 ± 1650,649  ops/ms
BiasedLockingBenchmark.withoutIdHashContended  thrpt  100   20061,973 ±  786,021  ops/ms

We can see that the using a custom hash code makes the lock/unlock loop work 4x faster than the one using the identity hash code (which disables biased locking.) When two threads contend for the lock, biased locking is disabled anyway so there is no significative difference between both hash methods.

A second run disables biased locking (-XX:-UseBiasedLocking) in all configurations.

Benchmark                                       Mode  Cnt      Score      Error   Units
BiasedLockingBenchmark.withIdHash              thrpt  100  37374,774 ±  204,795  ops/ms
BiasedLockingBenchmark.withoutIdHash           thrpt  100  36961,826 ±  214,083  ops/ms
BiasedLockingBenchmark.withIdHashContended     thrpt  100  18349,906 ± 1246,372  ops/ms
BiasedLockingBenchmark.withoutIdHashContended  thrpt  100  18262,290 ± 1371,588  ops/ms

The hash method no longer has any impact and withoutIdHash loses its advantage.

(All benchmarks were ran on a 2,7 GHz Intel Core i5.)

References

Whatever is not wild speculation and my weak reasoning trying to make sense of the JVM sources, comes from stitching together various sources about layout, biased locking, etc. The main ones are below: