Identifiers are better off without meaning
Once at Last.fm we had an integer overflow in an identify field. I can’t recall where exactly. But I do remember that the inconvenience of having a bunch of Hadoop jobs disrupted while we rushed to update the relevant type couldn’t spoil the collective pride for having more than 2 billion of whatever needed so many ids.
Being frugal with identifiers is seldom a good idea, but for me the worst identifier-related headaches came from IDs that had semantic value.
At Tuenti (once the largest social network in Spain) there was a concept similar to Facebook pages. Pages had types and subtypes. A page type might have been “group”, which had subtypes “business” or “community”. Another type could be “place” with subtypes like “store” or “landmark”.
Page identifiers were strings composed by concatenating numeric
identifiers of the type, subtype, and then an increment field in a DB.
If you visited https://tuenti.com/p/3_2_6691
you’d instantly know
the meaning. It was a “place” (3) of type “store” (2), and the store ID
was 6691. Page IDs decomposed in this way would be useful for multiple
purposes. Chosing a type-specific implementation of controllers to
compose the relevant page, routing to a database shard, that kind of
thing.
At some point Product wanted to change the classification of pages. To their frustration, this became problematic because the entire taxonomy was encrusted across the code all the way from URLs to databases.
Another example are New Relic entities. Entities are an abstraction that broadly represents anything that can send telemetry to New Relic. A host, a Kubernetes cluster, an application, a JVM or a network router can be entities. Of course those are all things you want to identify, so entities have Global Unique IDentifier, or GUID. Every single telemetry datapoint is stamped with the GUID of the entity that produced it, so they act as the keystone of the New Relic platform. Features like Service maps, distributed tracing, entity relationships, and many others are built upon them.
Entity GUIDs have
meaning.
An example of a GUID is 1|APM|APPLICATION|23
where 1 is the account,
APM
is the domain, APPLICATION
is the unique type within that
domain, and 23
is a unique identifier within the domain and type. That
application might be running in a host 1|INFRA|HOST|12
. If we wanted
to store that relation, we’d have an entry in some database saying:
"1|INFRA|HOST|12" RUNS "1|APM|APPLICATION|23"
.
Like at Tuenti, these semantics come handy. If you’re processing millions of telemetry datapoints per second, it’s useful to tell the type of reporting entity on the fly by decomposing the GUID, rather than perform an expensive lookup to an external service. Account IDs can be used to route data to cells (this talk from Andrew Bloomgarden explains how NR used this pattern to scale).
Domains like INFRA
or APM
corresponded to the original product
verticals at New Relic. Years later Product decided (with good criteria)
that they created unnecessary fragmentation on the user experience.
Types change, get renamed or merged with others. Sometimes (more often
than it seems) entities had to migrate from one account to another. In
all these cases you would be altering the identifier of many entities.
It was painful but possible to work around many of these problems. But replacing GUIDs with semantic-free identifiers was straight impossible. By virtue of being present in thousands of URLs, NRQL queries, etc. GUIDs had become a public API that thousands of customers relied upon. A technical solution to replace identifiers would have been a major project, but doable. What wasn’t possible was to run a find/replace across the private documentation and workflows of your entire customer base.
If you look closely, the world is full of semantic identifiers. Sometimes they are hard to avoid, but almost always a pain in the neck. Because they embed a specific model of the world. But models become obsolete faster than we’d like.
Addresses make notable examples. The “complex and idiosyncratic” Japanese address system reflects the organic growth of its urban areas. In British postal codes the final part can designate anything from a street to a flat depending on the amount of mail received by the premises.
When I was a kid, license plates would give up the province where the owner lived causing an array of nuisances. They were alleviated with the adoption of European standards. But partly. The root problem being that, like the Domain Name System identifiers (“Galo’s website”) remain tied to administrative authorities (“.net”), which can change regulations, or even disappear.
Nowadays I find most semantic identifiers in resource management. For some reason, when infrastructure teams define access rules to the resources of a particular service, they prefer to create a group named after the team that owns that service, something like “owner-team-access-list”. Identifiers tied to the org chart don’t like it when the service moves to another team, or owner-team is reorged away.
–
Update: a commenter in Reddit pointed at another great example of problems of identifiers with semantics: the German tank problem. Do send more!
Update: Discussion in HackerNews
Any thoughts? send me an email!
To get notifications for new posts, subscribe to my mailing list, the RSS feed or follow me on Twitter / Mastodon.
Archive
- Identifiers are better off without meaning
- Alert on symptoms, not causes
- How about we forget the concept of test types?
- How organisations cripple engineering teams with good intentions
- Migrating an Eureka-based microservice fleet to Kubernetes
- Talk write-up: "How to build a PaaS for 1500 engineers"
- Kubernetes made my latency 10x higher
- Sizing Kubernetes pods for JVM apps without fearing the OOM Killer
- GC forensics by example: multi-second pauses and allocation pressure
- How does the default hashCode() work?
- Frugal memory management on the JVM (Meetup)
- DirectBuffer creation / disposal has hidden contention on sun.misc.Cleaner