Shared Objects and Content Addressing: a Survey of Techniques
10-26, 17:10–17:35 (Europe/Berlin), Main stage

When shipping software and systems at scale, it's desirable to bundle libraries into files that can be shared between applications, in order to reduce the system size. However, this comes with tradeoffs: in order to share things, we have to organize. More concretely: this means for any applications that want to share libraries and data objects, they have to agree on file naming conventions.

This simple need leads to a much bigger problem: the organizational agreement problem doesn't scale. Individual package managers solve this problem in distinct ways. And which libraries and data objects themselves can be shared can be a point of contention: some systems are based on ideas like "major version number is enough" (which, spoiler, invariably creates problems). The end result? Packages from different ecosystems can't share dependencies; whole linux distributions become rapidly balkanized from each other because of library versioning and filename collision issues; and in the worst scenarios, it becomes impossible to install different versions of some software and libraries on a single system, due to name collisions.

There has to be a solution. Where else in computing have we seen (and solved) the problem of "there are many small variations of a piece of data, and we need to keep all of them, despite naming each one being an inhuman problem?" Right: version control. And what was ultimately the solution in version control? Content-addressing: hash the thing, and index the storage by that.

So can we use the same solution to make a new golden age where shared objects and shared libraries are both easy and reliable and conflict-free?

In this talk, we'll explore the problem space -- what hurdles are there to sharing? What really needs to count as unique? How can we wire up existing library loading systems to meet our goals (without rewriting the universe?) -- we'll survey some prior art, and we'll wrap up with some questions that we hope can lead to a better future for all!


Shared Objects and Content Addressing: a Survey of Techniques

  • briefly, what is content addressing and why is it a great conflict-remover
  • briefly, what do I mean by shared objects? stuff on the filesystem that should be readable and referencable... by more than one other "thing" (program; linker; etc).
  • but has anyone done this before? (Sort of? Not really. And not in ways shared beyond one tool!)
  • the numbers on dot-so files obviously aren't it
  • guix and nix use big mangles... But no that's not it either (it's a hash, but it's not content addressing; also, there's some weird funkiness about absolute paths which makes a whollllle bunch of knock-on problems that I don't even have time to unpack in this talk)
  • git is really the only familiar tool that actually does CA. And it's great! But it's not used to do shared objects in packaging and install management

So what if we did?

  • there's still questions
  • there's still stuff to align on (!!!) (ability to share a convention across tools precisely enough to get dedup and other good outcomes requires agreeing on enough to make hashes convergent.)
  • what things do we include in the hash? (What blobs? What filenames? What posix bits? Some things are obviously in; others less obvious.)
  • when we unpack, do we also include advisory info in the path... Like package name as well as hash? (Purist answer is "no" but survey of every system anyone has heard of: all answered "yes": guix, nix, stow, go mod, etc)
  • what do we do at unpack time when there's more metadata required by the reality of unpacking (e.g. all the POSIX bits) that we (intentionally) didn't track in our content convergence hash?
  • how does this shape up on multiuser systems? (what's needed to confront that, vs what does it look like when we ignore this?)
  • what do we do about any weirdo software who do require more bits than we agree on (e.g. ssh config files are a rare example... but what about setuid? what about xattrs?)
  • how do we integrate these things with real systems? (Linux SO files; how is that challenge different from source files for $lang-pkg-mgr; etc)
  • how much can we make a convention that's language/toolchain/etc agnostic and shared?

A modest proposal / conversation starter:

  • let's take a quick peek at the git tree hash
  • blobs, paths, symlinks, and the x bit -- how closely does that match the real practical needs?
  • call to discuss: what else would you need? and what defaults/fallbacks/polyfills are valid where $X doesn't apply?