Python Resolution Evolution: Decoupling Metadata from Downloads in Pip
10-26, 14:50–15:15 (Europe/Berlin), Main stage

Over the past few years, the Python community has largely unified around the new backtracking pip resolver, but many widely-used ML frameworks which bundle large amounts of binary code have historically pushed at the boundaries of pip's performance envelope and continue to require further innovation. Starting in 2019, I began to investigate how to reduce the size and bandwidth requirements of ML models deployed by the Twitter Cortex ML team, which produced initial drafts of the work that would later be upstreamed into pip as the install --report and --use-feature=fast-deps features. In this talk, I walk through the motivating use cases from Twitter, how these ideas were over time collectively translated into coherent standards, and how to take advantage of these improvements when building Python applications.


Until 2020, Python applications shipped using a "greedy" or non-backtracking resolver, which led to significant angst and workarounds: within larger organizations, by pinning every dependency, while historically forcing open source projects to under-constrain their declared dependencies. While there was a lot of discussion at the time about whether to dive into using a SAT solver or other techniques requiring native dependencies, it ended up being much easier for a bootstrapping tool like pip to use a pure-Python solution.

At the same time, other rumblings have been heard across the Python community. Spack, another package manager written in Python, ended up rewriting its concretizer to use an ASP logic solver written in C++. As machine learning continues to become hotter and hotter, Python itself has finally become able to reconsider the Global Interpreter Lock to allow for greater control by native code. When I first began looking at this problem, I had been contributing to the pants build tool and helping to define our Python-level API to orchestrate build steps executed in parallel with Rust. In all of these cases, there is a desire to retain the programmability and hackability of Python without experiencing growing pains as it gets applied to more and more tasks.

However, I was surprised to find that improvements to pip resolve performance and the time to create deployable applications from that resolution largely came from managing i/o more effectively, and not (as is often alleged) by any inherent slowness of Python as an implementation language. Indeed, I demonstrate a case later in the talk regarding the production of single-file zipapp (pex) executables (as were used by the Twitter Cortex team) where, although I have created a Rust project and Python extension to slightly improve the performance of this task, the main thrust of the speedup by orders of magnitude was simply in figuring out how to avoid redoing computation by re-compressing very large ML frameworks every time, so the pex project is able to achieve this speedup without taking on any native dependencies.

Most significantly, the pip project has recently been putting in immense effort to enable "virtualized" metadata-only requirements in its implementation that allow the resolver to make progress without downloading massive binary wheels. After I posted an initial prototype of zip file hackery with HTTP range requests to get around Python not having developed a metadata standard yet, github.com/McSinyx made that code production-ready into --use-feature=fast-deps as a Google Summer of Code project. Later, to address the case of sdists which cannot be manipulated in this way, github.com/uranusjr proposed and got accepted PEP 658, which now provides metadata for wheels on pypi, while github.com/sbidoul took over and shipped pip install --report to make pip's resolution algorithm available to downstream consumers handling download and install separately, which was the key innovation underlying the feature I shipped for the Cortex team. With recent work, this will allow pip install --dry-run --report to avoid downloading any artifacts at all, which enables users like Twitter Cortex necessarily building very large applications to avoid performing large amounts of network i/o just to figure out what they need to include in their binary.

Talk will discuss:
- The workloads from the Twitter Cortex team and representative speedups from that work (see https://github.com/pantsbuild/pants/pull/8793).
- How pip maintainers responded to my initial proposal (https://github.com/pypa/pip/issues/7819), and how other stakeholders weighed in on its general utility, expressed faith in the idea, and invested effort to move it forward (see saga at https://github.com/pypa/pip/issues/53).
- How tools like pex have integrated the pip resolver to form extremely slick interfaces you can build other tools on top of (especially showcasing https://github.com/pantsbuild/pex/pull/2175).
- What remains to be done to make pip the fastest resolver in the west (especially work at https://github.com/pypa/pip/issues/12184).

Typing free software to break the shoulders of giants from golden handcuffs. Working on extending the Signal protocol to replace gpg.

Have previously worked on:
- spack at LLNL (https://llnl.gov)
- pants at Twitter

Can be found at:
- @hipsterelectron on Twitter,
- @[email protected] on Mastodon,
- @cosmicexplorer on GitHub.