Learning to Predict and Improve Build Successes in Package Ecosystems
10-27, 14:50–15:15 (Europe/Berlin), Main stage

The complexity of software has been increasing, where a typical application relies on tens or even hundreds of packages. The task of finding compatible versions and configuring builds for these packages poses a significant challenge. This talk introduces a method in which we leverage cutting-edge AI technology and advanced package management methodologies to address the challenges of managing software ecosystems. We use graph neural networks (GNNs) to analyze a prominent software ecosystem in HPC, the Exascale Computing Project (ECP) software stack E4S. By using the ECP’s E4S stack as an example, and leveraging Spack’s parameterized package recipes, we demonstrate that GNNs can be effectively trained to understand the build incompatibilities in a large software ecosystem and identify configurations that will not work, without the need to actually build them.


Modern software has reached an unprecedented level of complexity, consisting of tens or even hundreds of dependencies on various packages. In order to tackle this complexity, software ecosystems rely on automated package managers to analyze compatibility constraints among different packages and select a compatible set of package versions to install. Current approaches rely on experts with in-depth knowledge of packages and constraints to identify compatible versions. In practice, users often have to explore different choices of package versions to find an appropriate one that builds successfully. In this talk, we present a tool, called BuildCheck, to understand build incompatibilities, predict bad configurations, and assist developers in managing version constraints. We combine the capabilities of Graph Neural Networks and advanced package management technologies to offer solutions for managing package dependencies. Our tool, BuildCheck, evaluated on E4S software ecosystem consisting of 45, 837 data points can predict build outcomes with 91% accuracy eliminating very expensive trial-and-error exercises to find working builds. Furthermore, our novel self-supervised pre-training method using masked modeling was shown to improve the prediction accuracy when only a limited amount of data is available.

I am a Research Scientist in the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory. I joined CASC as a postdoctoral research staff in 2016. My research focuses on approximate computing, floating-point mixed-precision, machine learning, and fault tolerance of HPC applications. I also have expertise in load balancing algorithms, cosmology simulations application, and HPC runtime systems.

I received my Ph.D. (2016) and M.S. (2012) in Computer Science from University of Illinois Urbana Champaign (UIUC). Prior to enrolling for graduate studies, I was a software engineer at Google.