“ML Container Toolkit”: a container system for streamlining AI model deployment on Bede/HPC platforms

Ben Thorpe University of York

4. What value did the dRTP bring?

6. Useful links Apptainer and bede_ml-toolkit documentation

1. The Problem

Scientific researchers who use AI/ML software as part of their research methodology often need for their software to be moved to a different machine. This may be to move from a small scale development environment to a large scale “train and test” environment on an HPC cluster, or because old hardware has been replaced with new hardware, or because they want to make their software “open source” to other researchers who will want to run it on their own machines. Regardless of the reason, a key requirement is that given identical inputs the software should run and produce identical results in the new environment.

Moving ML and AI software pipelines between different machines is however far from straightforward and can be a pain point for both researchers and HPC teams. The entire software environment on one machine (including code, libraries and tools) typically needs to be replicated and this can be both time-consuming, frustrating and error prone, even for an expert, due to differences in hardware, operating systems, software versions and complex dependency chains.

Further issues compound the problem, since researchers are likely to lack the root privileges, time and dRTP ‘know-how’ needed to manually recreate an entire software environment. HPC admin and support teams cannot practically support every diverse library that may be required and dRTPs may be needed to assist with deployment but may not be available due to their skills being in high demand.

2. The Solution

A solution to this problem, which is increasing in popularity, is to use a containerisation toolkit such as Apptainer.

Containerisation provides a secure and efficient solution for moving code between environments.

Containers work by bundling a software application with all its necessary libraries, data and configurations, into a single portable file (this is a Singularity Image Format (SIF) file, which is read-only - you can think of it as a ‘snapshot’ of your entire software environment, frozen in time). By isolating your environment from the underlying hardware, containers eliminate installation headaches, protect code from future system updates and ensure that work is perfectly reproducible - it will work on your machine, and it will work on a supercomputer, or any other machine, at any future time, exactly the same way.

“ML container tool kit”, an easy-to-use toolkit for setting up containers with Apptainer.

To help make Apptainer more quick and easy to use for researchers and dRTPs who want to set up containers for ML models, Ben Thorpe, the dRTP lead for this intervention, developed a wrapper for Apptainer that provides a simplified interface to the sophisticated functionality that Apptainer offers. Called “ML container tool kit”, this wrapper means a user can set up a container for their ML model, guided by prompts and via a few simple user commands, without any significant scripting or need to consult lengthy documentation.

Why use Apptainer? Apptainer is HPC-specific, integrates with SLURM, does not require root access, supports encryption (enhancing security), and in this case was conveniently pre-installed on BEDE.

ML container tool kit - how things work:

Consists of a python script that acts as a wrapper for Apptainer.

Provides a set of simple commands for users to interface with Apptainer (e.g., load model_name)
Built primarily for use on the Grace-Hopper nodes of N8 HPC cluster Bede. However you can install the ML container toolkit on your local machine or it can be used to set up a container on another HPC cluster.
Simplifies complex GPU setup
Allows users to seamlessly use local definition files
Time efficiencies of using ML container toolkit: seconds to run vs hours or more writing various configuration scripts.
Avoids the need to engage with lengthy documentation.
Removes redundancy in scientific workflows

ML Container toolkit is available for use on Bede and can be used on other HPCs. You can find documentation on how to use it here:
https://bede-container-docs.readthedocs.io/en/latest/
Please also visit the following Github project pages, where you can find additional documentation on the container toolkit: https://github.com/bjthorpe/Bede_containers

3. Why it helps AI4Science

When applying AI in their scientific research, scientists often need to compare and test multiple models to find the best performance for their task. By greatly simplifying the various steps involved in installing and running AI models on HPC, the ML container tool kit will hopefully encourage scientists to spend more time exploring and testing the space of relevant models, at scale (and less time on tedious dependency management and getting things to run).

Furthermore, ML container tool kit will hopefully encourage researchers to use containers in their workflows, and thereby will encourage greater software mobility, and ensure that the software they run will be reproducible and produce results in exactly the same way, regardless of the machine and operating system that it is run on (and so assists in good practice).

You can learn more about a use case for the ML container tool kit in one of our other intervention case studies:
Streamlining Materials Modeling: A Unified Entry Point for Matbench Discovery Models in CASTEP
multiple containers were created with “ML container tool kit” as part of work to streamline access to the top ranked models of the materials leaderboard for comparison and testing.

4. What value did the dRTP bring?

Digital Research Technical Professionals (dRTPs) act as the vital bridge between complex infrastructure and scientific application. This case study demonstrates how dRTP Ben Thorpe, a Research Software Engineer at the University of York who has more than 5 years prior experience working in HPC on several different systems, and who has experience working on Bede with both Power9 and GraceHoppers, could apply their skillset and specialist knowledge to analyse the problem and create the scripts needed to provide a container based solution using Apptainer, specifically tuned for the Bede HPC environment.

Because Apptainer runs entirely in “user space” on an HPC you do not need administrator privileges to run it, however it does require some additional non-trivial scripting (e.g. for various configuration files) when you want to use Apptainer to build a container for your model. Working with Apptainer also involves following some fairly technical documentation and this may prove intimidating or off-putting to science researchers with limited time and/or dRTP skills.

5. The dRTP Experience

Ben used various skills in this intervention, including: scripting and interfacing with other system processes using python and also skills in Linux sys-admin and containerisation. Such skills are fairly unique to dRTPs, and Ben suggested that having a cursory understanding of both these areas would be useful to many researchers interested in deploying models on HPC.

Ben also acquired new skills when he developed the tool kit so that it can effectively manage errors, providing simple error messages to users. The main new skill was error handling between different system processes, as Ben explained: “Python has robust tools and procedures for handling its own errors. However, when it comes to errors thrown by other programs, (potentially written in other languages) even doing simple things like providing useful error messages can be non-trivial. Thus I have had to learn how to handle such errors in a robust way. Such that if/when something does go wrong, users are not sent on a “wild goose chase” looking through various cryptic secondary errors that can often hide the root cause of the problem”.

There were benefits from working on a group project with RSEs based at other universities. Ben found it useful to get other RSE’s perspectives on the project and to see and try alternative workflows, getting some insight into what works for other people in the team to see if there was anything that could be taken away. The Bede team was very helpful and supportive throughout and access to HPC was generally good.

We asked Ben if he had any advice for researchers who want to use HPC and AI to accelerate their research:

“The best advice I can give is to try not to chase the latest and greatest. It can be overwhelming as there is seemingly an endless number of complex options for software and tools to learn and wade through. Especially in the AI space, where things move so quickly you can feel like you are constantly on shifting sands.

Instead pick a handful of software tools, get stuck in, and try to find something that works for you. It does not need to be perfect. It’s far better spending a few days/weeks to have something that mostly works. As opposed to spending months or years constantly switching out and chasing the perfect tools and not doing any meaningful research. Also, once you have something that works, you are in a far better position to know what software features would be useful vs “nice to have” or even unnecessary”.

Thank you Ben!

6. Useful links

Streamlining Materials Modeling: A Unified Entry Point for Matbench Discovery Models in CASTEP

Read another interesting case study from Ben

Apptainer

Apptainer (formerly Singularity) simplifies the creation and execution of containers, ensuring software components are encapsulated for portability and reproducibility.

ml-toolkit documentation

ml-toolkit is a python script designed to make working with ML/AI models simpler and easier on the Bede grace hopper nodes by using containers.

About Apptainer/Singularity

Apptainer (Singularity) Basics

What is Apptainer?

Ben Thorpe University of York

Contents

1. The Problem

2. The Solution

3. Why it helps AI4Science

4. What value did the dRTP bring?

5. The dRTP Experience

6. Useful links Apptainer and bede_ml-toolkit documentation

1. The Problem

2. The Solution

3. Why it helps AI4Science

4. What value did the dRTP bring?

5. The dRTP Experience

6. Useful links