Deep-Learning Driven Adaptive Molecular Simulations on BEDE

Nikolai Juraschko University of Oxford

4. What value did the dRTP bring?

1. The Problem

DeepDriveMD integrates machine learning with atomistic molecular dynamics to overcome the challenges of simulating complex protein folding landscapes. By automatically identifying key biophysical coordinates and using them to drive exploration, it enables the efficient sampling of novel conformational states that simulations struggle to reach within traditional timescales. DeepDriveMD is a powerful piece of software for anyone doing molecular simulations including protein folding and ligand binding.

While DeepDriveMD has demonstrated its utility on large-scale HPC clusters in the United States, its deployment on UK infrastructure, specifically the BEDE cluster was hindered by several critical factors:

Architecture and Scale Incompatibility: The original software was optimised for the US Vista HPC (and other big US HPCs), requiring a minimum of 8 nodes. This scale is prohibitive for initial testing and general use on BEDE. Furthermore, Bede has two different CPU architectures that are different to Vista; the older Power9 CPU architecture with V100 Nvidia GPUs and the newer Nvidia Grace Hopper nodes with ARM64 CPUs. The original code cannot be simply run on these different architectures without changes.

Out of Date Software: In academic research, incentives prioritise publishing results over software maintenance. Consequently, the DeepDriveMD codebase suffered from unfinished code segments, outdated libraries, and an incomplete dependency list.

Documentation Gaps: A significant lack of "get started" documentation meant that researchers could spend weeks attempting to install packages into a single environment without success.

2. The Solution

To address these barriers, significant work was done to update the DeepDriveMD accessibility for UK researchers. This includes the following:

Downscaling and Optimisation: The code was modified to allow the pipeline to run on as few as 2 nodes, facilitating easier testing and more flexible resource allocation. This involved converting certain parallel processes to serial where necessary to reduce compute demand. Practically, this is important because on smaller tier 2 HPC systems like Bede, it can take an impractical length of time to get access to 8 nodes.
Environment Management: Installing a large number of packages into a single software environment can come with lots of problems due to conflicts between versions of software packages. This is often referred to as "dependency hell". Rather than fighting in a single environment, the solution split the workflow into three distinct, manageable environments. This made the chances of conflicting dependency versions less likely and generally made the code much easier to install. It also allows a developer to just run the part of the workflow that they care about without having to worry about installing bits of software that the developer doesn’t need.

Code Patches and Portability: Significant package updates and code patches were applied to ensure compatibility with the Grace Hopper and Power 9 architectures.

The ultimate goal was to ensure a researcher could follow provided instructions and have the system running on their own cluster in approximately less than one hour, rather than weeks. Two repositories have been created as forks of the original DeepDriveMD to document these changes. The repository for the Power9 architecture can be found here. The repository for the Grace Hopper Nodes can be found here.

3. Why it helps AI4Science

DeepDriveMD serves as a full-pipeline solution for Enhanced Sampling in simulations. It utilises a four-step iterative loop:

Generate data: Spawning Molecular Dynamics simulations.
Learn representation: Using ML-based analysis to process simulation data.
Query: Identifying novel "outliers" or start points.
Spawn: Initiating new simulations based on these AI-selected variables.

By allowing AI to steer the sampling, researchers can significantly reduce bias and explore massive sampling spaces across long time scales that would be impossible with traditional molecular dynamics alone.

Crucially, DeepDriveMD does runs this entire end to end workflow, making this a very powerful research tool. It uses significant compute to do this efficiently. Therefore, this is software that must be run on HPC and not locally on a laptop.

4. What value did the dRTP bring?

The dRTP (Nikolai Juraschko) acted as a vital bridge between high-level scientific research and complex system architecture. In its original state, DeepDriveMD could not be run on any of the HPC’s in the UK. The work done in this case study edited the code and added documentation that allows it to be run on both architectures on the BEDE HPC. Furthermore the Grace Hopper nodes are also the architecture on Isambard-AI; the new UK tier 1 cluster. Therefore, this work is directly transferable to other HPCs too.

Making this codebase accessible was only possible by someone with software and hardware expertise, including experience with HPC workflows and comfortable working with the DeepDriveMD codebase itself. This intersection of skills is not commonly found amongst researchers and highlights the value brought to research by dRTPs.

5. The dRTP Experience

Refactoring an established codebase can often present a greater challenge than developing new software from scratch. Furthermore, traditional funding structures frequently prioritise novel methodologies over the maintenance and improvement of existing tools, creating a critical gap in software sustainability. This project presented an exciting challenge to address that disparity by adapting DeepDriveMD (DDMD) for state of the art UK HPC environments. Although feasibility of porting DDMD across all BEDE architectures was initially questioned, the project achieved full success. It is great to see that DDMD is now accessible to researchers nationwide, facilitating its use on major UK clusters, and significantly lowering the barrier to entry for advanced adaptive molecular simulation workflows.

6. Useful links

DeepDriveMD-BEDE

KhalidLab/Deepdrive_we-BEDE

Nikolai Juraschko University of Oxford

Contents

1. The Problem

2. The Solution

3. Why it helps AI4Science

4. What value did the dRTP bring?

5. The dRTP Experience

6. Useful links

1. The Problem

2. The Solution

3. Why it helps AI4Science

4. What value did the dRTP bring?

5. The dRTP Experience

6. Useful links