Skip to main content
header.image.blue

Evaluating AI Workflows using the Causal Testing Framework on Bede

Farhad Allian University of Sheffield


Contents

1. The Problem
2. The Solution
3. Why it helps AI4Science
4. What value did the dRTP bring?
5. The dRTP Experience
6. Useful links


1. The Problem

The rapid adoption of AI and ML in scientific discovery has significantly outpaced the development of methodologies required to verify them. Traditional software testing approaches often struggle where there is a lack of "ground truth", making it difficult to define a correct output for any given input. Furthermore, the vast parameter spaces of modern models make exhaustive testing computationally prohibitive at scale.

These challenges are compounded by the stochastic nature of many scientific simulations, where identical inputs can yield different outputs, making it hard to distinguish between expected variance and genuine modeling errors. Without rigorous verification, these "black box" models pose a significant risk to the reproducibility and accuracy of scientific research, especially when researcher priorities are understandably focused more on discovery than on complex software testing protocols.


2. The Solution

To address these issues, the Causal Testing Framework (CTF) can be used. This domain-agnostic solution utilizes causal inference and Directed Acyclic Graphs (DAGs) to model input-output relationships more effectively than conventional methods.

Imagine you are testing a driving simulator: instead of checking if one specific car stops at a specific line, you map out a causal graph where rain and speed both have an expected relationship with Braking Distance. You then define a "causal test" that compares a dry control group to a rainy treatment group to see if the weather change actually causes the stopping distance to increase as expected. The framework uses your simulation data to determine whether this is true, verifying that the software correctly reflects the relationship.

The advantage of using CTF is it uses causal graphs to prove if specific inputs are the actual drivers behind a model's decision. This allows you to scientifically verify fairness and test the model's robustness against "what-if" scenarios, ensuring it stays reliable even when the data changes.

As part of this work, a bespoke tool has been developed that enables better testing of black-box systems using causal testing called causal-ai. It includes:

  • Bede Integration: The tool is hosted centrally on the Bede HPC, allowing users to access it via the simple command module load causal_testing. This allows near instant access to CTF without worrying about installation.

Case Studies: User guides have been written giving examples of how to run CTF on different types of data, including image classification and video-based models, specifically focusing on domain adaptation.


3. Why it helps AI4Science

Scientific AI models often involve complex, multi-variable relationships that traditional testing cannot adequately map, making them difficult for stakeholders to trust. By providing an intuitive, accessible path to causal analysis, the CTF bridges this gap, transforming "black boxes" into interpretable systems. Since researchers often prioritize model development over complex validation, lowering the barrier to entry for robust testing is essential to gaining the stakeholder buy-in necessary for real-world adoption.


4. What value did the dRTP bring?

Digital Research Technical Professionals (dRTPs) serve as the vital link between theoretical science and robust software, translating the complex causal concepts of the CTF for researchers in other domains. Because deep testing is often a low priority for scientists, dRTPs are uniquely positioned to minimize friction by integrating these tools directly into existing workflows on platforms like the Bede HPC. By combining an understanding of AI reliability with high-level software expertise, dRTPs can provide the streamlined access and comprehensive guidance necessary to turn a sophisticated framework into an enduring, high-adoption research standard.


5. The dRTP Experience

Beyond the core project deliverables, this work also created an opportunity to contribute directly to other open-source research software. While working with the PyKale framework, two bugs were identified and fixed upstream via pull requests (PR #517, PR #523) - a missing AdamW optimiser implementation and a silent multiprocessing issue in the data loading pipeline. These contributions, though outside the original project scope, highlight how dRTPs are well-positioned to strengthen the open research ecosystem through upstream collaboration, ensuring that fixes benefit the broader scientific community rather than remaining isolated workarounds.


6. Useful links


Return to article index