Bridging the Gap: an onboarding guide for streamlined access to a biological imaging machine learning toolkit

Tamora James University of Sheffield

4. What value did the dRTP bring?

As high-dimensional imaging has become a cornerstone of modern bioscience, we are seeing new examples of researcher-led, custom AI toolkits, which are designed to process high dimensional images for specific scientific workflows. However the practical application and wider uptake of such software, especially by teams outside of the original development team, is often challenging due to the complexities that are often present in research code. In this Digital Research Technologist Professional (DRTP) intervention we focused on softening some of the barriers to uptake, by providing an example of how an onboarding, quick-start guide is a useful feature for streamlining deployment for new users. We focus on an exciting example of new custom AI research software: PhenoCLR, a comprehensive framework for analyzing organismal colour patterns using artificial intelligence.

1. The Problem

Biological research frequently demands the analysis of complex specimens in tasks such as classification and segmentation and can extend to other tasks that seek to characterise complex visual information such as colour or pattern. Biological image data now often transcends the visible spectrum and may contain information from channels such as fluorescence, thermal, and hyperspectral wavelengths.

AI provides a robust and attractive framework for researchers seeking to accelerate discovery from these sophisticated datasets and has the potential to enhance research in a number of ways e.g.: by providing the speed and scalability needed to process vast datasets while ensuring consistent results free from human error; by recognizing subtle patterns and managing complexity in data, allowing for precision in tasks such as classification and segmentation, which may be better than that obtained from manual analysis.

So why isn't everyone using AI image tools in their research?

The problem is that most pre-trained, “off the shelf” AI image toolkits are built for handling standard photos, which makes them ill-suited and inefficient for analysing high-dimensional images and the extra layers of information that these images may contain.

The solution is to use "custom" models. In bioscience (as in many other science research fields) these are rarely built from scratch. Instead, researchers typically take "off-the-shelf" computer vision architectures and adapt them, e.g. through transfer learning, fine-tuning, or structural modifications, so that they can handle the unique properties of specialist data and scientific tasks. The problem then is that such research software is rarely available for re-use in a “plug and play” fashion by users outside of the originating team. Research software is often developed iteratively, sometimes over multiple projects, with limited funding (when compared with commercial software products), and by different researchers, whose focus understandably may be to produce academic outputs. This means that scientists and users who seek to implement and redeploy custom AI research models for science workflows often have to navigate complex layers of legacy code and there can be various issues, e.g.: access to code repositories may be unclear, paths may be hardcoded, documentation may be minimal and there may be hidden or undocumented dependencies. This can be off-putting to those without a computer science background and the problem is compounded when there is a need to move from a local machine to High-Performance Computing (HPC) environment, where complex software dependency chains can create further barriers.

2. The Solution

In this project the immediate goal was to carry out an example of a simple intervention that can lower the barrier to entry for researchers to access a custom AI research software toolkit, in this case, the contrastive learning model PhenoCLR. Developed by researchers at the University of Sheffield in successive funded projects, PhenoCLR is a generalisable toolkit for extracting and analysing colour pattern information from biological images. The software has the potential for wide applicability across the biosciences and to transform the ability of researchers to rapidly characterise colour pattern phenotypes from image datasets.

The PhenoCLR toolkit provides a useful case study for the challenge of improving access to and encouraging the reuse of valuable analytical research tools. The codebase is powerful yet complex, which can make it challenging for researchers without a computer science background to deploy. The lead researchers are keen to make the framework open, to keep this software moving and to have it packaged in a form such that it can be easily used by others in their lab and beyond, to greatly benefit the field.

Our intervention initially focused on creating an "On-boarding guide" containing user-friendly documentation. Subsequently it became clear that it would be valuable to also provide recommendations of what could be done to improve the codebase in accordance with FAIR principles to improve reproducibility and reusability.

PhenoCLR: a comprehensive framework for analyzing organismal colour patterns using artificial intelligence.
PhenoCLR is a generalisable tool-kit for extracting and analysing colour pattern information from high-dimensional images. It can be used to support biological research that seeks to understand colour pattern variation within and between biological species.
PhenoCLR originated in a seed funded project led by Dr Chris Cooney, lecturer in Biosciences, supported by Dr Tamora James of the Sheffield Research Software Engineering (RSE) Team, both at the University of Sheffield. With a BBSRC funded project* to follow, the research team (including Eleftherios Ioannou) developed a powerful toolkit that combines state-of-the-art machine learning techniques (computer vision and contrastive learning methods) to analyze biological colour patterns across multiple spectral modalities. The framework is built around Google’s SimCLR (Simple Contrastive Learning of Visual Representations) and provides tools for both training models and evaluating embeddings.
Key features
Multi-Modal Imaging: Supports everything from standard RGB and Multispectral (UV) to high-resolution Hyperspectral data (408+ bands).
Advanced AI Learning: Uses self-supervised learning and Kornia-powered data augmentation to create meaningful biological embeddings.
Evaluation and Visualization: Includes built-in tools for UMAP/t-SNE mapping and statistical evaluation (Silhouette scores) to help visualize phenotypic clusters.
The software is ideal for researchers working in
Biological Pattern Analysis - Understanding colour variation in organisms.
Computer Vision - Developing robust visual representations for biological data. Comparative Biology - Analyzing phenotypic variation across species.
Spectral Imaging - Working with multispectral and hyperspectral data.
The public codebase, more information on how it works and documentation is available here.
* PhenoCLR development work was supported by the BBSRC award BB/Y513830/1: Unlocking the complexity of organismal colour patterns using AI. The toolkit is built on top of Google's SimCLR implementation.

What was done:

The first step was to review existing documentation, particularly a “Quick Start” guide for basic analysis using the PhenoCLR toolkit. The review immediately revealed that the documentation was incomplete in places and that there were inconsistencies with the codebase. For example, paths specified in the documentation were not matched by the structure of the public codebase. Any user following the quick start guide would need to navigate a number of challenges including setting up the computational environment, modifying example configurations to supply missing options, correcting paths to scripts and understanding which configuration options should be adjusted to match available resources. Additionally, the documentation was very technical and did not provide much context for users to understand the purpose of the software.

Examining the codebase, it became clear that the software had developed organically from its original incarnation, with multiple copies of scripts across two different repositories. While this structure met the needs of the research project, it did not make the codebase straightforward to work with, nor did it adhere to design principles, such as “Don’t Repeat Yourself” (DRY), that can contribute to code that is easier to maintain, understand and reuse.

The documentation review led to the following aims for the on-boarding guide:

Present a worked example so that novice users can build up their understanding of the toolkit
Provide details about what each step of the pipeline is intended to achieve and what outputs are expected
Use a subset of open source data to allow model training with limited resources
Use standard configurations from the codebase and describe how and why to adjust configuration options
Select configuration options so that model training doesn’t take too long

The onboarding guide is to be reviewed (and tested) by domain researchers. When complete it will be made available via the PhenoCLR documentation pages.

3. Why it helps AI4Science

Empowering bioscience researchers to make use of cutting edge image analysis pipelines has the potential to streamline investigation of key biological research questions. Given the breadth of the domain, improving access to ML image analysis for biological image data would have multiple applications, for example in cell biology, plant physiology, evolutionary biology, ecology and conservation. By easing barriers to access for the specific example of a custom image toolkit, PhenoCLR, this intervention helps to make a powerful new software tool more readily available to researchers and this has the potential to help catalyse existing bioscience research programmes and to open up new fundamental and applied research areas involving colour pattern phenotyping that are currently intractable.

Applying FAIR principles to research software helps support open science research goals such as transparency, reproducibility and reusability.

4. What value did the dRTP bring?

The dRTP community and organisations such as the Software Sustainability Institute (SSI), have played a key role in changing the culture and promoting the application of FAIR (Findable, Accessible, Interoperable, Reusable) principles and open research practices - such as reproducibility - to the development of research software.

The dRTP leading this intervention, Dr. Tamora James, was both involved in the initial phase of software development and is also an expert in FAIR principles for research software (FAIR4RS) and is lead for delivering RSE training courses on FAIR and reproducible research software development to the Sheffield research community, to help to upskill researchers in the application of FAIR principles in their research software projects.

In her evaluation of the research codebase, Tamora was guided by FAIR principles, and her on-boarding guide and recommendations are designed to help others to find and access the software, to help with interoperability and to encourage reuse.

FAIR principles for research software (FAIR4RS)
While the historical focus has been on "FAIR Data" over the past few years there has been growing recognition that research software should adhere to FAIR guidelines as well as meeting broader open research goals such as reproducibility and open access. A modified set of FAIR principles for research software (FAIR4RS), allowing for the inherent differences between data and software, has been developed (2022) and provides a framework for the development of FAIR research software (see references below).
dRTP teams across the country can help researchers to apply FAIR principles in their research software projects including AI for science research, both through direct collaboration throughout a project lifecycle and through providing training courses for researchers.
For further reading about the use of FAIR principles in research software development see:
Barker et al., ‘Introducing the FAIR Principles for Research Software’. doi:10.1038/s41597-022-01710-x
Chue Hong et al., ‘FAIR Principles for Research Software (FAIR4RS Principles)’. doi:10.15497/RDA00068

5. The dRTP Experience

For the dRTP, who has research experience in the domain, this was a relatively rare opportunity to revisit a project that they had worked on previously. While the post-doctoral researcher who developed the codebase has now moved on to a new position, the dRTP was able to provide some continuity in terms of understanding the project’s aims and objectives. This background knowledge was valuable both for producing on-boarding guidance and recommendations for further work to improve this research software.

6. Useful links

PhenoCLR

A Contrastive Representation Learning Framework Across the Spectral Domain

Tamora James University of Sheffield

Contents

1. The Problem

2. The Solution

3. Why it helps AI4Science

4. What value did the dRTP bring?

5. The dRTP Experience

6. Useful links

1. The Problem

2. The Solution

3. Why it helps AI4Science

4. What value did the dRTP bring?

5. The dRTP Experience

6. Useful links