Paul Niklas Ruth and Samantha Finnigan Durham University
Contents
1. The Problem
2. The Solution
3. Why it helps AI4Science
4. What value did the dRTP bring?
5. The dRTP Experience
3. Why it helps AI4Science
4. What value did the dRTP bring?
5. The dRTP Experience
5. The dRTP Experience
1. The Problem
In many domains, such as social sciences, researchers rely heavily on transcribing large volumes of interview data. However, traditional commercial Speech-to-Text (S2T) services present three primary obstacles:
- Financial Sustainability: Subscription based models often expire with project funding. This makes it difficult to conduct follow up interviews or longitudinal studies once a grant has ended.
- Reproducibility: Commercial providers frequently update models or algorithms without transparency. This makes it impossible for researchers to replicate identical transcription results over time.
- Data Sovereignty and Ethics: Sending sensitive, identifiable patient or interviewee data to external, third party servers raises significant legal and moral concerns regarding GDPR and data privacy.
Although using a model on local infrastructure does help with these issues, lots of researchers lack the technical background to set up a local speech-to-text model. Furthermore, to do this at scale, the model needs to be run on a High Performance Computer (HPC) which is another technical challenge to overcome for many researchers.
In many domains, such as social sciences, researchers rely heavily on transcribing large volumes of interview data. However, traditional commercial Speech-to-Text (S2T) services present three primary obstacles:
- Financial Sustainability: Subscription based models often expire with project funding. This makes it difficult to conduct follow up interviews or longitudinal studies once a grant has ended.
- Reproducibility: Commercial providers frequently update models or algorithms without transparency. This makes it impossible for researchers to replicate identical transcription results over time.
- Data Sovereignty and Ethics: Sending sensitive, identifiable patient or interviewee data to external, third party servers raises significant legal and moral concerns regarding GDPR and data privacy.
Although using a model on local infrastructure does help with these issues, lots of researchers lack the technical background to set up a local speech-to-text model. Furthermore, to do this at scale, the model needs to be run on a High Performance Computer (HPC) which is another technical challenge to overcome for many researchers.
2. The Solution
To perform large scale speech-to-text tasks, support for OpenAI’s Whisper model has been added to Bede. This project has implemented whisper.cpp (high-performance C++ port of the original model) as an accessible module on the Bede HPC that can be used with very little effort and documented in the official bede user documentation (link will be added to this article when it goes live).
Some of the key features of the implementation are:
- Optimized Implementation: While the project began with a PyTorch implementation, it transitioned to a C++ version (whisper.cpp) to improve efficiency and ease of use in a cluster environment.
- User Guide: A set of simplified instructions covering the entire workflow. This includes account setup, data upload, executing the transcription, and downloading results.
Parameter Tuning: To handle noisy data, such as interviews with background music, the solution supports the ability to easily change model parameters. This includes temperature adjustments and silence thresholding to prevent model hallucinations. This is vital for attaining good performance with the model.
To perform large scale speech-to-text tasks, support for OpenAI’s Whisper model has been added to Bede. This project has implemented whisper.cpp (high-performance C++ port of the original model) as an accessible module on the Bede HPC that can be used with very little effort and documented in the official bede user documentation (link will be added to this article when it goes live).
Some of the key features of the implementation are:
- Optimized Implementation: While the project began with a PyTorch implementation, it transitioned to a C++ version (whisper.cpp) to improve efficiency and ease of use in a cluster environment.
- User Guide: A set of simplified instructions covering the entire workflow. This includes account setup, data upload, executing the transcription, and downloading results.
Parameter Tuning: To handle noisy data, such as interviews with background music, the solution supports the ability to easily change model parameters. This includes temperature adjustments and silence thresholding to prevent model hallucinations. This is vital for attaining good performance with the model.
3. Why it helps AI4Science
The deployment of OpenAI Whisper on BEDE provides a resource for AI-driven research in the medical and social sciences, specifically by addressing the technical and ethical barriers associated with sensitive data.
As part of the case study, Whisper was used to transcribe medical interviews from the patient voices dataset. The dRTPs optimized the model to handle 800 complex files containing music and silence, utilizing silence thresholding and temperature tuning to eliminate hallucinations and error propagation. This high performance solution enables the rapid, scalable transcription of large audio and video datasets that would otherwise be manually prohibitive, establishing a foundation for future multimodal AI research. Local deployment on BEDE addresses critical ethical and technical barriers in AI research by ensuring data sovereignty for sensitive patient archives.
The deployment of OpenAI Whisper on BEDE provides a resource for AI-driven research in the medical and social sciences, specifically by addressing the technical and ethical barriers associated with sensitive data.
As part of the case study, Whisper was used to transcribe medical interviews from the patient voices dataset. The dRTPs optimized the model to handle 800 complex files containing music and silence, utilizing silence thresholding and temperature tuning to eliminate hallucinations and error propagation. This high performance solution enables the rapid, scalable transcription of large audio and video datasets that would otherwise be manually prohibitive, establishing a foundation for future multimodal AI research. Local deployment on BEDE addresses critical ethical and technical barriers in AI research by ensuring data sovereignty for sensitive patient archives.
4. What value did the dRTP bring?
The work was carried out by two dRTPs Samantha Finnigan and Paul Niklas Ruth, who work as RSE’s at the Advanced Research Computing Centre, Durham. They brought a mix of dRTP skills to address the problem:
Deploying whisper.cpp on Bede is non-trivial. It requires expertise in c/c++ compilation; something which is uncommon is social science domains. A further added complication is that whisper.cpp is not compatible with Grace hopper GPU nodes (the latest GPUs in Bede) by default. This is not something that can be installed by simply following some instructions from a user manual. Furthermore, the Whisper model is not something that is designed to work out the box. It needs to be tuned to the dataset by changing parameters and observing the results.
Beyond handling the initial setup, the dRTP acted as an essential research partner by collaborating with the researcher to understand these requirements. For example, it became clear when working with the researcher and the data, that the need for rapid hyper-parameter was important. As a result, this was implemented straight away.
The work was carried out by two dRTPs Samantha Finnigan and Paul Niklas Ruth, who work as RSE’s at the Advanced Research Computing Centre, Durham. They brought a mix of dRTP skills to address the problem:
Deploying whisper.cpp on Bede is non-trivial. It requires expertise in c/c++ compilation; something which is uncommon is social science domains. A further added complication is that whisper.cpp is not compatible with Grace hopper GPU nodes (the latest GPUs in Bede) by default. This is not something that can be installed by simply following some instructions from a user manual. Furthermore, the Whisper model is not something that is designed to work out the box. It needs to be tuned to the dataset by changing parameters and observing the results.
Beyond handling the initial setup, the dRTP acted as an essential research partner by collaborating with the researcher to understand these requirements. For example, it became clear when working with the researcher and the data, that the need for rapid hyper-parameter was important. As a result, this was implemented straight away.
5. The dRTP Experience
The overall task was solved in an iterative workflow between Samantha Finnigan and effort between Samantha Finnigan and Paul Niklas Ruth, involving three main steps: Firstly, enabling to get the software to run on the BEDE hardware as well as an initial rough integration into BEDEs module system using publicly available examples as smoke test (Paul Niklas Ruth). Secondly, using these prototype implementations to fine tune the plethora of available options with the available demonstration dataset to produce the required transcriptions (Samantha Finnigan). Finally, combining the runnable prototypes with the fine-tuned values and the lessons learned from fine-tuning into a script that allows for non-expert users to use the developed workflows, combined with a detailed documentation of the workflow (Paul Niklas Ruth). This provides a read-to-use solution to transcribe a large corpus of audio into text.
(Samantha Finnigan noted: manual tuning for the dataset was iterative, long and arduous, checking and re-checking outputs for reasonable output. 50% of 35h of my task were spent on tuning.)
The overall task was solved in an iterative workflow between Samantha Finnigan and effort between Samantha Finnigan and Paul Niklas Ruth, involving three main steps: Firstly, enabling to get the software to run on the BEDE hardware as well as an initial rough integration into BEDEs module system using publicly available examples as smoke test (Paul Niklas Ruth). Secondly, using these prototype implementations to fine tune the plethora of available options with the available demonstration dataset to produce the required transcriptions (Samantha Finnigan). Finally, combining the runnable prototypes with the fine-tuned values and the lessons learned from fine-tuning into a script that allows for non-expert users to use the developed workflows, combined with a detailed documentation of the workflow (Paul Niklas Ruth). This provides a read-to-use solution to transcribe a large corpus of audio into text.
(Samantha Finnigan noted: manual tuning for the dataset was iterative, long and arduous, checking and re-checking outputs for reasonable output. 50% of 35h of my task were spent on tuning.)