Research Project: Easing access to high-throughput GPUs for Deep Learning researchers
Why did you apply for this internship?
I applied as I'm interested in becoming a research software engineer because of the potential to complete work that makes a real impact on people.
I've always wanted to work at the forefront of innovation and be excited by the work I do, and this seems like the perfect opportunity to realise that. I love working with data and finding ways to improve systems, and I know that I'll learn a significant amount during this internship that I can take into the future.
I aim to learn as much as possible during these couple of months. I have worked on mainframe applications during my time on a previous placement, and I would like to build on these unique experiences further by exploring different high-throughput systems and the impact they can have, particularly within research.
I would love to build my experience as much as possible in areas that I cannot explore as much in my own time, with limited resources, as well as improve my skills when working in teams.
Finally, it would be great to contribute to a project that makes a tangible difference to research and learn about new topics like deep learning and high-throughput GPUs, as well as toolkits such as HTCondor.
What did you hope to gain in completing this project?
I hoped to learn a lot of new skills working with HTCondor and other software, as well as gain an insight into what it's like working in research. This is an area I'm extremely interested in, so it would be incredibly helpful to confirm whether this is something I'd want to pursue in the future.
Project Overview
There is a strong need in the university to run many Deep Learning training runs. This would be very costly to do either on the Cloud or by buying dedicated resources.
However, there are many computers around the university which have high-quality GPU cards which we could make use of when their normal users aren’t using the computers. For example, during the night. I worked on developing a prototype HTCondor cluster within Newcastle University along with creating tooling to make it easier to use the system for non-computing experts, to begin to solve this problem.
What were the key results of your research project?
- I successfully created a small HTCondor Cluster on campus, using unused university computers. These computers could communicate with each other, allowing me to submit jobs from the head node which then ran on the other nodes based on availability and resource requirements. This proved that creating a cluster in the university would be possible, as it would only require the addition of more machines in the same manner.
- I tested the performance of this cluster with and without Microsoft Defender. The final result was that Defender didn't have a significant impact on the performance, meaning that the University's security requirements could be satisfied.
- I built a user interface which could be used by non-computing experts and students to access the HTCondor cluster, allowing jobs to be submitted without using the command-line.
How do you feel you have benefited from completing this internship, and has it made you consider future career paths?
I've benefitted from this internship by learning a huge amount about different components of computer science and machine learning.
I've gained a lot of experience with HTCondor (which I had never used before), I've learnt about different security measures and their importance, and I’ve developed skills working physically on a large number of machines and installing operating systems from scratch, which I wouldn't have ever been able to do in my own time.
GitHub repository: https://github.com/tomreece4/htcondor-interface (user interface only)