Sheffield NVIDIA GPU Hackathon on Bede

Bede isn't your typical x86 high performance workhorse. It is based around IBM's POWER9 CPU and NVIDIA Tesla GPUs. Connectivity within a node is optimised by both the CPUs and GPUs being connected to an NVIDIA NVLink 2.0 bus, and outside of a node by a dual-rail Mellanox EDR InfiniBand interconnect. This allows direct memory transfers to and from GPU memory making Bede ideal for large memory GPU use and multi-node GPU use.

The GPU Hackathon was attended by 8 teams and supported by 21 mentors (and assistants) from NVIDIA (x6), RSE Sheffield (x6), JSC (x2), CSCS (x2), E Science Centrel NL (x1), University of Durham (x1), CINECA (x1), University of Oxford/Southampton (x1), University of Manchester (x1). In addition to this the Hackathon was supported by OCF with system administration for Bede from Durham University.

One of the key objectives was to ensure that use of the Bede system was integrated to provide researchers with a clear path to access a national GPU system after the event. Being the pilot users of a system presents several challenges, not least a lack of documentation.

As part of the hackathon preparation members of the RSE team at Sheffield and ARC at Durham put in place an open source documentation system so that key information on the system and software stack could be easily communicated and quickly updated. This documentation has been instrumental in the commissioning of Bede and the most up-to-date-version will be published on the N8 website before Bede is launched as a production service.

During the Hackathon participants ran and benchmarked HPC jobs on the Bede system to profile code. Mentors provided guided analysis of the profiling results to explain and analyse the information presented by recorded hardware metrics. Through the mentors’ expert advice and an iterative process of code improvement the teams reported considerable speedup of their starting code.

By the end of the hackathon it was clear that the key to maximising the benefit of Bede’s unique architecture will be access to expert mentors and GPU specialists. Bede's institutional commitment to providing expert GPU RSEs is therefore an excellent path to delivering value as the system develops. It also serves to emphasise the importance of the collaborative RSE model.

We will be providing details of RSE support for Bede in the coming days. Links will be provided from our Twitter feed and this page will also be updated.