ocf_still_03-745x365@2x

Bede Hardware

The system is based on the IBM POWER9 CPU and NVIDIA Tesla GPUs.

Connectivity within a node is optimised by both the CPUs and GPUs being connected to an NVIDIA NVLink 2.0 bus, and outside of a node by a dual-rail Mellanox EDR InfiniBand interconnect allowing GPUDirect RDMA communications (direct memory transfers to/from GPU memory).

Together with IBM's software engineering, the POWER9 architecture is uniquely positioned for:

  • Large memory GPU use, as the GPUs can access main system memory via POWER9's large model feature.
  • Multi node GPU use, via IBM's Distributed Deep Learning (DDL) software.

There are:

  • 2x "login" nodes, each containing:

    • 2x POWER9 CPUs @ 2.4GHz (40 cores total and 4 hardware threads per core), with NVLink 2.0
    • 512GB DDR4 RAM
    • 4x Tesla V100 32G NVLink 2.0
    • 1x Mellanox EDR (100Gbit/s) InfiniBand port
  • 32x "gpu" nodes, each containing:

    • 2x POWER9 CPUs @ 2.7GHz (32 cores total and 4 hardware threads per core), with NVLink 2.0
    • 512GB DDR4 RAM
    • 4x Tesla V100 32G NVLink 2.0
    • 2x Mellanox EDR (100Gbit/s) InfiniBand ports
  • 4x "infer" nodes, each containing:
    • 2x POWER9 CPUs @ 2.9GHz (40 cores total and 4 hardware threads per core)
    • 256GB DDR4 RAM
    • 4x Tesla T4 16G PCIe
    • 1x Mellanox EDR (100Gbit/s) InfiniBand port

The Mellanox EDR InfiniBand interconnect is organised in a 2:1 block fat tree topology. GPUDirect RDMA transfers are supported on the 32 "gpu" nodes only.

Storage is provided by a 2PB Lustre filesystem capable of reaching 10GB/s read or write performance, supplemented by an NFS service providing modest home and project directory needs.

Bede is running Red Hat Enterprise Linux 7 and access to its computational resources is mediated by the Slurm batch scheduler.

The inclusion of the IC922 system on Bede is one of the first uses of this new hardware anywhere in the world. The use on NVIDIA’s high-bandwidth NV Link enables tensor outputs to be moved to the larger system memory block. When loss calculations occur and the algorithm is updated this data can be efficiently moved back to the GPU. This unique architecture ultimately enables deeper model to be trained using much higher resolution imagery than has previously been possible.


Return to article index