N8 CIR - Success in the Cloud

Following on from the success of the N8 CIR research software engineers’ meet up earlier in 2019, this event offers similar opportunities for those working with cloud computing platforms.

The event will follow a similar format with three speakers in the morning followed by a series of lightning talks, all coupled to networking opportunities.

Confirmed speakers include Simon Hood and Christopher Paul from the University of Manchester who will be speaking about the technical aspects of cloud bursting using HTCondor on AWS. Cliff Addison from the University of Liverpool will be speaking about utilising public cloud in a planned disaster recovery scenario, full abstract below. Finally Gary Leeming, also from the University of Manchester, will be speaking about the challenges of utilising the cloud for processing sensitive data.

To book a place at the event please visit: https://www.eventbrite.co.uk/e/n8-cir-success-in-the-cloud-tickets-72923629539

The event will take place at Chamberspace on Deansgate in Manchester. The event will be fully catered with tea, coffee and pastries, a buffet lunch and morning and afternoon refreshments.

The event is free to those working or studying at universities within the N8 Research Partnership.

Stepping towards HPC disaster recovery in the cloud

Authors: Cliff Addison and Manhui Wang (Cliff Addison, speaker), University of Liverpool

A new data centre with water cooled racks for our HPC system, as well as sufficient UPS and generator back-up for all the data centre, is scheduled to come into service in October 2019. Our HPC is the first system to be moved there. The expectation is that the system will be operational in its new home before this event. (Expectations of data centres is a talk for a separate occasion.)

I order to provide some HPC computing capability while the local system is being moved, Liverpool have worked with Alces to provide a cloud alternative. A test cloud HPC system was available in early June when the power to the existing data centre needed to be powered off. This talk will look at three key outcomes of utilizing public cloud in what, essentially, is a planned Disaster Recovery scenario. These are:

To gather enough detail to determine what of our on-premises HPC cluster is mission-critical.
To fully understand and document how a cloud version of our HPC cluster would operate - from how close our cloud HPC could mirror on-premises systems down to how users would interact.
To take this knowledge and evolve it into a full Disaster Recovery policy, including addressing data storage.

We managed to replicate the entire system environment in our trial cloud system so the look and feel of user interactions on the cloud system was similar to our on-premise system. The stand-by cloud foot print for an HPC system is fairly small – storage of system images and “some” user data. Depending on the acceptable levels of storage and compute in the cloud solution, some form of readily deployable HPC-DR solution does seem possible for fairly modest costs.

Protecting Sensitive and Personal Research Data in the Cloud

Gary Leeming, Chief Technology Officer, Connected Health Cities, University of Manchester

Researchers are increasingly using sensitive and personal data in their research which requires new, risk-based approaches to security, especially as the University of Manchester moves to a cloud-first policy. Leeming will discuss the challenges of developing a new service that needs to balance the risk of threats against the need to ensure that researchers have access to appropriate services to collect, manage and investigate data that is subject to tight governance controls and how a cloud-based approach, training, and policies need to be developed to enable this.

Cloud Bursting with HTCondor and AWS

Simon Hood and Chris Paul, University of Manchester

The University of Manchester has a ‘Cloud First’ policy, so we going ‘Paddling in the Cloud’ --- testing a variety of scenarios to determine what is likely to work for the research community, cost-effectively, and what will not.

Use of the AWS Spot Market with UoM's HTCondor is an obvious good place to look. AWS suggested using their Annex interface so that's what we did. We encountered a number of problems, but worked with AWS and developers at The University of Wisconsin (the home of HTCondor) to overcome these. Our mods have become part of the updated Annex.

In this talk we will give an overview of the environment and challenges at UoM, then detail what we did, and finally describe the (successful!) outcome.

Condor Cloud: Accelerating material discovery.

Dani Reta, University of Manchester

When talking about the next generation of data storage with ultra-high capacities, single magnetic molecules, capable of retaining information at the molecular level, are ideal contenders. Unfortunately, even the best examples can only do that below liquid nitrogen temperatures. Thus we are developing computational methods to understand what leads to information loss, hoping to improve their performance.

To do that, for each molecule under study, we need to run on the order of 30k independent calculations (each taking ~4 hours on a single core in the CSF), in what constitutes an embarrassingly parallel problem.

This type of high throughput computing is ideal for the UoM Condor service, but the scale of our problem was too large for on-site solutions. Luckily, the recently developed Condor cloud bursting service proved ideal to this task – by extending the Condor pool into AWS we were able to run ~3k cores simultaneously, resulting in the successful completion of this otherwise intractable problem.

Abstracting over providers

Dave Love, university of Manchester

A brief discussion of provider-independent building of development instances, HPC-style clusters, and "virtual datacentres", with indications of where work needs doing.

Useful Event Information

This section of the event page contains useful information for those attending the event. We will add more documents and information closer to the event.

N8 CIR Cloud Walking Routes 2019 10 17 Cloud Schedule Code of Conduct

17 OCT

2019