N8 CIR UK Biobank Seminar Series

Alex Casson, University of Manchester

Analysing the 100,000+ accelerometer datasets in the UK Biobank using the University of Manchester computational shared facilities

The UK Biobank contains over 100,000 records from participants who wore a wrist wearable device for a week, allowing investigations into their activity patterns, sleep patterns, and more. The raw accelerometry data is more than 20 TB in size, with a large amount of meta-data also available.

This presentation will overview our computational approaches taken to analyse the entire dataset using the University of Manchester computational shared facilities, with optimizations to the storage and processing approach to accelerate the analysis. We will detail these optimizations, and our open-source code for processing the data on high-performance clusters and on Windows PCs.

We will also highlight some of the caveats and limitations of the dataset that users should be aware of when designing their studies using it.

Recording of presentation

Emma Drummond, Lancaster University

Extending the Reach of UK Biobank Data into the Mitochondrial Genome

Emma Drummond has developed a novel interpolation route which links microarray data to open-source libraries of full mtDNA sequence. Mitochondria are the cell organelle relied upon to provide ATP for all eukaryotic cells, which they must do efficiently and responsively.

The UKBiobank has collected a volume of data that enables an exploration of the human mitochondrial genome (mtDNA) for variants influencing mitochondrial performance. However, the density of the genetic data requires imputation, interpolating from known data points. Current methods fail to fully exploit what is known about the mtDNA and its inheritance pattern, which Drummond has improved upon.

Recording of presentation

Richard Williams, University of Manchester

Code Set Selection Methods for Primary Care Data

Most primary care health data is ‘coded’ – meaning, instead of a patient’s record containing the term “type 2 diabetes“, it would contain the clinical code “C10F”. When we want to analyse patient data, we must first make sets of these clinical codes, called ‘code sets’, to describe what we want to find out.

However, caution must be exercised as the process of creating these code sets is non-trivial, and mistakes at this early stage of the analysis pipeline have been shown to lead different research teams to reach different results and conclusions from the same data source. In this talk, Richard will provide an introduction to the various coding systems in use in UK primary care, talk about the right ways to create clinical code sets, and signpost attendees to various online resources that can help.