10 October 2022 to 31 December 2028
Zoom Webinar
Europe/London timezone

Bioinformatics and the Curse of Dimensionality

Date: Tuesday 10 December 2024 – 15:00 (Europe/London)
Speaker: Euan McDonnell, Bioinformatics Data Scientist with the Computational Biology Facility (CBF) based at the University of Liverpool

Abstract

Bioinformatics as a field has seen a rapid expansion in prevalence over the past 25 years. Much of this has been driven by the increase in the scale and frequency of large-scale datasets, predominantly global biological profiling approaches, or so-called “omics” technologies. These encompass a wide range of applications that quantify the abundance, activity, or presence of various biological entities in a top-down and unbiased manner, resulting in datasets with 100s-millions of features. Much of bioinformatics is concerned with the ranking and selection of such features in regards to their relationships to external factors or co-relationships within- or between-datatypes. However the complex, time-consuming, and expensive task of processing and acquiring biological samples, as well as generation of data from such samples means that, in relation to the dimensionality, the number of data points is frequently far less than the number of features. This problem is termed the “large p, small n” or “p>>n” problem and is a critical issue that is ubiquitous in bioinformatics and health data science. Such high dimensionality in-tandem with low degrees of freedom confers a major analytical and computational challenge due to the explosive increase in the size of the sampling domain with increasing features; the so-called “curse of dimensionality”. This results in overfitting/high variance in statistical and machine learning models, as well as compounding issues faced with the inherent high variability between biological samples. Bioinformatics has thus seen the application of a suite of methodologies that aim to tackle this issue, commonly including the use of empirical Bayes pooling of information, dimensional reduction and regularisation/sparsification procedures. While these approaches have allowed the field to mostly keep up with the increasing scale of data-sets that are being generated, further developments will be required in order.

The talk is now also available on YouTube: https://youtu.be/e9Lnd5hR6Uk

Biography

Euan started his academic career with an integrated Masters in microbiology from the University of Leeds, where in his final year project he delved into bioinformatics using Bash and Python to analyse transcriptome-wide cleavage sites of a bacterial endoribonuclease. He subsequently undertook a DiMeN MRC-funded bioinformatics PhD working on transcriptomic networks and their dysregulation by the oncogenic herpesvirus Kaposi’s Sarcoma-associated Herpesvirus. Since Jan 2023, he has worked as a bioinformatics data scientist with the Computational Biology Facility at the University of Liverpool. His research focuses on a range of projects, including transcriptome-wide analysis to determine the benefit of arginine intervention on pre-mature neonatal patients, predicting and comparing genotype-epigenome relationships in foetal and osteoarthritic/osteoporotic tissue and predicting discriminatory protein biomarkers for the diagnosis of non-bacterial osteomyelitis. More-generally, he is interested in network approaches to biology, primarily Gaussian graphical models and how prior information/multiple ‘omics data-types can be integrated into network models.