Name: STFC School on Data Intensive Science 2024
Start: 2024-07-14T09:00:00+01:00
End: 2024-07-19T16:00:00+01:00
Location: John Lennon Art and Design Building

STFC School on Data Intensive Science 2024

from Sunday, 14 July 2024 (09:00) to Friday, 19 July 2024 (16:00)

Monday, 8 July 2024
Tuesday, 9 July 2024
Wednesday, 10 July 2024
Thursday, 11 July 2024
Friday, 12 July 2024
Saturday, 13 July 2024
Sunday, 14 July 2024

18:00 Welcome Reception
Welcome Reception
18:00 - 19:00
Room: Atrium

Monday, 15 July 2024
09:00 Welcome and Introduction (Carsten Welsch / University of Liverpool, Julie Sheldon / LJMU, Naomi Smith / University of Liverpool) - Naomi Smith (Cockcroft Institute/University of Liverpool) Carsten Welsch (University of Liverpool) Julie Sheldon (LJMU)
Welcome and Introduction (Carsten Welsch / University of Liverpool, Julie Sheldon / LJMU, Naomi Smith / University of Liverpool)
- Naomi Smith (Cockcroft Institute/University of Liverpool)
- Carsten Welsch (University of Liverpool)
- Julie Sheldon (LJMU)
09:00 - 10:00
Room: The Johnson Foundation Auditorium
10:00 Coffee break
Coffee break
10:00 - 10:30
Room: Public Exhibition Space
10:30 Moving towards intelligent data (Louise Butcher / Hartree) - Louise Butcher (Hartree)
Moving towards intelligent data (Louise Butcher / Hartree)
- Louise Butcher (Hartree)
10:30 - 12:30
Room: Ann Walker Seminar Room Participants will learn about collecting good quality, unbiased data, preparing the data for modelling and exploring some simple machine learning models. There will be a largely practical element to help you work with your data. There will be opportunities to consider how to apply machine learning to participants’ own data problems, using freely available open source tools. Learning Objectives - How data science happens in the real world - What needs to be done to make real data ready for machine learning - What methods work with real data - How to handle data legally - What ethical and social issues surround the use of AI Pre-requisites - Students should have a working knowledge of Python, including use of Pandas. - Suggested viewing. Create free account and watch Beginner's Guide to Data Collection
Practical Guide to Data Engineering with Focus on Apache Spark (Ajay Rawat / Hartree) - Ajay Rawat (Hartree)
Practical Guide to Data Engineering with Focus on Apache Spark (Ajay Rawat / Hartree)
- Ajay Rawat (Hartree)
10:30 - 12:30
Room: Archibald Bathgate seminar room A comprehensive learning experience tailored to equip participants with the knowledge and practical skills necessary to excel in the dynamic field of data engineering and big data processing. Practical session participants will be using Databricks https://community.cloud.databricks.com/login.html Learning Objectives - Understand the key concepts of data engineering and big data processing. - Describe the architecture and functionalities of Apache Spark. - Utilize Spark SQL and Data Frames for data querying and analysis. - Perform data transformations and aggregations using Spark functions. Pre-requisites - Basic understanding of programming concepts (Python) - Familiarity with relational databases - Understanding data warehousing concepts and ETL process (advantageous but not mandatory) - Recommended STFC Training: Enrol for free then watch video  Practical Guide to Data Engineering
Practical guide to Neural Networks (Michail Smyrnakis / Hartree) - Michail Smyrnakis (Hartree)
Practical guide to Neural Networks (Michail Smyrnakis / Hartree)
- Michail Smyrnakis (Hartree)
10:30 - 12:30
Room: Lecture Room 2 This session will guide the participant through some of the practical considerations to make when looking at how neural networks can be used. You will also complete practical exercises where you will be introduced to the two main python libraries that are used with neural networks namely Pytorch and Tensorflow. You will gain some hands-on experience of applying existing libraries and pretrained neural networks to various small-scale problems.   Learning Objectives - Understand the concept of Artificial Neural Networks and Deep Neural Networks.   - Gain familiarity to different types of advanced neural networks and areas of application for each type. - have hands on experience with using the two main python libraries for Deep Neural Nets (Pytorch and Tensorflow). - learn how to load data using Tensorflow and Pytorch - understand if your trained models have underfit or overfit the data - learn how to define a Convolutional Neural Network in both Pytorch and Tensorflow - explore the effects of hyperparameters Pre-requisites - Understanding of basic mathematical concepts e.g. functions, matrices and derivatives.  - Minimal experience with Python
12:30 Lunch
Lunch
12:30 - 13:30
Room: Public Exhibition Space
13:30 Practical Guide to Data Engineering with Focus on Apache Spark (Ajay Rawat / Hartree) - Ajay Rawat (Hartree)
Practical Guide to Data Engineering with Focus on Apache Spark (Ajay Rawat / Hartree)
- Ajay Rawat (Hartree)
13:30 - 15:30
Room: Archibald Bathgate Seminar Room
Practical Guide to Machine Learning with data collection and preparation (Louise Butcher / Hartree) - Louise Butcher (Hartree)
Practical Guide to Machine Learning with data collection and preparation (Louise Butcher / Hartree)
- Louise Butcher (Hartree)
13:30 - 15:30
Room: Ann Walker Seminar Room
Practical guide to Neural Networks (Michail Smyrnakis / Hartree) - Michail Smyrnakis (Hartree)
Practical guide to Neural Networks (Michail Smyrnakis / Hartree)
- Michail Smyrnakis (Hartree)
13:30 - 15:30
Room: Lecture Room 2
15:30 Coffee break
Coffee break
15:30 - 16:00
Room: GFLEX

16:00

16:00 - 17:00
Room: GFLEX
Tuesday, 16 July 2024
08:55 AgenticAI for Data Analysis (Boris Bolliet / Cambridge University) - Boris Bolliet (Cambridge University)
AgenticAI for Data Analysis (Boris Bolliet / Cambridge University)
- Boris Bolliet (Cambridge University)
08:55 - 09:55
Room: The Johnson Foundation Auditorium Multi-agent systems consisting of Large Language Model (LLM) and Retrieval Augmented Generation (RAG) powered assistants are game-changing for data analysis tasks. Compared to a standard script that would run a pipeline from A to Z, we can interact dynamically with intelligent agents to ask for modifications, or to request more information on the analysis at hand while the pipeline is being developed and executed. This way, large portions of data analysis pipelines become automated in a fully controlled manner. We will show examples where we set-up widely used codes in CMB analyses such as CAMB and CLASS to perform calculations and execute them within minutes while the same would take hours for a human. This illustrates first small steps towards fully AI assisted cosmological data analysis.
10:00 Coffee break
Coffee break
10:00 - 10:30
Room: Public Exhibition Space
10:30 Clustering and Data Visualisation Algorithms for Astrophysics (Adam Knowles & Dharmesh Mistry / LJMU) - Dharmesh Mistry (LJMU) Adam Knowles (LJMU)
Clustering and Data Visualisation Algorithms for Astrophysics (Adam Knowles & Dharmesh Mistry / LJMU)
- Dharmesh Mistry (LJMU)
- Adam Knowles (LJMU)
10:30 - 12:30
Room: Ann Walker Seminar Room Clustering reveals natural groupings within data, while dimensionality reduction enables the easy visualisation of underlying structures in high-dimensional data. They are essential, powerful tools for exploratory data analysis and pattern recognition. Such methods are readily available, easily implemented and commonplace in many disciplines including astrophysics and healthcare. In this session, we will give an overview of a few widely-used algorithms, present some practical applications to different datasets and provide some hands-on experience through group activities.
GIT workshop (Joao Bento / LJMU) - Joao Bento (LJMU)
GIT workshop (Joao Bento / LJMU)
- Joao Bento (LJMU)
10:30 - 12:30
Room: Mason Owen Boardroom (3rd Floor) Git is the fundamental tool for version control and collaborative coding and is easily the most widely used piece of software by developers worldwide. As such, it is a critical skill in the arsenal of anyone who does any kind of development. In this session, we will introduce the foundations of git, motivate the use of this tool, and proceed with a hands-on workshop in which you will use git in practice on new and existing code repositories. In order to participate in this hands-on workshop, you will need to have git installed on your system, a modern integrated development environment also installed (such as Visual Studio Code), and an account on Github already setup to authenticate with your computer.
How to make ultra-fast predictions with neural network emulators (Boris Bolliet / Cambridge University) - Boris Bolliet (Cambridge University)
How to make ultra-fast predictions with neural network emulators (Boris Bolliet / Cambridge University)
- Boris Bolliet (Cambridge University)
10:30 - 12:30
Room: Lecture Room 2 We will explain how train a neural network which evaluates the Cosmic Microwave Background (CMB) power spectra a thousand times faster than a full Boltzmann solver. We will work with a Google Colab notebook and build a simple neural network from scratch, using training data that will be provided. We will adopt and follow the strategy presented in Alsing et al (2020), Spurio-Mancini et al (2022), and Bolliet et al (2024). Although our example is based on CMB spectra, our approach is general and applicable to a wide range of problems of interpolation in high-dimensional space.
12:30 Lunch
Lunch
12:30 - 13:30
Room: Public Exhibition Space
13:30 (Machine) Learning to create artwork and quantum fields (Pavel Buividovich / University of Liverpool) - Pavel Buividovich (University of Liverpool)
(Machine) Learning to create artwork and quantum fields (Pavel Buividovich / University of Liverpool)
- Pavel Buividovich (University of Liverpool)
13:30 - 14:30
Room: The Johnson Foundation Auditorium Machine Learning algorithms for image generation are often considered as competitors for human artistic activities. We discuss a somewhat unexpected aspect of this competition, demonstrating how GenAI can help to detect human-made artwork forgeries made in the style of famous painters. We then discuss a less visual, but technically more advanced application of GenAI to the generation of "snapshots" (configurations) of quantum fields. In contrast to image generation, where success is measured by human perception, this application of GenAI imposes much stricter constraints on statistical properties of the output data which motivate deeper mathematical insights in GenAI models. We review some of the state-of-the-art GenAI algorithms suitable for generation of both artistic images and quantum fields.
14:30 Self supervised Learning in Astrophysics (Anna Scaife / University of Manchester)
Self supervised Learning in Astrophysics (Anna Scaife / University of Manchester)
14:30 - 15:30
Room: The Johnson Foundation Auditorium I will review why self-supervised learning is so important for astronomy, what the current most popular approaches are for self-supervised learning and how these are being applied to astronomical data. This lecture will recap some of the general principles of deep-learning, but forward reading is advised for those who are less familiar with the topic.

15:30 Coffee break
Coffee break
15:30 - 16:00
Room: GFLEX

16:00

16:00 - 17:00
Room: GFLEX

18:00

18:00 - 19:00
Room: LTA
Wednesday, 17 July 2024
09:00 Open source science and making an impact (James Nightingale / Newcastle University) - James Nightingale (Newcastle University)
Open source science and making an impact (James Nightingale / Newcastle University)
- James Nightingale (Newcastle University)
09:00 - 10:00
Room: The Johnson Foundation Auditorium Open source science applies the principles of open source software development to scientific research, emphasizing transparency, collaboration, and accessibility to make scientific knowledge and data freely available. I will argue that adopting these practices to the highest standard is crucial for the advancement of science. Through case studies from the “reproducibility crisis”, I will highlight the potentially devastating consequences of not practicing open science. I will then demonstrate how projects like SciPy, NumPy, and Pandas have transformed the research landscape. Open source science requires significant time, effort, and energy, and may not always be rewarded by the current scientific funding landscape. Nevertheless, I will contend that it ultimately enhances your research output, productivity, and is a compelling means to build collaboration with commercial industry partners.
10:00 Coffee break
Coffee break
10:00 - 10:30
Room: Public Exhibition Space
10:30 Big Data Python ecosystem for HEP analysis (Eduardo Rodrigues / University of Liverpool) - Eduardo Rodrigues (University of Liverpool)
Big Data Python ecosystem for HEP analysis (Eduardo Rodrigues / University of Liverpool)
- Eduardo Rodrigues (University of Liverpool)
10:30 - 12:30
Room: Archibald Bathgate seminar room Data analysis in High Energy Physics (HEP) has evolved considerably in recent years, with "Big Data" tools being ever more used. Python as a programming language for analysis work is established and a HEP-specific ecosystem connecting well with the wider scientific Python ecosystem is both mature at this point and under continuous development. I will discuss HEP data as Big Data, Python and its analysis ecosystem provided by various community domain-specific projects. I will dwell in particular on the Scikit-HEP project, which I started in late 2016 with a few colleagues from various backgrounds and domains of expertise. It is now part of the official software stack of the experiments ATLAS, Belle II, CMS, KM3NeT and LHCb.
Publishing code (Ed Bennett / Swansea University) - Ed Bennett (Swansea University)
Publishing code (Ed Bennett / Swansea University)
- Ed Bennett (Swansea University)
10:30 - 12:30
Room: Mason Owen Boardroom (3rd Floor) It’s becoming increasingly important to share the code we use to produce the results that we share in papers, to comply with increasingly strict guidance from our funders, and to ensure that others are able to reproduce our work. In this lesson, we’ll explore how to do this: how to specify the computational environment used to generate a result in a way that others can reproduce it, and how to make code available in a way that can be cited, and will continue to be accessible in the future. This lesson assumes existing basic knowledge of the Python programming language and the Git version control system, and that you have a scientific Python setup and Git installed on your computer.
PyAutofit: Classy Probabilistic Programming (James Nightingale / Newcastle University) - James Nightingale (Newcastle University)
PyAutofit: Classy Probabilistic Programming (James Nightingale / Newcastle University)
- James Nightingale (Newcastle University)
10:30 - 12:30
Room: Lecture Room 2 A major trend in Physics and Astronomy and healthcare is the rapid adoption of Bayesian statistics for data analysis and modeling. With modern data-sets growing by orders of magnitude in size, the focus is now on developing methods capable of applying contemporary inference techniques to extremely large datasets. To this aim, I present PyAutoFit (https://github.com/rhayes777/PyAutoFit), an open-source probabilistic programming language for automated Bayesian inference. In this hands on demonstration, I will: 1) Give an overview of how to compose a probabilistic model and perform automated Bayesian inference. 2) Demonstrate a simple model-fitting example using a Cosmology based science-case. 3) Illustrate the use of Bayesian graphs to perform simultaneous inference of thousands on datasets.
12:30 Lunch
Lunch
12:30 - 13:30
Room: Public Exhibition Space

13:30 Free afternoon
Free afternoon
13:30 - 17:00

21:00 Live astronomy with Liverpool Telescope (Helen Jermak and Chris Copperwheat / LJMU)
Live astronomy with Liverpool Telescope (Helen Jermak and Chris Copperwheat / LJMU)
21:00 - 22:00
Thursday, 18 July 2024
09:00 Graph Neural Nets, Application of AI to Sports (Zhe Wang / DeepMind) - Zhe Wang (DeepMind)
Graph Neural Nets, Application of AI to Sports (Zhe Wang / DeepMind)
- Zhe Wang (DeepMind)
09:00 - 10:00
Room: The Johnson Foundation Auditorium Identifying key patterns of tactics implemented by rival teams, and developing effective responses, lies at the heart of modern football. However, doing so algorithmically remains an open research challenge. To address this unmet need, we propose TacticAI, an AI football tactics assistant developed and evaluated in close collaboration with domain experts from Liverpool FC. We focus on analysing corner kicks, as they offer coaches the most direct opportunities for interventions and improvements. TacticAI incorporates both a predictive and a generative component, allowing the coaches to effectively sample and explore alternative player setups for each corner kick routine and to select those with the highest predicted likelihood of success. We validate TacticAI on a number of relevant benchmark tasks: predicting receivers and shot attempts and recommending player position adjustments. The utility of TacticAI is validated by a qualitative study conducted with football domain experts at Liverpool FC. We show that TacticAI’s model suggestions are not only indistinguishable from real tactics, but also favoured over existing tactics 90% of the time, and that TacticAI offers an effective corner kick retrieval system. TacticAI achieves these results despite the limited availability of gold-standard data, achieving data efficiency through geometric deep learning.
10:00 Coffee break
Coffee break
10:00 - 10:30
Room: Public Exhibition Space

10:30 Foundational AI: Into the World of Large Language Models and Transformers (Naimuri)
Foundational AI: Into the World of Large Language Models and Transformers (Naimuri)
10:30 - 12:30
Room: Lecture room 1 In this workshop, participants will delve into the foundational concepts underlying large language models (LLMs). We will begin by exploring tokenisation, including word-based, character-based and subword-based approaches. Next, we will cover word embeddings, with a particular focus on word2vec. This will be followed by an in-depth look at self-attention and the transformer architecture. Attendees will then be divided into groups to experiment hands-on with different LLMs, applying their new knowledge and gaining practical experience.
How to not make Numpy slow (Ed Bennett / Swansea University) - Ed Bennett (Swansea University)
How to not make Numpy slow (Ed Bennett / Swansea University)
- Ed Bennett (Swansea University)
10:30 - 12:30
Room: Lecture Room 2 Numpy is one of the most well-recognised ways to achieve good performance for numerical computation in Python. However, this performance is not guaranteed—it is possible to write Numpy code that is slower than the equivalent plain Python. In this workshop we’ll explore how to avoid these pitfalls, and in some cases obtain speedups of over 200x, while also reducing the volume of code.
Simulated project scenarios – Real-world challenges in data science (NHS England)
Simulated project scenarios – Real-world challenges in data science (NHS England)
10:30 - 12:30
Room: Ann Walker Seminar Room This workshop will consider the real world challenges of applying data science in healthcare by considering the system rather than just the solution. In this pen and paper interactive workshop we will be considering a fake patient and how data science could support their care by: Setting proposed solutions within the complex data landscape Highlighting the range of persona and considerations which need to be balanced when applying data science in healthcare Emphasising the need for holistic thinking and linking this to competencies required to succeed in the health industry in the public sector. The session will be delivered by two esteemed data scientists from the central Data Science Team in NHS England.

12:30 Lunch
Lunch
12:30 - 13:30
Room: Public Exhibition Space

13:30

13:30 - 15:00
Room: The Johnson Foundation Auditorium

15:00 Coffee break
Coffee break
15:00 - 15:30
Room: Public Exhibition Space

15:30

15:30 - 17:00
Room: The Johnson Foundation Auditorium

19:00 Conference dinner
Conference dinner
19:00 - 22:00
Friday, 19 July 2024
09:30 Kaggle competition - David Hutchcroft (University of Liverpool)
Kaggle competition
- David Hutchcroft (University of Liverpool)
09:30 - 10:00
Room: The Johnson Foundation Auditorium
10:00 Coffee break
Coffee break
10:00 - 10:30
Room: Public Exhibition Space

10:30 Kaggle competition
Kaggle competition
10:30 - 12:30
Room: The Johnson Foundation Auditorium

12:30 Lunch
Lunch
12:30 - 13:30
Room: Public Exhibition Space

13:30 Prize ceremony and wrap up
Prize ceremony and wrap up
13:30 - 14:00
Room: The Johnson Foundation Auditorium