STFC School on Data Intensive Science 2024

Europe/London
The Johnson Foundation Auditorium (John Lennon Art and Design Building)

The Johnson Foundation Auditorium

John Lennon Art and Design Building

Duckinfield Street Liverpool L3 5RD
Description

This week-long School will provide PhD students that are active in data intensive science additional skills to support their research, help them make industry placements a success and provide advice concerning possible career pathways in industry. The event consists of hands-on workshops, plenary talks, group discussions and evening events. 

Registration deadline: 21 June 2024

Payment deadline extended: 28 June 2024

The previous school was hosted by Durham University.

    • 18:00
      Welcome Reception Atrium (Central Teaching Laboratories, University of Liverpool)

      Atrium

      Central Teaching Laboratories, University of Liverpool

    • 1
      Welcome and Introduction (Carsten Welsch / University of Liverpool, Julie Sheldon / LJMU, Naomi Smith / University of Liverpool) The Johnson Foundation Auditorium

      The Johnson Foundation Auditorium

      John Lennon Art and Design Building

      Duckinfield Street Liverpool L3 5RD
      Speakers: Carsten Welsch (University of Liverpool), Julie Sheldon (LJMU), Naomi Smith (Cockcroft Institute/University of Liverpool)
    • 10:00
      Coffee break Public Exhibition Space

      Public Exhibition Space

    • 2
      Moving towards intelligent data (Louise Butcher / Hartree) Ann Walker Seminar Room

      Ann Walker Seminar Room

      John Lennon Art and Design Building

      Participants will learn about collecting good quality, unbiased data, preparing the data
      for modelling and exploring some simple machine learning models. There will be a largely practical element to help you work with your data. There will be opportunities to consider how to apply machine learning to participants’ own data problems, using freely available open source tools.
      Learning Objectives
      - How data science happens in the real world
      - What needs to be done to make real data ready for machine learning
      - What methods work with real data
      - How to handle data legally
      - What ethical and social issues surround the use of AI
      Pre-requisites
      - Students should have a working knowledge of Python, including use of Pandas.
      - Suggested viewing. Create free account and watch Beginner's Guide to Data Collection

      Speaker: Louise Butcher (Hartree)
    • 3
      Practical Guide to Data Engineering with Focus on Apache Spark (Ajay Rawat / Hartree) Archibald Bathgate seminar room

      Archibald Bathgate seminar room

      John Lennon Art and Design Building

      A comprehensive learning experience tailored to equip participants with the knowledge
      and practical skills necessary to excel in the dynamic field of data engineering and big data processing. Practical session participants will be using Databricks https://community.cloud.databricks.com/login.html
      Learning Objectives
      - Understand the key concepts of data engineering and big data processing.
      - Describe the architecture and functionalities of Apache Spark.
      - Utilize Spark SQL and Data Frames for data querying and analysis.
      - Perform data transformations and aggregations using Spark functions.
      Pre-requisites
      - Basic understanding of programming concepts (Python)
      - Familiarity with relational databases
      - Understanding data warehousing concepts and ETL process (advantageous but not mandatory)
      - Recommended STFC Training: Enrol for free then watch video  Practical Guide to Data Engineering

      Speaker: Ajay Rawat (Hartree)
    • 4
      Practical guide to Neural Networks (Michail Smyrnakis / Hartree) Lecture Room 2

      Lecture Room 2

      John Lennon Art and Design Building

      This session will guide the participant through some of the practical considerations to
      make when looking at how neural networks can be used. You will also complete practical exercises where you will be introduced to the two main python libraries that are used with neural networks namely Pytorch and Tensorflow. You will gain some hands-on experience of applying existing libraries and pretrained neural networks to various small-scale problems.  
      Learning Objectives
      - Understand the concept of Artificial Neural Networks and Deep Neural Networks.  
      - Gain familiarity to different types of advanced neural networks and areas of application for each type.
      - have hands on experience with using the two main python libraries for Deep Neural Nets (Pytorch and Tensorflow).
      - learn how to load data using Tensorflow and Pytorch
      - understand if your trained models have underfit or overfit the data
      - learn how to define a Convolutional Neural Network in both Pytorch and Tensorflow
      - explore the effects of hyperparameters
      Pre-requisites
      - Understanding of basic mathematical concepts e.g. functions, matrices and derivatives. 
      - Minimal experience with Python

      Speaker: Michail Smyrnakis (Hartree)
    • 12:30
      Lunch Public Exhibition Space

      Public Exhibition Space

    • 5
      Practical Guide to Data Engineering with Focus on Apache Spark (Ajay Rawat / Hartree) Archibald Bathgate Seminar Room

      Archibald Bathgate Seminar Room

      John Lennon Art and Design Building

      Speaker: Ajay Rawat (Hartree)
    • 6
      Practical Guide to Machine Learning with data collection and preparation (Louise Butcher / Hartree) Ann Walker Seminar Room

      Ann Walker Seminar Room

      John Lennon Art and Design Building

      Speaker: Louise Butcher (Hartree)
    • 7
      Practical guide to Neural Networks (Michail Smyrnakis / Hartree) Lecture Room 2

      Lecture Room 2

      John Lennon Art and Design Building

      Speaker: Michail Smyrnakis (Hartree)
    • 15:30
      Coffee break GFLEX (Central Teaching Laboratories, University of Liverpool)

      GFLEX

      Central Teaching Laboratories, University of Liverpool

    • Icebreaker: Communicating Your Research GFLEX (Central Teaching Laboratories, University of Liverpool)

      GFLEX

      Central Teaching Laboratories, University of Liverpool

    • 8
      AgenticAI for Data Analysis (Boris Bolliet / Cambridge University) The Johnson Foundation Auditorium

      The Johnson Foundation Auditorium

      John Lennon Art and Design Building

      Duckinfield Street Liverpool L3 5RD

      Multi-agent systems consisting of Large Language Model (LLM) and Retrieval
      Augmented Generation (RAG) powered assistants are game-changing for data analysis tasks. Compared to a standard script that would run a pipeline from A to Z, we can interact dynamically with intelligent agents to ask for modifications, or to request more information on the analysis at hand while the pipeline is being developed and executed. This way, large portions of data analysis pipelines become automated in a fully controlled manner. We will show examples where we set-up widely used codes in CMB analyses such as CAMB and CLASS to perform calculations and execute them within minutes while the same would take hours for a human. This illustrates first small steps towards fully AI assisted cosmological data analysis.

      Speaker: Boris Bolliet (Cambridge University)
    • 10:00
      Coffee break Public Exhibition Space

      Public Exhibition Space

    • 9
      Clustering and Data Visualisation Algorithms for Astrophysics (Adam Knowles & Dharmesh Mistry / LJMU) Ann Walker Seminar Room

      Ann Walker Seminar Room

      John Lennon Art and Design Building

      Clustering reveals natural groupings within data, while dimensionality reduction
      enables the easy visualisation of underlying structures in high-dimensional data. They are essential, powerful tools for exploratory data analysis and pattern recognition. Such methods are readily available, easily implemented and commonplace in many disciplines including astrophysics and healthcare. In this session, we will give an overview of a few widely-used algorithms, present some practical applications to different datasets and provide some hands-on experience through group activities.

      Speakers: Adam Knowles (LJMU), Dharmesh Mistry (LJMU)
    • 10
      GIT workshop (Joao Bento / LJMU) Mason Owen Boardroom (3rd Floor)

      Mason Owen Boardroom (3rd Floor)

      John Lennon Art and Design Building

      Git is the fundamental tool for version control and collaborative coding and is easily the
      most widely used piece of software by developers worldwide. As such, it is a critical skill in the arsenal of anyone who does any kind of development. In this session, we will introduce the foundations of git, motivate the use of this tool, and proceed with a hands-on workshop in which you will use git in practice on new and existing code repositories. In order to participate in this hands-on workshop, you will need to have git installed on your system, a modern integrated development environment also installed (such as Visual Studio Code), and an account on Github already setup to authenticate with your computer.

      Speaker: Joao Bento (LJMU)
    • 11
      How to make ultra-fast predictions with neural network emulators (Boris Bolliet / Cambridge University) Lecture Room 2

      Lecture Room 2

      John Lennon Art and Design Building

      We will explain how train a neural network which evaluates the Cosmic Microwave
      Background (CMB) power spectra a thousand times faster than a full Boltzmann solver. We will work with a Google Colab notebook and build a simple neural network from scratch, using training data that will be provided. We will adopt and follow the strategy presented in Alsing et al (2020), Spurio-Mancini et al (2022), and Bolliet et al (2024). Although our example is based on CMB spectra, our approach is general and applicable to a wide range of problems of interpolation in high-dimensional space.

      Speaker: Boris Bolliet (Cambridge University)
    • 12:30
      Lunch Public Exhibition Space

      Public Exhibition Space

    • 12
      (Machine) Learning to create artwork and quantum fields (Pavel Buividovich / University of Liverpool) The Johnson Foundation Auditorium

      The Johnson Foundation Auditorium

      John Lennon Art and Design Building

      Duckinfield Street Liverpool L3 5RD

      Machine Learning algorithms for image generation are often considered as competitors for human artistic activities. We discuss a somewhat unexpected aspect of this competition, demonstrating how GenAI can help to detect human-made artwork forgeries made in the style of famous painters. We then discuss a less visual, but technically more advanced application of GenAI to the generation of "snapshots" (configurations) of quantum fields. In contrast to image generation, where success is measured by human perception, this application of GenAI imposes much stricter constraints on statistical properties of the output data which motivate deeper mathematical insights in GenAI models. We review some of the state-of-the-art GenAI algorithms suitable for generation of both artistic images and quantum fields.

      Speaker: Pavel Buividovich (University of Liverpool)
    • 13
      Self supervised Learning in Astrophysics (Anna Scaife / University of Manchester) The Johnson Foundation Auditorium

      The Johnson Foundation Auditorium

      John Lennon Art and Design Building

      Duckinfield Street Liverpool L3 5RD

      I will review why self-supervised learning is so important for astronomy, what the
      current most popular approaches are for self-supervised learning and how these are being applied to astronomical data. This lecture will recap some of the general principles of deep-learning, but forward reading is advised for those who are less familiar with the topic.

    • 15:30
      Coffee break GFLEX (Central Teaching Laboratories, University of Liverpool)

      GFLEX

      Central Teaching Laboratories, University of Liverpool

    • Poster session GFLEX (Central Teaching Laboratories, University of Liverpool)

      GFLEX

      Central Teaching Laboratories, University of Liverpool

    • Public talk (Anna Scaife, University of Manchester) LTA (CTH)

      LTA

      CTH

      https://artificial-intelligence-and-aliens.eventbrite.co.uk

    • 14
      Open source science and making an impact (James Nightingale / Newcastle University) The Johnson Foundation Auditorium

      The Johnson Foundation Auditorium

      John Lennon Art and Design Building

      Duckinfield Street Liverpool L3 5RD

      Open source science applies the principles of open source software development to scientific research, emphasizing transparency, collaboration, and accessibility to make scientific knowledge and data freely available. I will argue that adopting these practices to the highest standard is crucial for the advancement of science. Through case studies from the “reproducibility crisis”, I will highlight the potentially devastating consequences of not practicing open science. I will then demonstrate how projects like SciPy, NumPy, and Pandas have transformed the research landscape. Open source science requires significant time, effort, and energy, and may not always be rewarded by the current scientific funding landscape. Nevertheless, I will contend that it ultimately enhances your research output, productivity, and is a compelling means to build collaboration with commercial industry partners.

      Speaker: James Nightingale (Newcastle University)
    • 10:00
      Coffee break Public Exhibition Space

      Public Exhibition Space

    • 15
      Big Data Python ecosystem for HEP analysis (Eduardo Rodrigues / University of Liverpool) Archibald Bathgate seminar room (John Lennon Building of Art and Design)

      Archibald Bathgate seminar room

      John Lennon Building of Art and Design

      Data analysis in High Energy Physics (HEP) has evolved considerably in recent years, with "Big Data" tools being ever more used. Python as a programming language for analysis work is established and a HEP-specific ecosystem connecting well with the wider scientific Python ecosystem is both mature at this point and under continuous development.
      I will discuss HEP data as Big Data, Python and its analysis ecosystem provided by various community domain-specific projects. I will dwell in particular on the Scikit-HEP project, which I started in late 2016 with a few colleagues from various backgrounds and domains of expertise. It is now part of the official software stack of the experiments ATLAS, Belle II, CMS, KM3NeT and LHCb.

      Speaker: Eduardo Rodrigues (University of Liverpool)
    • 16
      Publishing code (Ed Bennett / Swansea University) Mason Owen Boardroom (3rd Floor) (John Lennon Building of Art and Design)

      Mason Owen Boardroom (3rd Floor)

      John Lennon Building of Art and Design

      It’s becoming increasingly important to share the code we use to produce the results
      that we share in papers, to comply with increasingly strict guidance from our funders, and to ensure that others are able to reproduce our work. In this lesson, we’ll explore how to do this: how to specify the computational environment used to generate a result in a way that others can reproduce it, and how to make code available in a way that can be cited, and will continue to be accessible in the future. This lesson assumes existing basic knowledge of the Python programming language and the Git version control system, and that you have a scientific Python setup and Git installed on your computer.

      Speaker: Ed Bennett (Swansea University)
    • 17
      PyAutofit: Classy Probabilistic Programming (James Nightingale / Newcastle University) Lecture Room 2

      Lecture Room 2

      John Lennon Art and Design Building

      A major trend in Physics and Astronomy and healthcare is the rapid adoption of Bayesian statistics for data analysis and modeling. With modern data-sets growing by orders of magnitude in size, the focus is now on developing methods capable of applying contemporary inference techniques to extremely large datasets. To this aim, I present PyAutoFit (https://github.com/rhayes777/PyAutoFit), an open-source probabilistic programming language for automated Bayesian inference.

      In this hands on demonstration, I will:
      1) Give an overview of how to compose a probabilistic model and perform automated Bayesian inference.
      2) Demonstrate a simple model-fitting example using a Cosmology based science-case.
      3) Illustrate the use of Bayesian graphs to perform simultaneous inference of thousands on datasets.

      Speaker: James Nightingale (Newcastle University)
    • 12:30
      Lunch Public Exhibition Space

      Public Exhibition Space

    • 13:30
      Free afternoon
    • 21:00
      Live astronomy with Liverpool Telescope (Helen Jermak and Chris Copperwheat / LJMU) Online

      Online

      Join the team to learn more about the Liverpool Telescope (LT)! During this session we'll connect remotely to the telescope and take some observations in real-time (weather permitting). The LT first started observing in 2003 and became fully autonomous in 2004. Since then it has been observing without humans for 20 years, taking observations of gamma-ray bursts, supernovae and many other objects using a wide range of instrumentation. Join astronomers from the LT and its future companion, the New Robotic Telescope, to talk about observations, data and the exciting science behind autonomous telescopes.

    • 18
      Graph Neural Nets, Application of AI to Sports (Zhe Wang / DeepMind) The Johnson Foundation Auditorium

      The Johnson Foundation Auditorium

      John Lennon Art and Design Building

      Duckinfield Street Liverpool L3 5RD

      Identifying key patterns of tactics implemented by rival teams, and developing effective responses, lies at the heart of modern football. However, doing so algorithmically remains an open research challenge. To address this unmet need, we propose TacticAI, an AI football tactics assistant developed and evaluated in close collaboration with domain experts from Liverpool FC. We focus on analysing corner kicks, as they offer coaches the most direct opportunities for interventions and improvements. TacticAI incorporates both a predictive and a generative component, allowing the coaches to effectively sample and explore alternative player setups for each corner kick routine and to select those with the highest predicted likelihood of success. We validate TacticAI on a number of relevant benchmark tasks: predicting receivers and shot attempts and recommending player position adjustments. The utility of TacticAI is validated by a qualitative study conducted with football domain experts at Liverpool FC. We show that TacticAI’s model suggestions are not only indistinguishable from real tactics, but also favoured over existing tactics 90% of the time, and that TacticAI offers an effective corner kick retrieval system. TacticAI achieves these results despite the limited availability of gold-standard data, achieving data efficiency through geometric deep learning.

      Speaker: Zhe Wang (DeepMind)
    • 10:00
      Coffee break Public Exhibition Space

      Public Exhibition Space

    • 19
      Foundational AI: Into the World of Large Language Models and Transformers (Naimuri) Lecture room 1

      Lecture room 1

      John Lennon Art and Design Building

      In this workshop, participants will delve into the foundational concepts underlying large language models (LLMs). We will begin by exploring tokenisation, including word-based, character-based and subword-based approaches. Next, we will cover word embeddings, with a particular focus on word2vec. This will be followed by an in-depth look at self-attention and the transformer architecture. Attendees will then be divided into groups to experiment hands-on with different LLMs, applying their new knowledge and gaining practical experience.

    • 20
      How to not make Numpy slow (Ed Bennett / Swansea University) Lecture Room 2

      Lecture Room 2

      John Lennon Art and Design Building

      Numpy is one of the most well-recognised ways to achieve good performance for numerical computation in Python. However, this performance is not guaranteed—it is possible to write Numpy code that is slower than the equivalent plain Python. In this workshop we’ll explore how to avoid these pitfalls, and in some cases obtain speedups of over 200x, while also reducing the volume of code.

      Speaker: Ed Bennett (Swansea University)
    • 21
      Simulated project scenarios – Real-world challenges in data science (NHS England) Ann Walker Seminar Room (John Lennon Building of Art and Design)

      Ann Walker Seminar Room

      John Lennon Building of Art and Design

      This workshop will consider the real world challenges of applying data science in healthcare by considering the system rather than just the solution. In this pen and paper interactive workshop we will be considering a fake patient and how data science could support their care by:

      Setting proposed solutions within the complex data landscape
      Highlighting the range of persona and considerations which need to be balanced when applying data science in healthcare
      Emphasising the need for holistic thinking and linking this to competencies required to succeed in the health industry in the public sector.

      The session will be delivered by two esteemed data scientists from the central Data Science Team in NHS England.

    • 12:30
      Lunch Public Exhibition Space

      Public Exhibition Space

    • Industry session The Johnson Foundation Auditorium

      The Johnson Foundation Auditorium

      John Lennon Art and Design Building

      Duckinfield Street Liverpool L3 5RD

      Confirmed speakers:
      Alice Morris (the Guardian), Sarah McDonald (Dogs Trust), Jonny Pearson (NHS England), Selina Dhinsey (Multiverse)

    • 15:00
      Coffee break Public Exhibition Space

      Public Exhibition Space

    • Industry session The Johnson Foundation Auditorium

      The Johnson Foundation Auditorium

      John Lennon Art and Design Building

      Duckinfield Street Liverpool L3 5RD

      Confirmed speakers:
      Alice Morris (the Guardian), Sarah McDonald (Dogs Trust), Jonny Pearson (NHS England), Selina Dhinsey (Multiverse)

    • 19:00
      Conference dinner Alma de Cuba

      Alma de Cuba

    • 22
      Kaggle competition The Johnson Foundation Auditorium

      The Johnson Foundation Auditorium

      John Lennon Art and Design Building

      Duckinfield Street Liverpool L3 5RD
      Speaker: David Hutchcroft (University of Liverpool)
    • 10:00
      Coffee break Public Exhibition Space

      Public Exhibition Space

    • 23
      Kaggle competition The Johnson Foundation Auditorium

      The Johnson Foundation Auditorium

      John Lennon Art and Design Building

      Duckinfield Street Liverpool L3 5RD
    • 12:30
      Lunch Public Exhibition Space

      Public Exhibition Space

    • 24
      Prize ceremony and wrap up The Johnson Foundation Auditorium

      The Johnson Foundation Auditorium

      John Lennon Art and Design Building

      Duckinfield Street Liverpool L3 5RD