STFC School on Data Intensive Science 2024

Europe/London
John Lennon Art and Design Building

John Lennon Art and Design Building

Duckinfield Street Liverpool L3 5RD
Description

This week-long School will provide PhD students that are active in data intensive science additional skills to support their research, help them make industry placements a success and provide advice concerning possible career pathways in industry. The event consists of hands-on workshops, plenary talks, group discussions and evening events. 

Registration deadline: 21 June 2024

Payment deadline extended: 28 June 2024

The previous school was hosted by Durham University.

    • 18:00 20:00
      Welcome Reception 2h
    • 09:00 10:00
      Welcome and Introduction (Carsten Welsch / University of Liverpool, Julie Sheldon / LJMU, Adam Ruby / KPMG Liverpool) 1h
      Speakers: Carsten Welsch (University of Liverpool) , Adam Ruby (KPMG Liverpool) , Julie Sheldon (LJMU)
    • 10:00 10:30
      Coffee break 30m
    • 10:30 12:30
      Moving towards intelligent data (Louise Butcher / Hartree) 2h

      Participants will learn about collecting good quality, unbiased data, preparing the data
      for modelling and exploring some simple machine learning models. There will be a largely practical element to help you work with your data. There will be opportunities to consider how to apply machine learning to participants’ own data problems, using freely available open source tools.
      Learning Objectives
      - How data science happens in the real world
      - What needs to be done to make real data ready for machine learning
      - What methods work with real data
      - How to handle data legally
      - What ethical and social issues surround the use of AI
      Pre-requisites
      - Students should have a working knowledge of Python, including use of Pandas.
      - Suggested viewing. Create free account and watch Beginner's Guide to Data Collection

      Speaker: Louise Butcher (Hartree)
    • 10:30 12:30
      Practical Guide to Data Engineering with Focus on Apache Spark (Ajay Rawat / Hartree) 2h

      A comprehensive learning experience tailored to equip participants with the knowledge
      and practical skills necessary to excel in the dynamic field of data engineering and big data processing. Practical session participants will be using Databricks https://community.cloud.databricks.com/login.html
      Learning Objectives
      - Understand the key concepts of data engineering and big data processing.
      - Describe the architecture and functionalities of Apache Spark.
      - Utilize Spark SQL and Data Frames for data querying and analysis.
      - Perform data transformations and aggregations using Spark functions.
      Pre-requisites
      - Basic understanding of programming concepts (Python)
      - Familiarity with relational databases
      - Understanding data warehousing concepts and ETL process (advantageous but not mandatory)
      - Recommended STFC Training: Enrol for free then watch video  Practical Guide to Data Engineering

      Speaker: Ajay Rawat (Hartree)
    • 10:30 12:30
      Practical guide to Neural Networks (Michail Smyrnakis / Hartree) 2h

      This session will guide the participant through some of the practical considerations to
      make when looking at how neural networks can be used. You will also complete practical exercises where you will be introduced to the two main python libraries that are used with neural networks namely Pytorch and Tensorflow. You will gain some hands-on experience of applying existing libraries and pretrained neural networks to various small-scale problems.  
      Learning Objectives
      - Understand the concept of Artificial Neural Networks and Deep Neural Networks.  
      - Gain familiarity to different types of advanced neural networks and areas of application for each type.
      - have hands on experience with using the two main python libraries for Deep Neural Nets (Pytorch and Tensorflow).
      - learn how to load data using Tensorflow and Pytorch
      - understand if your trained models have underfit or overfit the data
      - learn how to define a Convolutional Neural Network in both Pytorch and Tensorflow
      - explore the effects of hyperparameters
      Pre-requisites
      - Understanding of basic mathematical concepts e.g. functions, matrices and derivatives. 
      - Minimal experience with Python

      Speaker: Michail Smyrnakis (Hartree)
    • 12:30 13:30
      Lunch 1h
    • 13:30 15:30
      Practical Guide to Data Engineering with Focus on Apache Spark (Ajay Rawat / Hartree) 2h
      Speaker: Ajay Rawat (Hartree)
    • 13:30 15:30
      Practical Guide to Machine Learning with data collection and preparation (Louise Butcher / Hartree) 2h
      Speaker: Louise Butcher (Hartree)
    • 13:30 15:30
      Practical guide to Neural Networks (Michail Smyrnakis / Hartree) 2h
      Speaker: Michail Smyrnakis (Hartree)
    • 15:30 16:00
      Coffee break 30m
    • 16:00 17:00
      Poster session: 1
    • 09:00 10:00
      AgenticAI for Data Analysis (Boris Bolliet / Cambridge University) 1h

      Multi-agent systems consisting of Large Language Model (LLM) and Retrieval
      Augmented Generation (RAG) powered assistants are game-changing for data analysis tasks. Compared to a standard script that would run a pipeline from A to Z, we can interact dynamically with intelligent agents to ask for modifications, or to request more information on the analysis at hand while the pipeline is being developed and executed. This way, large portions of data analysis pipelines become automated in a fully controlled manner. We will show examples where we set-up widely used codes in CMB analyses such as CAMB and CLASS to perform calculations and execute them within minutes while the same would take hours for a human. This illustrates first small steps towards fully AI assisted cosmological data analysis.

      Speaker: Boris Bolliet (Cambridge University)
    • 10:00 10:30
      Coffee break 30m
    • 10:30 12:30
      Clustering and Data Visualisation Algorithms for Astrophysics (Adam Knowles & Dharmesh Mistry / LJMU) 2h

      Clustering reveals natural groupings within data, while dimensionality reduction
      enables the easy visualisation of underlying structures in high-dimensional data. They are essential, powerful tools for exploratory data analysis and pattern recognition. Such methods are readily available, easily implemented and commonplace in many disciplines including astrophysics and healthcare. In this session, we will give an overview of a few widely-used algorithms, present some practical applications to different datasets and provide some hands-on experience through group activities.

      Speakers: Adam Knowles (LJMU) , Dharmesh Mistry (LJMU)
    • 10:30 12:30
      GIT workshop (Joao Bento / LJMU) 2h

      Git is the fundamental tool for version control and collaborative coding and is easily the
      most widely used piece of software by developers worldwide. As such, it is a critical skill in the arsenal of anyone who does any kind of development. In this session, we will introduce the foundations of git, motivate the use of this tool, and proceed with a hands-on workshop in which you will use git in practice on new and existing code repositories. In order to participate in this hands-on workshop, you will need to have git installed on your system, a modern integrated development environment also installed (such as Visual Studio Code), and an account on Github already setup to authenticate with your computer.

      Speaker: Joao Bento (LJMU)
    • 10:30 12:30
      How to make ultra-fast predictions with neural network emulators (Boris Bolliet / Cambridge University) 2h

      We will explain how train a neural network which evaluates the Cosmic Microwave
      Background (CMB) power spectra a thousand times faster than a full Boltzmann solver. We will work with a Google Colab notebook and build a simple neural network from scratch, using training data that will be provided. We will adopt and follow the strategy presented in Alsing et al (2020), Spurio-Mancini et al (2022), and Bolliet et al (2024). Although our example is based on CMB spectra, our approach is general and applicable to a wide range of problems of interpolation in high-dimensional space.

      Speaker: Boris Bolliet (Cambridge University)
    • 12:30 13:30
      Lunch 1h
    • 13:30 14:30
      (Machine) Learning to create artwork and quantum fields (Pavel Buividovich / University of Liverpool) 1h

      Machine Learning algorithms for image generation are often considered as competitors for human artistic activities. We discuss a somewhat unexpected aspect of this competition, demonstrating how GenAI can help to detect human-made artwork forgeries made in the style of famous painters. We then discuss a less visual, but technically more advanced application of GenAI to the generation of "snapshots" (configurations) of quantum fields. In contrast to image generation, where success is measured by human perception, this application of GenAI imposes much stricter constraints on statistical properties of the output data which motivate deeper mathematical insights in GenAI models. We review some of the state-of-the-art GenAI algorithms suitable for generation of both artistic images and quantum fields.

      Speaker: Pavel Buividovich (University of Liverpool)
    • 14:30 15:30
      Self supervised Learning in Astrophysics (Anna Scaife / University of Manchester) 1h

      I will review why self-supervised learning is so important for astronomy, what the
      current most popular approaches are for self-supervised learning and how these are being applied to astronomical data. This lecture will recap some of the general principles of deep-learning, but forward reading is advised for those who are less familiar with the topic.

    • 15:30 16:00
      Coffee break 30m
    • 16:00 17:00
      Poster session: 2
    • 18:00 19:00
      Public talk (Anna Scaife, University of Manchester) LTA (CTH)

      LTA

      CTH

    • 09:00 10:00
      Open source science and making an impact (James Nightingale / Newcastle University) 1h

      Open source science applies the principles of open source software development to scientific research, emphasizing transparency, collaboration, and accessibility to make scientific knowledge and data freely available. I will argue that adopting these practices to the highest standard is crucial for the advancement of science. Through case studies from the “reproducibility crisis”, I will highlight the potentially devastating consequences of not practicing open science. I will then demonstrate how projects like SciPy, NumPy, and Pandas have transformed the research landscape. Open source science requires significant time, effort, and energy, and may not always be rewarded by the current scientific funding landscape. Nevertheless, I will contend that it ultimately enhances your research output, productivity, and is a compelling means to build collaboration with commercial industry partners.

      Speaker: James Nightingale (Newcastle University)
    • 10:00 10:30
      Coffee break 30m
    • 10:30 12:30
      Big Data Python ecosystem for HEP analysis (Eduardo Rodrigues / University of Liverpool) 2h
      Speaker: Eduardo Rodrigues (University of Liverpool)
    • 10:30 12:30
      Publishing code (Ed Bennett / Swansea University) 2h

      It’s becoming increasingly important to share the code we use to produce the results
      that we share in papers, to comply with increasingly strict guidance from our funders, and to ensure that others are able to reproduce our work. In this lesson, we’ll explore how to do this: how to specify the computational environment used to generate a result in a way that others can reproduce it, and how to make code available in a way that can be cited, and will continue to be accessible in the future. This lesson assumes existing basic knowledge of the Python programming language and the Git version control system, and that you have a scientific Python setup and Git installed on your computer.

      Speaker: Ed Bennett (Swansea University)
    • 10:30 12:30
      PyAutofit: Classy Probabilistic Programming (James Nightingale / Newcastle University) 2h

      A major trend in Physics and Astronomy and healthcare is the rapid adoption of Bayesian statistics for data analysis and modeling. With modern data-sets growing by orders of magnitude in size, the focus is now on developing methods capable of applying contemporary inference techniques to extremely large datasets. To this aim, I present PyAutoFit (https://github.com/rhayes777/PyAutoFit), an open-source probabilistic programming language for automated Bayesian inference.

      In this hands on demonstration, I will:
      1) Give an overview of how to compose a probabilistic model and perform automated Bayesian inference.
      2) Demonstrate a simple model-fitting example using a Cosmology based science-case.
      3) Illustrate the use of Bayesian graphs to perform simultaneous inference of thousands on datasets.

      Speaker: James Nightingale (Newcastle University)
    • 12:30 13:30
      Lunch 1h
    • 13:30 14:30
      Bioinformatics and the Curse of Dimensionality (Euan McDonnell / University of Liverpool) 1h
      Speaker: Euan McDonnell (University of Liverpool)
    • 14:30 17:00
      Free afternoon 2h 30m
    • 20:00 22:00
      Live astronomy with Liverpool Telescope 2h
    • 09:00 10:00
      Graph Neural Nets, Application of AI to Sports (Zhe Wang / DeepMind) 1h

      Identifying key patterns of tactics implemented by rival teams, and developing effective responses, lies at the heart of modern football. However, doing so algorithmically remains an open research challenge. To address this unmet need, we propose TacticAI, an AI football tactics assistant developed and evaluated in close collaboration with domain experts from Liverpool FC. We focus on analysing corner kicks, as they offer coaches the most direct opportunities for interventions and improvements. TacticAI incorporates both a predictive and a generative component, allowing the coaches to effectively sample and explore alternative player setups for each corner kick routine and to select those with the highest predicted likelihood of success. We validate TacticAI on a number of relevant benchmark tasks: predicting receivers and shot attempts and recommending player position adjustments. The utility of TacticAI is validated by a qualitative study conducted with football domain experts at Liverpool FC. We show that TacticAI’s model suggestions are not only indistinguishable from real tactics, but also favoured over existing tactics 90% of the time, and that TacticAI offers an effective corner kick retrieval system. TacticAI achieves these results despite the limited availability of gold-standard data, achieving data efficiency through geometric deep learning.

      Speaker: Zhe Wang (DeepMind)
    • 10:00 10:30
      Coffee break 30m
    • 10:30 12:30
      Foundational AI: Into the World of Large Language Models and Transformers (Naimuri) 2h

      In this workshop, participants will delve into the foundational concepts underlying large language models (LLMs). We will begin by exploring tokenisation, including word-based, character-based and subword-based approaches. Next, we will cover word embeddings, with a particular focus on word2vec. This will be followed by an in-depth look at self-attention and the transformer architecture. Attendees will then be divided into groups to experiment hands-on with different LLMs, applying their new knowledge and gaining practical experience.

    • 10:30 12:30
      How to not make Numpy slow (Ed Bennett / Swansea University) 2h

      Numpy is one of the most well-recognised ways to achieve good performance for numerical computation in Python. However, this performance is not guaranteed—it is possible to write Numpy code that is slower than the equivalent plain Python. In this workshop we’ll explore how to avoid these pitfalls, and in some cases obtain speedups of over 200x, while also reducing the volume of code.

      Speaker: Ed Bennett (Swansea University)
    • 10:30 12:30
      Simulated project scenarios – Real-world challenges in data science (NHS Engand) 2h
    • 12:30 13:30
      Lunch 1h
    • 13:30 15:30
      Industry session

      Confirmed speakers:
      Alice Morris (the Guardian), Sarah McDonald (Dogs Trust), Jonny Pearson (NHS England), Selina Dhinsey (Multiverse)

    • 15:30 16:00
      Coffee break 30m
    • 16:00 17:00
      Industry session

      Confirmed speakers:
      Alice Morris (the Guardian), Sarah McDonald (Dogs Trust), Jonny Pearson (NHS England), Selina Dhinsey (Multiverse)

    • 19:00 22:00
      Conference dinner 3h
    • 09:00 10:00
      Computing competition
    • 10:00 10:30
      Coffee break 30m
    • 10:30 12:30
      Computing competition: continued
    • 12:30 13:30
      Lunch 1h
    • 13:30 14:00
      Prize ceremony and wrap up 30m