STFC School on Data Intensive Science 2020

Europe/London
Zoom Meeting

Zoom Meeting

Carsten Welsch (University of Liverpool)
Description

LIV.DAT Logo

This STFC-funded school will provide PhD students that are active in data intensive science additional skills to support their research, help them make industry placements a success and provide advice concerning possible career pathways in industry. The event consists of hands-on workshops, plenary talks, group discussions and evening events.

In light of the current situation around COVID-19, LIV.DAT closely monitors the situation for developments and requirements for organising events like these. The School will take place with current guidelines in place such as social distancing and a strict hygiene regime. However, as the developments around Covid-19 are outside our control, organised sessions may change and/or may be provided online. This information will be communicated to any registered participants as well as on this site.

Registration is now closed.

Due to the current situation around COVID-19, the decision has been made to hold this STFC School as an online event. The main program will follow closely the one that was scheduled to take place in Liverpool which includes workshops, an interactive poster session and live astronomy.

All applicants will be contacted with further details. (Updated 1 September 2020)

The previous school in this series was hosted by DISCnet in Sussex in June 2019

This event is supported by the STFC under agreement No 4070265360.

Poster
Scholarship Application Form
Prof Carsten Welsch
    • 09:00 09:15
      Welcome and Introduction 15m
      Speakers: Carsten Welsch (University of Liverpool), Philip James
    • 09:15 10:00
      No Silver Bullet -- Pitfalls and Limitations in Machine Learning 45m
      Speaker: Kurt Rinnert
    • 10:00 10:30
      Break 30m
    • 10:30 12:30
      Parallel Session ML/AI

      During the parallel sessions participants will break into smaller groups for focused discussions on selected topics.

      • 10:30
        HEP NN training 2h

        This session will introduce aspects of unsupervised (and weakly supervised) learning methods and demonstrate these concepts using concrete problems from particle physics. The availability of high-quality synthetic data from Monte Carlo (MC) simulation is a key ingredient for the success of particle physics. However, the production and storage of these MC simulations occupies a large fraction of computing resources of big experimental collaborations. We will introduce generative machine learning models such as generative adversarial networks (GANs) and autoencoders which promise a way to greatly speed-up simulation. Furthermore, we will explore the idea of unsupervised searches for anomalies as a novel way of data quality monitoring and potential discovery.

        Speaker: Gregor Ksieczka
      • 10:30
        Introduction to ML 2h

        This session is an introduction into machine learning. Machine learning is everywhere in modern “big-data” science. As physicists and big-data scientists, it’s a good idea to know a bit about machine learning. The aim of this module is to explore what it means to build a machine learning model and expand on concepts in machine learning that are essential to anyone working in big-data science.

        Module: https://hsf-training.github.io/hsf-training-ml-webpage/

        Speaker: Meirin Oan Evans
      • 10:30
        Scikit/Keras 2h

        We will explore practical applications of TensorFlow 2.0 using Keras to build models. The aim of these tutorials is for you to learn how to construct models to work with different shape feature spaces (both image data and flat input data) with several different types of neural network and to explore common issues that can be encountered when applying training to data.

        Speaker: Adrian Bevan
    • 12:30 13:30
      Break 1h
    • 13:30 15:30
      Parallel Session ML/AI: continued

      During the parallel sessions participants will break into smaller groups for focused discussions on selected topics.

      • 13:30
        HEP NN training 2h

        This session will introduce aspects of unsupervised (and weakly supervised) learning methods and demonstrate these concepts using concrete problems from particle physics. The availability of high-quality synthetic data from Monte Carlo (MC) simulation is a key ingredient for the success of particle physics. However, the production and storage of these MC simulations occupies a large fraction of computing resources of big experimental collaborations. We will introduce generative machine learning models such as generative adversarial networks (GANs) and autoencoders which promise a way to greatly speed-up simulation. Furthermore, we will explore the idea of unsupervised searches for anomalies as a novel way of data quality monitoring and potential discovery.

        NOTE: to use the Jupyter Notebooks available on Google's colab site you should have a Google Drive area to copy them to.

        Speakers: Gregor Ksieczka , Lisa Benato
      • 13:30
        Introduction to ML 2h

        This session is an introduction into machine learning. Machine learning is everywhere in modern “big-data” science. As physicists and big-data scientists, it’s a good idea to know a bit about machine learning. The aim of this module is to explore what it means to build a machine learning model and expand on concepts in machine learning that are essential to anyone working in big-data science.
        https://hsf-training.github.io/hsf-training-ml-webpage/

        Speaker: Meirin Oan Evans
      • 13:30
        Scikit/Keras 2h

        We will explore practical applications of TensorFlow 2.0 using Keras to build models. The aim of these tutorials is for you to learn how to construct models to work with different shape feature spaces (both image data and flat input data) with several different types of neural network and to explore common issues that can be encountered when applying training to data.

        Speaker: Adrian Bevan
    • 15:30 15:45
      Break 15m
    • 15:45 17:00
      Poster Presentations: Session 1 + 2

      All participants are invited to contribute a scientific poster about their own research.

    • 09:00 10:00
      Human-aware AI 1h

      Machine learning is often synonymous with predictive models of exceptional accuracy. In classification they are commonly evaluated with summary measures of predictive performance - but is this enough to validate a complex algorithm? Non-linear models will exploit any artefacts in the data, which can result in high performing models that are completely spurious. Examples of this will be shown. This leads onto to the need for a clear ontology of model interpretability, for model design and usability testing. This reinforces the emerging paradigm of AI not as a stand-alone oracle but as an interactive tool to generate insights by querying the data, sometimes called xAI or AI2.0 – AI with a person in the loop.

      In this talk, Professor Paulo Lisboa will describe how probabilistic machine learning models can be presented as similarity networks and how SVMs and neural networks generate simpler and transparent models including globally accurate representations with nomograms. Perhaps surprisingly, this can buck the accuracy/interpretability trade-off, by producing self-explaining neural networks that outperform black box models and match deep learning. The dependence on the main predictive variables will be made explicit for a range of benchmark data sets commonly used in the machine learning literature.

      Paulo Lisboa is Professor in the Applied Mathematics at Liverpool John Moores University, UK and Project Director for LCR Activate, an ERDF funded £5m project to accelerate the development of SMEs in the Digital Creative Sectors in the Liverpool City Region. His research focus is advanced data analysis for decision support, in particular with applications to personalised medicine and public health. His research group on data science has developed rigorous methodologies to make machine learning models interpretable by end users.

      Speaker: Paulo Lisboa
    • 10:00 10:30
      Break 30m
    • 10:30 12:30
      Parallel Session Data Analysis
      • 10:30
        Big Data Python ecosystem for HEP 2h

        Data analysis in High Energy Physics (HEP) has evolved considerably in recent years. In particular, the role of Python has been gaining much momentum, sharing at present the show with C++ as a language of choice. Several (community) domain-specific projects have seen the day, providing (HEP) data analysis packages that profit from, and talk to well with, the huge Python scientific ecosystem, which navigates around NumPy and friends. In this "Big Data Python ecosystem for HEP" session I will present and discuss a large set of this new HEP ecosystem ever more used by analysts across several experiments such as the LHC experiments but also Belle II, KM3NeT and others. Ample time will be provided to "play around" with the material, in Jupyter notebooks.

        Speaker: Eduardo Rodrigues
      • 10:30
        Demystifying "Big Data" 2h

        In a world increasingly dependent on vast data sets and complex processes to analyse and manipulate them, public understanding of them is poor. Combined with secrecy from vested interests, political spin, and conspiracy theories about "mutant algorithms" etc. this leads to mistrust, even fear, and dangerous anti-science attitudes. Demystifying "big data" is an important task which anyone working in the field can help with - it can also be great fun. In this workshop we will explore some approaches to engaging non-specialists, school students and the general public with data science and its role (for good or evil) in the modern world.

        Speaker: Andy Newsam
      • 10:30
        Preparation of large datasets for machine learning 2h

        Machine learning enjoys increasing popularity in many fields in economy and science. In most of the courses about machine learning, the data is already given nicely formatted for the usage of out the box, and the pre-processing of data is barely touched. But in reality, one of the most time-consuming steps when using machine learning is the proper preparation of the training data. Besides that, the way of data preparation also determines the DNN architecture and has a huge impact on the performance of the DNN. This lecture will give you insights about proper data processing, some tips and tricks and hopefully enough information for you to train powerful DNNs.

        Speaker: Isabell Melzer-Pellmann
    • 12:30 13:30
      Break 1h
    • 13:30 15:30
      Parallel Session Data Analysis: continued
      • 13:30
        Big Data Python ecosystem for HEP 2h

        Data analysis in High Energy Physics (HEP) has evolved considerably in recent years. In particular, the role of Python has been gaining much momentum, sharing at present the show with C++ as a language of choice. Several (community) domain-specific projects have seen the day, providing (HEP) data analysis packages that profit from, and talk to well with, the huge Python scientific ecosystem, which navigates around NumPy and friends. In this "Big Data Python ecosystem for HEP" session I will present and discuss a large set of this new HEP ecosystem ever more used by analysts across several experiments such as the LHC experiments but also Belle II, KM3NeT and others. Ample time will be provided to "play around" with the material, in Jupyter notebooks.

        Speaker: Eduardo Rodrigues
      • 13:30
        Demystifying "Big Data" 2h

        In a world increasingly dependent on vast data sets and complex processes to analyse and manipulate them, public understanding of them is poor. Combined with secrecy from vested interests, political spin, and conspiracy theories about "mutant algorithms" etc. this leads to mistrust, even fear, and dangerous anti-science attitudes. Demystifying "big data" is an important task which anyone working in the field can help with - it can also be great fun. In this workshop we will explore some approaches to engaging non-specialists, school students and the general public with data science and its role (for good or evil) in the modern world.

        Speaker: Andy Newsam 
      • 13:30
        Preparation of large datasets for machine learning 2h

        Machine learning enjoys increasing popularity in many fields in economy and science. In most of the courses about machine learning, the data is already given nicely formatted for the usage of out the box, and the pre-processing of data is barely touched. But in reality, one of the most time-consuming steps when using machine learning is the proper preparation of the training data. Besides that, the way of data preparation also determines the DNN architecture and has a huge impact on the performance of the DNN. This lecture will give you insights about proper data processing, some tips and tricks and hopefully enough information for you to train powerful DNNs.

        Speaker: Isabell Melzer-Pellmann
    • 15:30 15:45
      Break 15m
    • 15:45 17:00
      Poster Presentations: Session 3 + 4

      All participants are invited to contribute a scientific poster about their own research.

    • 19:00 20:00
      Public talk on Options and Opportunities for Health Data Science

      Healthcare is arguably the last major industry to be transformed by the information age. Deployments of information technology have only scratched the surface of possibilities for the potential influence of information and computer science on the quality and cost-effectiveness of healthcare.

      As part of this School on Data Intensive Science we have organised a public evening lecture on Data Science in Healthcare. During this lecture, Professor Andrew Morris, Director of Health Data Research UK (HDR UK) will present the vision, objectives and scientific strategy of HDR UK. The opportunities provided by computer science and “big data” to transform health care delivery models will be discussed. Examples will be given from nationwide research and development programmes that integrate electronic patient records with biologic and health system data.
      Since August 2017 Professor Morris has been the inaugural Director of Health Data Research UK, the multi-funder UK Institute for health and biomedical informatics research that will capitalise on the UK’s renowned data resources and research strengths to transform lives through health data science.

      This lecture on Options and Opportunities for Health Data Science is open to everyone attending the School as well as the wider public. All School’ participants are automatically registered however, members of the public will be asked to register via https://indico.ph.liv.ac.uk/event/178/.

    • 09:00 10:00
      Making data work for you 1h

      Explore how data-driven technologies can improve productivity and strengthen competitive advantage, as well as some practical tips on getting data ready to make it useable and useful. This talk will look at established data tools and techniques, and explore how they have been used in practice to solve business problems, including: Brief introduction to the Hartree Centre; The data science process in theory and reality; Data gathering: collection bias and open data; How to make your data work for you; Structuring data and creating value.

      Dr Louise Butcher is a Senior Data Scientist at the STFC Hartree Centre. Although working on all areas of data science and machine learning, Louise has a particular interest in the analysis of geospatial data including both satellite data and GPS/sensors. Projects have included analysing energy use for South West Water; analysing patient needs on discharge for Liverpool NHS Clinical Commissioning group; and improving GPS filters for mobile phone tacking for Glow Media. As part of a varied career to date, Louise previously worked at the University of Manchester on computer vision and face recognition, and founded a spin out company to exploit the technology.

      Speaker: Louise Butcher
    • 10:00 10:30
      Break 30m
    • 10:30 12:30
      Parallel Session: Distributed programming
      • 10:30
        Git Demystified 2h

        Familiar with basic git commands, but not sure where to go next? Have you been copying mysterious git snippets from stackoverflow? This workshop is a deep dive into the surprisingly elegant underlying mechanisms that git uses to represent your code. We'll use this mental model to understand how to git works and what it can do for us. The workshop only assumes basic knowledge of git, but some months of regular usage of basic commands (add/commit) will be an advantage.

        Workshop Requirements:

        The workshop assumes a basic familiarity with a basic git workflow for creating commits. You should be comfortable with using git add and git commit to create new commits.

        In order follow along with this workshop, you'll need an installation of a command line version of git which runs in a bash shell. Whilst the understanding you'll gain will be transferrable to your preferred git tools (e.g. GUI tools or editor integration tools), we recommend that you avoid the temptation follow along with any tools other than the git command line in a bash shell.

        For windows users, this means you'll need to install and run git bash. Instructions for doing so can be found in the Setup on Windows section below.

        Setup on Windows:

        • Download the Git for Windows installer.
        • Run the installer and follow the steps below:
          Click on "Next" four times (two times if you've previously installed Git). You don't need to change anything in the Information, location, components, and start menu screens.
        • From the dropdown menu select "Use the nano editor by default" (NOTE: you may need to scroll up to find it) and click on "Next".
          • Ensure that "Git from the command line and also from 3rd-party software" is selected and click on "Next". (If you don't do this Git Bash will not work properly, requiring you to remove the Git Bash installation, re-run the installer and to select the "Git from the command line and also from 3rd-party software" option.)
          • Ensure that "Use the native Windows Secure Channel library" is selected and click on "Next".
          • Ensure that "Checkout Windows-style, commit Unix-style line endings" is selected and click on "Next".
          • Ensure that "Use Windows' default console window" is selected and click on "Next".
          • Ensure that "Default (fast-forward or merge) is selected and click "Next"
          • Ensure that "Enable Git Credential Manager" is selected and click on "Next".
          • Ensure that "Enable file system caching" is selected and click on "Next".
          • Click on "Install".
          • Click on "Finish" or "Next".
        • If your "HOME" environment variable is not set (or you don't know what this is):
          • Open command prompt (Open Start Menu then type cmd and press Enter)
          • Type the following line into the command prompt window exactly as shown:
            setx HOME "%USERPROFILE%"
          • Press Enter, you should see SUCCESS: Specified value was saved.
          • Quit command prompt by typing exit then pressing Enter

        This will provide you with both Git and Bash in the Git Bash program.

        There is an excellent tutorial on setup for Windows here:
        https://youtu.be/339AEqk9c-8

        Setup on Mac OS

        The default shell in some versions of macOS is Bash, and Bash is available in all versions, so no need to install anything. You access Bash from the Terminal (found in /Applications/Utilities). See the Git installation video tutorial for an example on how to open the Terminal. You may want to keep Terminal in your dock for this workshop.

        To see if your default shell is Bash type `echo $\$$SHELL` in Terminal and press the Return key. If the message printed does not end with `/bash` then your default is something else and you can run Bash by typing bash If you want to change your default shell, see this [Apple Support article](https://support.apple.com/en-au/HT208050) and follow the instructions on "How to change your default shell". For macOS, install Git for Mac by downloading and running the most recent "mavericks" installer from [this list](http://sourceforge.net/projects/git-osx-installer/files/). Because this installer is not signed by the developer, you may have to right click (control click) on the .pkg file, click Open, and click Open on the pop up window. After installing Git, there will not be anything in your /Applications folder, as Git is a command line program. For older versions of OS X (10.5-10.8) use the most recent available installer labelled "snow-leopard" [available here](http://sourceforge.net/projects/git-osx-installer/files/). Setup on Linux The default shell is usually Bash and there is usually no need to install anything. To see if your default shell is Bash type echo $SHELL in a terminal and press the Enter key. If the message printed does not end with '/bash' then your default is something else and you can run Bash by typing bash.

        If Git is not already available on your machine you can try to install it via your distro's package manager. For Debian/Ubuntu run:

        sudo apt-get install git

        and for Fedora run:

        sudo dnf install git.

        Speaker: Mark Dawson
      • 10:30
        Virtual universes vs. the real thing 2h

        In this session we will run a series of exercises aimed at familiarizing the attendees with large, publicly-available data sets derived from both cosmological simulations ("virtual universes") and astrophysical observations from two of ESA's most recent satellite missions: Gaia and Planck. We will show how the data sets can be retrieved in an efficient way and analysed using a variety of publicly-available software tools (e.g., visualisation software). In the case of Gaia, we will show how simple yet informative comparisons can be made with state-of-the-art simulations. For Planck, we will introduce methods for cosmological parameter inference, including MCMC samplers.

        Speakers: Andreea Font, Ian McCarthy
    • 12:30 16:00
      Free afternoon 3h 30m
    • 20:00 22:00
      Live astronomy with the Liverpool Telescope

      During this interactive session, Dr Chris Copperwheat will discuss the Liverpool Telescope 2, which is a new, 4-metre, fully robotic optical/infrared telescope based on the Canary Island of La Palma. Chris will connect remotely to the telescope and will go through all the engineering interfaces. Furthermore, he will be taking suggestions for targets and collecting the data in real time.

      Dr Chris Copperwheat is the Liverpool Telescope Astronomer in Charge and a Reader in Time Domain Astrophysics at the Astrophysics Research Institute of Liverpool John Moores University. He is responsible for giving a science-led perspective to the day-to-day management operations of the telescope. Chris took on the Astronomer in Charge role in September 2015. He joined the ARI in September 2012 as the Liverpool Telescope 2 (LT2) Project Scientist, and led development of the LT2 science case during the first phase of that project.

    • 09:00 10:00
      Careers in data science - case studies

      During this session, three people from industry will each speak about their own career and how they have progressed. Each talk will last around 15 minutes with the opportunity for participants to submit questions via the Zoom Chat function.

      • 09:00
        Case Study 1 15m

        Dr Edward Jones is Director of IT Projects and Programmes at Amey

        Speaker: Edward Jones
      • 09:15
        Q&A 5m
      • 09:20
        Case Study 2 15m

        Senior Solution Architect HPC - NVIDIA (www.nvidia.com)
        Paul Graham joined NVIDIA in Dec 2018 as a Senior Solutions Architect, where he has responsibility for supporting customers and partners in delivering accelerated solutions to the Higher Education, High Performance Computing and AI communities in the UK. Previously he spent 20 years working at EPCC, the supercomputing centre at the University of Edinburgh, where he worked on a broad range of projects, principally with industrial and commercial partners, including data mining for a national bank, software performance optimisation for Rolls Royce, parallelisation of electro-magnetic modelling code for the oil industry, and many projects with local SMEs. He also was the coordinator of SHAPE (SME HPC Adoption Programme in Europe), supported as part of the PRACE collaboration of supercomputing centres across Europe. At NVIDIA Paul is an advocate for using accelerated computing in HPC, and the emerging use of AI as a powerful new tool for researchers.

        Speaker: Paul Graham
      • 09:35
        Q&A 5m
      • 09:40
        Case Study 3 15m

        Director of AI; High Performance Computing, Big Data & AI - ATOS (www.atos.net)
        Sunil works at the forefront of technological advancement, within Artificial Intelligence. He has implemented successful projects within healthcare, education, financial services, commercial, aerospace & defence industries as well start-up environments across the UK and Europe. Sunil hosts conferences that provide training, insights and direct access to experts on the hottest topics in industry. He also regularly speaks at industry events and on expert panels worldwide

        Speaker: Sunil Mistry
      • 09:55
        Q&A 5m
    • 10:00 10:30
      Break 30m
    • 10:30 12:30
      Parallel Session: Project Management

      by Fistral Training and Consultancy Ltd

    • 12:30 13:30
      Break 1h
    • 13:30 15:30
      Parallel Session: International Collaboration

      by Fistral Training and Consultancy Ltd

    • 15:30 16:00
      Break 30m
    • 16:00 18:00
      Industry Careers Workshop

      During this dedicated industry session, external speakers present career pathways and provide careers advice. This interactive session will give participants the opportunity to connect with businesses, learn more about working in industry, create better commercial awareness, and understand the enormous benefits from research collaboration. This afternoon session will be chaired by Dr Alexis Nolan Webster from the Careers Office at the University of Liverpool. All participants are encouraged to submit questions via the Zoom Chat function throughout the session

      Speakers:
      Dr Edward Jones
      Amey - Director of IT Projects and Programmes, Group IT

      Dr Blair Edwards
      IBM Research - Data Technologies Group Leader

      Dr Martyn Spink
      IBM Research - Programme Director, IBM Research Europe

      Sunil Mistry
      ATOS - Director of AI, High Performance Computing, Big Data & AI

      Richard McClay
      Ultra Electronics - Chief Technology Officer

      Dr Jonathan Smith
      STFC - Business Development Manager

    • 18:20 20:00
      Online Escape Room 1h 40m
    • 09:00 12:15
      Kaggle competition

      Half day programming challenge in small groups

    • 12:15 12:30
      A student's placement experience 15m
      Speaker: Alexander Hill
    • 12:30 13:00
      Break 30m
    • 13:00 13:30
      Prize ceremony and Close
      Convener: David Hutchcroft (University of Liverpool)