4th collaboration workshop on Reinforcement Learning for Autonomous Accelerators (RL4AA'26)

Europe/London
Theatre 2, Teaching Hub 502 (University of Liverpool)

Theatre 2, Teaching Hub 502

University of Liverpool

Liverpool L69 7ZP UK
Andrea Santamaria Garcia (University of Liverpool and Cockcroft Institute)
Description

We are pleased to announce RL4AA'26, the fourth workshop of the series organised by the Reinforcement Learning for Autonomous Accelerators (RL4AA) Collaboration. After three successful workshops (KIT in 2023, Salzburg in 2024, and DESY in 2025) RL4AA'26 will be hosted in 2026 as a LIV.INNO Workshop by the University of Liverpool and Cockcroft Institute in the musically iconic port city of Liverpool.

RL4AA brings together the reinforcement learning and accelerator physics communities to share results, practical insights from the field, and the big questions ahead for real-world RL. Expect engaging keynotes, invited and contributed talks, a lively poster session, and a hands-on coding challenge.

Scope

Submissions are not limited to particle accelerators. We welcome any work on sequential decision-making in real-world settings (e.g., safety, sample efficiency, sim-to-real, partial observability). A set of student talk slots will be reserved.

Coding challenge

Join a team to tackle an accelerator-themed RL task. We will form balanced teams that mix newer folks and experienced contributors. You’ll submit results to a live leaderboard as you go, improve your model, and we will wrap with awards for the top two teams.

Who should attend

Whether you are deep into RL research or just starting and curious, there’s something for you (from introductory lectures the first day to advanced talks).

Registration

The event is free, but places are limited. Please register early to secure your spot.

We can’t wait to welcome you to Liverpool in 2026!

 

 

 

 

     

 

 

 

 

 

Dr Andrea Santamaria Garcia
    • Registration Teaching Hub 502, First Floor

      Teaching Hub 502, First Floor

      University of Liverpool

      Arrival of participants and registration.

    • Welcome: Welcome and workshop organisation Theatre 2, Teaching Hub 502

      Theatre 2, Teaching Hub 502

      University of Liverpool

      Liverpool L69 7ZP UK
      Conveners: Andrea Santamaria Garcia (University of Liverpool and Cockcroft Institute), Carsten Welsch (University of Liverpool)
      • 1
        Welcome and opening
        Speaker: Andrea Santamaria Garcia (University of Liverpool and Cockcroft Institute)
      • 2
        Accelerator science in Liverpool
        Speaker: Carsten Welsch (University of Liverpool)
    • Lecture 🎓: Introduction to Reinforcement Learning Theatre 2, Teaching Hub 502

      Theatre 2, Teaching Hub 502

      University of Liverpool

      Liverpool L69 7ZP UK
      Convener: Joel Wulff (CERN)
      • 3
        Introduction to Reinforcement Learning
        Speaker: Andrea Santamaria Garcia (University of Liverpool and Cockcroft Institute)
    • 10:30
      Coffee & pastries ☕ 🥐 Teaching Hub 502, First Floor

      Teaching Hub 502, First Floor

      University of Liverpool

    • Lecture 🎓: Introduction to Deep RL agents Theatre 2, Teaching Hub 502

      Theatre 2, Teaching Hub 502

      University of Liverpool

      Liverpool L69 7ZP UK
      Convener: Joel Wulff (CERN)
    • Coding Challenge 💻: Introduction to CLARA Theatre 2, Teaching Hub 502

      Theatre 2, Teaching Hub 502

      University of Liverpool

      Liverpool L69 7ZP UK

      Time allocated to work on the RL challenge

      Challenge information:

      • The final submission deadline: 14:15, Wednesday the 1st of April
      • 1st and 2nd place teams will be asked to present their results
      • Template slides available in session contributions: They will guide what we want to see you present. If you prefer to make your own feel free! You will have approximately 5 minutes speaking + 3 minutes for questions per team.
      Convener: Joel Wulff (CERN)
      • 5
        Introduction to CLARA
        Speaker: Alexander Brynes (Science & Technology Facilities Council)
    • 12:00
      Lunch break (self-paid) The Courtyard

      The Courtyard

      University of Liverpool Guild of Students 160 Mount Pleasant Liverpool L3 5TR
    • Contributed talks: Student/Junior Talks Theatre 2, Teaching Hub 502

      Theatre 2, Teaching Hub 502

      University of Liverpool

      Liverpool L69 7ZP UK
      Convener: Joseph Wolfenden (University of Liverpool)
      • 6
        Reliable Control via Contextual MPC and Approximate Dynamic Programming: Integrating Bayesian Foresight

        Industrial energy management presents a challenging control problem characterized by strict safety hierarchies, stochastic load fluctuations, and setting actions often has a significantly delayed effects. This work investigates a Reliable Hierarchical Control Architecture composed of probabilistic forecasting paired with a decision-maker.
        The processed problem is characterized by an uncontrollable base load and a controllable variable load, which are added together to create the total load-consumption of the system. Estimation via Bayesian Structural Time Series (BSTS) is utilized for inferring the base load. On top of this probabilistic context, a "Safety Shield" derived from SCADA logic enforces deterministic constraints, ensuring that high-priority assets (e.g., generators) are switched strictly according to operational hierarchies. Within the control task, we rigorously evaluate three distinct decision-making methodologies on their speed, accuracy and adaptability, specifically analyzing their ability to handle unknown plant parameters and actuation latency:
        1. A Greedy Heuristic Baseline that employs an adaptive continuous load estimator to dynamically learn power consumption parameters during operation, to enable a perpetual correction of strategy. While fast-reacting to unexpected parameter drift, this approach remains myopic, risking high latency and therefore fewer corrections for constant, time-consuming adaptations.
        2. Mixed-Integer Quadratic Programming (MIQP), which provides globally optimal scheduling over a finite horizon. However, its effectiveness is limited by the requirement for linearized plant models and explicit, pre-defined delay compensation, making it brittle to unmodeled temporal dynamics.
        3. Dynamic Programming (DP), which serves as a theoretical benchmark by naturally incorporating delayed effects into the state space, though its practical application is severely limited by exponential computational scaling in the number of flexible loads.
        Our results highlight a critical trade-off: optimization-based methods (MIQP, DP) offer theoretical guarantees but struggle with the "Reality Gap" of unmodeled delays and non-linearities. The adaptive heuristic offers practical resilience but lacks refined foresight. Consequently, we propose that future work must bridge this gap through learning-based approaches, such as Shielded Reinforcement Learning, which can implicitly learn high-dimensional delay dynamics and non-linear interactions that are computationally intractable for formal optimization models.

        Speaker: Sarah Trausner (University of Salzburg)
      • 7
        Dreaming of Schottky Spectra: Building World Models for LEIR robust automation

        Stripper foil degradation in the Low Energy Ion Ring (LEIR) causes beam distribution drift that progressively degrades performance during multi-turn stacking at flat bottom. World models have emerged as a promising approach for sample-efficient and robust agents, enabling them to improve their behavior by rolling out policies in learned environment models between real interactions, thereby reducing the need for expensive online exploration. In this work, a world model-based reinforcement learning approach is presented for autonomous compensation of foil aging effects in LEIR, extending previous reinforcement learning-based intensity optimization efforts. The agent observes the accelerator state through encoded Longitudinal Schottky spectra and Time-of-Flight (TOF) measurements—capturing the coupled dynamics of beam parameters affected by foil-induced distribution drift—and learns a compact latent representation via a trained world model. This learned representation allows the agent to plan actions through internal simulation, improving sample efficiency under the strict data and operation constraints of accelerator control. By controlling the RF ramping and debunching cavities, electron gun voltage, and cooler bump to maintain optimal phase space conditions, the agent adapts to aging-induced beam drift throughout the injection plateau. The stochasticity of the world model can be tuned through temperature scaling, enabling robust imaginary rollouts that prepare the agent for varying, unforeseen foil conditions. The resulting policy maintains beam intensity above nominal targets across repeated injection cycles despite progressive stripper foil aging, illustrating the potential of world model-based reinforcement learning for autonomous accelerator operation.

        Speaker: Borja Rodriguez Mateos (CERN)
      • 8
        Testing and Improving RL Policies via Rule Learning

        Using domain knowledge to improve deep RL policies is a current challenge. LEGIBLE mines rules from an RL policy, constituting a partially symbolic representation. These rules describe which decisions the RL policy makes and which it avoids making. It then generalizes the mined rules using domain knowledge. Finally, it evaluates generalized rules to determine which generalizations improve performance when enforced. These improvements show weaknesses in the policy, where it has not learned the general rules and thus can be improved by rule guidance. We show the efficacy of our approach by demonstrating that it effectively finds weaknesses, accompanied by explanations of these weaknesses in several RL environments.
        Closing the loop from neural to symbolic and back to neural representation, we show how to integrate symbolic (rule-based) knowledge into neural RL policies by leveraging RL from demonstrations with OFTEN-DeepRL.

        Speaker: Ignacio D. Lopez-Miguel
      • 9
        Resource-Conditioned Reinforcement Learning for Physics Instrument Design

        Designing advanced particle-physics instruments requires navigating a high-dimensional space of discrete and continuous choices while satisfying strict constraints on material, cost, and geometry. In practice, these constraints evolve throughout an experiment’s lifetime, making it insufficient to optimize a single “best” detector configuration. We present a resource-conditioned reinforcement learning (RL) framework for detector design that produces families of optimized configurations matched to different constraint levels. Building on prior RL-based instrument design workflows (arXiv:2412.10237), we train agents that condition their policy on available resources and other problem parameters, enabling a single training run to generate multiple designs spanning a spectrum of feasible budgets and geometries.

        We demonstrate the approach on longitudinal calorimeter design, where the agent learns to adapt sensor placement and layer thickness patterns in a non-linear way as resources change, yielding a set of optimized architectures that directly expose performance–cost trade-offs (e.g., energy resolution versus material usage). We discuss practical aspects of constraint handling and feasibility enforcement, and we outline how the same conditional formulation can be extended to additional design degrees of freedom—such as total detector size, spatial envelopes, or task-specific operating points—supporting interactive studies for decision-makers. This work reframes detector optimization from producing a single configuration to learning a controllable design policy that can respond to shifting requirements during experiment design.

        Speaker: Sara Zoccheddu (University of Zurich)
    • 14:50
      Coffee ☕ Teaching Hub 502, First Floor

      Teaching Hub 502, First Floor

      University of Liverpool

    • Coding Challenge 💻: Hands-on Theatre 2, Teaching Hub 502

      Theatre 2, Teaching Hub 502

      University of Liverpool

      Liverpool L69 7ZP UK

      Time allocated to work on the RL challenge

      Challenge information:

      • The final submission deadline: 14:15, Wednesday the 1st of April
      • 1st and 2nd place teams will be asked to present their results
      • Template slides available in session contributions: They will guide what we want to see you present. If you prefer to make your own feel free! You will have approximately 5 minutes speaking + 3 minutes for questions per team.
      Conveners: Amelia Pollard (ASTeC), Andrea Santamaria Garcia (University of Liverpool and Cockcroft Institute), Joel Wulff (CERN)
      • 10
        Challenge Introduction
        Speaker: Joel Wulff (CERN)
      • 11
        Challenge Free Work Time
        Speakers: Amelia Pollard (ASTeC), Andrea Santamaria Garcia (University of Liverpool and Cockcroft Institute), Joel Wulff (CERN)
    • Workshop social: Dinner at Albert's Schloss Albert's Schloss

      Albert's Schloss

      18-26 Bold St, Liverpool L1 4DS
    • Welcome: Day 2 Theatre 2, Teaching Hub 502

      Theatre 2, Teaching Hub 502

      University of Liverpool

      Liverpool L69 7ZP UK
      Convener: Andrea Santamaria Garcia (University of Liverpool and Cockcroft Institute)
    • Keynote Theatre 2, Teaching Hub 502

      Theatre 2, Teaching Hub 502

      University of Liverpool

      Liverpool L69 7ZP UK
      Convener: Andrea Santamaria Garcia (University of Liverpool and Cockcroft Institute)
      • 12
        Accelerating Reinforcement Learning with Off-Policy Data: Promises, Pitfalls, and Future Directions

        Reinforcement learning is a promising technique for solving complex control problems in real-world physical systems, such as robotics, plasma stabilization, and particle accelerators. However, RL is often data-hungry, and its classic on-policy formulation is often inefficient, as it disallows data reuse, and unsafe, as it requires the agent to interact with the environment from scratch.
        Off-policy reinforcement learning offers a more appealing paradigm by enabling the reuse of historical data and the utilization of safe, external behavior sources (such as human operator logs). However, this flexibility comes at a cost: off-policy learning introduces significant theoretical instabilities. In this talk, we will analyze some fundamental difficulties in off-policy reinforcement learning, both in value and policy learning, explore the algorithmic landscape that tames them, and see the future direction in which the field is moving.

        Speaker: samuele Tosatto (Universität Innsbruck)
    • 10:15
      Coffee & pastries ☕ 🥐 Teaching Hub 502, First Floor

      Teaching Hub 502, First Floor

      University of Liverpool

    • Invited Talks: RL Applied to Particle Accelerators Theatre 2, Teaching Hub 502

      Theatre 2, Teaching Hub 502

      University of Liverpool

      Liverpool L69 7ZP UK
      Convener: Andrea Santamaria Garcia (University of Liverpool and Cockcroft Institute)
      • 13
        Reinforcement Learning for Real-Time Cyclotron Tuning: Results from the PSI Injector~2 Experiment

        Achieving reliable, fast, and reproducible tuning of high-power cyclotrons remains a key operational challenge as accelerators move toward increasingly complex beam configurations and higher intensities. To address this, we conducted a two-week experimental campaign at the PSI Injector~2 cyclotron to evaluate the feasibility of applying reinforcement learning (RL) for real-time beam tuning on a live machine.

        A continuous-control RL agent was integrated with the accelerator control system and operated under strict safety constraints. The agent was trained online at low beam current and evaluated across multiple turn configurations, including nominal and degraded operating regimes. We demonstrate stable convergence within hours, effective phase alignment with reduced beam losses, and robust autonomous operation during extended overnight evaluation runs without triggering interlocks.

        These experiments represent an important step toward automated cyclotron tuning and provide practical insights into safe RL deployment, policy generalization, and operational robustness for future high-current HIPA and ADS-class facilities.

        Speaker: Malek Haj Tahar (Transmutex SA)
      • 14
        Batch spacing optimization at SPS injection by RL

        This contribution will be based on the paper "Batch spacing optimization by reinforcement learning" (DOI: https://doi.org/10.1103/g9wr-197z):
        Beams designated for the LHC are injected into the SPS in multiple batches. Given the tight spacing of 200 ns between these batches, the injection kickers have to be precisely synchronized with the injected beam to minimize injection oscillations. Due to machine drift, the optimal settings for the kickers vary. This paper presents an active controller trained by reinforcement learning that counteracts the machine drifts by adjusting the settings. The agent was exclusively trained in a simulation environment and directly transferred to the accelerator. Although its results are slightly worse than those obtained by an explicit numerical optimizer, the BOBYQA algorithm, the agent attains these results much faster since it requires far less computation.

        Speaker: Matthias Remta (CERN / University of Vienna)
    • 11:45
      Group photo Front of Victoria Building

      Front of Victoria Building

    • Poster session: Poster session & lunch Teaching Hub 502 First Floor

      Teaching Hub 502 First Floor

      University of Liverpool

      • 15
        Agent-Based Simulation of Medical Device Redistribution in Crises with Reinforcement Learning

        RL4AA’26 Poster Abstract

        Background and Motivation
        In crisis situations, hospitals can face shortages of medical devices required for the proper treatment of patients. More effective coordination of existing medical resources could therefore
        improve patient care as well as the resilience of the healthcare system.

        The objective of this work is to develop a spatial simulation that models the distribution of medical devices in crisis situations and supports decision-making on how devices can be allocated within a hospital network in an effective way to reduce shortages. A key aspect of this is the integration of geographic information, which allows the analysis of location-base relationships in relation to socio-demographic conditions, influencing planning and transport decisions. In addition, the simulation could be used to quantify the impact of different
        distribution strategies on the resilience of the regional healthcare system.

        Methodology
        The simulation combines agent-based modeling (ABM) with reinforcement learning. ABM agents represent entities such as hospitals or medical devices, while reinforcement learning is responsible for decision-making in the model. The learning objective is to reduce device shortages, while penalizing inefficient actions such as long-distance transport. We develop two simlation models to study the distribution of medical devices in crisis situations.

        The base model represents a scenario in which multiple medical devices become unavailable across all hospitals due to a cybersecurity incident. Some hospitals experience shortages, while others still have enough devices and can potentially share them. The model simulates how
        medical devices can be redistributed within the hospital network to reduce deficits.

        The advanced model extends the base model by simulating a system overload caused by an infectious disease. Patient inflows are modeled based on geographic and socio-demographic
        factors as well as hospitalization data, leading to simulated bed occupancy in hospitals. The resulting demand for medical devices is derived from the number of hospitalized patients. The device redistribution logic is then applied under these overload conditions.

        The model is currently under development.

        Expected Results
        The simulation is expected to provide decision support for the distribution of medical devices during crisis situations. It enables the identification of critical load thresholds at which hospitals
        become overstressed under specific crisis scenarios. Furthermore, the results are expected to demonstrate how device distribution strategies can increase the resilience of the regional healthcare system.

        Speaker: Georg Weinberger (University of Salzburg - Department of Geoinformatics (Z_GIS))
      • 16
        Automated tuning using RL trained on Cheetah simulation at DESY and building Cheetah simulation for ISIS virtual accelerator

        Automating tuning has been an area of great interest in the accelerator community in recent years. Bayesian Optimisation (BO) has been favoured over Reinforcement Learning (RL) due to its short training time and reliability. However, RL has become increasingly viable with access to large training datasets from fast and differentiable simulation, Cheetah.

        In this work, we develop Cheetah simulations and tune the R-Weg section of the DESY II synchrotron using RL. RL-based tuning is significantly faster than BO during inference, and initial results indicate that it is equally competitive in accuracy metrics. We also evaluate Cheetah for the ISIS Linac and explore its integration as a backend within our Virtual Accelerator for ISIS project.

        Speaker: Raunakk Banerjee (science and technology facilities council)
      • 17
        Autonomous beam flattening using reinforcement learning at the CLEAR facility at CERN

        Modern particle accelerators operate in highly complex, nonlinear, and time-varying regimes, where optimal performance relies on the coordinated tuning of many coupled parameters under uncertainty and noise. Traditional control and optimization strategies based on physics models, linearization, or manual tuning often struggle to adapt in real time to changing beam conditions, hardware drifts, and incomplete diagnostics.
        These challenges are particularly relevant at the CERN Linear Electron Accelerator for Research (CLEAR) facility, which supports a wide range of experiments requiring diverse beam configurations. Among these, medical irradiation experiments demand dedicated beam settings, including configurations that produce flat and uniform transverse beam profiles at the sample location using a dual-scattering system. Establishing and maintaining such beam conditions requires significant machine time and careful manual tuning, while stable and reproducible beam parameters are essential for experimental reliability.
        Reinforcement Learning (RL) offers a promising framework for autonomous accelerator operation by enabling control agents to learn optimal tuning policies through interaction with the machine or high-fidelity simulations, with the potential to reduce setup time and improve beam stability. To address these challenges, an RL-based beam-flattening algorithm is being developed to autonomously optimize the beam profile by tuning quadrupole and corrector magnets that steer and shape the beam onto the scattering system. The approach has been implemented and validated using a simulation model of the CLEAR beamline and is planned for deployment during the 2026 experimental run.

        Speaker: Giacomo Tangari (Sapienza University of Rome / CERN)
      • 18
        Autonomous Optimization of RF Triple Splitting in the CERN PS

        Reinforcement learning (RL) is a powerful technique for optimizing complex beam manipulations. An RL-based autonomous controller has been developed for the triple splitting RF manipulation in the CERN Proton Synchrotron (PS), essential to establish the bunch spacing for the LHC. The system combined a convolutional neural network for initial phase correction with sequential soft actor-critic agents optimizing the RF parameters. Trained with simulated bunch profile data, the controller demonstrated robust and rapid convergence during early beam tests. This motivated the deployment as an on-demand tool and later as a fully autonomous controller. However, changes to the RF voltage program or operating conditions would require offline simulation, dataset regeneration, and retraining. With the experience gained running the RL controller, its replacement by a PID-based solution requiring only gain tuning while achieving comparable performance has been completed. This case study highlights both the strengths and limitations of RL for autonomous accelerator control and underlines maintainability as a key criterion operational implementations.

        Speaker: Joel Wulff (CERN)
      • 19
        Batch spacing optimization at SPS injection by RL

        This contribution will be based on the paper "Batch spacing optimization by reinforcement learning" (DOI: https://doi.org/10.1103/g9wr-197z):
        Beams designated for the LHC are injected into the SPS in multiple batches. Given the tight spacing of 200 ns between these batches, the injection kickers have to be precisely synchronized with the injected beam to minimize injection oscillations. Due to machine drift, the optimal settings for the kickers vary. This paper presents an active controller trained by reinforcement learning that counteracts the machine drifts by adjusting the settings. The agent was exclusively trained in a simulation environment and directly transferred to the accelerator. Although its results are slightly worse than those obtained by an explicit numerical optimizer, the BOBYQA algorithm, the agent attains these results much faster since it requires far less computation.

        Speaker: Matthias Remta (CERN / University of Vienna)
      • 20
        Binary Trigger Signals for Deep Reinforcement Learning in Equity Trading

        This study introduces a novel binary trigger-based state representation for deep reinforcement learning (DRL) in stock trading. Unlike conventional approaches using continuous technical indicators (MACD, RSI, CCI, ADX), we encode market state via binary signals: MVX (moving-average crossover) and BOLLX (Bollinger band breakout). We also propose trigger-date filtering, which trains only on dates when triggers fire, reducing training data by 50-70%.

        Evaluating 27 configurations (three algorithms: A2C, PPO, SAC across nine indicator variants) on Dow Jones 30 daily data (Jan-Nov 2025), we discover a strong algorithm-indicator dependency: A2C with MVX yields +30.85% improvement, PPO with BOLLX achieves +16.09%, while SAC remains robust to both. The best configuration (A2C with filtered MVX) achieves 31.90% cumulative return, a Sharpe ratio of 1.41, and outperforms the DJIA baseline by 154%.

        A systematic review of papers (2015-2025) suggests both contributions are novel: no prior work employs binary trigger signals or trigger-date filtering in DRL trading. Results partially validate RL over traditional strategies (37% of models beat DJIA) while showing trigger-date filtering benefits A2C but hurts PPO/SAC. Limitations include the 11-month test period and absence of LSTM temporal modeling, suggesting future work on recurrent architectures and multi-market validation.

        This study introduces a novel binary trigger-based state representation for deep reinforcement learning (DRL) in stock trading. Unlike conventional approaches using continuous technical indicators (MACD, RSI, CCI, ADX), we encode market state via binary signals: MVX (moving-average crossover) and BOLLX (Bollinger band breakout). We also propose trigger-date filtering, which trains only on dates when triggers fire, reducing training data by 50-70%.

        Evaluating 27 configurations (three algorithms: A2C, PPO, SAC across nine indicator variants) on Dow Jones 30 daily data (Jan-Nov 2025), we discover a strong algorithm-indicator dependency: A2C with MVX yields +30.85% improvement, PPO with BOLLX achieves +16.09%, while SAC remains robust to both. The best configuration (A2C with filtered MVX) achieves 31.90% cumulative return, a Sharpe ratio of 1.41, and outperforms the DJIA baseline by 154%.

        A systematic review of papers (2015-2025) suggests both contributions are novel: no prior work employs binary trigger signals or trigger-date filtering in DRL trading. Results partially validate RL over traditional strategies (37% of models beat DJIA) while showing trigger-date filtering benefits A2C but hurts PPO/SAC. Limitations include the 11-month test period and absence of LSTM temporal modeling, suggesting future work on recurrent architectures and multi-market validation.

        Speakers: Juan Manuel Montoya Bayardo, Dr Simon Hirländer (Uni Salzburg)
      • 21
        Causal GP-MPC: Where Structure, Safety, and Online Learning Meet for Robust Accelerator Control

        Robust accelerator control increasingly relies on data-driven optimisation, yet balancing adaptability with safety remains challenging. Simulation-driven, physics-informed reinforcement learning (RL) relies on soft constraints without formal safety guarantees, and classical response-matrix inversion (RMI) becomes suboptimal under noise
        and hard actuator limits. Using the AWAKE electron beam-steering task at CERN as a high-fidelity benchmark, we formulate beam steering as a stochastic control problem in a linear Markov decision process (MDP) with continuous state/action spaces and realistic constraints, and compare RMI, the nominally optimal linear controller-Kalman Quadratic Programming (KalmanQP), Gaussian-process MPC (GP-MPC), and RL.

        Our main contribution is a causal GP-MPC scheme that embeds the beamline’s causal layout directly into the GP prior and kernel. This structural inductive bias reduces model complexity, improves conditioning, and enables accurate multi-step prediction from limited data. In simulations on the measured response matrix, RMI and KalmanQP perform well in benign conditions, but their nominal optimality is brittle performance degrades sharply under noise. PPO learns robust policies yet is data inefficient. Structured GP-MPC bridges these extremes, leveraging the RMI-based physical prior for high sample efficiency and a learned residual to surpass the robustness of standard controllers. Taken together, the results indicate that causally structured learning offers a promising route to data-efficient, interpretable, and deployable control strategies for complex accelerator systems.

        Speaker: Simon Hirlaender (IDA Lab, Paris Lodron University of Salzburg)
      • 22
        Crystal Channelling Optimisation in the LHC Using Reinforcement Learning

        The Large Hadron Collider (LHC) requires a collimation system to ensure safe operation with both proton and heavy-ion beams. As of 2023, a crystal collimation scheme using bent silicon crystals was introduced to improve the collimation efficiency for heavy-ion beams. However, drifts in the crystal angular position led to the loss of cleaning performance during physics fills. These drifts are thought to derive from mechanical deformation of the goniometer due to heating caused by beam impedance effects. A quadratic-fit based optimiser was deployed to compensate for such drifts using feedback from beam loss monitors. This paper details the simulation environment to train reinforcement learning agents to maintain the optimal channelling position with increased reliability and reduced convergence time, and presents the latest results obtained with lead ion beams.

        Speaker: Andrea Vella (University of Malta)
      • 23
        Designing a Gated Recurrent Unit in the Versal AI Engines

        This poster presents the design and implementation of a Gated Recurrent Unit (GRU) on Xilinx Versal AI Engines. We outline the mapping of GRU computations to the AI Engine architecture, discuss dataflow and parallelization strategies, and highlight performance considerations for efficient recurrent neural network inference. The design supports unquantized models by leveraging 32-bit floating-point datatypes on the AI Engines.

        Speaker: Michail Sapkas (UniPD - INFN Padova)
      • 24
        Designing for Tunability and Feedback in the Muon EDM Experiment

        The muon electric dipole moment (muEDM) experiment at PSI relies on highly sensitive off-axis muon injection into a compact frozen-spin trap. Injection performance depends strongly on magnetic field and material properties that are difficult to characterize with sufficient accuracy prior to commissioning. For a system of this complexity, purely feed-forward optimization of experimental geometry based on simulation alone is limited by unavoidable model uncertainty. We are therefore exploring a design for the next phase of our experiment that explicitly prioritizes tunability: the ability to compensate expected deviations using controllable correction elements and measurement-based feedback. This naturally leads to a control formulation, in which detector outputs can be interpreted as system states, controllable elements (e.g. steering coils) as control actions, and injection efficiency as a performance objective.

        This poster presents an early-stage study aimed at developing such a tunable experimental design. We discuss ongoing and planned studies of the experiment, and outline how feedback-based methods could be applied to bridge the sim-to-real gap in practice. The goal is to form a concrete, experimentally motivated control problem and invite discussion on how modern learning-based methods could be integrated into the development of the experiment.

        Speaker: Johannes Alexander Jaeger (ETH Zurich / Paul Scherrer Institut)
      • 25
        Developing neural network based surrogate models for predicting laser accelerated proton energy spectra

        Reliable and well‑characterised laser‑driven proton beams are essential for advancing laser‑ion acceleration from fundamental research to practical applications such as medical physics [1]. However, shot-to-shot variability and the lack of robust, non‑invasive diagnostics continue to limit progress. Recent advances in machine learning [2] offer a promising route to overcoming these challenges by enabling data‑driven prediction of beam properties directly from experimental inputs.
        Building on our group’s previous work, reported in McQueen et al. [3], which demonstrated a neural‑network synthetic diagnostic capable of predicting proton energy spectra from laser input parameters and back‑reflected light, we now investigate the development of a more flexible surrogate model that removes the requirement for secondary reflected‑light diagnostics. Using the same experimental dataset, we train a neural‑network surrogate [4] that takes only laser and target parameters as inputs, learns underlying laser–plasma interaction dynamics, and predicts proton energy spectra with associated uncertainty quantification. This approach aims to increase model portability across different laser facilities, including those where reflected‑light diagnostics are unavailable or impractical.
        Further work will incorporate data from a dedicated experiment at the ELI-Beamlines facility, enabling systematic studies of parameter‑space diversity and controlled scans. These investigations will assess how experimental variability and structured data collection impact the accuracy and generalisability of the surrogate model, contributing toward the long‑term goal of autonomous, machine‑learning‑assisted accelerator operation.
        [1] Kroll, F. et al. Tumour irradiation in mice with a laser-accelerated proton beam. Nat. Phys. 18, 316–322 (2022)
        [2] Döpp, A. et al. Data-driven science and machine learning methods in laser–plasma physics. High Power Laser Science and Engineering, 11, e55. (2023)
        [3] C. J. McQueen. et al. A neural network-based synthetic diagnostic of laser-accelerated proton energy spectra. Comm. Phys, 8, 66 (2025)
        [4] B. Z. Djordjević. et al. Modeling laser-driven ion acceleration with deep learning. Phys. Plasmas 28, 4 (2021)

        Speaker: lana Buckleton (University of strathclyde)
      • 26
        Diagnosis and optimisation of laser pulse shaping for laser-plasma accelerators

        Laser-plasma accelerators (LPAs) still trail conventional accelerators in terms of their ability to generate high-quality electron beams with low shot-to-shot variation. But with higher repetition rates and longer-term operation, the use of machine learning techniques is becoming increasingly viable as a control tool for improving the stability and reliability of LPAs.

        In this context, machine learning techniques have previously been used to identify promising working points [1] and to tune laser-plasma accelerators during operation [2],[3],[4]. A key challenge is that the high-intensity laser pulse used to drive the accelerating wakefield is often subject to unexpected variations. These variations have a significant impact on the properties of the accelerated electron bunches, so correlating, predicting [5] and compensating these deviations using active feedback mechanisms is crucial for improving beam quality and the stability of the LPA, particularly for proposed projects to utilise LPAs as frontends for conventional machines [6]. As part of this, it is necessary to have a fast and verifiable analysis of the laser pulse that informs the operator or machine controls about which optical component to tune.

        Here we present machine learning and simulation techniques implemented to diagnose and optimise laser systems for laser plasma acceleration at DESY. A neural network which takes a few minutes to train on CPU has been used to identify complex laser mode coefficients with a high degree of accuracy in $\sim$ 10ms. This leverages LASY, a simulation tool which models beam propagation through the laser system.

        [1] https://journals.aps.org/prab/abstract/10.1103/PhysRevAccelBeams.26.084601
        [2] https://www.nature.com/articles/s41467-020-20245-6
        [3] https://link.springer.com/book/10.1007/978-3-031-88083-4
        [4] https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.126.104801
        [5] https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.126.174801
        [6] https://bib-pubdb1.desy.de/record/615183

        Speaker: Emily Archer (DESY)
      • 27
        Dreaming of Schottky Spectra: Building World Models for LEIR robust automation

        Stripper foil degradation in the Low Energy Ion Ring (LEIR) causes beam distribution drift that progressively degrades performance during multi-turn stacking at flat bottom. World models have emerged as a promising approach for sample-efficient and robust agents, enabling them to improve their behavior by rolling out policies in learned environment models between real interactions, thereby reducing the need for expensive online exploration. In this work, a world model-based reinforcement learning approach is presented for autonomous compensation of foil aging effects in LEIR, extending previous reinforcement learning-based intensity optimization efforts. The agent observes the accelerator state through encoded Longitudinal Schottky spectra and Time-of-Flight (TOF) measurements—capturing the coupled dynamics of beam parameters affected by foil-induced distribution drift—and learns a compact latent representation via a trained world model. This learned representation allows the agent to plan actions through internal simulation, improving sample efficiency under the strict data and operation constraints of accelerator control. By controlling the RF ramping and debunching cavities, electron gun voltage, and cooler bump to maintain optimal phase space conditions, the agent adapts to aging-induced beam drift throughout the injection plateau. The stochasticity of the world model can be tuned through temperature scaling, enabling robust imaginary rollouts that prepare the agent for varying, unforeseen foil conditions. The resulting policy maintains beam intensity above nominal targets across repeated injection cycles despite progressive stripper foil aging, illustrating the potential of world model-based reinforcement learning for autonomous accelerator operation.

        Speaker: Borja Rodriguez Mateos (CERN)
      • 28
        Extending Reinforcement Learning for Beam Steering with Bayesian Optimization and Online System Identification in the CERN SPS North Area

        Commissioning slow extracted beams from the CERN Super Proton Synchrotron (SPS) to the North Area experimental targets requires trajectory control through multiple transfer lines using corrector magnets—a process that traditionally demands significant expert intervention. Previous work demonstrated using reinforcement learning (RL) for automated trajectory correction based on secondary emission monitor (SEM) split-foil intensity measurements, successfully centering the beam on target under nominal conditions. However, two failure modes require human intervention: complete signal loss when the beam exceeds SEM acceptance, and corrector magnet polarity changes that invalidate learned policies.
        We extend this framework with a hierarchical approach comprising three sequential stages. First, when the beam lies outside SEM detection range, we employ Bayesian optimization with random exploration to recover beam visibility. Second, we perform online system identification to automatically resolve corrector polarity ambiguities. With these prerequisites satisfied, the RL agent maps SEM observations to corrector adjustments, achieving beam centering throughout the transfer line.

        Speaker: Adrián Menor de Oñate (CERN)
      • 29
        Generalised Automatic Harmonic Operation in the CERN Proton Synchrotron Booster

        The Proton Synchrotron Booster (PSB) is equipped with a wideband radio-frequency (RF) system operated at multiple harmonics of the revolution frequency. Beyond acceleration, it stretches the protons bunches to mitigate space-charge effects. To maximize the bunch length throughout the entire acceleration cycle, three RF voltages at different harmonics and their relative phases must be tuned. However, imperfect signal-path compensation, heavy beam loading, and changing operating conditions require tedious adjustments for each beam type and intensity. To automate this optimization, we developed a memory-enhanced reinforcement-learning controller based on long-short term memory (LSTM) cells that learn iterative, profile-based corrections without supervised targets. Trained with simulated data including realistic artifacts and noise, the model was validated with beam. The contribution summarizes the design, training, and validation process.

        Speaker: Joel Wulff (CERN)
      • 30
        Geoff: Applications & Developments in 2025

        The complexity of the CERN and GSI/FAIR accelerator facilities requires a high degree of automation to maximize beam time and performance for physics experiments. Geoff, the Generic Optimization Framework & Frontend, is an open-source tool developed within the EURO-LABS project by CERN and GSI to streamline access to classical and AI-based optimization methods. It provides standardized interfaces for optimization problems and utility functions to speed up implementation. Plugins are independent packages with their own dependencies, allowing scaling from simple prototypes to complex state machines that communicate with devices in different timing domains. This contribution presents Geoff’s design, features, and current applications.

        At GSI, multi-objective Bayesian optimization was applied to SIS18 multi-turn injection, building a Pareto front from experimental data. At CERN, Geoff and ML/AI contributed to a record ion beam intensity for the LHC in 2024 through LEIR and SPS optimization. In addition, Geoff underwent major updates in 2025, aligning it with the latest developments in Python-based numerical and machine-learning software.

        Speaker: Dr Penny Madysa (GSI)
      • 31
        Improving Trajectory Tracking in Reinforcement Learning by Augmenting States with Future Targets

        Standard Reinforcement Learning (RL) for trajectory tracking typically relies on myopic state representations, providing agents only with the current target. This forces a reactive control paradigm, resulting in lag and overshoot during dynamic transitions. To address this, we propose augmenting the standard RL state space, which traditionally contains only the current reference, with future target information, e.g., a finite-horizon sequence of future targets or target velocities.
        We evaluate this predictive state representation on a real-world industrial testbed (Quanser Aero 2) using continuous S-curve trajectory profiles. Preliminary experiments demonstrate a significant performance improvement: augmenting the state with five future targets at 0.1 s intervals reduced the average tracking error from 2.60° (baselines) to 0.34°. These results suggest that simple state-augmentation enables model-free agents to learn sophisticated anticipatory behaviors, i.e. initiating control actions before target changes occur, without explicit model-based planning.

        Speaker: Georg Schäfer
      • 32
        Koopman-Stabilised World Models for Offline Reinforcement Learning in Accelerator Control

        Particle accelerators and their design studies generate large amounts of historical data from archived operation logs and high-fidelity simulations, yet most learning-based control strategies still rely on online optimisation, where new data must be collected through direct machine interaction. To make better use of such pre-generated data and avoid additional online exploration, we present a workflow for offline reinforcement learning (offline RL) based on a two-stage modelling approach. First, an Xsuite-based high-fidelity beam dynamics model is used to generate and archive trajectories for steering tasks across a set of representative machine scenarios (e.g., optics variations, alignment errors, and jitter conditions), providing synthetic but realistic expert and non-expert behaviour. Second, a Koopman-inspired hybrid world model is learned from this dataset, yielding fast, stable multi-step prediction together with epistemic uncertainty estimates via ensemble variance. This learned model serves as a surrogate environment for offline RL. We benchmark offline RL policies against a PPO agent trained directly in the original Xsuite physics model, where PPO episodes are terminated once trajectories leave expert-like regions or enter high-epistemic-uncertainty domains, reflecting realistic operational safety limits. Results show that policies trained purely offline on the Koopman world model can match or exceed PPO performance under these constraints, while requiring no additional online exploration. The proposed workflow demonstrates how Xsuite-based simulation, uncertainty-aware surrogate modelling, and offline RL can be combined to turn historical scenario data into a safe and reproducible pathway for learning-based accelerator control.

        Speaker: Simon Hirlaender (PLUS University Salzburg)
      • 33
        Leveraging Reinforcement Learning, Genetic Algorithms and Transformers for background determination in particle physics

        Experimental studies of beauty hadron decays face significant challenges due to a wide range of backgrounds arising from the numerous possible decay channels with similar final states. For a particular signal decay, the process for ascertaining the most relevant background processes necessitates a detailed analysis of final state particles, potential misidentifications, and kinematic overlaps, which, due to computational limitations, is restricted to the simulation of only the most relevant backgrounds. Moreover, this process typically relies on the physicist’s intuition and expertise, as no systematic method exists.

        This work has two primary goals. First, from a particle physics perspective, we present a novel approach that utilises Reinforcement Learning (RL) to overcome the aforementioned challenges by systematically determining the critical backgrounds affecting beauty hadron decay measurements. While beauty hadron physics serves as the case study in this work, the proposed strategy is broadly adaptable to other types of particle physics measurements. Second, from a Machine Learning perspective, we introduce a novel algorithm which exploits the synergy between RL and Genetic Algorithms (GAs) for environments with highly sparse rewards and a large trajectory space. This strategy leverages GAs to efficiently explore the trajectory space and identify successful trajectories, which are used to guide the RL agent's training. Our method also incorporates a transformer architecture for the RL agent to process token sequences that represent particle decays.

        Speaker: Guillermo Hijano Mendizabal (University of Zurich)
      • 34
        MACS: Multi-Agent Reinforcement Learning for Optimization of Crystal Structures

        Geometry optimization of atomic structures is a common and crucial task in computational chemistry and materials design. Following the learning to optimize paradigm, we propose a new multi-agent reinforcement learning method called Multi-Agent Crystal Structure optimization (MACS) to address periodic crystal structure optimization. MACS treats geometry optimization as a partially observable Markov game in which atoms are agents that adjust their positions to collectively discover a stable configuration. We train MACS across various compositions of reported crystalline materials to obtain a policy that successfully optimizes structures from the training compositions as well as structures of larger
        sizes and unseen compositions, confirming its excellent scalability and zero-shot transferability. We benchmark our approach against a broad range of state-of-theart optimization methods and demonstrate that MACS optimizes periodic crystal structures significantly faster, with fewer energy calculations, and the lowest failure rate.

        Speaker: Elena Zamaraeva (University of Manchester, Fusion21)
      • 35
        ML-Based Phase Space Reconstruction for Loss Reduction at the PS-to-SPS Transfer

        The beam intensity in the injector chain at CERN has been nearly doubled as part of the upgrades for the High-Luminosity LHC (HL-LHC). This presents multiple operational challenges. A critical bottleneck is the uncaptured beam created during the transfer from the Proton Synchrotron (PS) to the Super Proton Synchrotron (SPS). Tomographic reconstruction of the longitudinal distribution during bunch compression in the PS has been shown to be effective for predicting losses in the SPS and optimizing PS extraction timing. However, its computational demands currently prohibit live multi-bunch analysis. A machine learning approach to reconstruct the longitudinal phase space from bunch profile data during bunch compression in the PS to enable real-time analysis of bunch trains is presented. This supervised neural network provides the basis for a tool that could autonomously optimize RF parameters to minimize losses due to uncaptured beam in the SPS.

        Speaker: Jake Flowerdew (CERN)
      • 36
        Multi-Agent Reinforcement Learning for Resource Allocation in Wireless Network Communication

        Multi-Agent Reinforcement Learning (MARL) is an important subfield of Reinforcement Learning, in which multiple agents learn in a shared environment. The simultaneous learning of several players naturally arises in domains like robotics, network communication and traffic control, where agents affect and influence one another. Thus, MARL can simulate real-world problems in a reliable way, and consequently, interest in MARL continues to grow.
        In this work, we consider the real-world problem of resource allocation in wireless network communication.
        Due to the fast development of wireless network communication, data traffic is rising, and more devices are communicating like mobile users and machines in factories.
        These devices affect their communication by trying to gain the same resources for guaranteeing reliable communication. Thus, avoiding the overlap of used frequency bands and controlling the wireless network communication becomes more complicated. We use MARL to solve the problem of overlapping frequency bands so that the trained algorithm should distribute frequency bands properly. This ensures reliable network communication.
        Accordingly, all communicating devices are agents in a specific area where they communicate. As actions, they choose communication channels. Across different scenarios, the set of possible channel selections is variable.
        To enable a reliable solution of the problem, each agent receives the following information in its state: the communication channel used in the previous step, the own Quality of Service (QoS) achieved by the last action, a vector of all neighbouring devices and the communication channels the neighbouring devices used in their last action.
        After their selection of a communication channel, all agents a reward, chosen to be the sum of the achieved QoS of all agents, since a shared reward avoids adversarial behaviour and leads to cooperation between the agents.
        We start the training by using a single-agent Q-learning algorithm. This leads to optimal training results for a small amount of agents. However, there occurs additional MARL problems like the non-stationarity of the environment, as well as scalability problems and non-unique learning goals. To attack these problems, we use different MARL algorithms, like a NashQ-algorithm and an IQL-algorithm. These algorithms give optimal training results for a small amount of agents and outperform the regular Q-Learning algorithm in training time. However, the scalability problem still persists, so we want to address this in future work by using a VDN-QMIX algorithm, which uses the global state during training, but goes back to a decentraliced setting in the final execution. Thus, we hope to ensure a good scaled training with many agents as well as a good vision of the real-world problem.

        Speaker: Sabrina Pochaba (Salzburg Research)
      • 37
        Online reinforcement learning control of beam collision for BEPCII

        For the Beijing Electron-Positron Collider II (BEPCII), operators need to tune the transverse offsets—including displacement and angular deviation (x, x’, y, y’)—of the two beams at the interaction point (IP) to maintain high luminosity as the beam current decays during normal operation. Given that the optimal offset exhibits a non-linear variation with beam current within a single run and also differs across individual runs, sustaining the optimal beam offset at the IP for consistent high luminosity at all times is laborious. Consequently, operators typically adopt a linear model for automatic offset tuning. In this study, a Deep-Q-Network (DQN) agent was trained using historical data to adjust the beam offset at the IP. The DQN agent employs 18 input parameters (including IP offset, beam position monitor (BPM) readings, and beam current) and 8 output parameters (Q-values for action selection). This DQN agent has been successfully deployed in daily offset tuning, essentially replacing both the linear model and manual operator adjustments. Furthermore, it has achieved an increase in integrated luminosity compared to the previous approach.

        Speaker: Jiaqi Fan (中国科学院高能物理研究所(IHEP))
      • 38
        PIPELINES: A NODE-BASED EDITOR FOR STREAMLINED OPTIMISATION PROTOTYPING IN THE CONTROL ROOM

        Development shifts on accelerators are usually time-constrained and infrequent. Meanwhile, control room PCs are not designed for scrappy R\&D, and maintaining multiple workflows with python scripts is prone to error. GUI apps have been successfully deployed and used in the past to perform optimisation at accelerator facilities. However, bookkeeping can become difficult in complex tasks. Furthermore, support is missing for pre-optimisation steps such as response matrix measurements used in Slow Orbit Feedback (SOFB) machine learning algorithms. A PySide node-based visual editor has been developed and tested in the Diamond control room. A logical heirarchy of blocks define processes to perform and an inspector window allows the user to fine-tune blocks to their needs. Separate processes are spawned when compute or time-intensive blocks are run, keeping the main UI thread responsive. An optimisation problem is tackled using the app to demonstrate its usefulness.

        Speaker: Shaun Preston (University of Oxford)
      • 39
        Reinforcement Learning Beyond Greedy Optimisation for Delayed-Consequence Accelerator Control

        Most accelerator control systems assume that the effect of an action can be evaluated locally and immediately. While greedy approaches work in near-linear regimes and Bayesian Optimisation (BO) is now standard for black-box tuning, both are essentially static optimisers and struggle in dynamic tasks with delayed consequences, where even adaptive BO remains time-myopic and lacks explicit temporal credit assignment for system memory and long-range machine evolution.
        We investigate three relevant forms of delayed consequences: explicit action latency (field settling delays response), magnetic hysteresis (output depends on change history), and ballistic amplification (small upstream kicks grow through nonlinear optics and apertures, causing downstream loss).
        Using a high-fidelity XSuite model of the AWAKE electron line, we benchmark a reinforcement learning controller against an inverse-response greedy optimiser and BO. The learning-based method anticipates delayed effects and avoids failure regions where both baselines fail, indicating that delayed-consequence regimes are a key class of accelerator control problems where horizon-aware model-based or learning-based methods clearly outperform current practice.

        Speakers: Kajsa Miho Björkbom (PLUS University Salzburg), Simon Hirlaender (PLUS University Salzburg)
      • 40
        Reinforcement Learning combined with a surrogate model of the accelerator

        Recent developments at the INFN laboratories in Legnaro have demonstrated the effectiveness of Bayesian optimization in automating the tuning process of particle accelerators, yielding substantial improvements in beam quality, significantly reducing setup times, and shortening recovery times following interruptions. Despite these advances, the high-dimensional parameter space defined by numerous sensors and actuators continues to pose challenges for fast and reliable convergence to optimal configurations. This work proposes a machine learning-based framework that combines surrogate modeling of the accelerator with reinforcement learning strategies for closed-loop optimization, with the goal of further accelerating commissioning procedures and enhancing beam performance.

        Speaker: Daniele Zebele (INFN)
      • 41
        Reinforcement Learning–Guided Dynamic Tuning of a THz Linac

        By scaling accelerator operation to THz frequencies, dielectric-lined waveguides (DLWs) can achieve accelerating gradients far higher than conventional RF structures, while supporting modes that couple longitudinal acceleration with transverse focusing. We propose a reinforcement-learning–based dynamic tuner that, using beam distribution information, adjusts the THz phase and amplitude of consecutive DLW stages in real time to maximise electron transmission and energy gain while minimising emittance growth and energy spread under realistic jitter and misalignments.

        Speaker: Filip Peczek (The University of Manchester)
      • 42
        Resource-Conditioned Reinforcement Learning for Physics Instrument Design

        Designing advanced particle-physics instruments requires navigating a high-dimensional space of discrete and continuous choices while satisfying strict constraints on material, cost, and geometry. In practice, these constraints evolve throughout an experiment’s lifetime, making it insufficient to optimize a single “best” detector configuration. We present a resource-conditioned reinforcement learning (RL) framework for detector design that produces families of optimized configurations matched to different constraint levels. Building on prior RL-based instrument design workflows (arXiv:2412.10237), we train agents that condition their policy on available resources and other problem parameters, enabling a single training run to generate multiple designs spanning a spectrum of feasible budgets and geometries.

        We demonstrate the approach on longitudinal calorimeter design, where the agent learns to adapt sensor placement and layer thickness patterns in a non-linear way as resources change, yielding a set of optimized architectures that directly expose performance–cost trade-offs (e.g., energy resolution versus material usage). We discuss practical aspects of constraint handling and feasibility enforcement, and we outline how the same conditional formulation can be extended to additional design degrees of freedom—such as total detector size, spatial envelopes, or task-specific operating points—supporting interactive studies for decision-makers. This work reframes detector optimization from producing a single configuration to learning a controllable design policy that can respond to shifting requirements during experiment design.

        Speaker: Sara Zoccheddu (University of Zurich)
      • 43
        Robust Real-Time Optimization of SIS18 Injection using Gaussian Process MPC

        We present advancements in the data-driven Model Predictive Control (MPC) framework for optimizing multi-turn injection (MTI) into the SIS18 synchrotron. Building on our prior work on safe, sample-efficient optimization, we systematically investigate the impact of current noise and transverse emittance fluctuations. By incorporating realistic error models derived from dedicated measurements of ion source and UNILAC fluctuations on current and emittance into XSuite simulations, we demonstrate that the Gaussian Process model effectively filters aleatoric uncertainty, maintaining robust operation where standard numerical optimizers degrade. Furthermore, we report on the successful deployment of the framework during live SIS18 tuning. The controller autonomously adjusted injection parameters, demonstrating reliable convergence, enhanced efficiency, and a substantial reduction in tuning iterations compared to model-free RL methods, which often face challenges in real-world applications. These results establish data-driven MPC as a powerful tool for real-time optimization in noisy, high-stakes accelerator environments, setting the stage for safe learning-based control across FAIR facilities.

        Speaker: Simon Hirlaender (PLUS University Salzburg)
      • 44
        Testing and Improving RL Policies via Rule Learning

        Using domain knowledge to improve deep RL policies is a current challenge. LEGIBLE mines rules from an RL policy, constituting a partially symbolic representation. These rules describe which decisions the RL policy makes and which it avoids making. It then generalizes the mined rules using domain knowledge. Finally, it evaluates generalized rules to determine which generalizations improve performance when enforced. These improvements show weaknesses in the policy, where it has not learned the general rules and thus can be improved by rule guidance. We show the efficacy of our approach by demonstrating that it effectively finds weaknesses, accompanied by explanations of these weaknesses in several RL environments.
        Closing the loop from neural to symbolic and back to neural representation, we show how to integrate symbolic (rule-based) knowledge into neural RL policies by leveraging RL from demonstrations with OFTEN-DeepRL.

        Speaker: Ignacio D. Lopez-Miguel
      • 45
        Towards a Training-Efficient Reinforcement Learning Based Control Approach for the LUMEN Engine Using Curriculum-Guided PPO

        In space propulsion, a small set of controllable valves regulates propellant mass flows and thereby achieves desired operating targets. Reinforcement learning (RL) is promising here because it can learn feedback policies directly from simulator interaction. Prior work has demonstrated the practical viability of deep RL for related control tasks. However, even in low-dimensional actuator settings, as in our 2x2 configuration, training can be sample-inefficient and unstable due to nonlinear, coupled dynamics and constraint-critical transients. The resulting instability increases the risk of converging to suboptimal solutions or violating safety limits. To address these challenges, this study investigates a curriculum learning approach to improve training efficiency and reliability for RL-based control of the LUMEN (Liquid Upper stage deMonstrator ENgine) simulation environment. Using Proximal Policy Optimization (PPO), training is structured as a progression from simplified operating regimes toward increasingly demanding tasks by expanding the operating envelope. Evaluation focuses on training efficiency, transient-response quality, and tracking of constraint violations, complemented by time-series diagnostics and reward-component analyses to interpret learning progress.

        Speakers: Fabio Matanza (University of Salzburg, B.Sc. AI Thesis (DLR, Institute of Space Propulsion & IDA Lab Salzburg)), Simon Hirlaender (PLUS University Salzburg)
    • Contributed talks: RL Applied to Particle Accelerators Theatre 2, Teaching Hub 502

      Theatre 2, Teaching Hub 502

      University of Liverpool

      Liverpool L69 7ZP UK
      Convener: Simon Hirlaender (PLUS University Salzburg)
      • 46
        Extending Reinforcement Learning for Beam Steering with Bayesian Optimization and Online System Identification in the CERN SPS North Area

        Commissioning slow extracted beams from the CERN Super Proton Synchrotron (SPS) to the North Area experimental targets requires trajectory control through multiple transfer lines using corrector magnets—a process that traditionally demands significant expert intervention. Previous work demonstrated using reinforcement learning (RL) for automated trajectory correction based on secondary emission monitor (SEM) split-foil intensity measurements, successfully centering the beam on target under nominal conditions. However, two failure modes require human intervention: complete signal loss when the beam exceeds SEM acceptance, and corrector magnet polarity changes that invalidate learned policies.
        We extend this framework with a hierarchical approach comprising three sequential stages. First, when the beam lies outside SEM detection range, we employ Bayesian optimization with random exploration to recover beam visibility. Second, we perform online system identification to automatically resolve corrector polarity ambiguities. With these prerequisites satisfied, the RL agent maps SEM observations to corrector adjustments, achieving beam centering throughout the transfer line.

        Speaker: Adrián Menor de Oñate (CERN)
      • 47
        Real-Time Ion Beam Orbit Correction in the RAON LEBT Section Using Surrogate Modeling and Reinforcement Learning

        Precision orbit correction is critical for maintaining beam stability and transmission efficiency in low-energy beam transport (LEBT) systems, particularly under the influence of nonlinear field effects and strong space-charge forces. Traditional correction techniques often struggle to cope with such complexities, especially when real-time responsiveness is required.
        In this work, we present a machine learning–based framework that combines surrogate modeling with reinforcement learning to enable accurate and adaptive orbit correction in RAON's LEBT section. The approach demonstrates strong potential for improving correction performance beyond conventional methods, while offering generalizability to varying beam conditions. Moreover, this study highlights how AI-driven control techniques can be integrated into modern accelerator operations, paving the way toward intelligent and autonomous beam tuning systems.

        Speaker: Chong Shik Park (Korea University, Sejong)
      • 48
        Autonomous Optimization of RF Triple Splitting in the CERN PS

        Reinforcement learning (RL) is a powerful technique for optimizing complex beam manipulations. An RL-based autonomous controller has been developed for the triple splitting RF manipulation in the CERN Proton Synchrotron (PS), essential to establish the bunch spacing for the LHC. The system combined a convolutional neural network for initial phase correction with sequential soft actor-critic agents optimizing the RF parameters. Trained with simulated bunch profile data, the controller demonstrated robust and rapid convergence during early beam tests. This motivated the deployment as an on-demand tool and later as a fully autonomous controller. However, changes to the RF voltage program or operating conditions would require offline simulation, dataset regeneration, and retraining. With the experience gained running the RL controller, its replacement by a PID-based solution requiring only gain tuning while achieving comparable performance has been completed. This case study highlights both the strengths and limitations of RL for autonomous accelerator control and underlines maintainability as a key criterion for operational implementations.

        Speaker: Joel Wulff (CERN)
    • 15:00
      Coffee ☕ Teaching Hub 502, First Floor

      Teaching Hub 502, First Floor

      University of Liverpool

    • Contributed talks: RL Applied to Other Systems Theatre 2, Teaching Hub 502

      Theatre 2, Teaching Hub 502

      University of Liverpool

      Liverpool L69 7ZP UK
      Convener: Alexander Brynes (Science & Technology Facilities Council)
      • 49
        Reinforcement Learning for Control of Polarized Cryogenic Targets at Jefferson Lab

        The operation of solid, cryogenic polarized targets in nuclear physics experiments relies on continuous tuning of the microwave frequency to compensate for radiation damage and evolving material properties, a task that is traditionally performed through manual trial-and-error by expert operators. This work presents a data driven control framework that combines surrogate modeling with reinforcement learning to optimize the target polarization. Using operational data from the APOLLO cryogenic target system, we train and evaluate multilayer perceptron and Gaussian process regression models to predict polarization as a function of microwave frequency, beam current, and accumulated radiation dose. We show that Gaussian process-based models provide well calibrated uncertainty estimates and reliably identify regions outside the training distribution, while MLPs exhibit limited sensitivity to distributional shift. To enable learning and control across multiple target samples, we introduce a Gaussian process approximation and embed the surrogate model within a standardized simulation environment. A reinforcement learning agent is trained using a lower confidence-bound reward formulation that balances performance maximization against uncertainty.

        Speaker: Armen Kasparian (Jefferson Lab)
      • 50
        Leveraging Reinforcement Learning, Genetic Algorithms and Transformers for background determination in particle physics

        Experimental studies of beauty hadron decays face significant challenges due to a wide range of backgrounds arising from the numerous possible decay channels with similar final states. For a particular signal decay, the process for ascertaining the most relevant background processes necessitates a detailed analysis of final state particles, potential misidentifications, and kinematic overlaps, which, due to computational limitations, is restricted to the simulation of only the most relevant backgrounds. Moreover, this process typically relies on the physicist’s intuition and expertise, as no systematic method exists.

        This work has two primary goals. First, from a particle physics perspective, we present a novel approach that utilises Reinforcement Learning (RL) to overcome the aforementioned challenges by systematically determining the critical backgrounds affecting beauty hadron decay measurements. While beauty hadron physics serves as the case study in this work, the proposed strategy is broadly adaptable to other types of particle physics measurements. Second, from a Machine Learning perspective, we introduce a novel algorithm which exploits the synergy between RL and Genetic Algorithms (GAs) for environments with highly sparse rewards and a large trajectory space. This strategy leverages GAs to efficiently explore the trajectory space and identify successful trajectories, which are used to guide the RL agent's training. Our method also incorporates a transformer architecture for the RL agent to process token sequences that represent particle decays.

        Speaker: Guillermo Hijano Mendizabal (University of Zurich)
      • 51
        Multi-Agent Reinforcement Learning for Resource Allocation in Wireless Network Communication

        Multi-Agent Reinforcement Learning (MARL) is an important subfield of Reinforcement Learning, in which multiple agents learn in a shared environment. The simultaneous learning of several players naturally arises in domains like robotics, network communication and traffic control, where agents affect and influence one another. Thus, MARL can simulate real-world problems in a reliable way, and consequently, interest in MARL continues to grow.
        In this work, we consider the real-world problem of resource allocation in wireless network communication.
        Due to the fast development of wireless network communication, data traffic is rising, and more devices are communicating like mobile users and machines in factories.
        These devices affect their communication by trying to gain the same resources for guaranteeing reliable communication. Thus, avoiding the overlap of used frequency bands and controlling the wireless network communication becomes more complicated. We use MARL to solve the problem of overlapping frequency bands so that the trained algorithm should distribute frequency bands properly. This ensures reliable network communication.
        Accordingly, all communicating devices are agents in a specific area where they communicate. As actions, they choose communication channels. Across different scenarios, the set of possible channel selections is variable.
        To enable a reliable solution of the problem, each agent receives the following information in its state: the communication channel used in the previous step, the own Quality of Service (QoS) achieved by the last action, a vector of all neighbouring devices and the communication channels the neighbouring devices used in their last action.
        After their selection of a communication channel, all agents a reward, chosen to be the sum of the achieved QoS of all agents, since a shared reward avoids adversarial behaviour and leads to cooperation between the agents.
        We start the training by using a single-agent Q-learning algorithm. This leads to optimal training results for a small amount of agents. However, there occurs additional MARL problems like the non-stationarity of the environment, as well as scalability problems and non-unique learning goals. To attack these problems, we use different MARL algorithms, like a NashQ-algorithm and an IQL-algorithm. These algorithms give optimal training results for a small amount of agents and outperform the regular Q-Learning algorithm in training time. However, the scalability problem still persists, so we want to address this in future work by using a VDN-QMIX algorithm, which uses the global state during training, but goes back to a decentraliced setting in the final execution. Thus, we hope to ensure a good scaled training with many agents as well as a good vision of the real-world problem.

        Speaker: Sabrina Pochaba (Salzburg Research)
    • Coding Challenge 💻: Hands-on Theatre 2, Teaching Hub 502

      Theatre 2, Teaching Hub 502

      University of Liverpool

      Liverpool L69 7ZP UK

      Time allocated to work on the RL challenge

      Challenge information:

      • The final submission deadline: 14:15, Wednesday the 1st of April
      • 1st and 2nd place teams will be asked to present their results
      • Template slides available in session contributions: They will guide what we want to see you present. If you prefer to make your own feel free! You will have approximately 5 minutes speaking + 3 minutes for questions per team.
      Conveners: Amelia Pollard (ASTeC), Andrea Santamaria Garcia (University of Liverpool and Cockcroft Institute), Joel Wulff (CERN)
    • Workshop social: Dinner & Gaming Gravity Max

      Gravity Max

      5 Wall St, Liverpool L1 8JQ
    • Transportation 🚌: Departure to venue for last day Central Teaching Hub

      Central Teaching Hub

      University of Liverpool

    • 09:30
      Coffee & pastries ☕ 🥐 Cockcroft Institute

      Cockcroft Institute

    • Laboratory visit: Tour at Daresbury Laboratory Daresbury Laboratory

      Daresbury Laboratory

      • 52
        Site Overview Merrison Lecture Theatre, Cockcroft Institute

        Merrison Lecture Theatre, Cockcroft Institute

        Daresbury Laboratory

        Speaker: Peter McIntosh (STFC ASTeC)
      • 53
        Tours
    • 12:00
      Lunch break (self-paid) Campus restaurant or Waterside Café (Daresbury Laboratory)

      Campus restaurant or Waterside Café

      Daresbury Laboratory

    • Contributed talks: RL Applied to Other Systems and Others Merrison Lecture Theatre, Cockcroft Institute (Daresbury Laboratory)

      Merrison Lecture Theatre, Cockcroft Institute

      Daresbury Laboratory

      Convener: Amelia Pollard (ASTeC)
      • 54
        Deploying Recurrent Neural Networks on AMD FPGAs: A GRU Implementation on the Versal AI Engine

        This talk presents an overview of neural network deployment on reconfigurable hardware, with a particular focus on modern AMD FPGA platforms and the Versal Adaptive Compute Acceleration Platform (ACAP). The discussion begins with examples from physics applications where reinforcement learning and recurrent neural networks are jointly employed for real-time control and decision-making.

        An introduction to FPGA technology is then provided, covering the fundamental hardware components, common development workflows, and programming approaches using hardware description languages (HDL) and high-level synthesis (HLS). System-on-Chip (SoC) architectures are discussed, leading to a detailed presentation of the AMD Versal platform as a heterogeneous architecture integrating programmable logic, processing systems, and AI Engines.

        The talk subsequently reviews existing frameworks for deploying neural networks on FPGA-based systems. Finally, a case study is presented describing the design and implementation of a Gated Recurrent Unit (GRU) on the Versal AI Engine, highlighting architectural considerations and practical challenges associated with mapping recurrent neural networks to this platform.

        Speaker: Michail Sapkas (UniPD - INFN Padova)
      • 55
        Binary Trigger Signals for Deep Reinforcement Learning in Equity Trading

        This study introduces a novel binary trigger-based state representation for deep reinforcement learning (DRL) in stock trading. Unlike conventional approaches using continuous technical indicators (MACD, RSI, CCI, ADX), we encode market state via binary signals: MVX (moving-average crossover) and BOLLX (Bollinger band breakout). We also propose trigger-date filtering, which trains only on dates when triggers fire, reducing training data by 50-70%.

        Evaluating 27 configurations (three algorithms: A2C, PPO, SAC across nine indicator variants) on Dow Jones 30 daily data (Jan-Nov 2025), we discover a strong algorithm-indicator dependency: A2C with MVX yields +30.85% improvement, PPO with BOLLX achieves +16.09%, while SAC remains robust to both. The best configuration (A2C with filtered MVX) achieves 31.90% cumulative return, a Sharpe ratio of 1.41, and outperforms the DJIA baseline by 154%.

        A systematic review of papers (2015-2025) suggests both contributions are novel: no prior work employs binary trigger signals or trigger-date filtering in DRL trading. Results partially validate RL over traditional strategies (37% of models beat DJIA) while showing trigger-date filtering benefits A2C but hurts PPO/SAC. Limitations include the 11-month test period and absence of LSTM temporal modeling, suggesting future work on recurrent architectures and multi-market validation.

        Speaker: Juan Manuel Montoya Bayardo
      • 56
        RL control for AGS Injection in silico

        We present an RL-based approach for optimizing beam injection into the Alternating Gradient Synchrotron (AGS) using a fully differentiable in silico environment. Beam diagnostics from multiwire screens in the BtA transfer line and turn-by-turn beam position monitors throughout the AGS lattice are leveraged to characterize injection quality, quantify beam survival time, and localize beam losses along the ring. These observables are incorporated into the RL reward structure, enabling the agent to directly optimize beam survival and transmission efficiency. The simulation framework models nonlinear beam dynamics, lattice optics, and operational constraints, allowing gradients to be propagated through the environment for efficient policy learning. To ensure robustness and generalization, domain randomization is applied over initial beam distributions and lattice misalignments during training. Results demonstrate that the trained agent learns control policies that improve beam survival and reliably mitigate loss mechanisms across a range of perturbed machine conditions. This work establishes a foundation for robust, data-driven injection optimization and supports future translation of RL-based control strategies to accelerator operations.

        Speaker: Eiad Hamwi (Brookhaven National Laboratory)
    • Coding Challenge 💻: Hands-on & Presentations Merrison Lecture Theatre, Cockcroft Institute (Daresbury Laboratory)

      Merrison Lecture Theatre, Cockcroft Institute

      Daresbury Laboratory

      Time allocated to work on the RL challenge

      Challenge information:

      • The final submission deadline: 14:15, Wednesday the 1st of April
      • 1st and 2nd place teams will be asked to present their results
      • Template slides available in session contributions: They will guide what we want to see you present. If you prefer to make your own feel free! You will have approximately 5 minutes speaking + 3 minutes for questions per team.
      Convener: Joel Wulff (CERN)
      • 57
        Challenge Free Work Time
        Speakers: Amelia Pollard (ASTeC), Andrea Santamaria Garcia (University of Liverpool and Cockcroft Institute), Joel Wulff (CERN)
      • 58
        First and Second Team Presentations
    • Prize Awards 🏆: challenge, poster, and others Merrison Lecture Theatre, Cockcroft Institute (Daresbury Laboratory)

      Merrison Lecture Theatre, Cockcroft Institute

      Daresbury Laboratory

      Convener: Andrea Santamaria Garcia (University of Liverpool and Cockcroft Institute)
      • 59
        Prize Awards
        Speakers: Andrea Santamaria Garcia (University of Liverpool and Cockcroft Institute), Carsten Welsch (University of Liverpool)
    • 15:30
      Coffee ☕ Cockcroft Institute (Daresbury Laboratory)

      Cockcroft Institute

      Daresbury Laboratory

    • Discussion 🎤: Workshop closing Merrison Lecture Theatre, Cockcroft Institute (Daresbury Laboratory)

      Merrison Lecture Theatre, Cockcroft Institute

      Daresbury Laboratory

      Conveners: Andrea Santamaria Garcia (University of Liverpool and Cockcroft Institute), Borja Rodriguez Mateos (CERN), Joel Wulff (CERN), Simon Hirlaender (PLUS University Salzburg)
      • 60
        Discussion Session
        Speakers: Borja Rodriguez Mateos (CERN), Simon Hirlaender (PLUS University Salzburg)
    • Transportation 🚌: Departure to Liverpool Daresbury Laboratory

      Daresbury Laboratory