<center>
<h1><b><span style="color:blue">Big Data Python ecosystem for HEP analysis</span></b></h1>
<h3>Eduardo Rodrigues<br>University of Liverpool</h3>

<h3><span style="color:gray"><a href="https://indico.ph.liv.ac.uk/event/1639">STFC School on Data Intensive Science 2024</a>, Liverpool, 14-19 July 2024</span></h3>
</center>

### Abstract

Data analysis in High Energy Physics (HEP) has evolved considerably in recent years, with "Big Data" tools being ever more used.
Python as a programming language for analysis work is established and a HEP-specific ecosystem connecting well with the wider scientific Python ecosystem
is both mature at this point and under continuous development.
I will discuss HEP data as Big Data, Python and its analysis ecosystem provided by various community domain-specific projects.
I will dwell in particular on the Scikit-HEP project, which I started in late 2016 with a few colleagues from various backgrounds and domains of expertise.
It is now part of the official software stack of the experiments ATLAS, Belle II, CMS, KM3NeT and HCb.


---

### **Aside intro - "PyHEP community projects"**

A series of Python projects and software libraries have seen the light in the recent years, where by *projects* I select endeavours that provide one or more Python libraries *with a community around it*. 
Popular such projects are `Coffea`, `ComPWA`, `GooFit`, `Scikit-HEP` and `zfit`.

**Scikit-HEP** is:
- The one I co-founded in late 2016 with a few colleagues, hence intimately involved with.
- The oldest of such projects.
- The one with more libraries provided.
- The project on which most other "Big Data projects" depend on (they depend on at least one of Scikit-HEP libraries).

For these reasons I will be presenting some of the Scikit-HEP packages.

---

## **The Scikit-HEP project**

The scientific Python ecosystem can be organised, schematically, as a layered set of libraries and packages ever more specialised, from foundational and key libraries such as NumPy, Pandas and matplotlib, to domain-specific projects. In HEP we now also have our own "ecosystem shell". Looking from a Scikit-HEP centric perspective this ecosystem looks as follow:

<center><img src="images/scikit-hep-ecosystem-shells.svg" width="65%"/></center>

Note that several packages or projects are part of the "grand PyHEP ecosystem". That's the case of `Coffea`, `GooFit` and `zfit`.

### **Project topics and packages**

Very many topics are addressed within the project!
- Data manipulation and interoperability
- Data aggregation and histogramming
- Modeling and fitting
- Statistics
- Visualisation
- HEP-specific utilities e.g. to deal with particles and decays
- Simulation
- Interoperability with HEP-specific libraries

Here is an overview of the Scikit-HEP packages that are most popular and/or most actively used and maintained:

<center><img src="images/Scikit-HEP_ecosystem.png" width="60%"></center>

A "whetting your appetite" mini gallery ...:
<table>
<tr style="background: white;">
    <td align="center"><img src="images/Scikit-HEP_gallery_uproot.jpg" width="80%"></td>
    <td align="center"><img src="images/Scikit-HEP_gallery_Hist.jpg" width="80%"></td>
</tr>
</table>
<table>
<tr style="background: white;">
    <td align="center"><img src="images/Scikit-HEP_gallery_Particle.jpg" width="50%"></td>
    <td align="center"><img src="images/Scikit-HEP_gallery_DecayLanguage.png" width="80%"></td>
</tr>
</table>

#### **For reference - some useful links collected in one place:**

- Website: https://scikit-hep.org/
- GitHub: https://github.com/scikit-hep/

---

### **How to explore these lectures**

- **The notebooks are topical and self-consistent for you to run though them at your own pace and leisure.
Run what topics sound appealing to you ...**

- You liked these tutorials? Consider dropping a line and/or giving the GitHub repository a star, as that's a trivial way to convey positive feedback.

#### **The scikit-hep metapackage**

The project has a special package, `scikit-hep`, which is a *metapackage*. Unlike all others, which target specific topics, this metapackage simply provides an easy way to have a compatible set of project packages installed via a simple `conda install scikit-hep` (or `pip install scikit-hep`) command.

The Scikit-HEP packages used in these notebooks are in fact installed via the metapackage. It is trivial to check the available versions:

In [1]:
import skhep
skhep.show_versions()


System:
    python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:04:44) [MSC v.1940 64 bit (AMD64)]
executable: C:\home\sw\anaconda3\envs\STFC_DIS_2024\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
setuptools: 70.3.0
       pip: 24.0
     numpy: 1.26.4
     scipy: 1.14.0
    pandas: 2.2.2
matplotlib: 3.9.1

Scikit-HEP package version and dependencies:
        awkward: 2.6.6
boost_histogram: 1.4.1
  decaylanguage: 0.18.3
       hepstats: 0.8.1
       hepunits: 2.3.4
           hist: 2.7.3
     histoprint: 2.4.0
        iminuit: 2.26.0
         mplhep: 0.3.50
       particle: 0.24.0
          pylhe: 0.8.0
       resample: 1.10.0
          skhep: None
         uproot: 5.3.10
         vector: 1.4.1


<div class="alert alert-success">
<b>THANK YOU</b>

to Hans Dembinski, Henry Schreiner, Jim Pivarski, Jonas Eschle and others for knowingly (or unknowingly) providing material and/or inspiration for these tutorial notebooks!
</div>