Accessing Analysis Productions output with APD and Snakemake

Lesson Objectives

Learn how to use the APD python package to interface Analysis Production output with your analysis scripts
Learn how this can also be integrated into the Snakemake workflow manager
- See also the HSF lessons for an introduction to Snakemake

Analysis Production Data (APD) is a python package written to interface with the Analysis Productions database to determine the file locations for each of your jobs. This is present in the LHCb environment and lb-conda by default, and can be obtained via pip or conda for other environments. APD's main documentation can be found here.

APD Basics

Similar to in the Analysis Productions lesson in order to access the database one needs an authentication token which can be obtained by doing apd-login and following the instructions.

You can then access the output of a given job by following this basic example:

$ python
>>> from apd import AnalysisData
>>> datasets = AnalysisData("dpa", "starterkit")
>>> bu2jpsik_24c4_magdown = datasets(name="bu2jpsik_24c4_magdown")
>>> bu2jpsik_24c4_magdown
['root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/anaprod/lhcb/LHCb/Collision24/DATA.ROOT/00254511/0000/00254511_00000001_1.data.root', ...]

This will create bu2jpsik_24c4_magdown as a list of all the PFNs for the output of this job. You could then access these using your preferred kind of Python based ROOT (e.g. PyROOT, Uproot, ...).

A typical analysis however will have many jobs each corresponding to different conditions such as data taking periods, LHCb magnet polarities, and data types. Therefore it's more useful to obtain a mapping of each job to its outputs. To obtain a list of all the conditions the jobs could be filtered by one can do

>>> datasets.summary()
{'tags': {'config': {'lhcb', 'mc'}, 'polarity': {'magup', 'magdown'}, 'eventtype': {'27163002', '94000000'}, 'datatype': {'2016', '2024'}, 'version': {'v0r0p4434304', 'v1r2161'}, 'name': {'bu2jpsik_24c4_magup', '2016_magup_promptmc_d02kk', 'bu2jpsik_24c4_magdown', '2016_magdown_promptmc_d02kk'}, 'state': {'ready'}}, 'analysis': 'starterkit', 'working_group': 'dpa', 'Number_of_files': 62, 'Bytecount': 3794798402}

then a mapping can be made by simply doing

>>> outputs = {name: datasets(name=name) for name in datasets.summary()["tags"]["name"]}
>>> outputs
{'bu2jpsik_24c4_magup': ['root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/anaprod/lhcb/LHCb/Collision24/DATA.ROOT/00254507/0000/00254507_00000003_1.data.root', ...],
'2016_magup_promptmc_d02kk': ['root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/anaprod/lhcb/MC/2016/D02KK.ROOT/00166693/0000/00166693_00000001_1.d02kk.root', ...],
'bu2jpsik_24c4_magdown': ['root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/anaprod/lhcb/LHCb/Collision24/DATA.ROOT/00254511/0000/00254511_00000001_1.data.root', ...],
'2016_magdown_promptmc_d02kk': ['root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/anaprod/lhcb/MC/2016/D02KK.ROOT/00166695/0000/00166695_00000001_1.d02kk.root', ...]}

you can then directly iterate over all the files for your production or save the dictionary as a JSON file for later use. For a more discrete mapping you could use the polarity, datatype, eventtype or other tags to obtain a mapping of the form {event_type: {year: {polarity: {[PFNs]}}}} to target a subset of jobs in the production.

Multiple matches

To avoid ambiguity an exception will be raised if the tags used to filter jobs correspond to more than one job.

For example from above.

>>> bu2jpsik_24c4 = datasets(datatype="2024")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/cvmfs/lhcb.cern.ch/lib/var/lib/LbEnv/3526/stable/linux-64/lib/python3.12/site-packages/apd/analysis_data.py", line 399, in __call__
    raise ValueError("Error loading data: " + error_txt)
ValueError: Error loading data: 1 problem(s) found
{'datatype': '2024'}: 2 samples for the same configuration found, this is ambiguous:
    {'config': 'lhcb', 'polarity': 'magdown', 'eventtype': '94000000', 'datatype': '2024', 'version': 'v1r2161', 'name': 'bu2jpsik_24c4_magdown', 'state': 'ready'}
    {'config': 'lhcb', 'polarity': 'magup', 'eventtype': '94000000', 'datatype': '2024', 'version': 'v1r2161', 'name': 'bu2jpsik_24c4_magup', 'state': 'ready'}

instead you would need to do

>>> bu2jpsik_24c4_magup = datasets(datatype="2024", polarity="magup")
>>> bu2jpsik_24c4_magdown = datasets(datatype="2024", polarity="magdown")

Snakemake integration

Many analyses in LHCb have their workflows managed by Snakemake. APD provides additional measures for integrating the output of Analysis Productions into a Snakemake workflow.

Firstly, the APD tools for Snakemake can be imported into the Snakefile:

from apd.snakemake import get_analysis_data

The dataset can now be accessed similarly to above:

dataset = get_analysis_data("dpa", "starterkit")

And by specifying additional parameters, the dataset can be monitored and processed automatically by snakemake. Below is an example of a simple rule using this technique.

rule example_rule:
    input:
        data=lambda w: dataset(datatype=w.datatype, eventtype=w.eventtype, polarity=w.polarity)
    output:
        temp("filename_{config}_{datatype}_{eventtype}_{polarity}.root")
    shell:
        # Some script to run accessing the files
        "process_data.py -o {output} {input}"

A full Snakefile example can be found here.