Accessing Analysis Productions output with APD and Snakemake
Lesson Objectives
- Learn how to use the APD python package to interface Analysis Production output with your analysis scripts
- Learn how this can also be integrated into the Snakemake workflow manager
- See also the HSF lessons for an introduction to Snakemake
Analysis Production Data (APD) is a python package written to interface with the Analysis Productions database to determine the file locations for each of your jobs. This is present in the LHCb environment and lb-conda by default, and can be obtained via pip or conda for other environments. APD's main documentation can be found here.
APD Basics
Similar to in the Analysis Productions lesson in order to access the database one needs an authentication token which can be obtained by doing apd-login
and following the instructions.
You can then access the output of a given job by following this basic example:
$ python
>>> from apd import AnalysisData
>>> datasets = AnalysisData("dpa", "starterkit")
>>> bu2jpsik_24c4_magdown = datasets(name="bu2jpsik_24c4_magdown")
>>> bu2jpsik_24c4_magdown
['root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/anaprod/lhcb/LHCb/Collision24/DATA.ROOT/00254511/0000/00254511_00000001_1.data.root', ...]
This will create bu2jpsik_24c4_magdown
as a list of all the PFNs for the output of this job.
You could then access these using your preferred kind of Python based ROOT (e.g. PyROOT, Uproot, ...).
A typical analysis however will have many jobs each corresponding to different conditions such as data taking periods, LHCb magnet polarities, and data types. Therefore it's more useful to obtain a mapping of each job to its outputs. To obtain a list of all the conditions the jobs could be filtered by one can do
>>> datasets.summary()
{'tags': {'config': {'lhcb', 'mc'}, 'polarity': {'magup', 'magdown'}, 'eventtype': {'27163002', '94000000'}, 'datatype': {'2016', '2024'}, 'version': {'v0r0p4434304', 'v1r2161'}, 'name': {'bu2jpsik_24c4_magup', '2016_magup_promptmc_d02kk', 'bu2jpsik_24c4_magdown', '2016_magdown_promptmc_d02kk'}, 'state': {'ready'}}, 'analysis': 'starterkit', 'working_group': 'dpa', 'Number_of_files': 62, 'Bytecount': 3794798402}
>>> outputs = {name: datasets(name=name) for name in datasets.summary()["tags"]["name"]}
>>> outputs
{'bu2jpsik_24c4_magup': ['root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/anaprod/lhcb/LHCb/Collision24/DATA.ROOT/00254507/0000/00254507_00000003_1.data.root', ...],
'2016_magup_promptmc_d02kk': ['root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/anaprod/lhcb/MC/2016/D02KK.ROOT/00166693/0000/00166693_00000001_1.d02kk.root', ...],
'bu2jpsik_24c4_magdown': ['root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/anaprod/lhcb/LHCb/Collision24/DATA.ROOT/00254511/0000/00254511_00000001_1.data.root', ...],
'2016_magdown_promptmc_d02kk': ['root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/anaprod/lhcb/MC/2016/D02KK.ROOT/00166695/0000/00166695_00000001_1.d02kk.root', ...]}
polarity
, datatype
, eventtype
or other tags to obtain a mapping of the form {event_type: {year: {polarity: {[PFNs]}}}}
to target a subset of jobs in the production.
Multiple matches
To avoid ambiguity an exception will be raised if the tags used to filter jobs correspond to more than one job.
For example from above.
>>> bu2jpsik_24c4 = datasets(datatype="2024")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/cvmfs/lhcb.cern.ch/lib/var/lib/LbEnv/3526/stable/linux-64/lib/python3.12/site-packages/apd/analysis_data.py", line 399, in __call__
raise ValueError("Error loading data: " + error_txt)
ValueError: Error loading data: 1 problem(s) found
{'datatype': '2024'}: 2 samples for the same configuration found, this is ambiguous:
{'config': 'lhcb', 'polarity': 'magdown', 'eventtype': '94000000', 'datatype': '2024', 'version': 'v1r2161', 'name': 'bu2jpsik_24c4_magdown', 'state': 'ready'}
{'config': 'lhcb', 'polarity': 'magup', 'eventtype': '94000000', 'datatype': '2024', 'version': 'v1r2161', 'name': 'bu2jpsik_24c4_magup', 'state': 'ready'}
>>> bu2jpsik_24c4_magup = datasets(datatype="2024", polarity="magup")
>>> bu2jpsik_24c4_magdown = datasets(datatype="2024", polarity="magdown")
Snakemake integration
Many analyses in LHCb have their workflows managed by Snakemake. APD provides additional measures for integrating the output of Analysis Productions into a Snakemake workflow.
Firstly, the APD tools for Snakemake can be imported into the Snakefile:
from apd.snakemake import get_analysis_data
The dataset can now be accessed similarly to above:
dataset = get_analysis_data("dpa", "starterkit")
And by specifying additional parameters, the dataset can be monitored and processed automatically by snakemake. Below is an example of a simple rule using this technique.
rule example_rule:
input:
data=lambda w: dataset(datatype=w.datatype, eventtype=w.eventtype, polarity=w.polarity)
output:
temp("filename_{config}_{datatype}_{eventtype}_{polarity}.root")
shell:
# Some script to run accessing the files
"process_data.py -o {output} {input}"