Main Concepts
Command Line Tool
If you want to run an analysis with processors and taggers that have already been developed, the suggested way is to use the command line tool run_analysis.py.
The main part that define an analysis are the following:
datasets: path to a JSON file in the form{"dataset_name": [list_of_files]}(like the one dumped by dasgoclient)workflow: the coffea processor you want to use to process you data, can be found in the modules located inside the subpackagehiggs_dna.workflowsmetaconditions: the name (without.jsonextension) of one of the JSON files inherited from FLASHgg and located insidehiggs_dna.metaconditionsyear: the year condition for each sample, let us use2016preVFP,2016postVFP,2017,2018,2022preEE,2022postEE,2023taggers: the set of taggers you want to use, can be found in the modules located inside the subpackagehiggs_dna/workflows/taggerssystematics: the set of systematics you want to use for each samplecorections: the set of corrections you want to use for each sample
These parameters are specified in a JSON file and passed to the command line with the flag --json-analysis
run_analysis.py --json-analysis simple_analysis.json
where simple_analysis.json looks like this:
{
"samplejson": "/work/gallim/devel/HiggsDNA/tmp/DY-data-test.json",
"workflow": "tagandprobe",
"metaconditions": "Era2017_legacy_xgb_v1",
"taggers": [
"DummyTagger1"
],
"year":[
"SampleName1": ["2022preEE"],
"SampleName2": ["2017"]
]
"systematics": {
"SampleName1": [
"SystematicA", "SystematicB"
],
"SampleName2": [
"SystematicC"
]
},
"corrections": {
"SampleName1": [
"CorrectionA", "CorrectionB"
],
"SampleName2": [
"CorrectionC"
]
}
}
where the taggers list and the systematics or corrections dictionaries can be left empty if no taggers or systematics are applied. For most of the systematics, a year condition is necessary.
The next two flags that you will want to specify are dump and executor: the former receives the path to a directory where the parquet output files will be stored, while the latter specifies the Coffea executor used to process the chunks of data. It can be one of the following:
iterativefuturesdask/localdask/condordask/slurmdask/lpcdask/lxplusdask/casaparsl/slurmparsl/condor
There are then a few other options that depend on the backend. When running with Dask, for instance, you may want to use the command line to change the following parameters:
workers: number processes/threads used on the same node (this means that every job will use this amount of cores)scaleout: minimum number of nodes to scale out to (i.e. minimum number of jobs submitted)max-scaleout: maximum number of nodes to adapt your cluster to (i.e. maximum number of jobs submitted)
As usual, a description of all the options is printed when running:
run_analysis.py --help
Processors
Processors are items defined within Coffea where the analysis workflow is described. While a general overview is available in the Coffea documentation, here we will focus on the aspects that are important for HiggsDNA.
Since in Higgs to diphoton analysis there are some operations that are common to every analysis workflow, we wrote a base processor HggBaseProcessor which can be used in many basic analyses. If more complex operations are needed, one can still write a processor that inherits from the base class and redefines the function process. The operations that one can find within HggBaseprocessor.process are the following:
application of filters and triggers
Chained Quantile Regression to correct shower shapes and isolation variables
photon IdMVA
diphoton IdMVA
photon preselection
event tagging
application of systematic uncertainties
Write a New Processor
There are cases in which the workflows implemented in HiggsDNA are not enough for your studies. In these cases you might need to write your own processor. Depending on the scenario, there are different guidelines to do this.
Hgg-like workflow. In this case your analysis is similar to the one implemented in the Hgg basic processor, but you need to perform other operations on top (e.g. additional cuts, application of NNs, etc.). In order to reduce the amount repeated code, what you can do is write a processor that inherits from
HggBaseProcessorand redefine the functionprocess_extra. You can find an example of this in DYStudiesProcessor.Non Hgg-like workflow. This is the case in which the operations you need to perform are different from the ones performed in the
processfunction ofHggBaseProcess. In this kind of scenario you can still inherit fromHggBaseProcessorin order to have access to the same attributes, but you also need to rewrite theprocessfunction. An example of this is the TagAndProbeProcessor. In this case, we cannot use the standard workflow since we manipulate objects in a different way (for instance, we have tag and probe photons instead of lead and sublead and since each item of a pair can be either tag or probe we need to double the number of candidates - this is an operation that we would never do in a standard workflow).