higgs_dna.scripts.samples package

Submodules

higgs_dna.scripts.samples.download_files module

higgs_dna.scripts.samples.download_files.download_file(file_url, destination_folder, key)[source]

Download a single file using xrdcp.

Args:

file_url (str): URL of the file to download. destination_folder (str): Base directory where files are downloaded. key (str): The key associated with the file, used to create subdirectories.

Returns:
tuple: (file_url, success, message)
  • file_url (str): The URL of the file attempted to download.

  • success (bool): True if download was successful, False otherwise.

  • message (str): Success or error message.

higgs_dna.scripts.samples.download_files.download_files_in_parallel(file_dict, target_dir, num_files, num_threads)[source]

Download files in parallel using multiple threads.

Args:
file_dict (dict): Dictionary where each key is a string representing a category

(e.g., dataset name) and the value is a list of file URLs to download.

target_dir (str): Base directory where files will be downloaded. Subdirectories for each key

will be created within this directory.

num_files (int or None): Number of files to download per key. If None, download all files. num_threads (int): Number of concurrent threads to use for downloading.

Returns:

None

higgs_dna.scripts.samples.download_files.handle_target_directory(target_dir, force_recreate)[source]

Manage the target directory based on its current state and user preferences.

  • If the directory exists and is empty, proceed with downloading.

  • If the directory exists and is not empty:
    • If force_recreate is True, delete its contents and proceed.

    • Otherwise, exit the script to prevent accidental data loss.

  • If the directory does not exist, create it.

Args:

target_dir (str): Path to the target directory. force_recreate (bool): Flag indicating whether to force recreate the directory.

Raises:

SystemExit: If unable to handle the directory based on the above conditions.

higgs_dna.scripts.samples.download_files.is_directory_empty(directory)[source]

Check if a given directory is empty.

Args:

directory (str): Path to the directory.

Returns:

bool: True if the directory is empty, False otherwise.

higgs_dna.scripts.samples.download_files.main()[source]

Main function to parse arguments and initiate the file download process.

Workflow:
  1. Parse command-line arguments.

  2. Read and parse the JSON file containing file URLs.

  3. Determine whether to process a specific key or all keys.

  4. Handle the target directory based on its state and user flags.

  5. Download the specified number of files using xrdcp, either sequentially or in parallel.

higgs_dna.scripts.samples.fetch_datasets module

higgs_dna.scripts.samples.fetch_datasets.get_dataset_dict_grid(fset: Iterable[Iterable[str]], xrd: str, dbs_instance: str, logger) Dict[str, List[str]][source]

Fetch file lists for grid datasets using dasgoclient. This function is parallelised and will restart stuck requests after 10 seconds.

Parameters:
  • fset – Iterable of tuples (dataset-short-name, dataset-path)

  • xrd – xrootd prefix

  • dbs_instance – DBS instance for dasgoclient

  • logger – Logger instance

Returns:

Dictionary mapping dataset names to list of file paths

higgs_dna.scripts.samples.fetch_datasets.get_dataset_dict_local(fset: Iterable[Iterable[str]], recursive: bool, extensions: List[str], logger) Dict[str, List[str]][source]

Collect file lists for local directories.

Parameters:
  • fset – Iterable of tuples (dataset-short-name, directory-path)

  • recursive – Whether to search directories recursively

  • extensions – List of file extensions to filter (case-insensitive)

  • logger – Logger instance

Returns:

Dictionary mapping dataset names to list of local file paths

higgs_dna.scripts.samples.fetch_datasets.get_fetcher_args() Namespace[source]
higgs_dna.scripts.samples.fetch_datasets.main()[source]
higgs_dna.scripts.samples.fetch_datasets.read_input_file(input_txt: str, mode: str, logger) List[tuple][source]

Read the input text file and parse dataset names and paths.

Parameters:
  • input_txt – Path to the input text file

  • mode – Mode of operation (‘grid’ or ‘local’) for validation

  • logger – Logger instance

Returns:

List of tuples (dataset-name, dataset-path)

Module contents