higgs_dna.scripts.samples package
Submodules
higgs_dna.scripts.samples.download_files module
- higgs_dna.scripts.samples.download_files.download_file(file_url, destination_folder, key)[source]
Download a single file using xrdcp.
- Args:
file_url (str): URL of the file to download. destination_folder (str): Base directory where files are downloaded. key (str): The key associated with the file, used to create subdirectories.
- Returns:
- tuple: (file_url, success, message)
file_url (str): The URL of the file attempted to download.
success (bool): True if download was successful, False otherwise.
message (str): Success or error message.
- higgs_dna.scripts.samples.download_files.download_files_in_parallel(file_dict, target_dir, num_files, num_threads)[source]
Download files in parallel using multiple threads.
- Args:
- file_dict (dict): Dictionary where each key is a string representing a category
(e.g., dataset name) and the value is a list of file URLs to download.
- target_dir (str): Base directory where files will be downloaded. Subdirectories for each key
will be created within this directory.
num_files (int or None): Number of files to download per key. If None, download all files. num_threads (int): Number of concurrent threads to use for downloading.
- Returns:
None
- higgs_dna.scripts.samples.download_files.handle_target_directory(target_dir, force_recreate)[source]
Manage the target directory based on its current state and user preferences.
If the directory exists and is empty, proceed with downloading.
- If the directory exists and is not empty:
If force_recreate is True, delete its contents and proceed.
Otherwise, exit the script to prevent accidental data loss.
If the directory does not exist, create it.
- Args:
target_dir (str): Path to the target directory. force_recreate (bool): Flag indicating whether to force recreate the directory.
- Raises:
SystemExit: If unable to handle the directory based on the above conditions.
- higgs_dna.scripts.samples.download_files.is_directory_empty(directory)[source]
Check if a given directory is empty.
- Args:
directory (str): Path to the directory.
- Returns:
bool: True if the directory is empty, False otherwise.
- higgs_dna.scripts.samples.download_files.main()[source]
Main function to parse arguments and initiate the file download process.
- Workflow:
Parse command-line arguments.
Read and parse the JSON file containing file URLs.
Determine whether to process a specific key or all keys.
Handle the target directory based on its state and user flags.
Download the specified number of files using xrdcp, either sequentially or in parallel.
higgs_dna.scripts.samples.fetch_datasets module
- higgs_dna.scripts.samples.fetch_datasets.get_dataset_dict_grid(fset: Iterable[Iterable[str]], xrd: str, dbs_instance: str, logger) Dict[str, List[str]][source]
Fetch file lists for grid datasets using dasgoclient. This function is parallelised and will restart stuck requests after 10 seconds.
- Parameters:
fset – Iterable of tuples (dataset-short-name, dataset-path)
xrd – xrootd prefix
dbs_instance – DBS instance for dasgoclient
logger – Logger instance
- Returns:
Dictionary mapping dataset names to list of file paths
- higgs_dna.scripts.samples.fetch_datasets.get_dataset_dict_local(fset: Iterable[Iterable[str]], recursive: bool, extensions: List[str], logger) Dict[str, List[str]][source]
Collect file lists for local directories.
- Parameters:
fset – Iterable of tuples (dataset-short-name, directory-path)
recursive – Whether to search directories recursively
extensions – List of file extensions to filter (case-insensitive)
logger – Logger instance
- Returns:
Dictionary mapping dataset names to list of local file paths
- higgs_dna.scripts.samples.fetch_datasets.read_input_file(input_txt: str, mode: str, logger) List[tuple][source]
Read the input text file and parse dataset names and paths.
- Parameters:
input_txt – Path to the input text file
mode – Mode of operation (‘grid’ or ‘local’) for validation
logger – Logger instance
- Returns:
List of tuples (dataset-name, dataset-path)