schrodinger.active_learning.al_utils module¶
- class schrodinger.active_learning.al_utils.SelectionRule¶
Bases:
str
,enum.Enum
- RANDOM = 'random'¶
- MOST_UNCERTAIN = 'most_uncertain'¶
- GREEDY = 'greedy'¶
- DIVERSITY = 'diversity'¶
- DISE = 'dise'¶
- DISTINCT_SCAFFOLDS = 'distinct_scaffolds'¶
- schrodinger.active_learning.al_utils.positive_int(s)¶
ArgumentParser function to check whether input can be converted to positive integer.
- Parameters
s (str) – input string
- Returns
integer value of input string
- Return type
int
- schrodinger.active_learning.al_utils.split_smi_line(line: str) Optional[Tuple[str, str]] ¶
Split a line from .smi file to SMILES pattern and title. Return empty list if line is empty.
- Parameters
line – line from .smi file
- Returns
SMILES pattern, title
- Return type
[str, str] or []
- schrodinger.active_learning.al_utils.get_smi_header()¶
Create header for .smi input file. We assume the SMILES is in the first column and title in the second column.
- Returns
header list, header index for reordering SMILES and title
- Return type
list(str), list(int)
- schrodinger.active_learning.al_utils.get_csv_header(filename: str, smi_index: int, name_index: int, delimiter: str = ',', with_header: bool = True) Tuple[List[str], List[int]] ¶
Create header for .csv input file. The reordered index will put SMILES at first column and title in the second column.
- Parameters
filename – .csv input file
smi_index – column index of molecule SMILES
name_index – column index of molecule name
delimiter – delimiter of input csv files
with_header – Whether the file has header in its first line
- Returns
header list, header index for reordering SMILES and title
- schrodinger.active_learning.al_utils.my_csv_reader(filename: str)¶
Yield a csv reader that skips the first line.
- Parameters
filename – .csv file name
- Returns
csv.reader that skips first line of the file.
- schrodinger.active_learning.al_utils.read_score(score_file: str) dict ¶
Read known scores of ligands from args.score_file.
- Returns
a dictionary that maps ligand title to ligand score.
- schrodinger.active_learning.al_utils.random_filtering(file_list, output_name, probability, random_seed=None, with_header=True)¶
Randomly select lines from entries in the file_list based on probability. The aim is to have a light-weight and ultrafast function to generate a subset for pilot runs.
- Parameters
file_list (list) – paths of input files.
output_name (str) – name of the output file.
probability (float) – probablity of randomly select a line.
random_seed (int or None) – random seed number for shuffling the ligands
with_header (bool) – Whether input file(s) has header in its first line.
- schrodinger.active_learning.al_utils.get_smiles_from_al_csv_or_smi_line(line)¶
get the SMILES from a smi file or an active learning csv file with smiles in the first column
- schrodinger.active_learning.al_utils.reservoir_sampling(file_list, output_name, excluding_smiles=None, sample_size=100000, random_seed=None, with_header=True)¶
Randomly select sample_size of ligands from entries in the file_list. The aim is to have a light-weight and ultrafast function to generate a subset for pilot runs.
- Parameters
file_list (list) – paths of input files.
output_name (str) – name of the output file.
excluding_smiles (container) – list of smiles to skip in the sampling.
sample_size (int) – number of ligands to sample.
random_seed (int or None) – random seed number for shuffling the ligands
with_header (bool) – Whether input file(s) has header in its first line.
- schrodinger.active_learning.al_utils.dise_select_ligands(candidate_ligand_csv_list: list[str], selected_ligands_csv: str, num_requested_ligands: int, scored_ligands_smi: set = (), seed_cpds_set: Optional[set] = None, smi_index: int = 0, sort: bool = False, score_column: str = 'score', ascending: bool = True, similarity_threshold: float = 0.5, seed_cpds_fraction: float = 0.1)¶
This function selects specified number of ligands from candidate ligands using Directed sphere exclusion (DISE) optimizing for diversity and score. It writes the selected ligands in the specified csv file.
- Parameters
candidate_ligand_csv_list – list of file paths containing candidate ligands.
selected_ligands_csv – path to file with selected ligands.
num_requested_ligands – number of ligands required.
scored_ligands_smi – set of SMILES of already scored ligands that need to be removed from the output set.
seed_cpds_set – set of SMILES of seed ligands for DISE selection. These compounds are not included in the output and are only used to seed the selection.
smi_index – index of the SMILES column in candidate files.
sort – Whether or not the candidate ligands are sorted.
score_column – The column with scores. If the ligands are sorted and we have several files with candidate ligands, this column is used to merge them while maintaining the order.
ascending – If the ligands are sorted, whether the smaller score is better or worse.
similarity_threshold – similarity cutoff to use for DISE selection. The function tries to find enough compounds with requested cutoff and then increases the cutoff in five steps to 1.0 until it finds enough compounds.
seed_cpds_fraction – If no seed compounds are provided, it chooses this fraction of num_requested ligands from the candidate ligands and uses those as seeds. Note that these compounds are included in the output file and come from candidate ligands.
- schrodinger.active_learning.al_utils.extract_compressed_files(compressed_files, delete_original=False) list ¶
Extracts compressed gzip files from source directory. Only files that end with .gz (single file compression) will be extracted and placed back in the source directory. If delete_original is set to True, compressed file will also be deleted.
Returns updated filenames after extraction
- schrodinger.active_learning.al_utils.random_split(file_list: List, num_ligands: int, prefix: str = 'splited', block_size: int = 100000, name_index: int = 0, smi_index: int = 1, random_seed: Optional[int] = None, delimiter: str = ',', with_header: bool = True)¶
Combine input files, shuffle lines, split into files with block_size line per file. Reorder the columns such that SMILES and name is in the first and second column respectively.
- Parameters
file_list (list) – paths of input files.
num_ligands (int) – total number of ligands in all the input files.
prefix (str) – prefix of split files
block_size (int) – number of ligands in each sub .csv file.
name_index (int) – column index of molecule name
smi_index (int) – column index of molecule SMILES
random_seed (int or None) – random seed number for shuffling the ligands
delimiter (str) – delimiter of input csv files
with_header (bool) – Whether input file(s) has header in its first line.
- Returns
list of split files, reordered csv header
- Return type
list, list
- schrodinger.active_learning.al_utils.rename_qzip_model(old_qzip_model, new_qzip_model)¶
Rename the old .qzip model to new .qzip model.
- Parameters
old_qzip_model (str) – old .qzip model file.
new_qzip_model (str) – new .qzip model file.
- schrodinger.active_learning.al_utils.convert_tar_gz_to_qzip_model(tar_gz_model, qzip_model, job_args_json)¶
Convert .tar.gz ligand_ml model to .qzip model.
- Parameters
tar_gz_model (str) – input .tar.gz ligand_ml model file.
- Param
qzip_model: output .qzip model file.
:param job_args_json : the file included the arguments needed for deepautoqsar :type job_args_json : str
- schrodinger.active_learning.al_utils.convert_qzip_to_tar_gz_model(qzip_model)¶
Convert .qzip deepautoqsar model to .tar.gz ligand_ml model.
- Parameters
qzip_model (str) – .qzip deepautoqsar model filename.
- Returns
.tar.gz ligand_ml model filename.
- Return type
str
- schrodinger.active_learning.al_utils.get_zip_contents(zip_filename: str) list ¶
Get a list of zip contents from a zip file without extraction
- Parameters
zip_filename – .zip filename
- Returns
list of zip contents
- schrodinger.active_learning.al_utils.get_file_ext(filename: str) str ¶
Get the extension of the file name. Skip ‘gz’ if it is a gz compressed file.
- Parameters
filename – name of the file.
- Returns
‘gz’ excluded extension of the file.
- schrodinger.active_learning.al_utils.check_driver_disk_space(active_learning_job)¶
Estimate the driver disk usage of an active learning job with some assumed parameters. Print a warning is the available driver disk space is smaller than the estimate space. The environment variable SCRATCH_MAX_VOLUME_SIZE_GB enables the maximum amount of scratch space which will be available on a driver instance managed by schrodinger’s virtual cluster setup. If SCRATCH_MAX_VOLUME_SIZE_GB is set, it is used to specify the maximum volume size or free disk space in GB. Otherwise, the free disk space will be calculated based on the driver’s current free disk space and assumes the disk cannot be rescaled to add more space.
- Parameters
active_learning_job (ActiveLearningJob instance.) – current AL driver.
- schrodinger.active_learning.al_utils.node_run_timer(func)¶
Decorator for timing the running time of runNode method in ActiveLearningNode
- schrodinger.active_learning.al_utils.add_output_file(*output_files, incorporate=False)¶
Add files to jobcontrol output files.
- Parameters
output_files (str) – files to be transferred.
incorporate (bool) – marked files for incorporation by maestro.
- schrodinger.active_learning.al_utils.add_input_file(jsb, *input_files)¶
Check the existence of input file(s). Add it as jobcontrol input file if it exists, otherwise exit with error.
- Parameters
jsb (launchapi.JobSpecificationArgsBuilder) – job specification builder
input_files (str) – input file(s) to be added.
- schrodinger.active_learning.al_utils.concatenate_logs(combined_logfile, subjob_logfile_list, logger=None)¶
Combine subjob logfiles into single combined logfile.
- Parameters
combined_logfile (str) – combined log file name
subjob_logfile_list (list(str)) – list of subjob logfile names to be combined.
logger (Logger or None) – logger for receiving the info and error message.
- schrodinger.active_learning.al_utils.get_host_ncpu()¶
Return the host and number of CPU that should be used to submit subjobs. This function works both running under job control and not.
- Return type
tuple[str, int]
- schrodinger.active_learning.al_utils.is_hostname_valid(hostname: str) bool ¶
Check whether hostname is correct in the host file.
- Parameters
hostname – the hostname to check against
- Returns
Whether the hostname is defined in the host file.
- schrodinger.active_learning.al_utils.validate_input_files(input_files: List[str], remote_input_ligands: bool = False, allowed_format: List[str] = None) Optional[str] ¶
Check the existence and format of input files. Return error message if validation failed, otherwise return None.
- Parameters
input_files – paths of input files.
remote_input_ligands – Whether input ligand files are located at remote.
allowed_format – allowed input file formats.
- Returns
error message if validation failed; None if it passed
- schrodinger.active_learning.al_utils.validate_input_mae(input_files: list, max_check: int = 10) Optional[str] ¶
Validate structures in Maestro file(s).
- Parameters
input_files – list of path(s) to Maestro file(s).
max_check – maximum number of structures to validate.
- Returns
error message if validation fails. None if validation passes.
- schrodinger.active_learning.al_utils.yield_chunk(frame, length, func=None)¶
- schrodinger.active_learning.al_utils.validate_csv_header(header_str)¶
- schrodinger.active_learning.al_utils.open_maybe_compressed(filename: str, *a, **d) io.IOBase ¶
Takes fileutils.open_maybe_compressed and includes bzip2 and gzip support.
Open a file, using the gzip module if the filename ends in gz, using bz2 module if the filename ends in bz2, or default builtin open otherwise.
- schrodinger.active_learning.al_utils.validate_input_smiles(input_files: list, smi_index: int, name_index: int, with_header: bool = True, max_check: int = 10, check_csv_header: bool = False) str ¶
Validate SMILES in input files.
- Parameters
input_files – paths of input files.
smi_index – column index of molecule SMILES
name_index – column index of molecule name
with_header – Whether the file has header in its first line
max_check – maximum number of SMILES to validate
check_csv_header – whether to check the csv headers
- Returns
error message if validation failed; None if it passed
- schrodinger.active_learning.al_utils.validate_input_zipfile(zip_filename: str, remote_input_ligands: bool) str ¶
Quick two-step validation of zipped input file
Checks for valid filename of zip archive
Without extracting files from zip archive, confirm zipped contents are CSV formatted
- schrodinger.active_learning.al_utils.validate_selection_rule(selection_rules, num_iters)¶
Validates the selection rules for active learning iterations.
- Args:
selection_rules (list): A list of selection rules. num_iter (int): The number of iterations.
- Returns:
None: If the selection rules are valid. str: Error message if the selection rules are invalid.
- schrodinger.active_learning.al_utils.store_mae_to_db(db_filename, mae_file_list)¶
Store structure in .mae files to a sqlite3 database. pose_id is stored as -1 if property AL_POSE_ID not found in the structure.
- Parameters
db_filename (str) – path of the sqlite3 database
mae_file_list (list(str)) – list of .mae files that contain the structures to be stored to the database
- schrodinger.active_learning.al_utils.write_st_from_db_by_smiles(db_filename, out_mae_file, smi_list, chunk_size=500)¶
Extract the ligands’ structures from the database. Write the structures to the output .mae file.
- Parameters
db_filename (str) – path of the sqlite3 database containing ligands’ structure
out_mae_file (str) – path of the output .mae/.maegz file
smi_list (list(str)) – list of input ligands’ SMILES
chunk_size (int) – number of SMILES in each query
- schrodinger.active_learning.al_utils.add_file_to_aljob_restart_dict(active_learning_job, optional_restart_file, jobname)¶
Add a file to the optional_restart_files_dict of current active learning job. Only register the file with jobcontrol is active_learning_job is None.
- Parameters
active_learning_job (ActiveLearningJob instance or None.) – current AL driver.
optional_restart_file (str) – path of the a file to be added
jobname (str) – key of the list that contains the optional_restart_file
- schrodinger.active_learning.al_utils.read_scored_ligands(scored_csv_file_list)¶
Read the ligands that were already scored by ScoreProviderNode.
- Parameters
scored_csv_file_list (list(str)) – list of ligand_ml training .csv files.
- Returns
set of SMILES of the scored ligands.
- Return type
set(str)
- schrodinger.active_learning.al_utils.count_ligands(file_list: List[str], with_header: bool = True) int ¶
Count the number of ligands in all the files by counting the total number of lines. We assume each line contains a SMILES string.
- Parameters
file_list – list of input file paths.
with_header – Whether the input files have header.
- Returns
Number of ligands in all the input files.
- schrodinger.active_learning.al_utils.convert_csv_to_smi(csv_file: str, smi_file: str, smiles_column: int = 0, title_column: int = 1)¶
Convert a .csv ligand file to a .smi file. The default assumes that SMILES and Title are in the first and second columns of the .csv file respectively.
- Parameters
csv_file – path of the .csv file to be converted
smi_file – path of the output .smi file
smiles_column – column number of the SMILES in the input .csv. zero indexed.
title_column – column number of the Title in the input .csv. zero indexed.
- schrodinger.active_learning.al_utils.yield_chunkable(file_path: str, batch_size: int, shuffle: bool = False)¶
A generator that reads a file in batches of structures.
- Parameters
file_path – Path to the input file.
batch_size – Number of structures per batch.
shuffle – Whether to shuffle the structures in each batch.
- Yield
batch iterable
- schrodinger.active_learning.al_utils.remove_duplicates(input_file: str, title_column: int = 1) str ¶
This function takes in a .smi and removes duplicate entries with same title but possibly different SMILES. One possible use case is if the same ligand appears in different charged states or poses in the input file. The function just picks the first row for every unique title and discards the rest.
- Parameters
input_file – path of the input file.
title_column – column of the title entry in the .csv or .smi file, zero indexed.
- Returns
path of the output file with duplicates removed. Same as the input file name with _unique attached at the end.
- schrodinger.active_learning.al_utils.split_lig(lig_filename: str, output_prefix: str, nstruct: int) List[str] ¶
Split structures in a .mae file to batches. Uses a contextlib to manage a variable number of open files.
- Parameters
lig_filename – path of the .mae file to be splitted
output_prefix – prefix of splitted output ligands
nstruct – number ligands per batch
- Returns
list of batched .mae files
- schrodinger.active_learning.al_utils.get_allowed_ncpu(user_specified_ncpu: int) int ¶
Return the number of allowed CPUs for a job.
- Parameters
user_specified_ncpu – user specified maximum number of CPUs
- Returns
number of allowed CPUs
- schrodinger.active_learning.al_utils.generate_mae_file_with_unique_title(input_mae_file: str, output_mae_file: str)¶
Convert the input .mae file to output .mae file that contains unique titles.
- Parameters
input_mae_file – path of input .mae file
output_mae_file – path of output .mae file containing ligands with unique title
- schrodinger.active_learning.al_utils.read_all_st_from_file(st_file: str) List[schrodinger.structure._structure.Structure] ¶
Return all the structure in a file as a list.
- Parameters
st_file – path of the input file
- Returns
list of structures in the file
- schrodinger.active_learning.al_utils.my_file_exists(filename: str)¶
a version of os.path.isfile. Returns None if input is None instead of raising an error.
- schrodinger.active_learning.al_utils.generate_iter_values(input_value: str, num_iter: int, data_type=<class 'str'>)¶
Generate a list of values for a specified number of iterations.
This function takes a comma-separated string of values and a number of iterations, and returns a list of values. If only one value is provided, it is repeated for the specified number of iterations.
- Parameters
input_value – A comma-separated string of values
num_iter – The number of iterations
data_type – Input type of the value
- Returns
A list of values for the specified number of iterations
- schrodinger.active_learning.al_utils.get_al_script_name(argv: argparse.Namespace) str ¶
Determine the active learning script name from the provided arguments. This function checks if any of the predefined active learning script names are present in the list of arguments and returns the first match found.
- Param
argv: List of command-line arguments
- Returns
the name of the active learning script if found in the arguments, otherwise None.
- schrodinger.active_learning.al_utils.default_args(v: str, script_name: str) dict ¶
Return the default arguments for the given script. If None is passed as script_name, use AL-Glide default arguments
- schrodinger.active_learning.al_utils.configure_no_input_splitting(args: argparse.Namespace) argparse.Namespace ¶
- schrodinger.active_learning.al_utils.configure_mq_run(args: argparse.Namespace) argparse.Namespace ¶
- schrodinger.active_learning.al_utils.configure_dise_args(args: argparse.Namespace, default_sim_threshold: float) argparse.Namespace ¶
Make sure that the similarity threshold for DISE selection is between 0 and 1. If not, set it to the default value.
- schrodinger.active_learning.al_utils.scale_for_diversity_job(num_randomly_selected_ligands: int, max_diversity_sample_size: int, host_ncpus: int) Tuple[int, int] ¶
Scale the number of cpus and ndim for the combinatorial diversity job based on how many ligands were randomly selected. If the number of selected ligands is less than the max sample size, scale down the user provided cpus as they are unnecessary. Note: default/minimum ndim value is 10, and can only increase based on how many random ligands were selected.
- Parameters
num_randomly_selected_ligands – The number of randomly selected ligs.
max_diversity_sample_size – The max diversity sample size input.
host_ncpus – The number of host cpus available.
- Returns
The scaled ncpus and ndim.
- schrodinger.active_learning.al_utils.check_restart_host(restart_file: str, jobname: str)¶
Check whether the original job was launched with -HOST. If so, the
- schrodinger.active_learning.al_utils.check_required_modules()¶
Check if the required modules are installed for active learning workflows.
- schrodinger.active_learning.al_utils.set_env_variable(key, value)¶
A context manager to set an environment variable temporarily.
- Parameters
key – The environment variable key.
value – The environment variable value to set.
- schrodinger.active_learning.al_utils.get_max_open_files()¶
- schrodinger.active_learning.al_utils.get_extra_features(smi_title_list, csv_file_list, extra_features, smi_col='SMILES', title_col='Title')¶
- schrodinger.active_learning.al_utils.merge_files(file_list, output_file)¶
Simple merge function to merge a list of text or .smi files into a single output file.
- schrodinger.active_learning.al_utils.add_list_args(args)¶
Convert the space-separated string of values to a list of values.
- Args:
args (object): An object containing various attributes.
- Returns:
None: This function modifies the args object in place.
- schrodinger.active_learning.al_utils.set_default_args(args, script_name)¶
Set default values for specific arguments if they are not provided. The default values are based on the AL script.
- Args:
- args (argparse.Namespace): The arguments object containing the
parameters to be checked and potentially updated.
- script_name (str): The name of the script for which default values
are being set.
- Returns:
None