schrodinger.active_learning.al_node module¶

schrodinger.active_learning.al_node.estimate_time_cost(num_ligands, num_iter, train_size, train_time, num_score_license, num_autoqsar_license, available_cpu=None, score_per_ligand_cost=20, autoqsar_per_ligand_cost=0.02, num_rescore_ligand=0, multiplier=1.0, application='')¶

Roughly estimate the time cost a active learning job based on the inputs and number of available licenses.

Parameters:

num_ligands (int) – total number of ligands in the library.
num_iter (int) – number of active learning iterations.
train_size (int) – Ligand_ML training size per iteration.
train_time (float) – Ligand_ML training time per iteration in hours.
num_score_license (int) – total number of the application licenses
num_autoqsar_license (int) – total number of AutoQSAR licenses
available_cpu (int) – number of available CPU
score_per_ligand_cost (float) – estimate time of of single ligand scoring time cost in second.
autoqsar_per_ligand_cost (float) – estimate time of of single ligand Ligand_ML time cost in second.
num_rescore_ligand – Number of ligands to be rescored.
multiplier (float) – estimate expansion number per ligand.
application (str) – name of the application that provides score

Returns:

estimate time cost in hour

Return type:

float

schrodinger.active_learning.al_node.get_jobdj(host_list: Optional[Union[str, int]] = None, subjob_max_retries: int = 0) → JobDJ¶

Return JobDJ with specified host list

Parameters:

host_list – A list of (<hostname>, <maximum_concurrent_subjobs>
subjob_max_retries – Maximum number of retries for subjobs

Returns:

JobDJ with specific settings.

schrodinger.active_learning.al_node.get_top_ligands_from_csv_list(csv_list, output_csv, num_ligands)¶

Get the top ligands from a list of .csv files. Write the selected ligands to output csv file.

Parameters:

csv_list (list(str)) – list of .csv files containing the ligands.
output_csv (str) – name of output .csv file.
num_ligands (int) – number of ligands to select.

class schrodinger.active_learning.al_node.ActiveLearningNode(iter_num=1, job_name='active_learning', job_dir='.')¶

Bases: object

__init__(iter_num=1, job_name='active_learning', job_dir='.')¶

Initialize node for active learning workflow.

Parameters:

iter_num (int) – current active learning iteration number.
job_name (str) – active learning job name.
job_dir (str) – directory of where the jobs in the node will run.

classmethod getName(iter_num)¶

addOptionalRestartFiles(active_learning_job)¶

Add node’s optional restart file(s) to driver’s restart dict. Dump the restart dict to the restart .pkl file.

Parameters:: active_learning_job (ActiveLearningJob instance) – current AL driver

needsHistogram()¶

Whether we can generate a histogram plot of calculated target scores.

Returns:: whether the histogram of score can be plotted
Return type:: bool

class schrodinger.active_learning.al_node.PrepareSmilesNode(args, iter_num, job_name, job_dir)¶

Bases: ActiveLearningNode

__init__(args, iter_num, job_name, job_dir)¶: Initialize node for selecting ligands (SMILES) to be scored by ScoreProviderNode.

update_csv_list(filtered_csv_list: list)¶

Update the csv_list that will be used for ligand selection in AL-FEP workflows.

Parameters:: filtered_csv_list – New list of csv files to use for selection

checkOutcome(smi_file)¶

Validate the generated SMILES file.

Parameters:: smi_file (str) – name of SMILES file to be validated.

getNodeConfigReport()¶

runNode(top_ratio_csv_list, active_learning_job, smi_file_name=None, **kwargs)¶

Select ligands to be scored. diversity, random, uncertain and distinct scaffold selection rules will use ligands in top_ratio_csv_list. greedy and dise selection rule will use ligands in active_learning_job.most_recent_pred_file.

Parameters:

top_ratio_csv_list (list(str)) – list of csv files that contains the top -uncertainty_sample_ratio ligands by learning target.
active_learning_job (ActiveLearningJob instance.) – current active learning job.
smi_file_name (str) – SMILES file name that contains selected ligands.

uncertaintySelect(smi_file_name, scored_csv_file_list, sample_size, **kwargs)¶

Select random ligands from initial input csv or ligands with largest uncertainty from sorted ligand_ml .csv output.

Parameters:

smi_file_name (str) – SMILES file name that contains selected ligands.
scored_csv_file_list (list(str)) – list of ligand_ml training .csv file.
sample_size (int) – number of ligands to be sampled.

greedySelect(smi_file_name, scored_csv_file_list, sample_size, ascending=True, **kwargs)¶

Select top ligands from sorted ligand_ml .csv output. ligands in self.csv_list should be already sorted from best to worst.

Parameters:

smi_file_name (str) – SMILES file name that contains selected ligands.
scored_csv_file_list (list(str)) – list of .csv files containing scored ligands
sample_size (int) – number of ligands to be sampled.
ascending (bool) – ligands with lower scores are better

randomSelect(smi_file_name, scored_csv_file_list, sample_size, sort=True, **kwargs)¶

Select sample_size random ligands from input csv file(s).

Parameters:

smi_file_name (str) – SMILES file name that contains selected ligands.
scored_csv_file_list (list(str)) – list of ligand_ml training .csv file.
sample_size (int) – number of ligands to be sampled.
sort (bool) – Whether the csv files were sorted or initial inputs.

diversitySelect(smi_file_name, scored_csv_file_list, sample_size, sort=True, **kwargs)¶

Use combinatorial_diversity to select diverse ligands from input csv or sorted ligand_ml .csv output.

Number of cpus and ndim are scaled proportionately to the number of random ligands selected. ndim is the dimensionality of the chemical space. When the number of random ligands is equal to the max_diversity_sample_size, ncpu and ndim scale to their maximums: 300, 13, respectively. If 300 cpus are not available, the user defined ncpus are used.

Parameters:

smi_file_name (str) – SMILES file name that contains selected ligands.
scored_csv_file_list (list(str)) – list of ligand_ml training .csv file.
sample_size (int) – number of ligands to be sampled.
sort (bool) – Whether the csv files were sorted or initial inputs..

diseSelect(smi_file_name: str, scored_csv_file_list: list[str], sample_size: int, sort: bool = True, ascending: bool = True)¶

Uses DISE(Directed Sphere Exclusion) to select ligands for the training data. The objective is to select compounds based on diversity and score. Using any previously scored compounds as seeds, candidate ligands are selected in order of their scores and evaluated based on fingerprint similarity. A candidate ligand is added to the selected list if it is distinct, i.e. fingerprint is unique from those of the seeded compounds and previously added compounds.

Parameters:

smi_file_name – Path of the .smi file which contains selected ligands.
scored_csv_file_list – list of paths to previously scored ligand_ml training csv files.
sample_size – number of ligands required.
sort – Whether the ligands in the input files are sorted.
ascending – Whether lower score is better or worse, used only if sort is True.

scaffoldSelect(smi_file_name: str, scored_csv_file_list: list[str], sample_size: int, sort: bool = True, scaffold_block_size: int = 1000000, **kwargs)¶

Selects ligands based on Bemis-Murcko (BM) scaffolds, aiming to maximize the number of distinct generic BM scaffolds in the training set. It follows these steps:

Divide and Cluster: Splits input ligands into smaller subfiles (size determined by scaffold_block_size). Each subfile is clustered by distinct scaffolds, retaining up to 10 example ligands per scaffold.
Select and Write Ligands: A target number of ligands (samples_per_file) is determined for each subfile, based on the desired overall sample_size. There are two possibilities for each subfile: - Sufficient Scaffolds: If the subfile has at least the required ‘samples_per_file’ unique scaffolds, one ligand per scaffold is chosen for the output from randomly selected scaffolds. - Insufficient Scaffolds: If there are fewer scaffolds then all example ligands from all scaffolds are pooled. From this pool, ligands are selected for the final set. Note: Identical scaffolds from different subfiles are treated as distinct, leading to potentially duplicate scaffolds. This duplication level is acceptable for the purpose here of selecting a training set.
Fallback: If there are too few scaffolds and the number of ligands selected from this procedure is less than 40% of required sample_size, we switch to using combinatorial diversity for ligand selection.

Parameters:

smi_file_name – Path to .smi file that contains selected ligands.
scored_csv_file_list – list of ligand_ml training .csv files from previous rounds, these ligands need to be excluded from the current round.
sample_size – number of ligands required in output file.
sort – Whether the input ligands are sorted or not.
ascending – If the lower score is better or worse.
scaffold_block_size – number of ligands in one scaffold subjob.

class schrodinger.active_learning.al_node.ScoreProviderNode(iter_num, job_name, job_dir)¶

Bases: ActiveLearningNode

__init__(iter_num, job_name, job_dir)¶: Initialize node for obtaining the score of each ligand (SMILES).

checkOutcome(score_csv_file)¶

Validate the .csv score file.

Parameters:: score_csv_file (str) – name of generated .csv score file.

writeScoreCsv(title_to_score, output_csv)¶

Write score to .csv file that ligand_ml needs for training

Parameters:

title_to_score (defaultdict(lambda : BAD_SCORE)) – dict that maps ligand title to score
output_csv – ligand_ml training .csv file.
output_csv – str

class schrodinger.active_learning.al_node.KnownScoreProviderNode(args, iter_num, job_name, job_dir)¶

Bases: ScoreProviderNode

Class for obtaining the scores from external .csv file. This class is only used for the purpose of testing the performance active learning workflow.

__init__(args, iter_num, job_name, job_dir)¶: Initialize node for obtaining the score of each ligand (SMILES).

runNode(smi_file_name, active_learning_job, score_csv_file=None)¶

Read scores from active_learning_job.known_title_to_score.

Parameters:

smi_file_name (str) – SMILES file that contains the ligands to be scored.
active_learning_job (ActiveLearningJob instance.) – current active learning job.
score_csv_file (str) – ligand_ml training .csv file.

class schrodinger.active_learning.al_node.TrainNode(args, iter_num, job_name, job_dir)¶

Bases: ActiveLearningNode

Class for training the machine learning model.

__init__(args, iter_num, job_name, job_dir)¶

Initialize node for active learning workflow.

Parameters:

iter_num (int) – current active learning iteration number.
job_name (str) – active learning job name.
job_dir (str) – directory of where the jobs in the node will run.

createRawTrainingCsvFile(discard_cutoff, ascending=True)¶

Generate .csv file for ligand_ml training

Parameters:

discard_cutoff (float) – score cutoff for excluding the ligands in ML training set.
ascending (bool) – lower value means better ligand if ascending is True

Generate training .csv file for ligand_ml model generation.

class schrodinger.active_learning.al_node.AutoQSARTrainNode(args, iter_num, job_name, job_dir)¶

Bases: TrainNode

Class for AutoQSAR model generation.

AUTOQSAR_CMD = 'utilities/autoqsar'¶

TRAIN_FRACTION = 0.8¶

HOLDOUT_FRACTION = 0.1¶

__init__(args, iter_num, job_name, job_dir)¶

Initialize node for active learning workflow.

Parameters:

iter_num (int) – current active learning iteration number.
job_name (str) – active learning job name.
job_dir (str) – directory of where the jobs in the node will run.

runNode(active_learning_job)¶

Train an AutoQSAR model on 90% of the scored ligands (AutoQSAR will further apply its own internal 80/20 split on this 90%), then evaluate the model on the remaining 10% hold-out set and report metrics computed directly from those predictions.

Parameters:: active_learning_job (ActiveLearningJob instance.) – current active learning job.

checkOutcome(model_file: str)¶

Check whether the AutoQSAR model was generated successfully.

Parameters:: model_file – path to the AutoQSAR .qzip model file

getNodeConfigReport()¶

class schrodinger.active_learning.al_node.LigandMLTrainNode(args, iter_num, job_name, job_dir)¶

Bases: TrainNode

Class for ligand_ml model generation.

__init__(args, iter_num, job_name, job_dir)¶

Initialize node for active learning workflow.

Parameters:

iter_num (int) – current active learning iteration number.
job_name (str) – active learning job name.
job_dir (str) – directory of where the jobs in the node will run.

property need_feature¶

runDiskDatasetJob(model_archive: str, csv_file_abspath: str) → None¶: Generate disk dataset (requires multiprocessing). Runs on the driver node to ensure enough cores available :param model_archive: equivalent to prepare LigandML smasher base_dir :param csv_file_abspath: input file to generate disk datasets

trainModel(csv_file_abspath, model_file, features, feature_type, jobdj)¶

Train ligand_ml model with the .csv file.

Parameters:

csv_file_abspath (str) – path of the .csv file that contains the ligands and their scores.
model_file (str) – path of the output ligand_ml model file.
jobdj (queue.JobDJ object) – JobDJ where the training job runs.
features (list(str)) – list of features to be used in the model.
feature_type (no_feature, with_feature or only_feature) – type of features to be used in the model.

getNodeConfigReport()¶

runNode(active_learning_job)¶

Perform ligand_ml training with all the scored ligands. The model file includes the job_args.json file

Parameters:: active_learning_job (ActiveLearningJob instance.) – current active learning job.

checkOutcome(model_file: str)¶

Check whether the LigandML model was generated successfully.

Parameters:: model_file – path to the LigandML .qzip model file

static pickBestModel(model_file, model_dict)¶

Picks the best model from the given model dictionary based on the R2 score of each feature type.

Parameters: - model_file (str): The file path where the best model will be saved. - model_dict (dict): A dictionary containing the model information, where the keys are feature types and the values are the qzip model file paths.

Returns: None

class schrodinger.active_learning.al_node.EvalNode(args, iter_num, job_name, job_dir)¶

Bases: ActiveLearningNode

Class for performing ligand_ml prediction with generated model.

__init__(args, iter_num, job_name, job_dir)¶

Initialize node for active learning workflow.

Parameters:

iter_num (int) – current active learning iteration number.
job_name (str) – active learning job name.
job_dir (str) – directory of where the jobs in the node will run.

runNodeSetup(model_file: str, active_learning_job)¶: Set up the node for active learning by configuring model file, evaluation CSVs, and output file names.

runNodeFinalize()¶: Finalizes the node execution by handling output and restart files, and preparing input for the next node.

getBestResults(file_list, outfile)¶

Get the best ligands (with the lowest score) predicted by ligand_ml.

Parameters:

file_list (list(str)) – list of ligand_ml .csv output files. Each file is sorted by ligand_ml prediction score.
outfile (str) – .csv file that contains the best ligands.

checkOutcome(pred_csv_list, uncertain_csv_list)¶

Check the existence of ml prediction files.

Parameters:

pred_csv (list(str)) – list of ml prediction csv file(s)
uncertain_csv (list(str)) – list of ml prediction with uncertainty csv file(s).

class schrodinger.active_learning.al_node.AutoQSAREvalNode(args, iter_num, job_name, job_dir)¶

Bases: EvalNode

Class for performing AutoQSAR prediction with generated model.

AutoQSAR outputs columns named r_autoqsar_Pred_<prop> and r_autoqsar_Pred_<prop>_SD where <prop> is set by the -prop flag (see attr:_PROP_NAME). These are renamed to the standard score and uncertainty columns expected by downstream nodes.

AUTOQSAR_CMD = 'utilities/autoqsar'¶

runNode(model_file: str, active_learning_job)¶

Use the trained AutoQSAR model to evaluate all the ligands.

Parameters:

model_file – AutoQSAR .qzip model file.
active_learning_job – current active learning job.

class schrodinger.active_learning.al_node.LigandMLEvalNode(args, iter_num, job_name, job_dir)¶

Bases: EvalNode

Class for performing ligand_ml prediction with generated model.

__init__(args, iter_num, job_name, job_dir)¶

Initialize node for active learning workflow.

Parameters:

iter_num (int) – current active learning iteration number.
job_name (str) – active learning job name.
job_dir (str) – directory of where the jobs in the node will run.

runNode(model_file: str, active_learning_job)¶

Use the trained model to evaluate all the ligands.

Parameters:

model_file – ligand_ml .qzip model file.
active_learning_job (ActiveLearningJob instance.) – current active learning job.

evalMQ()¶: Evaluate ligands with ligand_ml model using ZeroMQ. Distributes evaluation jobs over a set of workers. Reverts to jobdj if ZeroMQ fails.

evalDJ()¶: Evaluate ligands with ligand_ml model using jobdj.

class schrodinger.active_learning.al_node.ActiveLearningNodeSupplier(calculate_score_node, pilot_score_node, rescore_node, score_provider_node=<class 'schrodinger.active_learning.al_node.ScoreProviderNode'>, prepare_smi_node=<class 'schrodinger.active_learning.al_node.PrepareSmilesNode'>, known_score_provider_node=<class 'schrodinger.active_learning.al_node.KnownScoreProviderNode'>, ml_train_node=<class 'schrodinger.active_learning.al_node.LigandMLTrainNode'>, ml_eval_node=<class 'schrodinger.active_learning.al_node.LigandMLEvalNode'>)¶

Bases: object

__init__(calculate_score_node, pilot_score_node, rescore_node, score_provider_node=<class 'schrodinger.active_learning.al_node.ScoreProviderNode'>, prepare_smi_node=<class 'schrodinger.active_learning.al_node.PrepareSmilesNode'>, known_score_provider_node=<class 'schrodinger.active_learning.al_node.KnownScoreProviderNode'>, ml_train_node=<class 'schrodinger.active_learning.al_node.LigandMLTrainNode'>, ml_eval_node=<class 'schrodinger.active_learning.al_node.LigandMLEvalNode'>)¶

schrodinger.active_learning.al_node.get_ml_node_kwargs(args) → dict¶

Return ml_train_node/ml_eval_node kwargs for the supplier.

Parameters:: args – parsed command-line arguments.
Returns:: keyword arguments for ActiveLearningNodeSupplier.