schrodinger.active_learning.al_node module¶
- schrodinger.active_learning.al_node.estimate_time_cost(num_ligands, num_iter, train_size, train_time, num_score_license, num_autoqsar_license, available_cpu=None, score_per_ligand_cost=20, autoqsar_per_ligand_cost=0.02, num_rescore_ligand=0, multiplier=1.0, application='')¶
Roughly estimate the time cost a active learning job based on the inputs and number of available licenses.
- Parameters
num_ligands (int) – total number of ligands in the library.
num_iter (int) – number of active learning iterations.
train_size (int) – Ligand_ML training size per iteration.
train_time (float) – Ligand_ML training time per iteration in hours.
num_score_license (int) – total number of the application licenses
num_autoqsar_license (int) – total number of AutoQSAR licenses
available_cpu (int) – number of available CPU
score_per_ligand_cost (float) – estimate time of of single ligand scoring time cost in second.
autoqsar_per_ligand_cost (float) – estimate time of of single ligand Ligand_ML time cost in second.
num_rescore_ligand – Number of ligands to be rescored.
multiplier (float) – estimate expansion number per ligand.
application (str) – name of the application that provides score
- Returns
estimate time cost in hour
- Return type
float
- schrodinger.active_learning.al_node.get_jobdj(host_list=None)¶
Return JobDJ with specified host list
- Parameters
host_list ([(str, int)] or None) – A list of (<hostname>, <maximum_concurrent_subjobs>)
- Returns
JobDJ with specific settings.
- Return type
queue.JobDJ object
- schrodinger.active_learning.al_node.get_top_ligands_from_csv_list(csv_list, output_csv, num_ligands)¶
Get the top ligands from a list of .csv files. Write the selected ligands to output csv file.
- Parameters
csv_list (list(str)) – list of .csv files containing the ligands.
output_csv (str) – name of output .csv file.
num_ligands (int) – number of ligands to select.
- class schrodinger.active_learning.al_node.ActiveLearningNode(iter_num=1, job_name='active_learning', job_dir='.')¶
Bases:
object
- __init__(iter_num=1, job_name='active_learning', job_dir='.')¶
Initialize node for active learning workflow.
- Parameters
iter_num (int) – current active learning iteration number.
job_name (str) – active learning job name.
job_dir (str) – directory of where the jobs in the node will run.
- classmethod getName(iter_num)¶
- addOptionalRestartFiles(active_learning_job)¶
Add node’s optional restart file(s) to driver’s restart dict. Dump the restart dict to the restart .pkl file.
- Parameters
active_learning_job (ActiveLearningJob instance) – current AL driver
- needsHistogram()¶
Whether we can generate a histogram plot of calculated target scores.
- Returns
whether the histogram of score can be plotted
- Return type
bool
- class schrodinger.active_learning.al_node.PrepareSmilesNode(args, iter_num, job_name, job_dir)¶
Bases:
schrodinger.active_learning.al_node.ActiveLearningNode
- __init__(args, iter_num, job_name, job_dir)¶
Initialize node for selecting ligands (SMILES) to be scored by ScoreProviderNode.
- checkOutcome(smi_file)¶
Validate the generated SMILES file.
- Parameters
smi_file (str) – name of SMILES file to be validated.
- runNode(csv_list, active_learning_job, smi_file_name=None, **kwargs)¶
Select ligands to be scored.
- Parameters
csv_list (list(str)) – list of csv files that contain candidate ligands.
active_learning_job (ActiveLearningJob instance.) – current active learning job.
smi_file_name (str) – SMILES file name that contains selected ligands.
- uncertaintySelect(smi_file_name, scored_csv_file_list, sample_size, **kwargs)¶
Select random ligands from initial input csv or ligands with largest uncertainty from sorted ligand_ml .csv output.
- Parameters
smi_file_name (str) – SMILES file name that contains selected ligands.
scored_csv_file_list (list(str)) – list of ligand_ml training .csv file.
sample_size (int) – number of ligands to be sampled.
- greedySelect(smi_file_name, scored_csv_file_list, sample_size, ascending=True, **kwargs)¶
Select top ligands from sorted ligand_ml .csv output. ligands in self.csv_list should be already sorted from best to worst.
- Parameters
smi_file_name (str) – SMILES file name that contains selected ligands.
scored_csv_file_list (list(str)) – list of .csv files containing scored ligands
sample_size (int) – number of ligands to be sampled.
ascending (bool) – ligands with lower scores are better
- randomSelect(smi_file_name, scored_csv_file_list, sample_size, sort=True, **kwargs)¶
Select sample_size random ligands from input csv file(s).
- Parameters
smi_file_name (str) – SMILES file name that contains selected ligands.
scored_csv_file_list (list(str)) – list of ligand_ml training .csv file.
sample_size (int) – number of ligands to be sampled.
sort (bool) – Whether the csv files were sorted or initial inputs.
- diversitySelect(smi_file_name, scored_csv_file_list, sample_size, sort=True, **kwargs)¶
Use combinatorial_diversity to select diverse ligands from input csv or sorted ligand_ml .csv output.
- Parameters
smi_file_name (str) – SMILES file name that contains selected ligands.
scored_csv_file_list (list(str)) – list of ligand_ml training .csv file.
sample_size (int) – number of ligands to be sampled.
sort (bool) – Whether the csv files were sorted or initial inputs..
- addOptionalRestartFiles(active_learning_job)¶
Add node’s optional restart file(s) to driver’s restart dict. Dump the restart dict to the restart .pkl file.
- Parameters
active_learning_job (ActiveLearningJob instance) – current AL driver
- classmethod getName(iter_num)¶
- needsHistogram()¶
Whether we can generate a histogram plot of calculated target scores.
- Returns
whether the histogram of score can be plotted
- Return type
bool
- class schrodinger.active_learning.al_node.ScoreProviderNode(iter_num, job_name, job_dir)¶
Bases:
schrodinger.active_learning.al_node.ActiveLearningNode
- __init__(iter_num, job_name, job_dir)¶
Initialize node for obtaining the score of each ligand (SMILES).
- checkOutcome(score_csv_file)¶
Validate the .csv score file.
- Parameters
score_csv_file (str) – name of generated .csv score file.
- writeScoreCsv(title_to_score, output_csv)¶
Write score to .csv file that ligand_ml needs for training
- Parameters
title_to_score (defaultdict(lambda : BAD_SCORE)) – dict that maps ligand title to score
output_csv – ligand_ml training .csv file.
output_csv – str
- addOptionalRestartFiles(active_learning_job)¶
Add node’s optional restart file(s) to driver’s restart dict. Dump the restart dict to the restart .pkl file.
- Parameters
active_learning_job (ActiveLearningJob instance) – current AL driver
- classmethod getName(iter_num)¶
- needsHistogram()¶
Whether we can generate a histogram plot of calculated target scores.
- Returns
whether the histogram of score can be plotted
- Return type
bool
- class schrodinger.active_learning.al_node.KnownScoreProviderNode(args, iter_num, job_name, job_dir)¶
Bases:
schrodinger.active_learning.al_node.ScoreProviderNode
Class for obtaining the scores from external .csv file. This class is only used for the purpose of testing the performance active learning workflow.
- __init__(args, iter_num, job_name, job_dir)¶
Initialize node for obtaining the score of each ligand (SMILES).
- runNode(smi_file_name, active_learning_job, score_csv_file=None)¶
Read scores from active_learning_job.known_title_to_score.
- Parameters
smi_file_name (str) – SMILES file that contains the ligands to be scored.
active_learning_job (ActiveLearningJob instance.) – current active learning job.
score_csv_file (str) – ligand_ml training .csv file.
- addOptionalRestartFiles(active_learning_job)¶
Add node’s optional restart file(s) to driver’s restart dict. Dump the restart dict to the restart .pkl file.
- Parameters
active_learning_job (ActiveLearningJob instance) – current AL driver
- checkOutcome(score_csv_file)¶
Validate the .csv score file.
- Parameters
score_csv_file (str) – name of generated .csv score file.
- classmethod getName(iter_num)¶
- needsHistogram()¶
Whether we can generate a histogram plot of calculated target scores.
- Returns
whether the histogram of score can be plotted
- Return type
bool
- writeScoreCsv(title_to_score, output_csv)¶
Write score to .csv file that ligand_ml needs for training
- Parameters
title_to_score (defaultdict(lambda : BAD_SCORE)) – dict that maps ligand title to score
output_csv – ligand_ml training .csv file.
output_csv – str
- class schrodinger.active_learning.al_node.LigandMLTrainNode(args, iter_num, job_name, job_dir)¶
Bases:
schrodinger.active_learning.al_node.ActiveLearningNode
Class for ligand_ml model generation.
- __init__(args, iter_num, job_name, job_dir)¶
Initialize node for active learning workflow.
- Parameters
iter_num (int) – current active learning iteration number.
job_name (str) – active learning job name.
job_dir (str) – directory of where the jobs in the node will run.
- checkOutcome(model_file)¶
Check whether ligand_ml model exist.
- Parameters
model_file (str) – name of ligand_ml .qzip model file
- createTrainingCsvFile(discard_cutoff, ascending=True)¶
Generate .csv file for ligand_ml training
- Parameters
discard_cutoff (float) – score cutoff for excluding the ligands in ML training set.
ascending (bool) – lower value means better ligand if ascending is True
Generate training .csv file for ligand_ml model generation.
- runNode(active_learning_job)¶
Perform ligand_ml training with all the scored ligands. The model file includes the job_args.json file
- Parameters
active_learning_job (ActiveLearningJob instance.) – current active learning job.
- runLigandMLMerge(merged_model_name, sub_model_list, jobdj=None)¶
Merge list of .tar.gz ligand_ml models to single .tar.gz ligand_ml model.
- Parameters
merged_model_name (str) – path of the final merged .tar.gz model.
sub_model_list (list(str)) – list of to be merged ligand_ml models.
jobdj (queue.JobDJ object or None) – JobDJ where the merging job runs.
- addOptionalRestartFiles(active_learning_job)¶
Add node’s optional restart file(s) to driver’s restart dict. Dump the restart dict to the restart .pkl file.
- Parameters
active_learning_job (ActiveLearningJob instance) – current AL driver
- classmethod getName(iter_num)¶
- needsHistogram()¶
Whether we can generate a histogram plot of calculated target scores.
- Returns
whether the histogram of score can be plotted
- Return type
bool
- class schrodinger.active_learning.al_node.LigandMLEvalNode(args, iter_num, job_name, job_dir)¶
Bases:
schrodinger.active_learning.al_node.ActiveLearningNode
Class for performing ligand_ml prediction with generated model.
- __init__(args, iter_num, job_name, job_dir)¶
Initialize node for active learning workflow.
- Parameters
iter_num (int) – current active learning iteration number.
job_name (str) – active learning job name.
job_dir (str) – directory of where the jobs in the node will run.
- getBestResults(file_list, outfile, ascending=True)¶
Get the best ligands (with the lowest score) predicted by ligand_ml.
- Parameters
file_list (list(str)) – list of ligand_ml .csv output files. Each file is sorted by ligand_ml prediction score.
outfile (str) – .csv file that contains the best ligands.
ascending (bool) – lower value means better ligand if ascending is True
- checkOutcome(pred_csv_list, uncertain_csv_list)¶
Check the existence of ligand_ml prediction files.
- Parameters
pred_csv (list(str)) – list of ligand_ml prediction csv file(s)
uncertain_csv (list(str)) – list of ligand_ml prediction with uncertainty csv file(s).
- runNode(model_file, active_learning_job)¶
Use the trained model to evaluate all the ligands.
- Parameters
model_file – ligand_ml .qzip model file.
model_file – str
active_learning_job (ActiveLearningJob instance.) – current active learning job.
- evalMQ(tar_model_file, output_csv, active_learning_job)¶
Evaluate ligands with ligand_ml model using ZMQ. Distributes evaluation jobs over a set of workers. Reverts to jobdj if ZMQ fails.
- Parameters
tar_model_file (str) – trained tar.gz model file
output_csv (str) – output prediction file
active_learning_job (ActiveLearningJob instance.) – current active learning job.
- Returns
True if ZMQ job is successful
- Return type
bool
- evalDJ(tar_model_file, output_csv, active_learning_job)¶
Evaluate ligands with ligand_ml model using jobdj.
- Parameters
tar_model_file (str) – tar.gz trained model file
output_csv (str) – output prediction file
active_learning_job (ActiveLearningJob instance.) – current active learning job.
- addOptionalRestartFiles(active_learning_job)¶
Add node’s optional restart file(s) to driver’s restart dict. Dump the restart dict to the restart .pkl file.
- Parameters
active_learning_job (ActiveLearningJob instance) – current AL driver
- classmethod getName(iter_num)¶
- needsHistogram()¶
Whether we can generate a histogram plot of calculated target scores.
- Returns
whether the histogram of score can be plotted
- Return type
bool
- class schrodinger.active_learning.al_node.ActiveLearningNodeSupplier(calculate_score_node, pilot_score_node, rescore_node, score_provider_node=<class 'schrodinger.active_learning.al_node.ScoreProviderNode'>, prepare_smi_node=<class 'schrodinger.active_learning.al_node.PrepareSmilesNode'>, known_score_provider_node=<class 'schrodinger.active_learning.al_node.KnownScoreProviderNode'>, ligand_ml_train_node=<class 'schrodinger.active_learning.al_node.LigandMLTrainNode'>, ligand_ml_eval_node=<class 'schrodinger.active_learning.al_node.LigandMLEvalNode'>)¶
Bases:
object
- __init__(calculate_score_node, pilot_score_node, rescore_node, score_provider_node=<class 'schrodinger.active_learning.al_node.ScoreProviderNode'>, prepare_smi_node=<class 'schrodinger.active_learning.al_node.PrepareSmilesNode'>, known_score_provider_node=<class 'schrodinger.active_learning.al_node.KnownScoreProviderNode'>, ligand_ml_train_node=<class 'schrodinger.active_learning.al_node.LigandMLTrainNode'>, ligand_ml_eval_node=<class 'schrodinger.active_learning.al_node.LigandMLEvalNode'>)¶