schrodinger.active_learning.al_node module¶
- schrodinger.active_learning.al_node.estimate_time_cost(num_ligands, num_iter, train_size, train_time, num_score_license, num_autoqsar_license, available_cpu=None, score_per_ligand_cost=20, autoqsar_per_ligand_cost=0.02, num_rescore_ligand=0, multiplier=1.0, application='')¶
Roughly estimate the time cost a active learning job based on the inputs and number of available licenses.
- Parameters
num_ligands (int) – total number of ligands in the library.
num_iter (int) – number of active learning iterations.
train_size (int) – Ligand_ML training size per iteration.
train_time (float) – Ligand_ML training time per iteration in hours.
num_score_license (int) – total number of the application licenses
num_autoqsar_license (int) – total number of AutoQSAR licenses
available_cpu (int) – number of available CPU
score_per_ligand_cost (float) – estimate time of of single ligand scoring time cost in second.
autoqsar_per_ligand_cost (float) – estimate time of of single ligand Ligand_ML time cost in second.
num_rescore_ligand – Number of ligands to be rescored.
multiplier (float) – estimate expansion number per ligand.
application (str) – name of the application that provides score
- Returns
estimate time cost in hour
- Return type
float
- schrodinger.active_learning.al_node.get_jobdj(host_list=None)¶
Return JobDJ with specified host list
- Parameters
host_list ([(str, int)] or None) – A list of (<hostname>, <maximum_concurrent_subjobs>)
- Returns
JobDJ with specific settings.
- Return type
queue.JobDJ object
- schrodinger.active_learning.al_node.get_top_ligands_from_csv_list(csv_list, output_csv, num_ligands)¶
Get the top ligands from a list of .csv files. Write the selected ligands to output csv file.
- Parameters
csv_list (list(str)) – list of .csv files containing the ligands.
output_csv (str) – name of output .csv file.
num_ligands (int) – number of ligands to select.
- class schrodinger.active_learning.al_node.ActiveLearningNode(iter_num=1, job_name='active_learning', job_dir='.')¶
Bases:
object
- __init__(iter_num=1, job_name='active_learning', job_dir='.')¶
Initialize node for active learning workflow.
- Parameters
iter_num (int) – current active learning iteration number.
job_name (str) – active learning job name.
job_dir (str) – directory of where the jobs in the node will run.
- classmethod getName(iter_num)¶
- addOptionalRestartFiles(active_learning_job)¶
Add node’s optional restart file(s) to driver’s restart dict. Dump the restart dict to the restart .pkl file.
- Parameters
active_learning_job (ActiveLearningJob instance) – current AL driver
- needsHistogram()¶
Whether we can generate a histogram plot of calculated target scores.
- Returns
whether the histogram of score can be plotted
- Return type
bool
- class schrodinger.active_learning.al_node.PrepareSmilesNode(args, iter_num, job_name, job_dir)¶
Bases:
schrodinger.active_learning.al_node.ActiveLearningNode
- __init__(args, iter_num, job_name, job_dir)¶
Initialize node for selecting ligands (SMILES) to be scored by ScoreProviderNode.
- checkOutcome(smi_file)¶
Validate the generated SMILES file.
- Parameters
smi_file (str) – name of SMILES file to be validated.
- runNode(csv_list, active_learning_job, smi_file_name=None, **kwargs)¶
Select ligands to be scored.
- Parameters
csv_list (list(str)) – list of csv files that contain candidate ligands.
active_learning_job (ActiveLearningJob instance.) – current active learning job.
smi_file_name (str) – SMILES file name that contains selected ligands.
- uncertaintySelect(smi_file_name, scored_csv_file_list, sample_size, **kwargs)¶
Select random ligands from initial input csv or ligands with largest uncertainty from sorted ligand_ml .csv output.
- Parameters
smi_file_name (str) – SMILES file name that contains selected ligands.
scored_csv_file_list (list(str)) – list of ligand_ml training .csv file.
sample_size (int) – number of ligands to be sampled.
- greedySelect(smi_file_name, scored_csv_file_list, sample_size, ascending=True, **kwargs)¶
Select top ligands from sorted ligand_ml .csv output. ligands in self.csv_list should be already sorted from best to worst.
- Parameters
smi_file_name (str) – SMILES file name that contains selected ligands.
scored_csv_file_list (list(str)) – list of .csv files containing scored ligands
sample_size (int) – number of ligands to be sampled.
ascending (bool) – ligands with lower scores are better
- randomSelect(smi_file_name, scored_csv_file_list, sample_size, sort=True, **kwargs)¶
Select sample_size random ligands from input csv file(s).
- Parameters
smi_file_name (str) – SMILES file name that contains selected ligands.
scored_csv_file_list (list(str)) – list of ligand_ml training .csv file.
sample_size (int) – number of ligands to be sampled.
sort (bool) – Whether the csv files were sorted or initial inputs.
- diversitySelect(smi_file_name, scored_csv_file_list, sample_size, sort=True, **kwargs)¶
Use combinatorial_diversity to select diverse ligands from input csv or sorted ligand_ml .csv output.
Number of cpus and ndim are scaled proportionately to the number of random ligands selected. ndim is the dimensionality of the chemical space. When the number of random ligands is equal to the max_diversity_sample_size, ncpu and ndim scale to their maximums: 300, 13, respectively. If 300 cpus are not available, the user defined ncpus are used.
- Parameters
smi_file_name (str) – SMILES file name that contains selected ligands.
scored_csv_file_list (list(str)) – list of ligand_ml training .csv file.
sample_size (int) – number of ligands to be sampled.
sort (bool) – Whether the csv files were sorted or initial inputs..
- class schrodinger.active_learning.al_node.ScoreProviderNode(iter_num, job_name, job_dir)¶
Bases:
schrodinger.active_learning.al_node.ActiveLearningNode
- __init__(iter_num, job_name, job_dir)¶
Initialize node for obtaining the score of each ligand (SMILES).
- checkOutcome(score_csv_file)¶
Validate the .csv score file.
- Parameters
score_csv_file (str) – name of generated .csv score file.
- writeScoreCsv(title_to_score, output_csv)¶
Write score to .csv file that ligand_ml needs for training
- Parameters
title_to_score (defaultdict(lambda : BAD_SCORE)) – dict that maps ligand title to score
output_csv – ligand_ml training .csv file.
output_csv – str
- class schrodinger.active_learning.al_node.KnownScoreProviderNode(args, iter_num, job_name, job_dir)¶
Bases:
schrodinger.active_learning.al_node.ScoreProviderNode
Class for obtaining the scores from external .csv file. This class is only used for the purpose of testing the performance active learning workflow.
- __init__(args, iter_num, job_name, job_dir)¶
Initialize node for obtaining the score of each ligand (SMILES).
- runNode(smi_file_name, active_learning_job, score_csv_file=None)¶
Read scores from active_learning_job.known_title_to_score.
- Parameters
smi_file_name (str) – SMILES file that contains the ligands to be scored.
active_learning_job (ActiveLearningJob instance.) – current active learning job.
score_csv_file (str) – ligand_ml training .csv file.
- class schrodinger.active_learning.al_node.LigandMLTrainNode(args, iter_num, job_name, job_dir)¶
Bases:
schrodinger.active_learning.al_node.ActiveLearningNode
Class for ligand_ml model generation.
- __init__(args, iter_num, job_name, job_dir)¶
Initialize node for active learning workflow.
- Parameters
iter_num (int) – current active learning iteration number.
job_name (str) – active learning job name.
job_dir (str) – directory of where the jobs in the node will run.
- checkOutcome(model_file)¶
Check whether ligand_ml model exist.
- Parameters
model_file (str) – name of ligand_ml .qzip model file
- createTrainingCsvFile(discard_cutoff, ascending=True)¶
Generate .csv file for ligand_ml training
- Parameters
discard_cutoff (float) – score cutoff for excluding the ligands in ML training set.
ascending (bool) – lower value means better ligand if ascending is True
Generate training .csv file for ligand_ml model generation.
- runDiskDatasetJob(model_archive: str, csv_file_abspath: str) None ¶
Generate disk dataset (requires multiprocessing). Runs on the driver node to ensure enough cores available :param tar_model_file: equivalent to prepare LigandML smasher base_dir :param csv_file_abspath: input file to generate disk datasets
- runNode(active_learning_job)¶
Perform ligand_ml training with all the scored ligands. The model file includes the job_args.json file
- Parameters
active_learning_job (ActiveLearningJob instance.) – current active learning job.
- class schrodinger.active_learning.al_node.LigandMLEvalNode(args, iter_num, job_name, job_dir)¶
Bases:
schrodinger.active_learning.al_node.ActiveLearningNode
Class for performing ligand_ml prediction with generated model.
- __init__(args, iter_num, job_name, job_dir)¶
Initialize node for active learning workflow.
- Parameters
iter_num (int) – current active learning iteration number.
job_name (str) – active learning job name.
job_dir (str) – directory of where the jobs in the node will run.
- getBestResults(file_list, outfile, ascending=True)¶
Get the best ligands (with the lowest score) predicted by ligand_ml.
- Parameters
file_list (list(str)) – list of ligand_ml .csv output files. Each file is sorted by ligand_ml prediction score.
outfile (str) – .csv file that contains the best ligands.
ascending (bool) – lower value means better ligand if ascending is True
- checkOutcome(pred_csv_list, uncertain_csv_list)¶
Check the existence of ligand_ml prediction files.
- Parameters
pred_csv (list(str)) – list of ligand_ml prediction csv file(s)
uncertain_csv (list(str)) – list of ligand_ml prediction with uncertainty csv file(s).
- runNode(model_file, active_learning_job)¶
Use the trained model to evaluate all the ligands.
- Parameters
model_file – ligand_ml .qzip model file.
model_file – str
active_learning_job (ActiveLearningJob instance.) – current active learning job.
- evalMQ(tar_model_file, output_csv, active_learning_job)¶
Evaluate ligands with ligand_ml model using ZMQ. Distributes evaluation jobs over a set of workers. Reverts to jobdj if ZMQ fails.
- Parameters
tar_model_file (str) – trained tar.gz model file
output_csv (str) – output prediction file
active_learning_job (ActiveLearningJob instance.) – current active learning job.
- Returns
True if ZMQ job is successful
- Return type
bool
- evalDJ(tar_model_file, output_csv, active_learning_job)¶
Evaluate ligands with ligand_ml model using jobdj.
- Parameters
tar_model_file (str) – tar.gz trained model file
output_csv (str) – output prediction file
active_learning_job (ActiveLearningJob instance.) – current active learning job.
- class schrodinger.active_learning.al_node.ActiveLearningNodeSupplier(calculate_score_node, pilot_score_node, rescore_node, score_provider_node=<class 'schrodinger.active_learning.al_node.ScoreProviderNode'>, prepare_smi_node=<class 'schrodinger.active_learning.al_node.PrepareSmilesNode'>, known_score_provider_node=<class 'schrodinger.active_learning.al_node.KnownScoreProviderNode'>, ligand_ml_train_node=<class 'schrodinger.active_learning.al_node.LigandMLTrainNode'>, ligand_ml_eval_node=<class 'schrodinger.active_learning.al_node.LigandMLEvalNode'>)¶
Bases:
object
- __init__(calculate_score_node, pilot_score_node, rescore_node, score_provider_node=<class 'schrodinger.active_learning.al_node.ScoreProviderNode'>, prepare_smi_node=<class 'schrodinger.active_learning.al_node.PrepareSmilesNode'>, known_score_provider_node=<class 'schrodinger.active_learning.al_node.KnownScoreProviderNode'>, ligand_ml_train_node=<class 'schrodinger.active_learning.al_node.LigandMLTrainNode'>, ligand_ml_eval_node=<class 'schrodinger.active_learning.al_node.LigandMLEvalNode'>)¶