schrodinger.application.building_block_exploration.bb_explorer_run_thompson_sampling module¶
- class schrodinger.application.building_block_exploration.bb_explorer_run_thompson_sampling.ThompsonSampler(args)¶
Bases:
objectClass to orchestrate the Thompson sampling for the building block exploration workflow. It starts with seed file and first performs a warmup cycle where we randomly pick compounds from the seed file to dock. These warmup results are used to set the prior means and variances of all the building blocks for all reagent classes. It then performs several cycles of following four steps: 1) Choose reagents using Thompson sampling. 2) Use combinatorial synthesis to synthesize full compounds from the reagents. 3) Dock those compounds using Glide. 4) Update the score distribution of various reagents using Bayesian update.
The cycles are run until we hit the specified number of compounds to dock. It returns a csv file with the docking scores of all the compounds docked and information about the route and reagents used to synthesize them. The aim is to find compounds with good docking scores while maintaining a diverse set of chemotypes by exploring multiple routes.
- __init__(args)¶
Initializes the ThompsonSampler object. Following attributes are initialized from the args object:
: glide_grid: Glide grid file to use for docking.
- : seed_file: Seed file from a generate_seed_file task. Only the building
blocks in the seed file are used for synthesis in this job.
- : max_glide_cpu: Maximum number of CPUs to use simultaneously for
Glide docking, user must have this many Glide licenses.
: glide_mq: Whether to use ZeroMQ for running Glide docking jobs.
- : extra_docking_config: Extra docking configuration keywords to use
for Glide docking.
- : ligprep_args: Extra LigPrep arguments to use for preparing the
ligands for docking.
- : product_property_filter: A json file containing property filters
for the product compounds.
- : product_smarts_filter: A canvasSearch filter (.cflt) file containing
smarts based filters for the product compounds.
: bloom_filter_path: Path to the bloom filter for in library filtering.
- : reaction_dict: Dictionary of chemical reactions to use, user provided
or the default one from mmshare data files.
- : route_data_dict: Dictionary of route data objects for each route
specified in the routes file.
- : max_library_attempts: Maximum number of products to synthesize
before we find one which is in library according to the bloom filter.
- : done_compounds: Set of InchiKeys of compounds that have already been
docked in this job.
: jobname: Name of the job.
: restart: Whether this is a new run or a restart from a previous run.
- : restart_state_file: Name of the json file containing the state
of the previous run to restart from.
: state_dict: Dictionary containing the state of the job.
: state_file: Name of the json file containing the state of the job.
- : output_scores_file: Name of the output csv file containing the
docking scores of all the compounds docked in this job.
: output_file_fieldnames: List of field names for the output csv file.
- : top_ligands_pool: Name of the glide lib file maintaining a pool of top
50k ligands by docking score. The pool updates after every cycle.
- : subjob_log_archive: Name of the zip file containing the logs of
all the subjobs run in this job.
- : output_top_hits_structure_file: Name of the output glide lib file
containing the top 5k ligands by docking score.
- : output_top_hits_by_route_file: Name of the output glide lib file
containing top 1k ligands by docking score for each of the top 5 routes.
- run()¶
Registers the job output files and runs all remaining cycles of the Thompson sampling workflow and moves them to the list of completed cycles. Updates the state file after finishing each cycle.
- setUpOutputFiles()¶
Sets up the job by creating and registering required output files.
- runCycle(cycle_name: str, cycle_details: dict)¶
Runs a single cycle of the Thompson sampling workflow.
- yieldCompoundsCycle(cycle_name: str, cycle_details: dict)¶
Yield compounds to dock in the current cycle.
- Parameters:
cycle_name – Name of the cycle to run.
cycle_details – A dictionary with two keys: 1) ‘num_to_dock’: number of compounds to dock in this cycle 2) ‘temperature’: Temperature to use to pick compounds.
- yieldCompoundsRoute(route_name: str, num_compounds: int, temperature: float)¶
Synthesizes compounds for the given route. For every reagent class of the given route, it selects reagents using roulette wheel selection and then uses these reagents to synthesize products by randomly iterating over the reagents. It then yields up to num_compounds compounds that are selected for docking. It optionally performs in library filtering using a bloom filter and also filters out compounds that have already been docked in the previous cycles.
- Parameters:
route_name – Name of the route to perform synthesis.
num_compounds – Number of compounds to yield for this route.
temperature – Temperature to use for picking reagents.
: yields: Products that are selected for docking.
- yieldCompoundsRouteRescore(route_name: str, num_compounds: int)¶
Synthesizes and yields compounds for the given route by systematically iterating over top scoring reagents for each reagent class. This is an exploit step where we enumerate all the products upto num_compounds from top 20% scoring reagents for each class. It also filters out compounds that have already been docked in the previous cycles and if a bloom filter is provided, it only yields compounds that are in library according to the bloom filter.
- synthesizeCompoundsFromReagents(route_name: str, route_data, selected_reagents_file: list, num_to_synthesize: int, rescore: bool = False, start=0, stop=None) str¶
- filteredProductGenerator(product_file: str, route_name: str, num_to_add: int)¶
Yields products from the product_file that are not already in self.done_compounds and if a bloom filter is provided, only yields products that are in library according to the bloom filter.
: param product_file: File containing the products to filter.
: param route_name: Name of the route used to synthesize the products.
: param num_to_add: Number of products to yields.
- setPriorMeansAndStd(warmup_scores: list)¶
Sets the means and variances of all the reagents to the warmup mean and variance.
- updateMeansAndStd(done_compounds_cycle: dict)¶
Update the means and variances of the routes and reagents based on the docking scores of the compounds in done_compounds_cycle.
- loadFromRestart()¶
Loads the state of the job from a previous run using the restart state file. Reads the docking scores file from the previous job to update the means and standard deviations of the reagents. And copies the glide lib file containing top ligands.
- parsePreviousDockingScores(previous_docking_scores_file)¶
Reads the docking scores file from a previous job and updates the means and standard deviations of the reagents based on the docking scores of the previously docked compounds.