schrodinger.application.building_block_exploration.bb_explorer_generate_seed_file module¶
- class schrodinger.application.building_block_exploration.bb_explorer_generate_seed_file.SeedFileGenerator(args)¶
Bases:
objectClass to generate a seed file for building block exploration. The seed file contains compounds synthesized using the provided routes and building blocks and aims to have roughly equal number of compounds from each route.
If a bloom filter file is provided, the generated seed file only contains compounds that are in the library represented by the bloom filter. In this case, the user may provide a max_library_attempts parameter to control how many attempts are made to generate one in-library compound. This may result in different number of compounds from different routes in the seed file depending on what fraction of the total possible products from each route are in the library. Note that in general, for two step routes, the number of possible products is a lot larger than one-step routes, so the fraction of in-library compounds is likely to be smaller.
- __init__(args)¶
Initialize the SeedFileGenerator with the user-provided arguments. Following attributes of the class are set in this initialization:
- : num_seed_compounds: Number of seed compounds to generate, user
provided. These are split roughly equally amongst different routes.
- : reaction_dict: Dictionary of chemical reactions to use, user provided
or the default one from mmshare data files.
- : route_dict: Dictionary of synthetic routes to run, user provided or
the default one from mmshare data files.
- : logfiles: a list of subjob log files. These are zipped if the job
is successfully completed.
- : bb_dir: Directory containing all the building blocks required by
routes we are running. These are either provided by the user or processed from the raw building blocks file provided by the user. If the user does not provide any building blocks, the default mmshare reagent data directory is used.
- : product_property_filter: A json file containing property filters
for the product compounds.
- : product_smarts_filter: A canvasSearch filter (.cflt) file containing
smarts based filters for the product compounds.
: bloom_filter_file: A bloom filter for in library filtering.
- : max_library_attempts: Maximum number of products to synthesize
before we find one which is in library according to the bloom filter.
: seed_file: Output seed file that the job returns.
- : already_added: Set to keep track of InchiKeys for the compounds
that have been already added to the seed file. This makes sure that we do not add the same compound multiple times.
: log_archive: A zip file containing all the subjob logs.
- run()¶
Main method to setup and run the seed file generation.
- setup()¶
- Setup the seed file generator by:
1) updating the routes dictionary to only contain routes for which all the required building blocks are present in the bb_dir. 2) creating the seed file and writing the header line. 3) adding the seed file and log archive to the job output files.
- generate_seed_file()¶
Generate a seed file based on the provided arguments.
- generate_seed_compounds_per_route(route_name: str, route_object: RouteNode, num_products: int)¶
Generates seed compounds for a specific route. If a bloom filter file is provided, only compounds that are in the library represented by the bloom filter are returned. In that case, initially num_products * max_library_attempts products are synthesized to increase the chances of getting enough in-library compounds. If no bloom filter is provided, num_products * EXTRA_FACTOR_FOR_REPEATED_PRODUCTS products are synthesized to account for repeated products. These products are then filtered through the bloom filter if provided, and upto num_products compounds are added to the seed file.
- Parameters:
route_name – The name of the route.
route_object – The RouteNode object to begin synthesis with.
num_products – The number of products to synthesize.
- add_to_seed_file(product_file: str, route_name: str, num_to_add: int)¶
Adds the compounds from the product file to the seed file.
- Parameters:
product_file – The file containing the products from the synthesis.
route_name – The name of the route for which the products were generated.
num_to_add – Number of compounds to add to the seed file for this route.