schrodinger.application.combinatorial_diversity.driver_utils module¶
Provides miscellaneous functionality for combinatorial_diversity_driver.py.
Copyright Schrodinger LLC, All Rights Reserved.
- schrodinger.application.combinatorial_diversity.driver_utils.add_property_biasing_options(parser)¶
Adds property biasing options to the provided parser.
- Parameters
parser (argparser.ArgumentParser) – Argument parser object.
- schrodinger.application.combinatorial_diversity.driver_utils.adjust_min_pop(min_pop, ndiverse, min_diverse_per_chunk, pool_size)¶
Adjusts the minimum population per chunk, if necessary, to ensure a minimum number of diverse structures per chunk.
- Parameters
min_pop (int) – Requested minimum population per chunk.
ndiverse (int) – Total number of diverse structures to select.
min_diverse_per_chunk (int) – Minimum allowed number of diverse structures per chunk.
pool_size (int) – Total number of structures in the pool.
- Returns
The appropriate minimum population.
- Return type
int
- schrodinger.application.combinatorial_diversity.driver_utils.combine_diverse_structures(subjob_names, outfile)¶
Combines diverse structures from subjobs to the indicated output file.
- Parameters
subjob_names (list(str)) – Subjob names.
outfile (str) – Output Maestro, SD, CSV or SMILES file. Diverse structures from subjobs must be in the same format.
- schrodinger.application.combinatorial_diversity.driver_utils.detect_property_types(infile, max_rows=1000, sticky_missing=False)¶
Given a .json, .fp, .csv or .smi input file, this function returns a dictionary of property names to property types for all properties, excluding SMILES and title, which are present in the file (.fp, .csv) or automatically calculated (.json, .smi). In the case of .fp and .csv, the first max_rows are examined to deduce property types.
- Parameters
infile (str) – Input file (.json, .fp, .csv or .smi).
max_rows (int) – The maximum number of rows to examine.
sticky_missing (bool) – If True, a property with any missing values will be assigned a type of PropertyType.MISSING. If False, the property type will be deduced from non-missing values.
- Returns
Dictionary of property name to PropertyType.
- Return type
dict{str: diversity_fingerprinter.PropertyType}
- schrodinger.application.combinatorial_diversity.driver_utils.extract_subjob_chunks(subjob_name, infile)¶
Exracts chunk files from the archive <subjob_name>.zip and returns lists of the fingerprint files, numbers of diverse structures, and fingerprint domains that should be supplied to the DiversitySelector object that will operate on each chunk. One of two behaviors will occur:
If the archive contains .csv files, then each fingerprint file will be infile, and the row numbers in each .csv file will be returned as the fingerprint domains.
If the archive contains .fp files, then those fingerprint file names will be returned, and each fingerprint domain will be None.
- Parameters
subjob_name (str) – The subjob name.
infile (str) – If the archive contains .csv files, this should be the name of the fingerprint file for which in-place splitting was done. Will be either the user-supplied input fingerprint file (-nocopy, -nosplit) or the fingerprint file generated from the input structures (-nosplit). Ignored if the archive contains .fp files.
- Returns
Lists of fingerprint file names, numbers of diverse structures and fingerprint domains.
- Return type
list(str), list(int), list(list(int))
- schrodinger.application.combinatorial_diversity.driver_utils.generate_fingerprints(infile, outfile, fptype, want_props=False, hba_file=None, hbd_file=None, logger=None)¶
Generates Canvas fingerprints and, optionally, a default set of physicochemical properties for the structures in a SMILES or CSV file.
- Parameters
infile (str) – Input SMILES or CSV file.
outfile – Output fingerprint file.
fptype (str) – Fingerprint type (see LEGAL_FP_TYPES).
want_props (bool or NoneType) – Whether to generate properties. Should be True only for SMILES input.
hba_file (str or NoneType) – File with customized hydrogen bond acceptor rules. Ignored if want_props is False.
hbd_file (str or NoneType) – File with customized hydrogen bond donor rules. Ignored if want_props is False.
logger (logging.Logger or NoneType) – Logger for warning and info messages.
- Raises
ValueError – If properties are requested for CSV input.
- schrodinger.application.combinatorial_diversity.driver_utils.generate_fingerprints_from_csv(csv_file, div_fp, fpout, logger)¶
Generates fingerprints for the SMILES in a .csv file, and writes the fingerprints, titles, properties from columns 2 and beyond and SMILES to an open fingeprint file. Returns the total number of input rows and the total number of fingerprint rows written.
- Parameters
csv_file (str) – CSV file name.
div_fp (diversity_fingerprinter.DiversityFingerprinter) – Diversity fingerprinter configured to generate only fingerprints.
fpout (canvas.ChmCustomOut32) – 32-bit custom fingerprint connection.
logger (logging.Logger or NoneType) – Logger for warning and info messages.
- Returns
Tuple of the number of input rows and the number of fingerprints successfully generated and written.
- Return type
int, int
- schrodinger.application.combinatorial_diversity.driver_utils.generate_fingerprints_from_smi(smi_file, want_props, div_fp, fpout, logger)¶
Generates fingerprints and properties for the SMILES in a .smi file, and writes the fingerprints, titles, properties and SMILES to an open fingerprint file. Returns the total number of input rows and the total number of fingerprint rows written.
- Parameters
smi_file (str) – SMILES file name.
want_props (bool) – Whether properties are being generated.
div_fp (diversity_fingerprinter.DiversityFingerprinter) – Diversity fingerprinter configured to generate fingerprints and, if want_props is True, properties.
fpout (canvas.ChmCustomOut32) – 32-bit custom fingerprint connection.
logger (logging.Logger or NoneType) – Logger for warning and info messages.
- Returns
Tuple of the number of input rows and the number of fingerprints successfully generated and written.
- Return type
int, int
- schrodinger.application.combinatorial_diversity.driver_utils.get_available_properties(infile, descriptions=False)¶
Returns a list of the available properties in the provided input file. If .json or .smi, the properties that are calculated automatically are returned. If .csv, properties in columns 3 and beyond are returned. If .fp, extra data columns other than SMILES are returned.
- Parameters
infile (str) – Input file with source of structures.
descriptions (bool) – Whether to include descriptions for automatically calculated properties.
- Returns
Property names.
- Return type
list(str)
- Raises
KeyError – If any required columns are missing.
- schrodinger.application.combinatorial_diversity.driver_utils.get_distributed_fp_generation_commands(args, nsub)¶
Returns lists of subjob commands for running distributed fingerprint and property generation.
- Parameters
args (argparse.Namespace) – Command line arguments.
nsub (int) – Number of subjobs.
- Returns
list of subjob commands.
- Return type
list(list(str))
- schrodinger.application.combinatorial_diversity.driver_utils.get_distributed_selection_commands(args, nsub)¶
Returns lists of subjob commands for running distributed diverse structure selection.
- Parameters
args (argparse.Namespace) – Command line arguments.
nsub (int) – Number of subjobs.
- Returns
list of subjob commands.
- Return type
list(list(str))
- schrodinger.application.combinatorial_diversity.driver_utils.get_generated_fingerprint_filename(args)¶
Returns the name of the generated fingerprint file supplied to a diversity subjob when -nosplit is in effect and the original source of structures was anything other than fingerprints.
- Parameters
args (argparse.Namespace) – Command line arguments for a diversity subjob
- Returns
Generated fingerprint file name
- Return type
str
- schrodinger.application.combinatorial_diversity.driver_utils.get_infile_type(infile)¶
Returns the input file type (JSON, FP, CSV, SMI) based on extension, or an empty string if the extension isn’t recognized.
- Parameters
infile (str) – Input file with source of structures.
- Returns
Input file type or empty string.
- Return type
str
- schrodinger.application.combinatorial_diversity.driver_utils.get_jobname(args)¶
Returns an appropriate job name based on args.fsubjob, args.dsubjob, SCHRODINGER_JOBNAME, the job control backend, or the base name of args.infile.
- Parameters
args (argparse.Namespace) – Command line arguments
- Returns
job name
- Return type
str
- schrodinger.application.combinatorial_diversity.driver_utils.get_parser()¶
Creates argparse.ArgumentParser with supported command line options.
- Returns
Argument parser object
- Return type
argparser.ArgumentParser
- schrodinger.application.combinatorial_diversity.driver_utils.get_property_type(value)¶
Returns the apparent PropertyType of the supplied value.
- Parameters
value (str) – The value whose type is to be deduced.
- Returns
The apparent type of value.
- Return type
- schrodinger.application.combinatorial_diversity.driver_utils.read_properties(infile, max_rows=1000)¶
Given a .fp or .csv file, this function returns the list of property names, excluding SMILES and title, followed by the property values for the first max_rows rows.
- Parameters
infile (str) – Input .fp or .csv file.
max_rows (int) – The maximum number of rows to read.
- Returns
list of property names followed by lists of property values
- Return type
list(str), list(list(str))
- Raises
ValueError – If infile is of the wrong type.
RuntimeError – If .csv file has inconsistent numbers of values.
- schrodinger.application.combinatorial_diversity.driver_utils.read_properties_from_csv_file(infile, max_rows=1000)¶
Given a .csv file, this function returns the list of property names from columns 2 and beyond, which excludes SMILES and title, followed by the property values for the first max_rows rows.
- Parameters
infile (str) – Input .csv file.
max_rows (int) – The maximum number of rows to read.
- Returns
list of property names followed by lists of property values
- Return type
list(str), list(list(str))
:raises RuntimeError if .csv file has inconsistent numbers of values.
- schrodinger.application.combinatorial_diversity.driver_utils.read_properties_from_fp_file(infile, max_rows=1000)¶
Given a .fp file, this function returns the list of property names, excluding SMILES and title, followed by the property values for the first max_rows rows.
- Parameters
infile (str) – Input .fp file.
max_rows (int) – The maximum number of rows to read.
- Returns
list of property names followed by lists of property values
- Return type
list(str), list(list(str))
- schrodinger.application.combinatorial_diversity.driver_utils.read_property_filters(filter_file)¶
Reads property filters from the provided CSV file. The format of each line is: prop_name,min_value,max_value
- Parameters
filter_file (str) – CSV file containing property filters.
- Returns
List of property filters.
- Return type
- Raises
RuntimeError – If filter_file is incorrectly formatted.
ValueError – If limits are invalid.
- schrodinger.application.combinatorial_diversity.driver_utils.split_fingerprints(fp_file, ndiverse, nsub, jobname, inplace=False, min_pop=10000, num_probes=10)¶
Splits a fingerprint file literally or figuratively into chunks using DiversitySplitter, and places the chunks into a series of zip archives named <jobname>_select_sub_i.zip, where i = 1, 2,…,nsub. Each archive contains one or more chunks to be processed by the associated subjob. Chunk j consists of exactly one of the following two files:
<jobname>_chunk_j.fp - Fingerprints in the chunk (if inplace=False) <jobname>_chunk_j.csv - Row numbers in the chunk (if inplace=True)
The value of inplace determines whether fp_file is literally split into smaller fingerprint files, or figuratively split by way or reporting the 0-based row numbers in each chunk.
In addition to the chunk files, <jobname>_select_sub_i.zip contains the file <jobname>_select_sub_i_manifest.csv, which contains an ordered list of the chunk file names and the number of diverse structures to select from each chunk.
- Parameters
fp_file (str) – 32-bit Canvas fingerprint file containing SMILES and any properties to be biased.
ndiverse (int) – The total number of diverse structures to select. Must be at least twice as large as the number of chunks.
nsub (int) – The desired number of subjobs. This would normally be the number of CPUs over which the job is to be distributed, since finer grained processing is already achieved by assigning one or more chunks to each subjob. The actual number of subjobs run may end up being smaller than this value.
jobname (str) – Job name. Determines the names of the archives and chunk files that will be created.
inplace (bool) – Controls whether to split fp_file into smaller files (inplace=False), or simply write the row numbers of each chunk (inplace=True).
min_pop (int) – Suggested minimum number of structures in each chunk. An adjustment is made, as necessary, to ensure the number of diverse structures per chunk is at least MIN_DIVERSE_PER_CHUNK.
num_probes (int) – The number of diverse probe structures used to construct the similarity space from which chunks are defined.
- Returns
tuple of the actual number of subjobs and the number of chunks
- Return type
int, int
- schrodinger.application.combinatorial_diversity.driver_utils.split_structures(struct_file, nsub, jobname)¶
Splits a SMILES or CSV file into nsub chunks, creating the files <jobname>_fpgen_sub_i.<ext>, where i=1,2,…,nsub and <ext> is “smi” or “csv”. Each chunk will contain a minimum of MIN_FP_PER_SUBJOB structures, so the number of chunks actually created may be less than nsub.
- Parameters
struct_file (str) – SMILES or CSV file to be split.
nsub (int) – The desired number of subjobs.
- Returns
The actual number of files created. Will be <= nsub.
- Return type
int
- schrodinger.application.combinatorial_diversity.driver_utils.summarize_property_filters(filter_file)¶
Generates a string with a summary of the property filters in the provided file.
- Parameters
filter_file (str) – CSV file with property filters.
- Returns
Summary of property filters.
- Return type
str
- schrodinger.application.combinatorial_diversity.driver_utils.validate_args(args, startup=False)¶
Checks the validity of command line arguments.
- Parameters
args (argparser.Namespace) – argparser.Namespace with command line arguments
startup (bool) – Set to True if validating at starup time
- Returns
tuple of validity and non-empty error message if not valid
- Return type
bool, str
- schrodinger.application.combinatorial_diversity.driver_utils.validate_properties(infile, filter_file=None)¶
Validates the input file and the property filter file to ensure that the required properties are present and numeric.
- Parameters
infile (str) – Input file with source of structures.
filter_file (str or NoneType) – Property filter file, if any.
- Returns
tuple of validity and non-empty error message if not valid
- Return type
bool, str
- schrodinger.application.combinatorial_diversity.driver_utils.write_random_smi_subset(infile, outfile, nsub, rand_seed=1)¶
Selects a random subset of rows from a .smi file and writes them to another .smi file.
- Parameters
infile (str) – Input .smi file.
outfile (str) – Output .smi file.
nsub (int) – Random subset size.
rand_seed (int) – Seed to initialize random number generator.
- Raises
ValueError – If nsub exceeds the number of rows in infile.
- schrodinger.application.combinatorial_diversity.driver_utils.write_subjob_selections(fp_files, diverse_subset_rows, outfile, gen_coords=False, v3000=False, logger=None)¶
Reads diverse structures and properties from the supplied fingerprint files and writes them to the indicated output Maestro, SD, CSV or SMILES file.
- Parameters
fp_files (list(str)) – Fingerprint file names.
diverse_subset_rows (list(list(int))) – Zero-based lists of row numbers for diverse structures in each fingerprint file.
outfile (str) – Output file for diverse structures and properties.
gen_coords (bool) – Whether to generate 3D coordinates for Maestro or SD output.
v3000 (bool) – Whether to write SD file structures in V3000 format.
logger (logging.Logger or NoneType) – Logger for warning messages.