schrodinger.application.combinatorial_diversity.driver_utils module

Provides miscellaneous functionality for combinatorial_diversity_driver.py.

Copyright Schrodinger LLC, All Rights Reserved.

schrodinger.application.combinatorial_diversity.driver_utils.add_property_biasing_options(parser)

Adds property biasing options to the provided parser.

Parameters

parser (argparser.ArgumentParser) – Argument parser object.

schrodinger.application.combinatorial_diversity.driver_utils.adjust_min_pop(min_pop, ndiverse, min_diverse_per_chunk, pool_size)

Adjusts the minimum population per chunk, if necessary, to ensure a minimum number of diverse structures per chunk.

Parameters
  • min_pop (int) – Requested minimum population per chunk.

  • ndiverse (int) – Total number of diverse structures to select.

  • min_diverse_per_chunk (int) – Minimum allowed number of diverse structures per chunk.

  • pool_size (int) – Total number of structures in the pool.

Returns

The appropriate minimum population.

Return type

int

schrodinger.application.combinatorial_diversity.driver_utils.combine_diverse_structures(subjob_names, outfile)

Combines diverse structures from subjobs to the indicated output file.

Parameters
  • subjob_names (list(str)) – Subjob names.

  • outfile (str) – Output Maestro, SD, CSV or SMILES file. Diverse structures from subjobs must be in the same format.

schrodinger.application.combinatorial_diversity.driver_utils.detect_property_types(infile, max_rows=1000, sticky_missing=False)

Given a .json, .fp, .csv or .smi input file, this function returns a dictionary of property names to property types for all properties, excluding SMILES and title, which are present in the file (.fp, .csv) or automatically calculated (.json, .smi). In the case of .fp and .csv, the first max_rows are examined to deduce property types.

Parameters
  • infile (str) – Input file (.json, .fp, .csv or .smi).

  • max_rows (int) – The maximum number of rows to examine.

  • sticky_missing (bool) – If True, a property with any missing values will be assigned a type of PropertyType.MISSING. If False, the property type will be deduced from non-missing values.

Returns

Dictionary of property name to PropertyType.

Return type

dict{str: diversity_fingerprinter.PropertyType}

schrodinger.application.combinatorial_diversity.driver_utils.extract_subjob_chunks(subjob_name, infile)

Exracts chunk files from the archive <subjob_name>.zip and returns lists of the fingerprint files, numbers of diverse structures, and fingerprint domains that should be supplied to the DiversitySelector object that will operate on each chunk. One of two behaviors will occur:

  1. If the archive contains .csv files, then each fingerprint file will be infile, and the row numbers in each .csv file will be returned as the fingerprint domains.

  2. If the archive contains .fp files, then those fingerprint file names will be returned, and each fingerprint domain will be None.

Parameters
  • subjob_name (str) – The subjob name.

  • infile (str) – If the archive contains .csv files, this should be the name of the fingerprint file for which in-place splitting was done. Will be either the user-supplied input fingerprint file (-nocopy, -nosplit) or the fingerprint file generated from the input structures (-nosplit). Ignored if the archive contains .fp files.

Returns

Lists of fingerprint file names, numbers of diverse structures and fingerprint domains.

Return type

list(str), list(int), list(list(int))

schrodinger.application.combinatorial_diversity.driver_utils.generate_fingerprints(infile, outfile, fptype, want_props=False, hba_file=None, hbd_file=None, logger=None)

Generates Canvas fingerprints and, optionally, a default set of physicochemical properties for the structures in a SMILES or CSV file.

Parameters
  • infile (str) – Input SMILES or CSV file.

  • outfile – Output fingerprint file.

  • fptype (str) – Fingerprint type (see LEGAL_FP_TYPES).

  • want_props (bool or NoneType) – Whether to generate properties. Should be True only for SMILES input.

  • hba_file (str or NoneType) – File with customized hydrogen bond acceptor rules. Ignored if want_props is False.

  • hbd_file (str or NoneType) – File with customized hydrogen bond donor rules. Ignored if want_props is False.

  • logger (logging.Logger or NoneType) – Logger for warning and info messages.

Raises

ValueError – If properties are requested for CSV input.

schrodinger.application.combinatorial_diversity.driver_utils.generate_fingerprints_from_csv(csv_file, div_fp, fpout, logger)

Generates fingerprints for the SMILES in a .csv file, and writes the fingerprints, titles, properties from columns 2 and beyond and SMILES to an open fingeprint file. Returns the total number of input rows and the total number of fingerprint rows written.

Parameters
  • csv_file (str) – CSV file name.

  • div_fp (diversity_fingerprinter.DiversityFingerprinter) – Diversity fingerprinter configured to generate only fingerprints.

  • fpout (canvas.ChmCustomOut32) – 32-bit custom fingerprint connection.

  • logger (logging.Logger or NoneType) – Logger for warning and info messages.

Returns

Tuple of the number of input rows and the number of fingerprints successfully generated and written.

Return type

int, int

schrodinger.application.combinatorial_diversity.driver_utils.generate_fingerprints_from_smi(smi_file, want_props, div_fp, fpout, logger)

Generates fingerprints and properties for the SMILES in a .smi file, and writes the fingerprints, titles, properties and SMILES to an open fingerprint file. Returns the total number of input rows and the total number of fingerprint rows written.

Parameters
  • smi_file (str) – SMILES file name.

  • want_props (bool) – Whether properties are being generated.

  • div_fp (diversity_fingerprinter.DiversityFingerprinter) – Diversity fingerprinter configured to generate fingerprints and, if want_props is True, properties.

  • fpout (canvas.ChmCustomOut32) – 32-bit custom fingerprint connection.

  • logger (logging.Logger or NoneType) – Logger for warning and info messages.

Returns

Tuple of the number of input rows and the number of fingerprints successfully generated and written.

Return type

int, int

schrodinger.application.combinatorial_diversity.driver_utils.get_available_properties(infile, descriptions=False)

Returns a list of the available properties in the provided input file. If .json or .smi, the properties that are calculated automatically are returned. If .csv, properties in columns 3 and beyond are returned. If .fp, extra data columns other than SMILES are returned.

Parameters
  • infile (str) – Input file with source of structures.

  • descriptions (bool) – Whether to include descriptions for automatically calculated properties.

Returns

Property names.

Return type

list(str)

Raises

KeyError – If any required columns are missing.

schrodinger.application.combinatorial_diversity.driver_utils.get_distributed_fp_generation_commands(args, nsub)

Returns lists of subjob commands for running distributed fingerprint and property generation.

Parameters
  • args (argparse.Namespace) – Command line arguments.

  • nsub (int) – Number of subjobs.

Returns

list of subjob commands.

Return type

list(list(str))

schrodinger.application.combinatorial_diversity.driver_utils.get_distributed_selection_commands(args, nsub)

Returns lists of subjob commands for running distributed diverse structure selection.

Parameters
  • args (argparse.Namespace) – Command line arguments.

  • nsub (int) – Number of subjobs.

Returns

list of subjob commands.

Return type

list(list(str))

schrodinger.application.combinatorial_diversity.driver_utils.get_generated_fingerprint_filename(args)

Returns the name of the generated fingerprint file supplied to a diversity subjob when -nosplit is in effect and the original source of structures was anything other than fingerprints.

Parameters

args (argparse.Namespace) – Command line arguments for a diversity subjob

Returns

Generated fingerprint file name

Return type

str

schrodinger.application.combinatorial_diversity.driver_utils.get_infile_type(infile)

Returns the input file type (JSON, FP, CSV, SMI) based on extension, or an empty string if the extension isn’t recognized.

Parameters

infile (str) – Input file with source of structures.

Returns

Input file type or empty string.

Return type

str

schrodinger.application.combinatorial_diversity.driver_utils.get_jobname(args)

Returns an appropriate job name based on args.fsubjob, args.dsubjob, SCHRODINGER_JOBNAME, the job control backend, or the base name of args.infile.

Parameters

args (argparse.Namespace) – Command line arguments

Returns

job name

Return type

str

schrodinger.application.combinatorial_diversity.driver_utils.get_parser()

Creates argparse.ArgumentParser with supported command line options.

Returns

Argument parser object

Return type

argparser.ArgumentParser

schrodinger.application.combinatorial_diversity.driver_utils.get_property_type(value)

Returns the apparent PropertyType of the supplied value.

Parameters

value (str) – The value whose type is to be deduced.

Returns

The apparent type of value.

Return type

diversity_fingerprinter.PropertyType

schrodinger.application.combinatorial_diversity.driver_utils.read_properties(infile, max_rows=1000)

Given a .fp or .csv file, this function returns the list of property names, excluding SMILES and title, followed by the property values for the first max_rows rows.

Parameters
  • infile (str) – Input .fp or .csv file.

  • max_rows (int) – The maximum number of rows to read.

Returns

list of property names followed by lists of property values

Return type

list(str), list(list(str))

Raises
  • ValueError – If infile is of the wrong type.

  • RuntimeError – If .csv file has inconsistent numbers of values.

schrodinger.application.combinatorial_diversity.driver_utils.read_properties_from_csv_file(infile, max_rows=1000)

Given a .csv file, this function returns the list of property names from columns 2 and beyond, which excludes SMILES and title, followed by the property values for the first max_rows rows.

Parameters
  • infile (str) – Input .csv file.

  • max_rows (int) – The maximum number of rows to read.

Returns

list of property names followed by lists of property values

Return type

list(str), list(list(str))

:raises RuntimeError if .csv file has inconsistent numbers of values.

schrodinger.application.combinatorial_diversity.driver_utils.read_properties_from_fp_file(infile, max_rows=1000)

Given a .fp file, this function returns the list of property names, excluding SMILES and title, followed by the property values for the first max_rows rows.

Parameters
  • infile (str) – Input .fp file.

  • max_rows (int) – The maximum number of rows to read.

Returns

list of property names followed by lists of property values

Return type

list(str), list(list(str))

schrodinger.application.combinatorial_diversity.driver_utils.read_property_filters(filter_file)

Reads property filters from the provided CSV file. The format of each line is: prop_name,min_value,max_value

Parameters

filter_file (str) – CSV file containing property filters.

Returns

List of property filters.

Return type

list(diversity_selector.PropertyFilter)

Raises
  • RuntimeError – If filter_file is incorrectly formatted.

  • ValueError – If limits are invalid.

schrodinger.application.combinatorial_diversity.driver_utils.split_fingerprints(fp_file, ndiverse, nsub, jobname, inplace=False, min_pop=10000, num_probes=10)

Splits a fingerprint file literally or figuratively into chunks using DiversitySplitter, and places the chunks into a series of zip archives named <jobname>_select_sub_i.zip, where i = 1, 2,…,nsub. Each archive contains one or more chunks to be processed by the associated subjob. Chunk j consists of exactly one of the following two files:

<jobname>_chunk_j.fp - Fingerprints in the chunk (if inplace=False) <jobname>_chunk_j.csv - Row numbers in the chunk (if inplace=True)

The value of inplace determines whether fp_file is literally split into smaller fingerprint files, or figuratively split by way or reporting the 0-based row numbers in each chunk.

In addition to the chunk files, <jobname>_select_sub_i.zip contains the file <jobname>_select_sub_i_manifest.csv, which contains an ordered list of the chunk file names and the number of diverse structures to select from each chunk.

Parameters
  • fp_file (str) – 32-bit Canvas fingerprint file containing SMILES and any properties to be biased.

  • ndiverse (int) – The total number of diverse structures to select. Must be at least twice as large as the number of chunks.

  • nsub (int) – The desired number of subjobs. This would normally be the number of CPUs over which the job is to be distributed, since finer grained processing is already achieved by assigning one or more chunks to each subjob. The actual number of subjobs run may end up being smaller than this value.

  • jobname (str) – Job name. Determines the names of the archives and chunk files that will be created.

  • inplace (bool) – Controls whether to split fp_file into smaller files (inplace=False), or simply write the row numbers of each chunk (inplace=True).

  • min_pop (int) – Suggested minimum number of structures in each chunk. An adjustment is made, as necessary, to ensure the number of diverse structures per chunk is at least MIN_DIVERSE_PER_CHUNK.

  • num_probes (int) – The number of diverse probe structures used to construct the similarity space from which chunks are defined.

Returns

tuple of the actual number of subjobs and the number of chunks

Return type

int, int

schrodinger.application.combinatorial_diversity.driver_utils.split_structures(struct_file, nsub, jobname)

Splits a SMILES or CSV file into nsub chunks, creating the files <jobname>_fpgen_sub_i.<ext>, where i=1,2,…,nsub and <ext> is “smi” or “csv”. Each chunk will contain a minimum of MIN_FP_PER_SUBJOB structures, so the number of chunks actually created may be less than nsub.

Parameters
  • struct_file (str) – SMILES or CSV file to be split.

  • nsub (int) – The desired number of subjobs.

Returns

The actual number of files created. Will be <= nsub.

Return type

int

schrodinger.application.combinatorial_diversity.driver_utils.summarize_property_filters(filter_file)

Generates a string with a summary of the property filters in the provided file.

Parameters

filter_file (str) – CSV file with property filters.

Returns

Summary of property filters.

Return type

str

schrodinger.application.combinatorial_diversity.driver_utils.validate_args(args, startup=False)

Checks the validity of command line arguments.

Parameters
  • args (argparser.Namespace) – argparser.Namespace with command line arguments

  • startup (bool) – Set to True if validating at starup time

Returns

tuple of validity and non-empty error message if not valid

Return type

bool, str

schrodinger.application.combinatorial_diversity.driver_utils.validate_properties(infile, filter_file=None)

Validates the input file and the property filter file to ensure that the required properties are present and numeric.

Parameters
  • infile (str) – Input file with source of structures.

  • filter_file (str or NoneType) – Property filter file, if any.

Returns

tuple of validity and non-empty error message if not valid

Return type

bool, str

schrodinger.application.combinatorial_diversity.driver_utils.write_random_smi_subset(infile, outfile, nsub, rand_seed=1)

Selects a random subset of rows from a .smi file and writes them to another .smi file.

Parameters
  • infile (str) – Input .smi file.

  • outfile (str) – Output .smi file.

  • nsub (int) – Random subset size.

  • rand_seed (int) – Seed to initialize random number generator.

Raises

ValueError – If nsub exceeds the number of rows in infile.

schrodinger.application.combinatorial_diversity.driver_utils.write_subjob_selections(fp_files, diverse_subset_rows, outfile, gen_coords=False, v3000=False, logger=None)

Reads diverse structures and properties from the supplied fingerprint files and writes them to the indicated output Maestro, SD, CSV or SMILES file.

Parameters
  • fp_files (list(str)) – Fingerprint file names.

  • diverse_subset_rows (list(list(int))) – Zero-based lists of row numbers for diverse structures in each fingerprint file.

  • outfile (str) – Output file for diverse structures and properties.

  • gen_coords (bool) – Whether to generate 3D coordinates for Maestro or SD output.

  • v3000 (bool) – Whether to write SD file structures in V3000 format.

  • logger (logging.Logger or NoneType) – Logger for warning messages.