schrodinger.application.combinatorial_diversity.diversity_selector module

This module contains the DiversitySelector class, which combines greedy and stochastic approaches in an optimization algorithm that chooses a diverse subset of compounds from a larger pool, with optional biasing of the selections to satisfy a set of property filters.

The objective function to be minimized is the average nearest neighbor similarity within the subset plus the average fraction of property filters failed. Minimization is achieved by starting with a random subset of compounds and then repeatedly attempting to replace the subset member exhibiting the highest nearest neighbor similarity with a randomly chosen member of the pool. Replacements that decrease both the average nearest neighbor similarity and average filter score are always made, as are replacements that decrease one quantity and leave the other unchanged. Replacements that increase either of the quantities are accepted or rejected in accordance with a Monte Carlo test whose probability of being satisfied decreases from 50% to 1% over the course of the optimization.

For further details, see: Bioorg. Med. Chem. 2012 20, 5379–5387.

Copyright Schrodinger LLC, All Rights Reserved.

class schrodinger.application.combinatorial_diversity.diversity_selector.PropertyFilter(name, min_value, max_value)

Bases: object

Simple object that holds the name of a property and the minimum and maximum allowed values of that property.

__init__(name, min_value, max_value)

Constructor taking the name of a numeric property and limits on the value of that property.

  • name (str) – Property name

  • min_value (float) – Minimum allowed value

  • max_value (float) – Maximum allowed value

schrodinger.application.combinatorial_diversity.diversity_selector.compute_filter_score(filters, prop_values, filter_columns)

Computes the fraction of property filters failed for a given compound.

  • filters (list(PropertyFilter) or NoneType) – List of property filters.

  • prop_values (list(str)) – Values of all properties for the compound.

  • filter_columns (list(int)) – 0-based indices into prop_values for just the properties in filters.


Fraction of filters failed.

Return type


schrodinger.application.combinatorial_diversity.diversity_selector.get_filter_columns(fpin, filters)

Given 32-bit fingerprint file connection and a list of property filters, this function returns 0-based indices of the filter properties into the the full list of fingerprint properties. Raises a KeyError if any of the properties aren’t found.

  • fpin (canvas.ChmFPIn32) – Fingerprint file connection.

  • filters (list(PropertyFilter)) – Property filters.


Filter property indices.

Return type


schrodinger.application.combinatorial_diversity.diversity_selector.get_fp_domain(fpin, fp_domain=None)

Given a 32-bit fingerprint file connection and an optional list of 0-based row numbers that define the domain of fingerprints to use, this function returns a 0-based list of all row numbers, or the sorted unique rows from the supplied list. Raises a ValueError if any row numbers in fp_domain are outside the legal range.

  • fpin (canvas.ChmFPIn32) – Fingerprint file connection.

  • fp_domain (list(int) or NoneType) – 0-based fingerprint row numbers.


All row numbers, or the supplied row numbers, after sorting and removing duplicates.

Return type



Given 32-bit fingerprint file connection, this function returns the 0-based index of the first column whose name contains “SMILES”. Raises a KeyError if no such column is found.


fpin (canvas.ChmFPIn32) – Fingerprint file connection.


Zero-based SMILES column index.

Return type


class schrodinger.application.combinatorial_diversity.diversity_selector.DiversitySelector(fp_file, opt_cycles=10, convrg_tol=0.001, convrg_cycles=3, mc_tol=0.001, rand_seed=1, filters=None, fp_domain=None, logger=None)

Bases: object

Given a file of 32-bit fingerprints, this class combines greedy and stochastic approaches to select a diverse subset of structures and, optionally, to bias the selections to favor compounds that satisfy a set of property filters.

__init__(fp_file, opt_cycles=10, convrg_tol=0.001, convrg_cycles=3, mc_tol=0.001, rand_seed=1, filters=None, fp_domain=None, logger=None)

Constructor taking the name of a 32-bit fingerprint file and options for optimizing diversity and, optionally, properties.

  • fp_file (str) – Input file of 32-bit fingerprints and SMILES.

  • opt_cycles (int) – Maximum number of optimization cycles. For a subset of N compounds, an optimization cycle consists of N passes, each of which involves an attempt to replace the compound with the highest nearest neighbor similarity.

  • convrg_tol (float) – Convergence tolerance on the absolute change in the objective function. If the change is less than this value, the convergence tolerance is satisfied.

  • convrg_cycles – Number of consecutive cycles over which the convergence tolerance must be satisfied in order to halt the optimization.

  • mc_tol (float) – Monte Carlo criterion. An increase of mc_tol in the objective function will be accepted with a probability of 50% in the first cycle and 1% in the last cycle.

  • rand_seed (int) – Random seed for initial subset selection and Monte Carlo tests.

  • filters (list(PropertyFilter) or NoneType) – List of property filters that selected compounds should preferentially satisfy. All properties must be present in fp_file. If omitted, properties will not be optimized.

  • fp_domain (list(int)) – List of 0-based row numbers in fp_file from which selections should be made. If omitted, all rows will be considered.

  • logger (logging.Logger or NoneType) – Logger for output of INFO level progress messages. Feedback can be helpful when large subsets are selected, as a given optimization cycle may take minutes or longer if the subset is significantly larger than 1000.


Selects the indicated number of optimized compounds and stores the subset data in the following member variables:

self.subset_rows - 0-based row numbers within fingerprint file self.subset_titles - Compound titles self.subset_smiles - Compound SMILES


num_select (int) – Desired number of compounds with optimized diversity and, optionally, optimized properties. Note that computational effort scales quadratically with this number, and values significantly larger than 1000 may lead to very long run times.