schrodinger.application.combinatorial_diversity.diversity_splitter module

Overview and Motivation

This module contains the DiversitySplitter class, which splits a potentially large data set into roughly equal-sized chunks that occupy distinct regions of chemical space. The procedure may be used to make the process of choosing diverse structures via DiversitySelector more scalable. For example, use of DiversitySelector to choose 10,000 diverse structures from a pool of 100,000 may take several hours due to the quadratic nature of the algorithm. However, if the pool of 100,000 were divided into 8 chunks of 12,500, DiversitySelector could be used to select 1,250 structures from each chunk of 12,500, with each calculation taking only about 1/64 the time of selecting 10,000 in one shot. While the combined diverse subsets of 1,250 will not be as diverse as if the algorithm were applied to select 10,000, the diversity will be higher than random selection. Furthermore, selections made from different chunks can be distributed over different processors for additional speedup. And finally, the time required to select larger numbers of diverse structures from larger pools scales linearly in the number of diverse structures, so long as the fraction being selected, such as 10%, remains fixed.

Splitting Chemical Sapce

Chemical space is divided by first choosing a small number of diverse probe structures from the pool using DiversitySelector. These N probes are used to define an N-dimensional similarity space by computing the similarity of each structure in the pool to each probe. For a pool of 100,000 structures, this yields a data matrix with 100,000 rows and N columns. Principal components analysis can be performed on this data matrix to obtain N orthogonal factors that span directions of maximum variance. If, say, the first 2 factors are retained, this divides the data space into 4 quadrants identified as ‘++’, ‘-+’, ‘–’ and ‘+-’ to indicate the algebraic sign of the score on each of the 2 factors. Below is an illustration of the division of 100,000 actual structures over 4 quadrants derived from PCA analysis using 10 diverse probes. PC1 is the factor with the largest eigenvalue and PC2 is the factor with the 2nd largest eigenvalue:

 20,036  |  32,154
      -+ | ++
      -- | +-
 25,998  |  21,812

The quadrants do not have equal populations, and as additional factors are retained and we move to octants and then orthants, the disparities in the populations become even greater. For example, in the case of 4 factors and 16 orthants, the populations range from 2,322 (–++) to 9,979 (—-). To achieve nearly equal populations, orthants are sorted by population and the most populous orthant is combined with the least populous orthant, the 2nd most populous orthant is combined with the 2nd least populous orthant, etc. In the above case of quadrants, this yields the following orthant pairs and combined populations:

(++, -+) --> 52,190
(+-, --) --> 47,810

If the goal is to obtain chunks of roughly 12,500 compounds, then we need to go beyond 2 factors so that the combined populations are smaller. Using 4 factors, the following combined populations are obtained:

(--++, ----) --> 12,301
(+++-, -+-+) --> 12,100
(++++, +-++) --> 13,528
(++-+, ++--) --> 12,990
(-+++, -++-) --> 12,356
(-+--, --+-) --> 12,353
(+-+-, ---+) --> 12,523
(+--+, +---) --> 11,849

Increasing to 5 factors results in combined populations that are about half as large as above, so we would use 4 factors to achieve the desired splitting.

Copyright Schrodinger LLC, All Rights Reserved.

class schrodinger.application.combinatorial_diversity.diversity_splitter.DiversitySplitter(fp_file, min_pop=10000, num_probes=10, rand_seed=1)

Bases: object

Given a file of 32-bit fingerprints, this class selects a diverse set of probe structures and uses the similarities to the probes to divide the structures in the fingerprint file into distinct regions of chemical space of roughly equal populations.

__init__(fp_file, min_pop=10000, num_probes=10, rand_seed=1)

Constructor taking the name of a 32-bit fingerprint file, the minimum desired population of each region of space, the number of diverse probes from which to create the space, and a random seed for the diversity algorithm used to select the probes.

  • fp_file (str) – Input file of 32-bit fingerprints and SMILES.

  • min_pop (int) – Minimum number of structures in each region.

  • num_probes (int) – Number of diverse probe structures.

  • rand_seed (int) – Random seed for underlying diversity algorithm.


Returns a list of strings that represent the orthant pairs over which the structures are divided. In the case of 4 factors, the returned list might look something like:

['++++|+++-', '++-+|+---', . . . '---+|----']

Orthant pair strings.

Return type



Returns the number of structures associated with each orthant pair. These follow the same order as getOrthantPairs.


Orthant pair populations.

Return type



Returns the 0-based fingerprint row numbers associated with each orthant pair. These follow the same order as getOrthantPairs.


Orthant pair rows.

Return type


splitFingerprints(file_base, create_files=True)

Splits the input fingerprint file into a set of fingerprint files containing the rows associated with each orthant pair. File names will be <file_base>_1.fp, <file_base>_2.fp, etc., and the rows in those files will correspond to the rows returned by getOrthantRows. This function returns the fingerprint file names.

Note that use of this function is not recommended for fingerprint files containing more than 5 million rows due to a buildup of memory within the ChmCustomOut32 objects that create the output fingerprint files. The situation is uniquely problematic because those objects all remain in scope, building up memory, throughout the entire course of this I/O operation.

  • file_base (str) – Base name of the fingerprint files to create.

  • create_files (bool) – If False, function will return fingerprint file names without actually creating the files.


Fingerprint file names.

Return type