schrodinger.application.combinatorial_diversity.diversity_splitter module¶
Overview and Motivation¶
This module contains the DiversitySplitter class, which splits a potentially large data set into roughly equal-sized chunks that occupy distinct regions of chemical space. The procedure may be used to make the process of choosing diverse structures via DiversitySelector more scalable. For example, use of DiversitySelector to choose 10,000 diverse structures from a pool of 100,000 may take several hours due to the quadratic nature of the algorithm. However, if the pool of 100,000 were divided into 8 chunks of 12,500, DiversitySelector could be used to select 1,250 structures from each chunk of 12,500, with each calculation taking only about 1/64 the time of selecting 10,000 in one shot. While the combined diverse subsets of 1,250 will not be as diverse as if the algorithm were applied to select 10,000, the diversity will be higher than random selection. Furthermore, selections made from different chunks can be distributed over different processors for additional speedup. And finally, the time required to select larger numbers of diverse structures from larger pools scales linearly in the number of diverse structures, so long as the fraction being selected, such as 10%, remains fixed.
Splitting Chemical Sapce¶
Chemical space is divided by first choosing a small number of diverse probe structures from the pool using DiversitySelector. These N probes are used to define an N-dimensional similarity space by computing the similarity of each structure in the pool to each probe. For a pool of 100,000 structures, this yields a data matrix with 100,000 rows and N columns. Principal components analysis can be performed on this data matrix to obtain N orthogonal factors that span directions of maximum variance. If, say, the first 2 factors are retained, this divides the data space into 4 quadrants identified as ‘++’, ‘-+’, ‘–’ and ‘+-’ to indicate the algebraic sign of the score on each of the 2 factors. Below is an illustration of the division of 100,000 actual structures over 4 quadrants derived from PCA analysis using 10 diverse probes. PC1 is the factor with the largest eigenvalue and PC2 is the factor with the 2nd largest eigenvalue:
        PC2
         |
 20,036  |  32,154
         |
      -+ | ++
---------|---------PC1
      -- | +-
         |
 25,998  |  21,812
         |
The quadrants do not have equal populations, and as additional factors are retained and we move to octants and then orthants, the disparities in the populations become even greater. For example, in the case of 4 factors and 16 orthants, the populations range from 2,322 (–++) to 9,979 (—-). To achieve nearly equal populations, orthants are sorted by population and the most populous orthant is combined with the least populous orthant, the 2nd most populous orthant is combined with the 2nd least populous orthant, etc. In the above case of quadrants, this yields the following orthant pairs and combined populations:
(++, -+) --> 52,190
(+-, --) --> 47,810
If the goal is to obtain chunks of roughly 12,500 compounds, then we need to go beyond 2 factors so that the combined populations are smaller. Using 4 factors, the following combined populations are obtained:
(--++, ----) --> 12,301
(+++-, -+-+) --> 12,100
(++++, +-++) --> 13,528
(++-+, ++--) --> 12,990
(-+++, -++-) --> 12,356
(-+--, --+-) --> 12,353
(+-+-, ---+) --> 12,523
(+--+, +---) --> 11,849
Increasing to 5 factors results in combined populations that are about half as large as above, so we would use 4 factors to achieve the desired splitting.
Copyright Schrodinger LLC, All Rights Reserved.
- class schrodinger.application.combinatorial_diversity.diversity_splitter.DiversitySplitter(fp_file, min_pop=10000, num_probes=10, rand_seed=1)¶
- Bases: - object- Given a file of 32-bit fingerprints, this class selects a diverse set of probe structures and uses the similarities to the probes to divide the structures in the fingerprint file into distinct regions of chemical space of roughly equal populations. - __init__(fp_file, min_pop=10000, num_probes=10, rand_seed=1)¶
- Constructor taking the name of a 32-bit fingerprint file, the minimum desired population of each region of space, the number of diverse probes from which to create the space, and a random seed for the diversity algorithm used to select the probes. - Parameters:
- fp_file (str) – Input file of 32-bit fingerprints and SMILES. 
- min_pop (int) – Minimum number of structures in each region. 
- num_probes (int) – Number of diverse probe structures. 
- rand_seed (int) – Random seed for underlying diversity algorithm. 
 
 
 - getOrthantPairs()¶
- Returns a list of strings that represent the orthant pairs over which the structures are divided. In the case of 4 factors, the returned list might look something like: - ['++++|+++-', '++-+|+---', . . . '---+|----'] - Returns:
- Orthant pair strings. 
- Return type:
- list(str) 
 
 - getOrthantPopulations()¶
- Returns the number of structures associated with each orthant pair. These follow the same order as getOrthantPairs. - Returns:
- Orthant pair populations. 
- Return type:
- list(int) 
 
 - getOrthantRows()¶
- Returns the 0-based fingerprint row numbers associated with each orthant pair. These follow the same order as getOrthantPairs. - Returns:
- Orthant pair rows. 
- Return type:
- list(list(int)) 
 
 - splitFingerprints(file_base, create_files=True, max_fpfiles_open=256)¶
- Splits the input fingerprint file into a set of fingerprint files containing the rows associated with each orthant pair. File names will be <file_base>_1.fp, <file_base>_2.fp, etc., and the rows in those files will correspond to the rows returned by getOrthantRows. This function returns the fingerprint file names. - Parameters:
- file_base (str) – Base name of the fingerprint files to create. 
- create_files (bool) – If False, function will return fingerprint file names without actually creating the files. 
- max_fpfiles_open (int) – Maximum number of output fingerprint files that may be open at any time. If the number of fingerprint files to create exceeds this value, multiple passes are made through the input fingerprint file. A larger value results in faster splitting but greater memory use. 
 
- Returns:
- Fingerprint file names. 
- Return type:
- list(str)