schrodinger.application.combinatorial_diversity.splitter_utils module¶
This module provides functionality for splitting a large data set into smaller chunks for scalable diversity selection via DiversitySelector.
Copyright Schrodinger LLC, All Rights Reserved.
- schrodinger.application.combinatorial_diversity.splitter_utils.compute_factor_scores(xcols, evectors)¶
Given N columns of autoscaled X variables and the N eigenvectors obtained from PCA of those X variables, this function computes the score on each eigenvector for each row of X values.
- Parameters
evectors (numpy.ndarray) – Eigenvectors from a PCA analysis. The jth eigenvector is stored in evectors[:, j].
xcols (numpy.ndarray) – Columns of autoscaled X variables. The jth column is stored in xcols[j].
- Returns
N PCA scores for each row in xcols. The shape of the returned vector is (xcols.shape[1], xcols.shape[0]), i.e., the shape of the transpose of xcols.
- Return type
numpy.ndarray
- schrodinger.application.combinatorial_diversity.splitter_utils.compute_sim_to_probes(fp_file, probe_rows)¶
Given a 32-bit fingerprint file and the 0-based row numbers for N diverse probe structures, this function computes columns of autoscaled Tanimoto similarities between the probes and all fingerprints in the file.
- Parameters
fp_file (str) – Input file of 32-bit fingerprints.
probe_rows (list(int)) – List of 0-based fingerprint row numbers for N diverse probe structures.
- Returns
N columns of autoscaled similarities.
- Return type
numpy.ndarray
- schrodinger.application.combinatorial_diversity.splitter_utils.create_sim_cormat(sim_cols)¶
Given N columns of autoscaled similarities, this function creates an an NxN matrix of Pearson correlations among those columns.
- Parameters
sim_cols (numpy.ndarray) – N columns of autoscaled similarities.
- Returns
NxN correlation matrix.
- Return type
numpy.ndarray
- schrodinger.application.combinatorial_diversity.splitter_utils.diagonalize_symmat(symmat)¶
Diagonalizes a real, symmetric matrix and returns the eigenvalues and eigenvectors sorted by decreasing eigenvalue.
- Parameters
symmat (numpy.ndarray) – Real, symmetric matrix. Not modified.
- Returns
Reverse-sorted eigenvalues, followed by eigenvectors. The jth eigenvector is stored in the column slice [:, j] of the returned numpy.ndarray.
- Return type
numpy.float64, numpy.ndarray
- schrodinger.application.combinatorial_diversity.splitter_utils.get_all_orthant_strings(ndim)¶
Yields all possible orthant strings for the given number of dimensions. For example, if ndim = 2, this function would yield the 2-dimensional orthant strings ‘++’, ‘+-’, ‘-+’, ‘–’. These correspond to the usual 4 quadrants in xy space.
- Parameters
ndim (int) – Number of dimensions.
- Yield
All possible orthant strings of length ndim.
- Ytype
str
- schrodinger.application.combinatorial_diversity.splitter_utils.get_orthant_strings(scores, ndim)¶
Given PCA factor scores over the full set of eigenvectors and a desired number of dimensions in that factor space, this function yields strings containing ‘+’ and ‘-’ characters which indicate the orthant in which each row of scores resides. A value of ‘+’ is assigned if score >= 0 and a value of ‘-’ is assigned if score is < 0.
For example, if a given row consists of the following scores on 8 factors:
[1.3289, -0.2439, -2.1774, 0.8391, 1.4632, -0.6268, 1.2238, -1.7802]
and ndim = 4, the orthant string would be ‘+–+’.
- Parameters
scores (numpy.ndarray) – PCA factor scores (see compute_factor_scores).
ndim (int) – Number of factors to consider. This determines the number of characters in each orthant string.
- Yield
Orthant string for each row in scores.
- Ytype
str
- schrodinger.application.combinatorial_diversity.splitter_utils.partition_scores(scores, min_pop)¶
Given PCA factor scores over the full set of eigenvectors and a minimum required population, this function partitions the scores into distinct orthant pairs of nearly equal population, where the smallest population is guaranteed to be at least min_pop. This is achieved by making a series of calls to get_orthant_strings with progressively larger values of ndim, grouping the scores by orthant string, sorting by population size and then combining the highest and lowest populations, the 2nd highest and 2nd lowest populations, etc. These combined populations decrease as ndim is increased, and the largest value of ndim which allows min_pop to be satisfied is used.
For example: 1. Suppose ndim=4 is the largest dimension that satisfies min_pop 2. Suppose a given row of scores yields the orthant string ‘-+-+’ 3. Suppose orthant ‘-+-+’ is combined with orthant ‘+–+’ 4. That row of scores would be assigned to orthant pair ‘+–+|-+-+’
- Parameters
scores (numpy.ndarray) – PCA factor scores (see compute_factor_scores).
min_pop (int) – Minimum required population of any orthant pair.
- Returns
Dictionary of orthant pair –> list of 0-based row numbers.
- Return type
dict{str: list(int)}
- schrodinger.application.combinatorial_diversity.splitter_utils.select_probes(fp_file, num_probes, rand_seed)¶
Selects the requested number of diverse probe structures from the provided 32-bit fingerprint file and returns the corresponding 0-based fingerprint row numbers.
- Parameters
fp_file (str) – Input file of 32-bit fingerprints and SMILES.
num_probes (int) – Number of diverse probe structures.
rand_seed (int) – Random seed for underlying diversity algorithm.
- Returns
List of 0-based row numbers for diverse probe structures.
- Return type
list(int)