schrodinger.application.matsci.mlearn.descriptor module¶

Create polymer descriptor and molecule features using rdkit.

exception schrodinger.application.matsci.mlearn.descriptor.FeaturizeError¶: Bases: RuntimeError

schrodinger.application.matsci.mlearn.descriptor.modify_df_for_testing(dataframe)¶: Used for unittests to customize the features dataframe

class schrodinger.application.matsci.mlearn.descriptor.Featurize(dataframe, smile_col='smiles', feature_types=('rdkit_descriptor', 'Morgan'), return_float64=False, warn_func=None, use_reference_cols=True)¶

Bases: object

Create features from the structures

__init__(dataframe, smile_col='smiles', feature_types=('rdkit_descriptor', 'Morgan'), return_float64=False, warn_func=None, use_reference_cols=True)¶

Initialize FeaturizeStructure class

Parameters

dataframe (pandas.DataFrame) – pandas dataframe containing structures
smile_col (str) – Name of smiles column.
feature_types (list) – list of features to be calculated for given structures
return_float64 (bool) –
warn_func (callable) – The function to call to log/show warnings
use_reference_cols (bool) – If True, use reference columns from the saved json file so the driver keeps working when new descriptor columns are added or the column orders change. If False (when called by the test_feature_existence_and_order unittest), it will create the original feature lists and deduplicate them, and a new reference file can be created from self.feature_cols.

getFeatures()¶

Create features from SMILES pattern of dataset

Return type: pandas.DataFrame
Returns: Dataframe containing structure, SMILES and full feature vector

removeDuplicates()¶: Removes duplicate from the features and dataframe columns

clearDataFrame()¶: Clear data

createMols()¶: Create rdkit molecules from SMILES

createRdkitDescriptors()¶: Create RDKit descriptors for the molecules

findBondGroups(mol)¶

Find largest size of contiguous rotatable bonds present in the system

Parameters: mol (rdkit.Chem.mol) – rdkit molecule
Return type: int
Returns: length of largest contiguous rotatable bonds present

createMatMinerDescriptors()¶: Create Mat-Miner descriptor from the rdkit molecule

createMorganDescriptors()¶: Create Morgan descriptor from the rdkit molecule

getMorganCount(mol, morgan_fp_generator)¶

Generate morgan fingerprint for one molecule

Parameters

mol (rdkit.Chem.mol) – rdkit molecule
morgan_fp_generator (class object) – Morgan fingerprint generator object

Returns

Dictionary with key as fingerprint name and value as fingerprint

Return type

Dict

createPolymerDescriptors()¶: Create polymer descriptors in the pandas dataframe.

exception schrodinger.application.matsci.mlearn.descriptor.PolymerDescriptorError¶: Bases: RuntimeError

class schrodinger.application.matsci.mlearn.descriptor.PolymerDescriptor(smiles)¶

Bases: object

Class to create the polymer descriptor

FLOAT_BASE = 'r_matsci'¶

DESCRIPTORS = {'r_matsci_Backbone_Atoms_Fraction': 'backboneAtomsFraction', 'r_matsci_Double_Ring_Atoms_Fraction': 'doubleRingAtomsFraction', 'r_matsci_Ring_Atoms_Fraction': 'ringAtomsFraction', 'r_matsci_Rotatable_Bonds_Fraction': 'rotatableBondsFraction', 'r_matsci_Sp3_Atoms_Fraction': 'sp3AtomsFraction', 'r_matsci_Triple_Ring_Atoms_Fraction': 'tripleRingAtomsFraction'}¶

__init__(smiles)¶

Initialize polymer descriptor class

Parameters: mol (rdkit.Chem.rdchem.Mol) – rdkit mol object of polymer

getFeatures()¶

Get descriptor for given polymer

Return dict: Dictionary with key as name of the descriptor and value as descriptor

run()¶: Generate polymer descriptors for the structure

getDoubleAndTripleRingAtoms()¶: Calculate the number of atoms in double and triple fused rings

ringAtomsFraction()¶

Get the fraction of atoms in rings in the polymer

Return type: float
Returns: The fraction of atoms in rings

doubleRingAtomsFraction()¶

Get the fraction of atoms in double fused ring systems in the polymer

Return type: float
Returns: The fraction of atoms in double fused ring systems

tripleRingAtomsFraction()¶

Get the fraction of atoms in triple fused ring systems in the polymer

Return type: float
Returns: The fraction of atoms in triple fused ring systems

backboneAtomsFraction()¶

Get the fraction of backbone atoms in the polymer

Return type: float
Returns: The fraction of backbone atoms

rotatableBondsFraction()¶

Get the fraction of rotatable bonds in the polymer

Return type: float
Returns: The fraction of rotatable bonds

sp3AtomsFraction()¶

Get the fraction of sp3 atoms in the polymer

Return type: float
Returns: The fraction of sp3 atoms bonds

schrodinger.application.matsci.mlearn.descriptor.create_oligomer(smiles, monomers)¶

Create an oligomer given the monomer SMILES and the number of monomer repetitions. The head and tail of the monomer should be denoted by the atom [At].

Parameters

smiles (str) – The SMILES string of the monomer
monomers (int) – The number of monomers to repeat

Return type

Chem.rdchem.Mol

Returns

The oligomer as Chem.rdchem.Mol

schrodinger.application.matsci.mlearn.descriptor.ml_predict(feature_array, model)¶

Predict with machine learning model

Parameters

feature_array (numpy.ndarray) – Input feature array
model (sklearn.pipeline.Pipeline or BaggingRegressor) – Model object

Return type

numpy.ndarray, numpy.ndarray

Returns

First array contains predictions from the model, second one contains the errors

schrodinger.application.matsci.mlearn.descriptor.log_info(logger, msg, verbose)¶

Log info to logger and also print it if verbose is True

Parameters

logger (logging.Logger) – The logger object
msg (str) – The message to log and print
verbose (bool) – Whether to print the message too

schrodinger.application.matsci.mlearn.descriptor.predict_with_bagged_models(feature_array, bag_model, error_type='ci_90', verbose=False, logger=None)¶

This code generates prediction uncertainties for bagged models. By default, the BaggingRegressor from sklearn does not have an error estimation approach. Hence, we will iterate through the models and generate predictions for each model. Afterwards, we will estimate the error in terms of standard deviations or confidence intervals.

Parameters

feature_array (numpy.ndarray) – Input feature array.
bag_model (sklearn.ensemble._bagging.BaggingRegressor) – Bagged regression model.
error_type (str) – Error type for the prediction. Format is ‘ci_X’: where “X” is the total value that is desired. This should be a float value, e.g. ci_90 means that you want a 90% confidence interval.
verbose (bool) – True if you want to print verbosely
logger (logging.Logger) – The logger object

Return type

numpy.ndarray, numpy.ndarray

Returns

First array contains predictions from the model, second one contains the errors

schrodinger.application.matsci.mlearn.descriptor.get_confidence_interval_error(pred_y_array, n_estimators, error_type='ci_90', logger=None, verbose=False)¶

Get the error value using confidence interval

Parameters

pred_y_array (numpy.ndarray) – Array containing predictions from the model
n_estimators (int) – Number of estimators
error_type (str) – Error type for the prediction. Format is ‘ci_X’: where “X” is the total value that is desired. This should be a float value, e.g. ci_90 means that you want a 90% confidence interval.
logger (logging.Logger) – The logger object
verbose (bool) – True if you want to print verbosely

Return type

numpy.ndarray

Returns

Array containing predicted errors