schrodinger.application.matsci.mlearn.descriptor module

Create polymer descriptor and molecule features using rdkit.

Copyright Schrodinger, LLC. All rights reserved.

exception schrodinger.application.matsci.mlearn.descriptor.FeaturizeError

Bases: RuntimeError

schrodinger.application.matsci.mlearn.descriptor.modify_df_for_testing(dataframe)

Used for unittests to customize the features dataframe

class schrodinger.application.matsci.mlearn.descriptor.Featurize(dataframe, smile_col='smiles', feature_types=('rdkit_descriptor', 'Morgan'), return_float64=False, warn_func=None, use_reference_cols=True)

Bases: object

Create features from the structures

__init__(dataframe, smile_col='smiles', feature_types=('rdkit_descriptor', 'Morgan'), return_float64=False, warn_func=None, use_reference_cols=True)

Initialize FeaturizeStructure class

Parameters:
  • dataframe (pandas.DataFrame) – pandas dataframe containing structures

  • smile_col (str) – Name of smiles column.

  • feature_types (list) – list of features to be calculated for given structures

  • return_float64 (bool) –

  • warn_func (callable) – The function to call to log/show warnings

  • use_reference_cols (bool) – If True, use reference columns from the saved json file so the driver keeps working when new descriptor columns are added or the column orders change. If False (when called by the test_feature_existence_and_order unittest), it will create the original feature lists and deduplicate them, and a new reference file can be created from self.feature_cols.

getFeatures()

Create features from SMILES pattern of dataset

Return type:

pandas.DataFrame

Returns:

Dataframe containing structure, SMILES and full feature vector

removeDuplicates()

Removes duplicate from the features and dataframe columns

clearDataFrame()

Clear data

createMols()

Create rdkit molecules from SMILES

createRdkitDescriptors()

Create RDKit descriptors for the molecules

findBondGroups(mol)

Find largest size of contiguous rotatable bonds present in the system

Parameters:

mol (rdkit.Chem.mol) – rdkit molecule

Return type:

int

Returns:

length of largest contiguous rotatable bonds present

createMatMinerDescriptors()

Create Mat-Miner descriptor from the rdkit molecule

createMorganDescriptors()

Create Morgan descriptor from the rdkit molecule

getMorganCount(mol, morgan_fp_generator)

Generate morgan fingerprint for one molecule

Parameters:
  • mol (rdkit.Chem.mol) – rdkit molecule

  • morgan_fp_generator (class object) – Morgan fingerprint generator object

Returns:

Dictionary with key as fingerprint name and value as fingerprint

Return type:

Dict

createPolymerDescriptors()

Create polymer descriptors in the pandas dataframe.

exception schrodinger.application.matsci.mlearn.descriptor.PolymerDescriptorError

Bases: RuntimeError

class schrodinger.application.matsci.mlearn.descriptor.PolymerDescriptor(smiles)

Bases: object

Class to create the polymer descriptor

FLOAT_BASE = 'r_matsci'
DESCRIPTORS = {'r_matsci_Backbone_Atoms_Fraction': 'backboneAtomsFraction', 'r_matsci_Double_Ring_Atoms_Fraction': 'doubleRingAtomsFraction', 'r_matsci_Ring_Atoms_Fraction': 'ringAtomsFraction', 'r_matsci_Rotatable_Bonds_Fraction': 'rotatableBondsFraction', 'r_matsci_Sp3_Atoms_Fraction': 'sp3AtomsFraction', 'r_matsci_Triple_Ring_Atoms_Fraction': 'tripleRingAtomsFraction'}
__init__(smiles)

Initialize polymer descriptor class

Parameters:

mol (rdkit.Chem.rdchem.Mol) – rdkit mol object of polymer

getFeatures()

Get descriptor for given polymer

Return dict:

Dictionary with key as name of the descriptor and value as descriptor

run()

Generate polymer descriptors for the structure

getDoubleAndTripleRingAtoms()

Calculate the number of atoms in double and triple fused rings

ringAtomsFraction()

Get the fraction of atoms in rings in the polymer

Return type:

float

Returns:

The fraction of atoms in rings

doubleRingAtomsFraction()

Get the fraction of atoms in double fused ring systems in the polymer

Return type:

float

Returns:

The fraction of atoms in double fused ring systems

tripleRingAtomsFraction()

Get the fraction of atoms in triple fused ring systems in the polymer

Return type:

float

Returns:

The fraction of atoms in triple fused ring systems

backboneAtomsFraction()

Get the fraction of backbone atoms in the polymer

Return type:

float

Returns:

The fraction of backbone atoms

rotatableBondsFraction()

Get the fraction of rotatable bonds in the polymer

Return type:

float

Returns:

The fraction of rotatable bonds

sp3AtomsFraction()

Get the fraction of sp3 atoms in the polymer

Return type:

float

Returns:

The fraction of sp3 atoms bonds

schrodinger.application.matsci.mlearn.descriptor.create_oligomer(smiles, monomers)

Create an oligomer given the monomer SMILES and the number of monomer repetitions. The head and tail of the monomer should be denoted by the atom [At].

Parameters:
  • smiles (str) – The SMILES string of the monomer

  • monomers (int) – The number of monomers to repeat

Return type:

Chem.rdchem.Mol

Returns:

The oligomer as Chem.rdchem.Mol

schrodinger.application.matsci.mlearn.descriptor.ml_predict(feature_array, model)

Predict with machine learning model

Parameters:
  • feature_array (numpy.ndarray) – Input feature array

  • model (sklearn.pipeline.Pipeline or BaggingRegressor) – Model object

Return type:

numpy.ndarray, numpy.ndarray

Returns:

First array contains predictions from the model, second one contains the errors

schrodinger.application.matsci.mlearn.descriptor.log_info(logger, msg, verbose)

Log info to logger and also print it if verbose is True

Parameters:
  • logger (logging.Logger) – The logger object

  • msg (str) – The message to log and print

  • verbose (bool) – Whether to print the message too

schrodinger.application.matsci.mlearn.descriptor.predict_with_bagged_models(feature_array, bag_model, error_type='ci_90', verbose=False, logger=None)

This code generates prediction uncertainties for bagged models. By default, the BaggingRegressor from sklearn does not have an error estimation approach. Hence, we will iterate through the models and generate predictions for each model. Afterwards, we will estimate the error in terms of standard deviations or confidence intervals.

Parameters:
  • feature_array (numpy.ndarray) – Input feature array.

  • bag_model (sklearn.ensemble._bagging.BaggingRegressor) – Bagged regression model.

  • error_type (str) – Error type for the prediction. Format is ‘ci_X’: where “X” is the total value that is desired. This should be a float value, e.g. ci_90 means that you want a 90% confidence interval.

  • verbose (bool) – True if you want to print verbosely

  • logger (logging.Logger) – The logger object

Return type:

numpy.ndarray, numpy.ndarray

Returns:

First array contains predictions from the model, second one contains the errors

schrodinger.application.matsci.mlearn.descriptor.get_confidence_interval_error(pred_y_array, n_estimators, error_type='ci_90', logger=None, verbose=False)

Get the error value using confidence interval

Parameters:
  • pred_y_array (numpy.ndarray) – Array containing predictions from the model

  • n_estimators (int) – Number of estimators

  • error_type (str) – Error type for the prediction. Format is ‘ci_X’: where “X” is the total value that is desired. This should be a float value, e.g. ci_90 means that you want a 90% confidence interval.

  • logger (logging.Logger) – The logger object

  • verbose (bool) – True if you want to print verbosely

Return type:

numpy.ndarray

Returns:

Array containing predicted errors