schrodinger.application.matsci.mlearn.descriptor module

Create polymer descriptor and molecule features using rdkit.

Copyright Schrodinger, LLC. All rights reserved.

exception schrodinger.application.matsci.mlearn.descriptor.FeaturizeError

Bases: RuntimeError

schrodinger.application.matsci.mlearn.descriptor.modify_df_for_testing(dataframe)

Used for unittests to customize the features dataframe

class schrodinger.application.matsci.mlearn.descriptor.Featurize(dataframe, smile_col='smiles', feature_types=('rdkit_descriptor', 'Morgan'), return_float64=False, warn_func=None, use_reference_cols=True)

Bases: object

Create features from the structures

__init__(dataframe, smile_col='smiles', feature_types=('rdkit_descriptor', 'Morgan'), return_float64=False, warn_func=None, use_reference_cols=True)

Initialize FeaturizeStructure class

Parameters
  • dataframe (pandas.DataFrame) – pandas dataframe containing structures

  • smile_col (str) – Name of smiles column.

  • feature_types (list) – list of features to be calculated for given structures

  • return_float64 (bool) –

  • warn_func (callable) – The function to call to log/show warnings

  • use_reference_cols (bool) – If True, use reference columns from the saved json file so the driver keeps working when new descriptor columns are added or the column orders change. If False (when called by the test_feature_existence_and_order unittest), it will create the original feature lists and deduplicate them, and a new reference file can be created from self.feature_cols.

getFeatures()

Create features from SMILES pattern of dataset

Return type

pandas.DataFrame

Returns

Dataframe containing structure, SMILES and full feature vector

removeDuplicates()

Removes duplicate from the features and dataframe columns

clearDataFrame()

Clear data

createMols()

Create rdkit molecules from SMILES

createRdkitDescriptors()

Create RDKit descriptors for the molecules

findBondGroups(mol)

Find largest size of contiguous rotatable bonds present in the system

Parameters

mol (rdkit.Chem.mol) – rdkit molecule

Return type

int

Returns

length of largest contiguous rotatable bonds present

createMatMinerDescriptors()

Create Mat-Miner descriptor from the rdkit molecule

createMorganDescriptors()

Create Morgan descriptor from the rdkit molecule

getMorganCount(mol, morgan_fp_generator)

Generate morgan fingerprint for one molecule

Parameters
  • mol (rdkit.Chem.mol) – rdkit molecule

  • morgan_fp_generator (class object) – Morgan fingerprint generator object

Returns

Dictionary with key as fingerprint name and value as fingerprint

Return type

Dict

createPolymerDescriptors()

Create polymer descriptors in the pandas dataframe.

exception schrodinger.application.matsci.mlearn.descriptor.PolymerDescriptorError

Bases: RuntimeError

class schrodinger.application.matsci.mlearn.descriptor.PolymerDescriptor(smiles)

Bases: object

Class to create the polymer descriptor

FLOAT_BASE = 'r_matsci'
DESCRIPTORS = {'r_matsci_Backbone_Atoms_Fraction': 'backboneAtomsFraction', 'r_matsci_Double_Ring_Atoms_Fraction': 'doubleRingAtomsFraction', 'r_matsci_Ring_Atoms_Fraction': 'ringAtomsFraction', 'r_matsci_Rotatable_Bonds_Fraction': 'rotatableBondsFraction', 'r_matsci_Sp3_Atoms_Fraction': 'sp3AtomsFraction', 'r_matsci_Triple_Ring_Atoms_Fraction': 'tripleRingAtomsFraction'}
__init__(smiles)

Initialize polymer descriptor class

Parameters

mol (rdkit.Chem.rdchem.Mol) – rdkit mol object of polymer

getFeatures()

Get descriptor for given polymer

Return dict

Dictionary with key as name of the descriptor and value as descriptor

run()

Generate polymer descriptors for the structure

getDoubleAndTripleRingAtoms()

Calculate the number of atoms in double and triple fused rings

ringAtomsFraction()

Get the fraction of atoms in rings in the polymer

Return type

float

Returns

The fraction of atoms in rings

doubleRingAtomsFraction()

Get the fraction of atoms in double fused ring systems in the polymer

Return type

float

Returns

The fraction of atoms in double fused ring systems

tripleRingAtomsFraction()

Get the fraction of atoms in triple fused ring systems in the polymer

Return type

float

Returns

The fraction of atoms in triple fused ring systems

backboneAtomsFraction()

Get the fraction of backbone atoms in the polymer

Return type

float

Returns

The fraction of backbone atoms

rotatableBondsFraction()

Get the fraction of rotatable bonds in the polymer

Return type

float

Returns

The fraction of rotatable bonds

sp3AtomsFraction()

Get the fraction of sp3 atoms in the polymer

Return type

float

Returns

The fraction of sp3 atoms bonds

schrodinger.application.matsci.mlearn.descriptor.create_oligomer(smiles, monomers)

Create an oligomer given the monomer SMILES and the number of monomer repetitions. The head and tail of the monomer should be denoted by the atom [At].

Parameters
  • smiles (str) – The SMILES string of the monomer

  • monomers (int) – The number of monomers to repeat

Return type

Chem.rdchem.Mol

Returns

The oligomer as Chem.rdchem.Mol

schrodinger.application.matsci.mlearn.descriptor.ml_predict(feature_array, model)

Predict with machine learning model

Parameters
  • feature_array (numpy.ndarray) – Input feature array

  • model (sklearn.pipeline.Pipeline or BaggingRegressor) – Model object

Return type

numpy.ndarray, numpy.ndarray

Returns

First array contains predictions from the model, second one contains the errors

schrodinger.application.matsci.mlearn.descriptor.log_info(logger, msg, verbose)

Log info to logger and also print it if verbose is True

Parameters
  • logger (logging.Logger) – The logger object

  • msg (str) – The message to log and print

  • verbose (bool) – Whether to print the message too

schrodinger.application.matsci.mlearn.descriptor.predict_with_bagged_models(feature_array, bag_model, error_type='ci_90', verbose=False, logger=None)

This code generates prediction uncertainties for bagged models. By default, the BaggingRegressor from sklearn does not have an error estimation approach. Hence, we will iterate through the models and generate predictions for each model. Afterwards, we will estimate the error in terms of standard deviations or confidence intervals.

Parameters
  • feature_array (numpy.ndarray) – Input feature array.

  • bag_model (sklearn.ensemble._bagging.BaggingRegressor) – Bagged regression model.

  • error_type (str) – Error type for the prediction. Format is ‘ci_X’: where “X” is the total value that is desired. This should be a float value, e.g. ci_90 means that you want a 90% confidence interval.

  • verbose (bool) – True if you want to print verbosely

  • logger (logging.Logger) – The logger object

Return type

numpy.ndarray, numpy.ndarray

Returns

First array contains predictions from the model, second one contains the errors

schrodinger.application.matsci.mlearn.descriptor.get_confidence_interval_error(pred_y_array, n_estimators, error_type='ci_90', logger=None, verbose=False)

Get the error value using confidence interval

Parameters
  • pred_y_array (numpy.ndarray) – Array containing predictions from the model

  • n_estimators (int) – Number of estimators

  • error_type (str) – Error type for the prediction. Format is ‘ci_X’: where “X” is the total value that is desired. This should be a float value, e.g. ci_90 means that you want a 90% confidence interval.

  • logger (logging.Logger) – The logger object

  • verbose (bool) – True if you want to print verbosely

Return type

numpy.ndarray

Returns

Array containing predicted errors