schrodinger.application.matsci.mlearn.descriptor module¶
Create polymer descriptor and molecule features using rdkit.
Copyright Schrodinger, LLC. All rights reserved.
- exception schrodinger.application.matsci.mlearn.descriptor.FeaturizeError¶
Bases:
RuntimeError
- schrodinger.application.matsci.mlearn.descriptor.modify_df_for_testing(dataframe)¶
Used for unittests to customize the features dataframe
- class schrodinger.application.matsci.mlearn.descriptor.Featurize(dataframe, smile_col='smiles', feature_types=('rdkit_descriptor', 'Morgan'), return_float64=False, warn_func=None, use_reference_cols=True)¶
Bases:
object
Create features from the structures
- __init__(dataframe, smile_col='smiles', feature_types=('rdkit_descriptor', 'Morgan'), return_float64=False, warn_func=None, use_reference_cols=True)¶
Initialize FeaturizeStructure class
- Parameters
dataframe (pandas.DataFrame) – pandas dataframe containing structures
smile_col (str) – Name of smiles column.
feature_types (list) – list of features to be calculated for given structures
return_float64 (bool) –
warn_func (callable) – The function to call to log/show warnings
use_reference_cols (bool) – If True, use reference columns from the saved json file so the driver keeps working when new descriptor columns are added or the column orders change. If False (when called by the test_feature_existence_and_order unittest), it will create the original feature lists and deduplicate them, and a new reference file can be created from self.feature_cols.
- getFeatures()¶
Create features from SMILES pattern of dataset
- Return type
pandas.DataFrame
- Returns
Dataframe containing structure, SMILES and full feature vector
- removeDuplicates()¶
Removes duplicate from the features and dataframe columns
- clearDataFrame()¶
Clear data
- createMols()¶
Create rdkit molecules from SMILES
- createRdkitDescriptors()¶
Create RDKit descriptors for the molecules
- findBondGroups(mol)¶
Find largest size of contiguous rotatable bonds present in the system
- Parameters
mol (
rdkit.Chem.mol
) – rdkit molecule- Return type
int
- Returns
length of largest contiguous rotatable bonds present
- createMatMinerDescriptors()¶
Create Mat-Miner descriptor from the rdkit molecule
- createMorganDescriptors()¶
Create Morgan descriptor from the rdkit molecule
- getMorganCount(mol, morgan_fp_generator)¶
Generate morgan fingerprint for one molecule
- Parameters
mol (
rdkit.Chem.mol
) – rdkit moleculemorgan_fp_generator (class object) – Morgan fingerprint generator object
- Returns
Dictionary with key as fingerprint name and value as fingerprint
- Return type
Dict
- createPolymerDescriptors()¶
Create polymer descriptors in the pandas dataframe.
- exception schrodinger.application.matsci.mlearn.descriptor.PolymerDescriptorError¶
Bases:
RuntimeError
- class schrodinger.application.matsci.mlearn.descriptor.PolymerDescriptor(smiles)¶
Bases:
object
Class to create the polymer descriptor
- FLOAT_BASE = 'r_matsci'¶
- DESCRIPTORS = {'r_matsci_Backbone_Atoms_Fraction': 'backboneAtomsFraction', 'r_matsci_Double_Ring_Atoms_Fraction': 'doubleRingAtomsFraction', 'r_matsci_Ring_Atoms_Fraction': 'ringAtomsFraction', 'r_matsci_Rotatable_Bonds_Fraction': 'rotatableBondsFraction', 'r_matsci_Sp3_Atoms_Fraction': 'sp3AtomsFraction', 'r_matsci_Triple_Ring_Atoms_Fraction': 'tripleRingAtomsFraction'}¶
- __init__(smiles)¶
Initialize polymer descriptor class
- Parameters
mol (rdkit.Chem.rdchem.Mol) – rdkit mol object of polymer
- getFeatures()¶
Get descriptor for given polymer
- Return dict
Dictionary with key as name of the descriptor and value as descriptor
- run()¶
Generate polymer descriptors for the structure
- getDoubleAndTripleRingAtoms()¶
Calculate the number of atoms in double and triple fused rings
- ringAtomsFraction()¶
Get the fraction of atoms in rings in the polymer
- Return type
float
- Returns
The fraction of atoms in rings
- doubleRingAtomsFraction()¶
Get the fraction of atoms in double fused ring systems in the polymer
- Return type
float
- Returns
The fraction of atoms in double fused ring systems
- tripleRingAtomsFraction()¶
Get the fraction of atoms in triple fused ring systems in the polymer
- Return type
float
- Returns
The fraction of atoms in triple fused ring systems
- backboneAtomsFraction()¶
Get the fraction of backbone atoms in the polymer
- Return type
float
- Returns
The fraction of backbone atoms
- rotatableBondsFraction()¶
Get the fraction of rotatable bonds in the polymer
- Return type
float
- Returns
The fraction of rotatable bonds
- sp3AtomsFraction()¶
Get the fraction of sp3 atoms in the polymer
- Return type
float
- Returns
The fraction of sp3 atoms bonds
- schrodinger.application.matsci.mlearn.descriptor.create_oligomer(smiles, monomers)¶
Create an oligomer given the monomer SMILES and the number of monomer repetitions. The head and tail of the monomer should be denoted by the atom [At].
- Parameters
smiles (str) – The SMILES string of the monomer
monomers (int) – The number of monomers to repeat
- Return type
Chem.rdchem.Mol
- Returns
The oligomer as
Chem.rdchem.Mol
- schrodinger.application.matsci.mlearn.descriptor.ml_predict(feature_array, model)¶
Predict with machine learning model
- Parameters
feature_array (numpy.ndarray) – Input feature array
model (sklearn.pipeline.Pipeline or BaggingRegressor) – Model object
- Return type
numpy.ndarray, numpy.ndarray
- Returns
First array contains predictions from the model, second one contains the errors
- schrodinger.application.matsci.mlearn.descriptor.log_info(logger, msg, verbose)¶
Log info to logger and also print it if verbose is True
- Parameters
logger (logging.Logger) – The logger object
msg (str) – The message to log and print
verbose (bool) – Whether to print the message too
- schrodinger.application.matsci.mlearn.descriptor.predict_with_bagged_models(feature_array, bag_model, error_type='ci_90', verbose=False, logger=None)¶
This code generates prediction uncertainties for bagged models. By default, the BaggingRegressor from sklearn does not have an error estimation approach. Hence, we will iterate through the models and generate predictions for each model. Afterwards, we will estimate the error in terms of standard deviations or confidence intervals.
- Parameters
feature_array (numpy.ndarray) – Input feature array.
bag_model (sklearn.ensemble._bagging.BaggingRegressor) – Bagged regression model.
error_type (str) – Error type for the prediction. Format is ‘ci_X’: where “X” is the total value that is desired. This should be a float value, e.g. ci_90 means that you want a 90% confidence interval.
verbose (bool) – True if you want to print verbosely
logger (logging.Logger) – The logger object
- Return type
numpy.ndarray, numpy.ndarray
- Returns
First array contains predictions from the model, second one contains the errors
- schrodinger.application.matsci.mlearn.descriptor.get_confidence_interval_error(pred_y_array, n_estimators, error_type='ci_90', logger=None, verbose=False)¶
Get the error value using confidence interval
- Parameters
pred_y_array (numpy.ndarray) – Array containing predictions from the model
n_estimators (int) – Number of estimators
error_type (str) – Error type for the prediction. Format is ‘ci_X’: where “X” is the total value that is desired. This should be a float value, e.g. ci_90 means that you want a 90% confidence interval.
logger (logging.Logger) – The logger object
verbose (bool) – True if you want to print verbosely
- Return type
numpy.ndarray
- Returns
Array containing predicted errors