schrodinger.application.matsci.ml_formulations_utils module

Classes and functions to help with ML formulation-based workflows.

Copyright Schrodinger, LLC. All rights reserved.

schrodinger.application.matsci.ml_formulations_utils.validate_smiles(smiles)

A cached function that wraps the adapter.validate_smarts function to validate the smiles string. NOTE: the ML formulations backend uses a dummy molecule ‘[1O]’ as a placeholder for empty components, we should reject that SMILES even though it is valid according to rdkit

Parameters:

smiles (str) – The smiles string to validate

Returns:

True if the smiles is valid, False otherwise

Return type:

bool

schrodinger.application.matsci.ml_formulations_utils.get_top_level_dir(tar)

Get the top level directory

Parameters:

tar (tarfile.TarFile) – The tarfile object to inspect

Returns:

The name of the top level directory, or a set of top-level entries

Return type:

str or set

schrodinger.application.matsci.ml_formulations_utils.add_tracking_index_to_csv(csv_file)

Add a tracking index to the CSV file, which is the row number of csv file

Parameters:

csv_file (str) – The CSV file to add the tracking index to

Returns:

The CSV file with the tracking index added

Return type:

str

schrodinger.application.matsci.ml_formulations_utils.remove_group_info_rows(input_df)

Remove the group (mixture of mixtures) info rows from the input dataframe if they are present.

Parameters:

input_df (pandas.DataFrame) – The input dataframe to remove the group info rows from

Return type:

pandas.DataFrame

Returns:

The input dataframe with the group info rows removed

schrodinger.application.matsci.ml_formulations_utils.merge_input_data_to_predicted(input_csv_file, predicted_csv_file)

Merge the input data to the predicted data based on the tracking index

Parameters:
  • input_csv_file (str) – The input CSV file to merge

  • predicted_csv_file (str) – The predicted CSV file to merge

schrodinger.application.matsci.ml_formulations_utils.check_model_version(model)

Get the release version stored in the release_version.txt file in the model and check if it matches the current release version.

Parameters:

model (str) – Path to the model file

Returns:

Whether the release versions match

Return type:

bool

schrodinger.application.matsci.ml_formulations_utils.get_and_setup_logger()

Get the global logger, and move its handler to the root logger. Needed for correct logging when calling ligand_ml.

Return type:

logging.Logger

Returns:

The global logger

schrodinger.application.matsci.ml_formulations_utils.set_mpo_criteria(model_path, ml_data)

Set the MPO criteria for the given model data. If the MPO criteria file is not found in the model path, use defaults and guess the MPO ranges.

Parameters:
  • model_path (str) – Path to the model file

  • ml_data (ml_formulations_gui_utils.MLModelData) – The corresponding model object

class schrodinger.application.matsci.ml_formulations_utils.BaseCSVReader

Bases: object

Base class for reading formulation CSV files

__init__()

Create an instance

setRequiredProps(required_props)

Set the required properties for reading the CSV.

Parameters:

required_props (list) – The list of required properties that must be present in the csv file. If None, there will be no requirement of properties

setGrpRequiredProps(required_props)

Set the required properties for reading the mixtures CSV.

Parameters:

required_props (list) – The list of required properties that must be present in the csv file. If None, there will be no requirement of properties

static validateComponent(component, is_mixture=False)

Validate that an input component is valid for the formulation. If the component is for a simple formulation, the component name must have a valid SMILES string. If the component is for a mixture, the component name must NOT be a valid SMILES string.

Parameters:
  • component (ml_gutils.Ingredient) – The component string

  • is_mixture (bool) – Whether the component is from a mixture (complex formulation)

Return type:

str

Returns:

The component string

Raises:

ValueError if the component is invalid

validateHeader(header)

Validate the header of the csv file

Parameters:

header (list) – The list of header values in the csv file

Raises:

ValueError – If any of the required headers are not found

static validateGrpCSVHeader(header, required_props)

Validate the header of groups csv file

Parameters:
  • header (list) – The list of header values in the csv file

  • required_props (list) – The list of required properties

Raises:

ValueError – If any of the required headers are not found

getFormulationsFromCSV(csv_reader, skip_props=None)

Get the formulations from the CSV reader

Parameters:
  • csv_reader (csv.DictReader) – The csv reader object

  • skip_props (list) – The list of properties to skip

static getCSVEncoding(filename)

Get the encoding of the CSV file

Parameters:

filename (str) – The filename of the CSV file

Returns:

The encoding of the CSV file

Return type:

str

readFormulationsFromCSV(filename, detect_encoding=False)

Read the formulations from the CSV file. Detect the encoding if requested.

Parameters:
  • filename (str) – The filename of the CSV file

  • detect_encoding (bool) – Whether to detect the encoding of the CSV file

Returns:

The list of formulations

Return type:

list(FormulationData)

readCSVData(filename)

Read the data from the CSV file

Parameters:

filename (str) – The filename of the CSV file

Returns:

The list of formulations

Return type:

list(FormulationData)

readCSVIOData(csv_io, skip_props=None)

Read the data from the CSV StringIO object

Parameters:
  • csv_io (io.StringIO) – The StringIO object of the CSV file

  • skip_props (list) – The list of properties to skip

Returns:

The list of formulations

Return type:

list(FormulationData)

schrodinger.application.matsci.ml_formulations_utils.clear_model_cache()

Clear all cached model extractions.

schrodinger.application.matsci.ml_formulations_utils.read_file_from_model_nocache(model, filename, match_basename=True)

Get the contents of a file that is inside the model, without using the model cache.

Returns:

The file contents as a StringIO object

Return type:

StringIO

schrodinger.application.matsci.ml_formulations_utils.read_file_from_model(model, filename, match_basename=True)

Get the contents of a file that is inside the model

Parameters:
  • model (str) – The path to the model

  • filename (str) – The name of the file to get from the model

  • match_basename (bool) – If True, match the basename of the filename, this is useful when searching for a file in roots. If False, match the filename as a part of the path, this is useful when searching for a file in a specific directory. For most of the testing the member name always has forward slashes, so do not use os.sep when full path

Returns:

The file contents as a StringIO object

Return type:

StringIO

schrodinger.application.matsci.ml_formulations_utils.read_json_from_model(*args, **kwargs)

Get the Python objects from a json file that is inside the model

Parameters:
  • model (str) – The path to the model

  • filename (str) – The name of the file to get from the model

  • match_basename (bool) – If True, match the basename of the filename, this is useful when searching for a file in roots. If False, match the filename as a part of the path, this is useful when searching for a file in a specific directory. For most of the testing the member name always has forward slashes, so do not use os.sep when full path

Return type:

object or None

Returns:

The contents of the json file converted to a Python object, or None if the file was empty or couldn’t be read

schrodinger.application.matsci.ml_formulations_utils.read_csv_from_model(*args, **kwargs)

A generator for the contents of a csv file that is inside the model

Parameters:
  • model (str) – The path to the model

  • filename (str) – The name of the file to get from the model

  • match_basename (bool) – If True, match the basename of the filename, this is useful when searching for a file in roots. If False, match the filename as a part of the path, this is useful when searching for a file in a specific directory. For most of the testing the member name always has forward slashes, so do not use os.sep when full path

schrodinger.application.matsci.ml_formulations_utils.find_filename_in_model(model, file_ending)

Find the name of a file in the model based on the file ending

Parameters:
  • model (str) – The path to the model

  • file_ending – The ending to the file name of interest

Return type:

str or None

Returns:

The name of the first file found with that ending, or None if no such filename was found

schrodinger.application.matsci.ml_formulations_utils.is_column_prediction(column)

Check if the column name is a prediction column

Parameters:

column (str) – The column name

Returns:

Whether the column name is a predicted value

Return type:

bool

schrodinger.application.matsci.ml_formulations_utils.get_additional_and_stacked_features(model)

Get the additional and stacked features from the model.

Parameters:

model (str) – the path to the model that was trained

Returns:

Tuple of the additional features and the stacked features

Return type:

tuple(list(str), list(str))

schrodinger.application.matsci.ml_formulations_utils.get_weight_pct_and_thickness_cols(model)

Extract weight percentage and thickness columns from model’s additional features. These columns are needed for proper feature alignment during prediction.

Parameters:

model (str) – The model path

Returns:

Set of wt% and thickness column names from the model

Return type:

set[str]

schrodinger.application.matsci.ml_formulations_utils.add_file_to_model(model, file_to_add, model_name=None, new_file_name=None)

Add a file to the tarred model. The model_name should be the name of the top-level directory when extracting the model. If it is not given, it will be found using get_top_level_dir. Specify new_file_name if the file should be re-named when moved into the model.

Parameters:
  • model (str) – The path to the tarred model

  • file_to_add (str) – The path to the file to add

  • model_name (str) – The name of the model (top-level directory when extracted)

  • new_file_name (str) – The new name for the file within the model

class schrodinger.application.matsci.ml_formulations_utils.FormulationDriverMixin

Bases: object

Mixin class for the ML formulations driver

TRAINING_RELEASE_VERSION_MSG = 'Model trained with version:'
temporary_jobname(new_jobname)

Context manager to temporarily change the jobname for model operations.

Parameters:

new_jobname (str) – The temporary jobname to use

addInputJsonToModel(model, options_override=None)

Add the input options to the tarred model

Parameters:
  • model (str) – The path to the model

  • options_override (dict) – Optional dict to override specific options

addFileToModel(model, file_to_add)

Add a file to the tarred model

Parameters:
  • model (str) – The path to the model

  • file_to_add (str) – The path to the file to add to the model

addReleaseVersionToModel(model)

Call get_release_name and add the release version to the model

Parameters:

model (str) – The path to the model

run()

Run the formulation machine learning

schrodinger.application.matsci.ml_formulations_utils.validate_basic_options(options, parser)

Validate the basic flag parser options for the ML formulations

Parameters:
  • options (argparse.Namespace) – The options from the parser

  • parser (argparse.ArgumentParser) – The parser object

Raises:

argparse.ArgumentError if any of the required options are not provided

schrodinger.application.matsci.ml_formulations_utils.add_basic_model_to_job_spec(job_builder, options, parser)

Basic job spec for the ML formulations and related workflows

Parameters:
  • job_builder (schrodinger.application.matsci.jobutils.JobBuilder) – The job builder object

  • options (argparse.Namespace) – The options from the parser

  • parser (argparse.ArgumentParser) – The parser object

schrodinger.application.matsci.ml_formulations_utils.is_hidden_feature(feature_name)

Check if a feature is hidden based on its name for OLED formulations.

Parameters:

feature_name (str) – The name of the feature

Returns:

True if the feature is hidden, False otherwise

Return type:

bool