schrodinger.active_learning.al_utils module

schrodinger.active_learning.al_utils.positive_int(s)

ArgumentParser function to check whether input can be converted to positive integer.

Parameters

s (str) – input string

Returns

integer value of input string

Return type

int

schrodinger.active_learning.al_utils.split_smi_line(line: str) Optional[Tuple[str, str]]

Split a line from .smi file to SMILES pattern and title. Return empty list if line is empty.

Parameters

line – line from .smi file

Returns

SMILES pattern, title

Return type

[str, str] or []

schrodinger.active_learning.al_utils.get_smi_header()

Create header for .smi input file. We assume the SMILES is in the first column and title in the second column.

Returns

header list, header index for reordering SMILES and title

Return type

list(str), list(int)

schrodinger.active_learning.al_utils.get_csv_header(filename: str, smi_index: int, name_index: int, delimiter: str = ',', with_header: bool = True) Tuple[List[str], List[int]]

Create header for .csv input file. The reordered index will put SMILES at first column and title in the second column.

Parameters
  • filename – .csv input file

  • smi_index – column index of molecule SMILES

  • name_index – column index of molecule name

  • delimiter – delimiter of input csv files

  • with_header – Whether the file has header in its first line

Returns

header list, header index for reordering SMILES and title

schrodinger.active_learning.al_utils.my_csv_reader(filename: str)

Yield a csv reader that skips the first line.

Parameters

filename – .csv file name

Returns

csv.reader that skips first line of the file.

schrodinger.active_learning.al_utils.read_score(score_file: str) dict

Read known scores of ligands from args.score_file.

Returns

a dictionary that maps ligand title to ligand score.

schrodinger.active_learning.al_utils.random_filtering(file_list, output_name, probability, random_seed=None, with_header=True)

Randomly select lines from entries in the file_list based on probability. The aim is to have a light-weight and ultrafast function to generate a subset for pilot runs.

Parameters
  • file_list (list) – paths of input files.

  • output_name (str) – name of the output file.

  • probability (float) – probablity of randomly select a line.

  • random_seed (int or None) – random seed number for shuffling the ligands

  • with_header (bool) – Whether input file(s) has header in its first line.

schrodinger.active_learning.al_utils.get_smiles_from_al_csv_or_smi_line(line)

get the SMILES from a smi file or an active learning csv file with smiles in the first column

schrodinger.active_learning.al_utils.reservoir_sampling(file_list, output_name, excluding_smiles=None, sample_size=100000, random_seed=None, with_header=True)

Randomly select sample_size of ligands from entries in the file_list. The aim is to have a light-weight and ultrafast function to generate a subset for pilot runs.

Parameters
  • file_list (list) – paths of input files.

  • output_name (str) – name of the output file.

  • excluding_smiles (container) – list of smiles to skip in the sampling.

  • sample_size (int) – number of ligands to sample.

  • random_seed (int or None) – random seed number for shuffling the ligands

  • with_header (bool) – Whether input file(s) has header in its first line.

schrodinger.active_learning.al_utils.extract_compressed_files(compressed_files, delete_original=False) list

Extracts compressed gzip files from source directory. Only files that end with .gz (single file compression) will be extracted and placed back in the source directory. If delete_original is set to True, compressed file will also be deleted.

Returns updated filenames after extraction

schrodinger.active_learning.al_utils.random_split(file_list: List, num_ligands: int, prefix: str = 'splited', block_size: int = 100000, name_index: int = 0, smi_index: int = 1, random_seed: Optional[int] = None, delimiter: str = ',', with_header: bool = True)

Combine input files, shuffle lines, split into files with block_size line per file. Reorder the columns such that SMILES and name is in the first and second column respectively.

Parameters
  • file_list (list) – paths of input files.

  • num_ligands (int) – total number of ligands in all the input files.

  • prefix (str) – prefix of split files

  • block_size (int) – number of ligands in each sub .csv file.

  • name_index (int) – column index of molecule name

  • smi_index (int) – column index of molecule SMILES

  • random_seed (int or None) – random seed number for shuffling the ligands

  • delimiter (str) – delimiter of input csv files

  • with_header (bool) – Whether input file(s) has header in its first line.

Returns

list of split files, reordered csv header

Return type

list, list

schrodinger.active_learning.al_utils.convert_tar_gz_to_qzip_model(tar_gz_model, qzip_model, job_args_json)

Convert .tar.gz ligand_ml model to .qzip model.

Parameters

tar_gz_model (str) – input .tar.gz ligand_ml model file.

Param

qzip_model: output .qzip model file.

:param job_args_json : the file included the arguments needed for deepautoqsar :type job_args_json : str

schrodinger.active_learning.al_utils.convert_qzip_to_tar_gz_model(qzip_model)

Convert .qzip deepautoqsar model to .tar.gz ligand_ml model.

Parameters

qzip_model (str) – .qzip deepautoqsar model filename.

Returns

.tar.gz ligand_ml model filename.

Return type

str

schrodinger.active_learning.al_utils.get_zip_contents(zip_filename: str) list

Get a list of zip contents from a zip file without extraction

Parameters

zip_filename – .zip filename

Returns

list of zip contents

schrodinger.active_learning.al_utils.get_file_ext(filename: str) str

Get the extension of the file name. Skip ‘gz’ if it is a gz compressed file.

Parameters

filename – name of the file.

Returns

‘gz’ excluded extension of the file.

schrodinger.active_learning.al_utils.check_driver_disk_space(active_learning_job)

Estimate the driver disk usage of an active learning job with some assumed parameters. Print a warning is the available driver disk space is smaller than the estimate space. The environment variable SCRATCH_MAX_VOLUME_SIZE_GB enables the maximum amount of scratch space which will be available on a driver instance managed by schrodinger’s virtual cluster setup. If SCRATCH_MAX_VOLUME_SIZE_GB is set, it is used to specify the maximum volume size or free disk space in GB. Otherwise, the free disk space will be calculated based on the driver’s current free disk space and assumes the disk cannot be rescaled to add more space.

Parameters

active_learning_job (ActiveLearningJob instance.) – current AL driver.

schrodinger.active_learning.al_utils.node_run_timer(func)

Decorator for timing the running time of runNode method in ActiveLearningNode

schrodinger.active_learning.al_utils.add_output_file(*output_files, incorporate=False)

Add files to jobcontrol output files.

Parameters
  • output_files (str) – files to be transferred.

  • incorporate (bool) – marked files for incorporation by maestro.

schrodinger.active_learning.al_utils.add_input_file(jsb, *input_files)

Check the existence of input file(s). Add it as jobcontrol input file if it exists, otherwise exit with error.

Parameters
schrodinger.active_learning.al_utils.concatenate_logs(combined_logfile, subjob_logfile_list, logger=None)

Combine subjob logfiles into single combined logfile.

Parameters
  • combined_logfile (str) – combined log file name

  • subjob_logfile_list (list(str)) – list of subjob logfile names to be combined.

  • logger (Logger or None) – logger for receiving the info and error message.

schrodinger.active_learning.al_utils.get_host_ncpu()

Return the host and number of CPU that should be used to submit subjobs. This function works both running under job control and not.

Return type

tuple[str, int]

schrodinger.active_learning.al_utils.is_hostname_valid(hostname: str) bool

Check whether hostname is correct in the host file.

Parameters

hostname – the hostname to check against

Returns

Whether the hostname is defined in the host file.

schrodinger.active_learning.al_utils.validate_input_files(input_files: List[str], remote_input_ligands: bool = False, allowed_format: List[str] = None) Optional[str]

Check the existence and format of input files. Return error message if validation failed, otherwise return None.

Parameters
  • input_files – paths of input files.

  • remote_input_ligands – Whether input ligand files are located at remote.

  • allowed_format – allowed input file formats.

Returns

error message if validation failed; None if it passed

schrodinger.active_learning.al_utils.validate_input_mae(input_files: list, max_check: int = 10) Optional[str]

Validate structures in Maestro file(s).

Parameters
  • input_files – list of path(s) to Maestro file(s).

  • max_check – maximum number of structures to validate.

Returns

error message if validation fails. None if validation passes.

schrodinger.active_learning.al_utils.yield_chunk(frame, length, func=None)
schrodinger.active_learning.al_utils.validate_csv_header(header_str)
schrodinger.active_learning.al_utils.open_maybe_compressed(filename: str, *a, **d) io.IOBase

Takes fileutils.open_maybe_compressed and includes bzip2 and gzip support.

Open a file, using the gzip module if the filename ends in gz, using bz2 module if the filename ends in bz2, or default builtin open otherwise.

schrodinger.active_learning.al_utils.validate_input_smiles(input_files: list, smi_index: int, name_index: int, with_header: bool = True, max_check: int = 10, check_csv_header: bool = False) str

Validate SMILES in input files.

Parameters
  • input_files – paths of input files.

  • smi_index – column index of molecule SMILES

  • name_index – column index of molecule name

  • with_header – Whether the file has header in its first line

  • max_check – maximum number of SMILES to validate

  • check_csv_header – whether to check the csv headers

Returns

error message if validation failed; None if it passed

schrodinger.active_learning.al_utils.validate_input_zipfile(zip_filename: str, remote_input_ligands: bool) str

Quick two-step validation of zipped input file

  1. Checks for valid filename of zip archive

  2. Without extracting files from zip archive, confirm zipped contents are CSV formatted

schrodinger.active_learning.al_utils.store_mae_to_db(db_filename, mae_file_list)

Store structure in .mae files to a sqlite3 database.

Parameters
  • db_filename (str) – path of the sqlite3 database

  • mae_file_list (list(str)) – list of .mae files that contain the structures to be stored to the database

schrodinger.active_learning.al_utils.write_st_from_db_by_smiles(db_filename, out_mae_file, smi_list, chunk_size=500)

Extract the ligands’ structures from the database. Write the structures to the output .mae file.

Parameters
  • db_filename (str) – path of the sqlite3 database containing ligands’ structure

  • out_mae_file (str) – path of the output .mae/.maegz file

  • smi_list (list(str)) – list of input ligands’ SMILES

  • chunk_size (int) – number of SMILES in each query

schrodinger.active_learning.al_utils.add_file_to_aljob_restart_dict(active_learning_job, optional_restart_file, jobname)

Add a file to the optional_restart_files_dict of current active learning job. Only register the file with jobcontrol is active_learning_job is None.

Parameters
  • active_learning_job (ActiveLearningJob instance or None.) – current AL driver.

  • optional_restart_file (str) – path of the a file to be added

  • jobname (str) – key of the list that contains the optional_restart_file

schrodinger.active_learning.al_utils.read_scored_ligands(scored_csv_file_list)

Read the ligands that were already scored by ScoreProviderNode.

Parameters

scored_csv_file_list (list(str)) – list of ligand_ml training .csv files.

Returns

set of SMILES of the scored ligands.

Return type

set(str)

schrodinger.active_learning.al_utils.count_ligands(file_list: List[str], with_header: bool = True) int

Count the number of ligands in all the files by counting the total number of lines. We assume each line contains a SMILES string.

Parameters
  • file_list – list of input file paths.

  • with_header – Whether the input files have header.

Returns

Number of ligands in all the input files.

schrodinger.active_learning.al_utils.convert_csv_to_smi(csv_file: str, smi_file: str, smiles_column: int = 0, title_column: int = 1)

Convert a .csv ligand file to a .smi file. The default assumes that SMILES and Title are in the first and second columns of the .csv file respectively.

Parameters
  • csv_file – path of the .csv file to be converted

  • smi_file – path of the output .smi file

  • smiles_column – column number of the SMILES in the input .csv. zero indexed.

  • title_column – column number of the Title in the input .csv. zero indexed.

schrodinger.active_learning.al_utils.yield_chunkable(file_path: str, batch_size: int, shuffle: bool = False)

A generator that reads a file in batches of structures.

Parameters
  • file_path – Path to the input file.

  • batch_size – Number of structures per batch.

  • shuffle – Whether to shuffle the structures in each batch.

Yield

batch iterable

schrodinger.active_learning.al_utils.remove_duplicates(input_file: str, title_column: int = 1) str

This function takes in a .smi and removes duplicate entries with same title but possibly different SMILES. One possible use case is if the same ligand appears in different charged states or poses in the input file. The function just picks the first row for every unique title and discards the rest.

Parameters
  • input_file – path of the input file.

  • title_column – column of the title entry in the .csv or .smi file, zero indexed.

Returns

path of the output file with duplicates removed. Same as the input file name with _unique attached at the end.

schrodinger.active_learning.al_utils.split_lig(lig_filename: str, output_prefix: str, nstruct: int) List[str]

Split structures in a .mae file to batches. Uses a contextlib to manage a variable number of open files.

Parameters
  • lig_filename – path of the .mae file to be splitted

  • output_prefix – prefix of splitted output ligands

  • nstruct – number ligands per batch

Returns

list of batched .mae files

schrodinger.active_learning.al_utils.get_allowed_ncpu(user_specified_ncpu)

Return the number of allowed CPUs for a job.

Parameters

user_specified_ncpu (int or None) – user specified maximum number of CPUs

Returns

number of allowed CPUs

Return type

int

schrodinger.active_learning.al_utils.generate_mae_file_with_unique_title(input_mae_file, output_mae_file)

Convert the input .mae file to output .mae file that contains unique titles.

Parameters
  • input_mae_file (str) – path of input .mae file

  • output_mae_file (str) – path of output .mae file containing ligands with unique title

schrodinger.active_learning.al_utils.read_all_st_from_file(st_file)

Return all the structure in a file as a list.

Parameters

st_file (str) – path of the input file

Returns

list of structures in the file

Return type

list(structure.Structure)

schrodinger.active_learning.al_utils.my_file_exists(filename)

a version of os.path.isfile. Returns None if input is None instead of raising an error.

schrodinger.active_learning.al_utils.default_args(v: str, script_name: str) dict

Return the default arguments for the given script. If None is passed as script_name, use AL-Glide default arguments

schrodinger.active_learning.al_utils.configure_no_input_splitting(args: argparse.Namespace) argparse.Namespace
schrodinger.active_learning.al_utils.configure_mq_run(args: argparse.Namespace) argparse.Namespace
schrodinger.active_learning.al_utils.scale_for_diversity_job(num_randomly_selected_ligands: int, max_diversity_sample_size: int, host_ncpus: int) Tuple[int, int]

Scale the number of cpus and ndim for the combinatorial diversity job based on how many ligands were randomly selected. If the number of selected ligands is less than the max sample size, scale down the user provided cpus as they are unnecessary. Note: default/minimum ndim value is 10, and can only increase based on how many random ligands were selected.

Parameters
  • num_randomly_selected_ligands – The number of randomly selected ligs.

  • max_diversity_sample_size – The max diversity sample size input.

  • host_ncpus – The number of host cpus available.

Returns

The scaled ncpus and ndim.

schrodinger.active_learning.al_utils.check_restart_host(restart_file: str, jobname: str)

Check whether the original job was launched with -HOST. If so, the

schrodinger.active_learning.al_utils.check_required_modules()

Check if the required modules are installed for active learning workflows.

schrodinger.active_learning.al_utils.set_env_variable(key, value)

A context manager to set an environment variable temporarily.

Parameters
  • key – The environment variable key.

  • value – The environment variable value to set.