schrodinger.application.phase.packages.oned_task_utils module

Performs task-based work for the 1D similarity driver.

Copyright Schrodinger LLC, All Rights Reserved.

class schrodinger.application.phase.packages.oned_task_utils.BasicStatsAccumulator(property_name)

Bases: object

Accumulates basic statistics for a set of property values.

__init__(property_name)

Constructor taking the name of the property.

addValue(value)

Adds a numeric value to the series, updating the statistics.

property std_dev

Returns the standard deviation.

schrodinger.application.phase.packages.oned_task_utils.base64_decode_fd(s)

Decodes feature definitions from a Base64 string. Feature definitions will be empty if string is empty.

Parameters

s (str) – Base64-encoded feature definition string

Returns

Feature definitions

Return type

list(phase.PhpFeatureDefinition)

schrodinger.application.phase.packages.oned_task_utils.base64_encode_fd(fd)

Encodes feature definitions to a Base64 string. String will be empty if fd is empty or None.

Parameters

fd (list(phase.PhpFeatureDefinition)) – Feature definitions

Returns

Base64-encoded feature definitions string

Return type

str

schrodinger.application.phase.packages.oned_task_utils.combine_oned_hits(hits_files_in, hits_file_out, query_row=None, sort=True, max_hits=1000, max_rows=1000000)

Combines a set of 1D hits files with or without sorting and capping.

Parameters
  • hits_files_in (list(str)) – List of compressed CSV hits files to combine

  • hits_file_out (str) – Output compressed CSV hits file

  • query_row (list(str)) – If supplied, this row is written before any hits

  • sort (bool) – Whether to write a sorted hits file

  • max_hits (int) – Cap on the number of sorted hits to output. Must not exceed MAX_CAPPED_HITS.

  • max_rows (int) – Maximum number of sorted rows to hold in memory

Returns

Number of hits written

Return type

int

Raises

ValueError if max_hits exceeds MAX_CAPPED_HITS

schrodinger.application.phase.packages.oned_task_utils.create_oned_data_file(structure_file, oned_data_file, treatment=0, fd=None, props=None, logger=None, progress_interval=10000)

Creates a 1D data file from the structures in a SMILES, SMILES-CSV, Maestro or SD file.

Parameters
  • structure_file (str) – Input file of structures

  • oned_data_file (str) – Destination 1D data file (.1dbin)

  • treatment (phase.OneDTreatment) – Structure treatment for 1D representations

  • fd (list(phase.PhpFeatureDefinition) or NoneType) – Overrides default feature definitions. Relevant only when treatment is in ONED_PHARM_TREATMENTS.

  • props (list(str) or NoneType) – m2io-style properties to include in the 1D data file, other than SMILES and title. Not used when a SMILES file is supplied.

  • logger (logging.Logger or NoneType) – Logger for info level progress messages

  • progress_interval (int) – Interval between progress messages

Returns

Number of rows written to the 1D data file

Return type

int

schrodinger.application.phase.packages.oned_task_utils.create_structure_from_hypothesis(hypo_file: str) schrodinger.structure._structure.Structure

Creates a structure composed of dummy atom fragments that will give rise to just the pharmacophore features in the supplied hypothesis. This allows the hypothesis to be used as a query when the structure treatment is ONED_TREATMENT_PHARM3D.

schrodinger.application.phase.packages.oned_task_utils.describe_oned_data_file(oned_data_file, stats=False)

Returns a string containing a description of the supplied 1D data file.

Parameters
  • oned_data_file (str) – Name of the 1D data file (.1dbin)

  • stats (bool) – Whether to report basic statistics for any numeric properties in the 1D data file

Returns

String containing the description

Return type

str

schrodinger.application.phase.packages.oned_task_utils.export_oned_data_file(oned_data_file, output_file, subset=None)

Exports rows from a 1D data file to a compressed CSV file. A subset of rows may be specified as a string of comma-separated row ranges, (e.g., ‘1:100,200:300’) or via a text file with a property name on the first line (e.g., ‘s_m_title’ or ‘s_sd_Vendor_ID’) and the values of that property on subsequent lines. If supplying comma-separated row ranges, the maximum row number must not exceed MAX_BITSET.

Parameters
  • oned_data_file (str) – Name of the 1D data file (.1dbin)

  • output_file (str) – Output compressed CSV file

  • subset (str) – A comma-separated list of row ranges or text file name

Returns

Number of rows exported

Return type

int

schrodinger.application.phase.packages.oned_task_utils.get_hits_file_names(jobname, nqueries)

Returns the names of the hits files the job will produce based on the number of query structures.

Parameters
  • jobname (str) – Job name

  • nqueries (int) – Number of query structures

Returns

Hits file names

Return type

list(str)

schrodinger.application.phase.packages.oned_task_utils.get_jobname(args)

Assigns the job name from SCHRODINGER_JOBNAME or from the base name of the appropriate input file.

Parameters

args (argparser.Namespace) – argparser.Namespace with command line arguments

Returns

Job name

Return type

str

schrodinger.application.phase.packages.oned_task_utils.get_master_oned_properties(filenames)

Returns the union of all properties in the provided 1D data files or compressed CSV hits files. The first three properties are always SMILES NAME and ONED_REP_PROPERTY. If CSV hits files are supplied, the last property will always be ONED_SIM_PROPERTY. Any additional properties will appear after the first three properties. If multiple files are supplied, those additional properties will be sorted alphabetically.

Parameters

filenames (list(str)) – List of files to consider

Returns

Union of additional properties

Return type

list(str)

schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_attributes(oned_data_file)

Returns the version, structure treatment and feature definitions of the supplied 1D data file.

Parameters

oned_data_file (str) – Name of the 1D data file (.1dbin)

Returns

tuple of version, structure treatment and feature definitions

Return type

str, phase.OneDTreatment, list(phase.PhpFeatureDefinition)

schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_names(source, prefer_cwd=False)

Returns the name of the 1D data file(s) specified in source, which may be the name of a 1D data file or a list file containing the names of one or more 1D data files.

Parameters
  • source (str) – 1D data file source specification

  • prefer_cwd (bool) – If source is a list file, setting this to True forces the use of 1D data files in the CWD if they exist, even if they also exist at the locations specified in the list file. This addresses the situation where the list file contains absolute paths that exist on the job host, but the corresponding files have been copied to the job directory. In that case, we want to be accessing only the files in the job directory.

Returns

1D data file names

Return type

list(str)

schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_properties(oned_data_file)

Returns a list of the names of any additional properties stored in the supplied 1D data file. These are properties other than SMILES, NAME and the ONED_REP_PROPERTY. The list will be empty if no additonal properties are stored.

Parameters

oned_data_file (str) – Name of the 1D data file (.1dbin)

Returns

list of additional property names

Return type

list(str)

schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_distribution(oned_data_files, nsub)

Given a list of 1D data files to screen and the number of subjobs over which the screen is to be distributed, this function determines how to divide the 1D data files over the subjobs. A list with nsub elements is returned, where a given element holds one or more 1D data file names and the (start, stop) row limits to screen in that file. For example, if 1D data files file1.1dbin and file2.1dbin are supplied, with 1200 and 1800 rows, respectively, and nsub is 3, this function would return the following:

[[[‘file1.1dbin’, (0, 1000)]], # subjob 1

[[‘file1.1dbin’, (1000, 1200)], [‘file2.1dbin’, (0, 800)]], # subjob 2 [[‘file2.1dbin’, (800, 1800)]]] # subjob 3

Parameters
  • oned_data_files (list(str)) – List of 1D data files

  • nsub (int) – Number of subjobs

Returns

List of lists of file name, (start, stop) limits

Return type

list(list(str, (int, int)))

schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_splits(oned_data_file, prefix_out, nfiles)

Returns a list of output file names and (start, stop) limits for physically splitting a 1D data file into a number of smaller, equal-sized files. A given element of the returned list will be of the form:

<prefix_out>_<n>.1dbin, (start, stop)

where <prefix_out>_<n>.1dbin is the nth output file to create and start, stop are the corresponding row limits in oned_data_file, with stop being non-inclusive.

Parameters
  • oned_data_file (str) – The name of the 1D data file to be split

  • prefix_out (str) – Prefix for all output 1D data files

  • nfiles (int) – The number of output 1D data files to create

Returns

List of file names and (start, stop) limits

Return type

list(str, (int, int))

schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_row_count(oned_data_file)

Returns the number of rows in the supplied 1D data file.

Parameters

oned_data_file (str) – Name of the 1D data file (.1dbin)

Returns

Number of rows in the file

Return type

int

schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_rows(oned_data_file, start=0, stop=None)

Generator that yields rows from the supplied 1D data file. Each row is a list of strings, where the first 3 elements are SMILES, name and 1D encoding, and any subsequent elements hold the values of additional properties stored in the 1D data file.

Parameters
  • oned_data_file (str) – Name of the 1D data file (.1dbin)

  • start (int) – 0-based starting row position

  • stop (int) – Upper limit on the rows to read. For example, if start=5 and stop=10, rows 5, 6, 7, 8 and 9 will be read. Reading proceeds until the end of the file by default.

Yield

The next row in the file

Ytype

list(str)

schrodinger.application.phase.packages.oned_task_utils.get_oned_queries(query_file)

Reads queries from the provided SMILES, SMILESCSV, Maestro, SD or Phase hypothesis file. Coordinates are not set in the case of SMILES or SMILESCSV.

Parameters

query_file (str) – Structure file containing queries

Returns

List of query structures or a list containing a single Phase hypothesis

Return type

list[structure.Structure] or list[phase.PhpHypoAdaptor]

schrodinger.application.phase.packages.oned_task_utils.get_oned_query(query, oned_data_file)

Returns a 1D representation for the provided query structure or hypothesis according to the attributes in the supplied 1D data file.

Parameters
  • query (structure.Structure or phase.PhpHypoAdaptor) – The query structure or hypothesis

  • oned_data_file (str) – Name of the input 1D data file (.1dbin)

Returns

1D representation of the query

Return type

phase.OneDRep

schrodinger.application.phase.packages.oned_task_utils.get_oned_query_count(query_file)

Returns the number of queries in the provided file.

Parameters

query_file (str) – Name of the query file

Returns

Number of queries

Return type

int

schrodinger.application.phase.packages.oned_task_utils.get_oned_query_row(query, oned_query_base64, master_properties)

Returns a row for the query that can be written to the top of a hits file.

Parameters
  • query (structure.Structure or phase.PhpHypoAdaptor) – The query

  • oned_query_base64 (str) – Base64-encoded 1D representation of the query

  • master_properties (list(str)) – The full list of properties being written to the hits file

Returns

Query structure row

Return type

list[str]

schrodinger.application.phase.packages.oned_task_utils.get_oned_query_rows(queries, oned_data_files)

Given a list of queries and the 1D data files that were screened, this function returns a row for each query that can be supplied to combine_oned_hits to ensure that the query appears at the top of its associated hits file.

Parameters
  • queries (list[structure.Structure] or list[phase.PhpHypoAdaptor]) – List of query structures or a list containing a single Phase hypothesis

  • oned_data_files (list[str]) – Names of 1D data files that were screened

Returns

A row for each query

Return type

list[list[str]]

schrodinger.application.phase.packages.oned_task_utils.get_oned_query_title(query)

Returns the title for the supplied query, which may be a structure or hypothesis.

Parameters

query (structure.Structure or phase.PhpHypoAdaptor) – Query structure or hypothesis

Returns

Title

Return type

str

schrodinger.application.phase.packages.oned_task_utils.get_property_positions(master_properties, file_properties)

Returns the postion of each master property in a potentially smaller list of properties from a particular file. If a master property is not found in file_properties, the position of that property will be len(file_properties). Thus if file_pos contains the positions returned by this function, and file_row contains the property values for some row in that file, the following code can be used to construct a master row of property values that contains ‘’ for each missing value:

file_row.append(‘’) master_row = [file_row[pos] for pos in file_pos]

Parameters
  • master_properties (list(str)) – Master list of properties

  • file_properties – The list of properties from the file

  • file_properties – list(str)

Returns

Positions of master_properties within file_properties

Return type

list of int

schrodinger.application.phase.packages.oned_task_utils.get_rows_to_export(row_ranges)

Constructs a canvas.ChmBitset from a comma-separated list of row ranges (e.g., ‘1:100,200:300’). Input row numbers are 1-based and upper limits are inclusive. The returned bitset will have a logical size equal to the highest row number supplied, and the on positions will be 0-based. Note that the maximum logical size for a ChmBitset is MAX_BITSET, so this function assumes that users will not attempt to create individual 1D data files that would exceed the ChmBitset limit.

Parameters
  • oned_data_file (str) – The name of the 1D data file (.1dbin)

  • row_ranges (str) – Comma-separated list of 1-based row ranges

Returns

Bitset with 0-based rows as the on positions

Return type

canvas.ChmBitset

Raise

ValueError if an illegal string of row ranges is supplied

schrodinger.application.phase.packages.oned_task_utils.get_split_file_names(prefix, n)

Returns the names of the 1D data files that will be created in the ‘split’ task.

Parameters
  • prefix (str) – Prefix of output 1D data files

  • n (int) – Number of files to create

Returns

1D data file names

Return type

list(str)

schrodinger.application.phase.packages.oned_task_utils.get_structure_file_reader(structure_file)

Returns the appropriate reader for the supplied structure file, which is expected to be SMILES, SMILESCSV, MAESTRO or SD.

Parameters

structure_file (str) – Input file of structures

Returns

Structure file reader

Return type

structure.SmilesReader, structure.SmilesCsvReader or structure.StructureReader

Raise

ValueError if the file format is illegal

schrodinger.application.phase.packages.oned_task_utils.get_values_to_match(oned_data_file, filename)

Given a 1D data file and a text file containing a property name followed by property values to match, this function returns the 0-based position of the specified property in the 1D data file and a set of the values to match.

Parameters
  • oned_data_file (str) – The name of the 1D data file (.1dbin)

  • filename (str) – The name of the text file with the property name and the values to match

Returns

0-based property position, followed by values to match

Return type

int, set(str)

Raise

ValueError if the property is not found in oned_data_file

schrodinger.application.phase.packages.oned_task_utils.is_oned_data_file(filename)

Returns True if the supplied file name corresponds to a 1D data file.

Parameters

filename (str) – The name of the file

Returns

Whether the name corresponds to a 1D data file

Return type

bool

schrodinger.application.phase.packages.oned_task_utils.merge_oned_data_files(oned_data_files_in, oned_data_file_out, remove=False)

Merges a list of 1D data files, creating an output file with a master set of properties.

Parameters
  • oned_data_files_in (list(str)) – List of 1D data files to merge

  • oned_data_file_out (str) – Output 1D data file

  • remove (bool) – If True input 1D files will be removed after merge

Returns

Total number of rows merged

Return type

int

schrodinger.application.phase.packages.oned_task_utils.merge_oned_hits_files(hits_files_in, hits_file_out, query_row=None)

Merges a list of compressed CSV hits files, creating an output file with a master set of properties.

Parameters
  • hits_files_in (list(str)) – List of hits files to merge

  • hits_file_out (str) – Output hits file

  • query_row (list(str)) – If supplied, this row is written before any hits. It should be obtained by calling get_oned_query_rows.

Returns

Number of merged hits written

Return type

int

schrodinger.application.phase.packages.oned_task_utils.run_oned_screen(query, oned_data_file, hits_file, start=0, stop=None, write_query_row=False, sort=True, max_hits=1000, max_rows=1000000, min_sim=0.0, norm_scheme=0, logger=None, progress_interval=100000)

Performs a 1D similarity screen with a single query and writes hits to to a compressed CSV file.

Parameters
  • query (structure.Structure or phase.PhpHypoAdaptor) – The query

  • oned_data_file (str) – Name of the input 1D data file (.1dbin)

  • hits_file (str) – Name of the output compressed CSV file (.csv.gz)

  • start (int) – 0-based starting row in oned_data_file

  • stop (int) – Upper limit on the rows to screen. For example, if start=5 and stop=10, rows 5, 6, 7, 8 and 9 will be screened. Screening proceeds until the end of the file by default.

  • write_query_row (bool) – Whether to write the 1D query as the first row

  • sort (bool) – Whether to write a sorted hits file

  • max_hits (int) – Cap on the number of sorted hits to write. Must not exceed MAX_CAPPED_HITS.

  • min_sim (float) – Write only hits whose similarity to the query are greater than or equal to this value

  • norm_scheme (int) – Similarity normalization scheme as defined in the enum phase.OneDNormScheme

  • logger (logging.Logger or NoneType) – Logger for info level progress messages

  • progress_interval (int) – Interval between progress messages

Returns

Number of hits written

Return type

int

Raises

ValueError if max_hits exceeds MAX_CAPPED_HITS

schrodinger.application.phase.packages.oned_task_utils.split_structure_file(structure_file, prefix_out, nfiles)

Splits a structure file into a number of smaller, equal-sized files named <prefix_out>_1.<ext>, <prefix_out>_2.<ext>, etc., where <ext> will be ‘smi.gz’, ‘csv.gz’, ‘maegz’ or ‘sdfgz’, depending on the type of file supplied. This function is not approprate for a Maestro or SD file containing more than 2**31 - 1 structures.

Parameters
  • structure_file (str) – The name of the structure file to be split

  • prefix_out (str) – Prefix for all output structure files

  • nfiles (int) – The number of output structure files to create

Returns

The names of the files created

Return type

list(str)

Raise

ValueError if the file format is illegal

schrodinger.application.phase.packages.oned_task_utils.split_oned_data_file(oned_data_file, prefix_out, nfiles)

Splits a 1D data file into a number of smaller, equal-sized files named <prefix_out>_1.1dbin, <prefix_out>_2.1dbin, etc.

Parameters
  • oned_data_file (str) – The name of the 1D data file to be split

  • prefix_out (str) – Prefix for all output 1D data files

  • nfiles (int) – The number of output 1D data files to create

Returns

The names of the files created

Return type

list(str)

schrodinger.application.phase.packages.oned_task_utils.write_oned_data_file_row_count(oned_data_file, row_count)

Appends the total number of rows to a 1D data file.

Parameters
  • oned_data_file (str) – Name of the 1D data file (.1dbin)

  • row_count (int) – Total number of rows in file