schrodinger.application.phase.packages.oned_task_utils module¶
Performs task-based work for the 1D similarity driver.
Copyright Schrodinger LLC, All Rights Reserved.
- class schrodinger.application.phase.packages.oned_task_utils.BasicStatsAccumulator(property_name)¶
Bases:
object
Accumulates basic statistics for a set of property values.
- __init__(property_name)¶
Constructor taking the name of the property.
- addValue(value)¶
Adds a numeric value to the series, updating the statistics.
- property std_dev¶
Returns the standard deviation.
- schrodinger.application.phase.packages.oned_task_utils.base64_decode_fd(s)¶
Decodes feature definitions from a Base64 string. Feature definitions will be empty if string is empty.
- Parameters
s (str) – Base64-encoded feature definition string
- Returns
Feature definitions
- Return type
list(phase.PhpFeatureDefinition)
- schrodinger.application.phase.packages.oned_task_utils.base64_encode_fd(fd)¶
Encodes feature definitions to a Base64 string. String will be empty if fd is empty or None.
- Parameters
fd (list(phase.PhpFeatureDefinition)) – Feature definitions
- Returns
Base64-encoded feature definitions string
- Return type
str
- schrodinger.application.phase.packages.oned_task_utils.combine_oned_hits(hits_files_in, hits_file_out, query_row=None, sort=True, max_hits=1000, max_rows=1000000)¶
Combines a set of 1D hits files with or without sorting and capping.
- Parameters
hits_files_in (list(str)) – List of compressed CSV hits files to combine
hits_file_out (str) – Output compressed CSV hits file
query_row (list(str)) – If supplied, this row is written before any hits
sort (bool) – Whether to write a sorted hits file
max_hits (int) – Cap on the number of sorted hits to output. Must not exceed MAX_CAPPED_HITS.
max_rows (int) – Maximum number of sorted rows to hold in memory
- Returns
Number of hits written
- Return type
int
- Raises
ValueError if max_hits exceeds MAX_CAPPED_HITS
- schrodinger.application.phase.packages.oned_task_utils.create_oned_data_file(structure_file, oned_data_file, treatment=0, fd=None, props=None, logger=None, progress_interval=10000)¶
Creates a 1D data file from the structures in a SMILES, SMILES-CSV, Maestro or SD file.
- Parameters
structure_file (str) – Input file of structures
oned_data_file (str) – Destination 1D data file (.1dbin)
treatment (phase.OneDTreatment) – Structure treatment for 1D representations
fd (list(phase.PhpFeatureDefinition) or NoneType) – Overrides default feature definitions. Relevant only when treatment is in ONED_PHARM_TREATMENTS.
props (list(str) or NoneType) – m2io-style properties to include in the 1D data file, other than SMILES and title. Not used when a SMILES file is supplied.
logger (logging.Logger or NoneType) – Logger for info level progress messages
progress_interval (int) – Interval between progress messages
- Returns
Number of rows written to the 1D data file
- Return type
int
- schrodinger.application.phase.packages.oned_task_utils.create_structure_from_hypothesis(hypo_file: str) schrodinger.structure._structure.Structure ¶
Creates a structure composed of dummy atom fragments that will give rise to just the pharmacophore features in the supplied hypothesis. This allows the hypothesis to be used as a query when the structure treatment is ONED_TREATMENT_PHARM3D.
- schrodinger.application.phase.packages.oned_task_utils.describe_oned_data_file(oned_data_file, stats=False)¶
Returns a string containing a description of the supplied 1D data file.
- Parameters
oned_data_file (str) – Name of the 1D data file (.1dbin)
stats (bool) – Whether to report basic statistics for any numeric properties in the 1D data file
- Returns
String containing the description
- Return type
str
- schrodinger.application.phase.packages.oned_task_utils.export_oned_data_file(oned_data_file, output_file, subset=None)¶
Exports rows from a 1D data file to a compressed CSV file. A subset of rows may be specified as a string of comma-separated row ranges, (e.g., ‘1:100,200:300’) or via a text file with a property name on the first line (e.g., ‘s_m_title’ or ‘s_sd_Vendor_ID’) and the values of that property on subsequent lines. If supplying comma-separated row ranges, the maximum row number must not exceed MAX_BITSET.
- Parameters
oned_data_file (str) – Name of the 1D data file (.1dbin)
output_file (str) – Output compressed CSV file
subset (str) – A comma-separated list of row ranges or text file name
- Returns
Number of rows exported
- Return type
int
- schrodinger.application.phase.packages.oned_task_utils.get_hits_file_names(jobname, nqueries)¶
Returns the names of the hits files the job will produce based on the number of query structures.
- Parameters
jobname (str) – Job name
nqueries (int) – Number of query structures
- Returns
Hits file names
- Return type
list(str)
- schrodinger.application.phase.packages.oned_task_utils.get_jobname(args)¶
Assigns the job name from SCHRODINGER_JOBNAME or from the base name of the appropriate input file.
- Parameters
args (argparser.Namespace) – argparser.Namespace with command line arguments
- Returns
Job name
- Return type
str
- schrodinger.application.phase.packages.oned_task_utils.get_master_oned_properties(filenames)¶
Returns the union of all properties in the provided 1D data files or compressed CSV hits files. The first three properties are always SMILES NAME and ONED_REP_PROPERTY. If CSV hits files are supplied, the last property will always be ONED_SIM_PROPERTY. Any additional properties will appear after the first three properties. If multiple files are supplied, those additional properties will be sorted alphabetically.
- Parameters
filenames (list(str)) – List of files to consider
- Returns
Union of additional properties
- Return type
list(str)
- schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_attributes(oned_data_file)¶
Returns the version, structure treatment and feature definitions of the supplied 1D data file.
- Parameters
oned_data_file (str) – Name of the 1D data file (.1dbin)
- Returns
tuple of version, structure treatment and feature definitions
- Return type
str, phase.OneDTreatment, list(phase.PhpFeatureDefinition)
- schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_names(source, prefer_cwd=False)¶
Returns the name of the 1D data file(s) specified in source, which may be the name of a 1D data file or a list file containing the names of one or more 1D data files.
- Parameters
source (str) – 1D data file source specification
prefer_cwd (bool) – If source is a list file, setting this to True forces the use of 1D data files in the CWD if they exist, even if they also exist at the locations specified in the list file. This addresses the situation where the list file contains absolute paths that exist on the job host, but the corresponding files have been copied to the job directory. In that case, we want to be accessing only the files in the job directory.
- Returns
1D data file names
- Return type
list(str)
- schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_properties(oned_data_file)¶
Returns a list of the names of any additional properties stored in the supplied 1D data file. These are properties other than SMILES, NAME and the ONED_REP_PROPERTY. The list will be empty if no additonal properties are stored.
- Parameters
oned_data_file (str) – Name of the 1D data file (.1dbin)
- Returns
list of additional property names
- Return type
list(str)
- schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_distribution(oned_data_files, nsub)¶
Given a list of 1D data files to screen and the number of subjobs over which the screen is to be distributed, this function determines how to divide the 1D data files over the subjobs. A list with nsub elements is returned, where a given element holds one or more 1D data file names and the (start, stop) row limits to screen in that file. For example, if 1D data files file1.1dbin and file2.1dbin are supplied, with 1200 and 1800 rows, respectively, and nsub is 3, this function would return the following:
- [[[‘file1.1dbin’, (0, 1000)]], # subjob 1
[[‘file1.1dbin’, (1000, 1200)], [‘file2.1dbin’, (0, 800)]], # subjob 2 [[‘file2.1dbin’, (800, 1800)]]] # subjob 3
- Parameters
oned_data_files (list(str)) – List of 1D data files
nsub (int) – Number of subjobs
- Returns
List of lists of file name, (start, stop) limits
- Return type
list(list(str, (int, int)))
- schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_splits(oned_data_file, prefix_out, nfiles)¶
Returns a list of output file names and (start, stop) limits for physically splitting a 1D data file into a number of smaller, equal-sized files. A given element of the returned list will be of the form:
<prefix_out>_<n>.1dbin, (start, stop)
where <prefix_out>_<n>.1dbin is the nth output file to create and start, stop are the corresponding row limits in oned_data_file, with stop being non-inclusive.
- Parameters
oned_data_file (str) – The name of the 1D data file to be split
prefix_out (str) – Prefix for all output 1D data files
nfiles (int) – The number of output 1D data files to create
- Returns
List of file names and (start, stop) limits
- Return type
list(str, (int, int))
- schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_row_count(oned_data_file)¶
Returns the number of rows in the supplied 1D data file.
- Parameters
oned_data_file (str) – Name of the 1D data file (.1dbin)
- Returns
Number of rows in the file
- Return type
int
- schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_rows(oned_data_file, start=0, stop=None)¶
Generator that yields rows from the supplied 1D data file. Each row is a list of strings, where the first 3 elements are SMILES, name and 1D encoding, and any subsequent elements hold the values of additional properties stored in the 1D data file.
- Parameters
oned_data_file (str) – Name of the 1D data file (.1dbin)
start (int) – 0-based starting row position
stop (int) – Upper limit on the rows to read. For example, if start=5 and stop=10, rows 5, 6, 7, 8 and 9 will be read. Reading proceeds until the end of the file by default.
- Yield
The next row in the file
- Ytype
list(str)
- schrodinger.application.phase.packages.oned_task_utils.get_oned_query(st_query, oned_data_file)¶
Returns a 1D representation for the provided query structure that’s created according to the attributes in the supplied 1D data file.
- Parameters
st_query (structure.Structure) – The query structure
oned_data_file (str) – Name of the input 1D data file (.1dbin)
- Returns
1D representation of the query structure
- Return type
phase.OneDRep
- schrodinger.application.phase.packages.oned_task_utils.get_oned_query_row(st_query, oned_query_base64, master_properties)¶
Returns a row for the query structure that can be written to the top of a hits file.
- Parameters
st_query (structure.Structure) – The query structure
oned_query_base64 (str) – Base64-encoded 1D representation of the query structure
master_properties (list(str)) – The full list of properties being written to the hits file
- Returns
Query structure row
- Return type
list(str)
- schrodinger.application.phase.packages.oned_task_utils.get_oned_query_rows(st_queries, oned_data_files)¶
Given a list of query structures and the 1D data files that were screened, this function returns a row for each query that can be supplied to combine_oned_hits to ensure that the query appears at the top of its associated hits file.
- Parameters
st_queries (list(structure.Structure)) – Query structures
oned_data_files (list(str)) – Names of 1D data files that were screened
- Returns
A row for each query
- Return type
list(list(str))
- schrodinger.application.phase.packages.oned_task_utils.get_oned_query_structures(query_file)¶
Reads query structures from the provided SMILES, SMILESCSV, Maestro, SD or Phase hypothesis file. Coordinates are not set in the case of SMILES or SMILESCSV.
- Parameters
query_file (str) – Structure file containing queries
- Returns
Query structures
- Return type
list(structure.Structure)
- schrodinger.application.phase.packages.oned_task_utils.get_property_positions(master_properties, file_properties)¶
Returns the postion of each master property in a potentially smaller list of properties from a particular file. If a master property is not found in file_properties, the position of that property will be len(file_properties). Thus if file_pos contains the positions returned by this function, and file_row contains the property values for some row in that file, the following code can be used to construct a master row of property values that contains ‘’ for each missing value:
file_row.append(‘’) master_row = [file_row[pos] for pos in file_pos]
- Parameters
master_properties (list(str)) – Master list of properties
file_properties – The list of properties from the file
file_properties – list(str)
- Returns
Positions of master_properties within file_properties
- Return type
list of int
- schrodinger.application.phase.packages.oned_task_utils.get_rows_to_export(row_ranges)¶
Constructs a canvas.ChmBitset from a comma-separated list of row ranges (e.g., ‘1:100,200:300’). Input row numbers are 1-based and upper limits are inclusive. The returned bitset will have a logical size equal to the highest row number supplied, and the on positions will be 0-based. Note that the maximum logical size for a ChmBitset is MAX_BITSET, so this function assumes that users will not attempt to create individual 1D data files that would exceed the ChmBitset limit.
- Parameters
oned_data_file (str) – The name of the 1D data file (.1dbin)
row_ranges (str) – Comma-separated list of 1-based row ranges
- Returns
Bitset with 0-based rows as the on positions
- Return type
canvas.ChmBitset
- Raise
ValueError if an illegal string of row ranges is supplied
- schrodinger.application.phase.packages.oned_task_utils.get_split_file_names(prefix, n)¶
Returns the names of the 1D data files that will be created in the ‘split’ task.
- Parameters
prefix (str) – Prefix of output 1D data files
n (int) – Number of files to create
- Returns
1D data file names
- Return type
list(str)
- schrodinger.application.phase.packages.oned_task_utils.get_structure_file_reader(structure_file)¶
Returns the appropriate reader for the supplied structure file, which is expected to be SMILES, SMILESCSV, MAESTRO or SD.
- Parameters
structure_file (str) – Input file of structures
- Returns
Structure file reader
- Return type
structure.SmilesReader, structure.SmilesCsvReader or structure.StructureReader
- Raise
ValueError if the file format is illegal
- schrodinger.application.phase.packages.oned_task_utils.get_values_to_match(oned_data_file, filename)¶
Given a 1D data file and a text file containing a property name followed by property values to match, this function returns the 0-based position of the specified property in the 1D data file and a set of the values to match.
- Parameters
oned_data_file (str) – The name of the 1D data file (.1dbin)
filename (str) – The name of the text file with the property name and the values to match
- Returns
0-based property position, followed by values to match
- Return type
int, set(str)
- Raise
ValueError if the property is not found in oned_data_file
- schrodinger.application.phase.packages.oned_task_utils.is_oned_data_file(filename)¶
Returns True if the supplied file name corresponds to a 1D data file.
- Parameters
filename (str) – The name of the file
- Returns
Whether the name corresponds to a 1D data file
- Return type
bool
- schrodinger.application.phase.packages.oned_task_utils.merge_oned_data_files(oned_data_files_in, oned_data_file_out, remove=False)¶
Merges a list of 1D data files, creating an output file with a master set of properties.
- Parameters
oned_data_files_in (list(str)) – List of 1D data files to merge
oned_data_file_out (str) – Output 1D data file
remove (bool) – If True input 1D files will be removed after merge
- Returns
Total number of rows merged
- Return type
int
- schrodinger.application.phase.packages.oned_task_utils.merge_oned_hits_files(hits_files_in, hits_file_out, query_row=None)¶
Merges a list of compressed CSV hits files, creating an output file with a master set of properties.
- Parameters
hits_files_in (list(str)) – List of hits files to merge
hits_file_out (str) – Output hits file
query_row (list(str)) – If supplied, this row is written before any hits. It should be obtained by calling get_oned_query_rows.
- Returns
Number of merged hits written
- Return type
int
- schrodinger.application.phase.packages.oned_task_utils.run_oned_screen(st_query, oned_data_file, hits_file, start=0, stop=None, write_query_row=False, sort=True, max_hits=1000, max_rows=1000000, min_sim=0.0, logger=None, progress_interval=100000)¶
Performs a 1D similarity screen with a single structure query and writes hits to a compressed CSV file.
- Parameters
st_query (structure.Structure) – The query structure
oned_data_file (str) – Name of the input 1D data file (.1dbin)
hits_file (str) – Name of the output compressed CSV file (.csv.gz)
start (int) – 0-based starting row in oned_data_file
stop (int) – Upper limit on the rows to screen. For example, if start=5 and stop=10, rows 5, 6, 7, 8 and 9 will be screened. Screening proceeds until the end of the file by default.
write_query_row (bool) – Whether to write the 1D query as the first row
sort (bool) – Whether to write a sorted hits file
max_hits (int) – Cap on the number of sorted hits to write. Must not exceed MAX_CAPPED_HITS.
min_sim (float) – Write only hits whose similarity to the query are greater than or equal to this value
logger (logging.Logger or NoneType) – Logger for info level progress messages
progress_interval (int) – Interval between progress messages
- Returns
Number of hits written
- Return type
int
- Raises
ValueError if max_hits exceeds MAX_CAPPED_HITS
- schrodinger.application.phase.packages.oned_task_utils.split_structure_file(structure_file, prefix_out, nfiles)¶
Splits a structure file into a number of smaller, equal-sized files named <prefix_out>_1.<ext>, <prefix_out>_2.<ext>, etc., where <ext> will be ‘smi.gz’, ‘csv.gz’, ‘maegz’ or ‘sdfgz’, depending on the type of file supplied. This function is not approprate for a Maestro or SD file containing more than 2**31 - 1 structures.
- Parameters
structure_file (str) – The name of the structure file to be split
prefix_out (str) – Prefix for all output structure files
nfiles (int) – The number of output structure files to create
- Returns
The names of the files created
- Return type
list(str)
- Raise
ValueError if the file format is illegal
- schrodinger.application.phase.packages.oned_task_utils.split_oned_data_file(oned_data_file, prefix_out, nfiles)¶
Splits a 1D data file into a number of smaller, equal-sized files named <prefix_out>_1.1dbin, <prefix_out>_2.1dbin, etc.
- Parameters
oned_data_file (str) – The name of the 1D data file to be split
prefix_out (str) – Prefix for all output 1D data files
nfiles (int) – The number of output 1D data files to create
- Returns
The names of the files created
- Return type
list(str)
- schrodinger.application.phase.packages.oned_task_utils.write_oned_data_file_row_count(oned_data_file, row_count)¶
Appends the total number of rows to a 1D data file.
- Parameters
oned_data_file (str) – Name of the 1D data file (.1dbin)
row_count (int) – Total number of rows in file