schrodinger.rdkit.molio module¶

PathFinder helper functions for reading and writing files using RDKit Mol objects.

class schrodinger.rdkit.molio.MolWriter(filename, generate_coordinates=True, require_stereo=False)¶

Bases: StructureWriter

Write Mol objects to a file using a StructureWriter-like API, optionally generating 3D coordinates.

__init__(filename, generate_coordinates=True, require_stereo=False)¶

Create a structure writer class based on the format.

Parameters:

filename (str or pathlib.Path) – The filename to write to.
overwrite (bool) – If False, append to an existing file instead of overwriting it.
format (str) – The format of the file. Values should be specified by one of the module-level constants MAESTRO, MOL2, SD, SMILES, or SMILESCSV. If the format is not explicitly specified it will be determined from the suffix of the filename. Multi-structure PDB files are not supported.
stereo (enum) –

Use of the stereo option in the constructor is pending deprecation. Please use the setOption method instead.

See the class docstring for documentation on the stereo options.
allow_empty_file (bool) – whether we should create a file with no structures if we don’t append any structures. Only a valid option for Maestro files.

append(mol)¶: Append the provided structure to the open file.

class schrodinger.rdkit.molio.StructureReaderAdapter(reader, implicitH=True)¶

Bases: object

A wrapper for a Structure reader, which, when iterated through, yields RDKit Mol objects, and can also be used as a context manager that closes the reader on exit.

__init__(reader, implicitH=True)¶

Parameters:

reader (iterable of Structure) – source of structures to convert
implicitH (bool) – use implicit hydrogens

class schrodinger.rdkit.molio.BaseCsvMolReader(file, name_field: str = None)¶

Bases: object

Parent class for CsvMolReader and CsvMolIterator.

__init__(file, name_field: str = None)¶

Parameters:

file (str or file-like object) – CSV filename (file may be compressed) or file-like object.
name_field (str) – name of the field to use as the molecule name, will take precedence over the default NAME_FIELDS

NAME_FIELDS = ('NAME', 's_m_title', 'Name')¶

close()¶

class schrodinger.rdkit.molio.CsvMolReader(file, *args, **kwargs)¶

Bases: BaseCsvMolReader

Read a SMILES CSV file, returning Mol objects.

This is similar to RDKit’s SmilesMolSupplier with delimiter=’,’, except that it uses the csv module instead of naively splitting on commas. This makes it possible to have field values containing commas, as long as they are quoted following the CSV convention. Note, however, that multi-line records are still not supported for efficiency reasons.

Also, gzip-compressed files (identified by the filename ending in “gz”) are supported.

A CsvMolReader supports random access, like a list. Upon instantiation, the file is read in full and kept in memory. For a CSV file having only SMILES and an ID, this takes about 100 MB per million entries.

__init__(file, *args, **kwargs)¶

Parameters:

file (str or file-like object) – CSV filename (file may be compressed) or file-like object.
name_field (str) – name of the field to use as the molecule name, will take precedence over the default NAME_FIELDS

__len__()¶

class schrodinger.rdkit.molio.CsvMolIterator(file, name_field: str = None)¶

Bases: BaseCsvMolReader

Read a SMILES CSV file, returning Mol objects.

Unlike CsvMolReader, CsvMolIterator does not support random access, but since it only keeps one line in memory at a time, memory use is minimal.

class schrodinger.rdkit.molio.CsvMolWriter(filename, properties=None, cxsmiles=False)¶

Bases: object

Write a CSV file given Mol objects, using a StructureWriter-like API. The first two columns are the SMILES and title, and the rest are the properties of the molecule.

We don’t use structure.SmilesCsvWriter because it is too slow due to all the conversions (the overall job takes 4 times as long, so the bottleneck clearly becomes the writing of the output file!).
We don’t use Chem.SmilesWriter because even though it can use comma as a delimiter, it doesn’t write proper CSV files because it doesn’t know how to escape the delimiter.

Also, gzip-compressed files (identified by the filename ending in “gz”) are supported.

__init__(filename, properties=None, cxsmiles=False)¶

Parameters:

filename (str or file-like object) – file to write
properties (list of str or None) – optional, list of names of properties to write to output file. If None, all the properties are written. (CAVEAT: if filename is a file object rather than an actual filename, only the properties present in the first molecule are written.)
cxsmiles (bool) – when writing SMILES, use CXSMILES extensions

append(mol)¶

Write a molecule to the file. The first time this is called, the header row is written based on mol’s properties or the properties passed to __init__, if any.

Parameters:: mol (rdkit.Chem.rdchem.Mol) – molecule

toSmiles(mol)¶

close()¶

class schrodinger.rdkit.molio.BasePfxMolReader(filename)¶

Bases: object

Parent class for PfxMolReader and PfxMolIterator.

__init__(filename)¶

close()¶

class schrodinger.rdkit.molio.PfxMolReader(filename)¶

Bases: BasePfxMolReader

Reader for PFX (PathFinder reactants) files. These are really zip archives containing a CSV file and a metadata JSON file.

Like CsvMolReader, PfxMolReader supports random access, like a list. Upon instantiation, the file is read in full and kept in memory. For a file having only SMILES and an ID, this takes about 100 MB per million entries.

csv_mol_reader_class¶: alias of CsvMolReader

__len__()¶

class schrodinger.rdkit.molio.PfxMolIterator(filename)¶

Bases: BasePfxMolReader

Reader for PFX (PathFinder reactants) files. These are really zip archives containing a CSV file and a metadata JSON file.

Unlike PfxMolReader, PfxMolIterator does not support random access, but since it only keeps one line in memory at a time, memory use is minimal.

csv_mol_reader_class¶: alias of CsvMolIterator

class schrodinger.rdkit.molio.PfxMolWriter(filename, properties=None)¶

Bases: object

Writer for PFX (PathFinder reactants) files. These are really zip archives containing a CSV file and a metadata JSON file.

__init__(filename, properties=None)¶

Parameters:

filename (str) – file to write
properties (list of str or None) – optional, list of names of properties to write to output file. If None, all the properties present on the first structure will be written (the assumption is that all molecules will have the same properties, or at least that the first molecule has all the properties that we care about).

append(mol)¶

Write a molecule to the file.

Parameters:: mol (rdkit.Chem.rdchem.Mol) – molecule

property written_count¶

close()¶

class schrodinger.rdkit.molio.RdkitMolWriter(filename, v3000=False)¶

Bases: object

Write Mol objects to a file using the RDKit file-writing classes, but with a StructureWriter-like API. Supports SMILES and SDF.

__init__(filename, v3000=False)¶

Parameters:

filename (str) – filename to write
v3000 (bool) – when writing SD, force the use of the V3000 format

property written_count¶

append(mol)¶

close()¶

class schrodinger.rdkit.molio.NoneSkipper(supplier)¶

Bases: object

A wrapper for a mol supplier, which, when iterated through, skips the None mols, and can also be used as a context manager.

__init__(supplier)¶

Parameters:: supplier (iterable of Mol) – supplier of molecules

__len__()¶

class schrodinger.rdkit.molio.GzippedSDMolSupplier(filename, *a, **kw)¶

Bases: ForwardSDMolSupplier

Subclass of ForwardSDMolSupplier to read gzip-compressed files. Use as a context manager to ensure that the file gets closed.

__init__(filename, *a, **kw)¶

Parameters:

filename (str) – gzip-compressed file
a – positional arguments to pass through to parent
kw – keyword arguments to pass through to parent

schrodinger.rdkit.molio.get_mol_writer(filename, generate_coordinates=True, require_stereo=False, v3000=False, cxsmiles=False)¶

Return a StructureWriter-like object based on the command-line arguments. RDkit is used for non-Maestro formats.

Parameters:

filename (str) – filename to write
generate_coordinates (bool) – generate 3D coordinates (non-SMILES formats)
require_stereo (bool) – when generating coordinates, fail when there’s unspecified stereochemistry, instead of producing an arbitrary isomer
v3000 (bool) – when writing SD, force the use of the V3000 format
cxsmiles (bool) – when writing SMILES, use CXSMILES extensions

schrodinger.rdkit.molio.supported_output_format(filename)¶

Check whether we know how to write a file with a given name, but without actually opening a file. Used for argument validation.

Return type:: bool

schrodinger.rdkit.molio.get_mol_reader(filename, skip_bad=True, implicitH=True, random_access=True)¶

Return a Mol reader given a filename or a SMILES string. For .smi and .csv files, use the RDKit SmilesMolSupplier; for other formats, use StructureReader but convert Structure to Mol before yielding each molecule.

Whenever possible, the reader will be a Sequence. This is the currently the case for .smi and .csv files when skip_bad is False. (And for a SMILES string, which returns a list of size 1.)

Parameters:

skip_bad (bool) – if True, bad structures are skipped implicitly, instead of being yielded as None (only applies to SMILES and CSV formats.)
implicitH (bool) – use implicit hydrogens (only has an effect when reading Maestro files)
random_access (bool) – if False, the reader object can only be used as an iterator, and the file is not read in memory all at once. (Only applies to CSV and PFX and is ignored for other formats, which provide no random access except for uncompressed SD.)

Return type:

Generator or Sequence of Mol

schrodinger.rdkit.molio.get_mol(target, implicitH=True)¶

Read a Mol from a file or a SMILES string.

Parameters:

target (str) – filename or SMILES
implicitH (bool) – use implicit hydrogens (only has an effect when reading Maestro files)

Return type:

rdkit.Chem.Mol

schrodinger.rdkit.molio.combine_output_files(outfiles, out, dedup=True, sort=False, union_csv_columns=False, rdkit=False, v3000=False)¶

Write the final output file.

Parameters:

outfiles (list[str]) – subjob output filenames
out (str) – output filename
dedup (bool) – skip duplicate products
sort (bool) – sort output (implies the subjob output is sorted)
union_csv_columns (bool) – if csv, union infile columns.
rdkit (bool) – Use an RDKit writer for SD files.
v3000 (bool) – If using an RDKit writer and writing an SD file, force V3000 format.

schrodinger.rdkit.molio.get_format_handler(infiles, outfile, union_csv_columns=False, rdkit=False, v3000=False)¶

Return the appropriate format handler for a specified output file type.

Parameters:

infiles (list[str]) – subjob output filenames, used as input for merging
outfile (str) – output filename
union_csv_columns (bool) – flag to write out the union of infile csv columns (if infile columns differ)
rdkit (bool) – Use an RDKit writer for SD files.
v3000 (bool) – If using an RDKit writer and writing an SD file, force V3000 format.

Returns:

instance of a subclass of BaseMergeHandler

Return type:

CsvMergeHandler, StructureMergeHandler, or SmiMergeHandler

schrodinger.rdkit.molio.merge_files_as_streams(infiles, outfile, file_handler, dedup)¶

Copies structures from infiles into outfile. Rejects duplicates using ‘file_handler.getCompareKey.’ Assumes infiles are sorted.

Parameters:

infiles (iterable over str) – names of the structure files to be joined
outfile (instance of subclass of BaseMergeHandler) – output file name
file_handler – object to handle open, read and write operations for the file.
dedup (bool) – flag to indicate if duplicate products should be removed from merged output file

Returns:

number of structures written

Return type:

int

schrodinger.rdkit.molio.merge_files_in_memory(infiles, outfile, filetype_handler, dedup)¶

Copies structures from infiles into outfile. Rejects duplicates using filetype_handler.getCompareKey.

Parameters:

infiles (iterable over str) – names of the structure files to be joined
outfile (str) – output file name

Returns:

number of structures written

Return type:

int

class schrodinger.rdkit.molio.BaseMergeHandler¶

Bases: object

Base class for filetype handlers for subjob output deduplication and merging.

getProductReader(file)¶

Given a file name, create and return an iterable file handle to iterate over all products.

Parameters:: file (str) – file name
Returns:: iterable context manager over filetype-specific product format
Return type:: iterable

getProductAppender(file)¶

Given a file name, create and return a file-writing object that writes with when its “append” method is called.

Parameters:: file (str) – file name
Returns:: a file handle with context management that supports the append() call used in merge_files_in_memory and merge_files_as_streams.
Return type:: file-like object

getCompareKey(product)¶

Given a product (formatted according to the filetype), return the computed comparison key (SMILES string) for the product.

Parameters:: product (filetype-specific product (type varies)) – filetype-specific product

class schrodinger.rdkit.molio.CsvMergeHandler(infiles, outfile, union_columns=True, dedup_field=None)¶

Bases: BaseMergeHandler

Class to bundle csv read/write operations

__init__(infiles, outfile, union_columns=True, dedup_field=None)¶

Parameters:

infiles (list(str)) – list of output files to join column, if necessary.
outfile (str) – output file
union_columns (bool) – flag to write out the union of infile csv columns (if infile columns differ)
dedup_field (str) – csv column to use to check for duplicates during deduplication

getProductReader(file)¶

Open a csv file, skip the first (header) line if necessary, and return a context-managing iterable over all remaining lines.

Parameters:: file (str) – file name
Returns:: iterable context manager over csv lines
Return type:: _CsvReadWrapper (iter(str) or iter(dict))

getProductAppender(file)¶

Open a csv file, write the first (header) line, and return a line writer that supports the getProductAppender.append calls.

Parameters:: file (str) – file name
Returns:: a file handle that supports the append() call used in merge_files_in_memory and merge_files_as_streams.
Return type:: file-like object

getHeader()¶

Returns the header for ProductAppenders to reference.

Returns:: Header line for the input csv files.
Return type:: str

getCompareKey(prod)¶

Compute SMILES from a given csv-formatted product.

Parameters:: prod (dict or list) – product in question
Returns:: SMILES string
Return type:: str

class schrodinger.rdkit.molio.StructureMergeHandler¶

Bases: BaseMergeHandler

Helper class to bundle structure.Structure IO operations.

getProductReader(file)¶

Create and return a structure reader

Parameters:: file (str) – structure file name
Returns:: structure reader for file
Return type:: structure.StructureReader

getProductAppender(file)¶

Create and return a structure writer

Parameters:: file (str) – structure file name
Returns:: structure writer for file
Return type:: structure.StructureWriter

getCompareKey(prod)¶

Compute smiles from a given Schrodinger structure to compare against other structures.

Parameters:: prod (structure.Structure) – product in question
Returns:: SMILES string
Return type:: str

class schrodinger.rdkit.molio.RdkitMergeHandler(v3000=False)¶

Bases: BaseMergeHandler

__init__(v3000=False)¶

Parameters:: v3000 (bool) – If using an RDKit writer and writing an SD file, force V3000 format.

getProductReader(file)¶

Given a file name, create and return an iterable file handle to iterate over all products.

Parameters:: file (str) – file name
Returns:: iterable context manager over filetype-specific product format
Return type:: iterable

getProductAppender(file)¶

Given a file name, create and return a file-writing object that writes with when its “append” method is called.

Parameters:: file (str) – file name
Returns:: a file handle with context management that supports the append() call used in merge_files_in_memory and merge_files_as_streams.
Return type:: file-like object

getCompareKey(prod)¶

Parameters:: prod (rdkit.Chem.Mol) – product in question
Returns:: SMILES string
Return type:: str

class schrodinger.rdkit.molio.SmiMergeHandler¶

Bases: BaseMergeHandler

Helper class to bundle SMILES (.smi) IO operations.

getProductReader(file)¶

Create and return a SMILES line reader

Parameters:: file (str) – SMILES file name
Returns:: SMILES line reader for file
Return type:: file-like object (__enter__, __exit__, __iter__)

getProductAppender(file)¶

Create and return a SMILES line writer

Parameters:: file (str) – SMILES file name
Returns:: SMILES line writer for file
Return type:: _SmilesAppender

getCompareKey(prod)¶

Compute smiles from a given SMILES line for comparison to other SMILES lines.

Parameters:: prod (str) – product in question
Returns:: SMILES string
Return type:: str

schrodinger.rdkit.molio.get_fieldnames(filenames)¶

Return a list with the union of the field names from all the given CSV files. The field names are listed in the order in which they were first seen. (First all the fields from file #1, then the “new” field names from file #2, etc.)

Parameters:: filenames ([str]) – list of CSV files
Returns:: list of field names
Return type:: [str]

schrodinger.rdkit.molio.is_csvgz(filename)¶

schrodinger.rdkit.molio.is_pfx(filename)¶

schrodinger.rdkit.molio.get_pfx_size(filename)¶: Return the size from the metadata header of a .pfx file.

schrodinger.rdkit.molio.extract_structures(filename, dest_file)¶: Extract structures from .pfx file into a given file.

schrodinger.rdkit.molio.remove_react_atom_props(mol)¶

Return a copy of mol where atom properties added by the RDKit reaction module have been stripped out.

Parameters:: mol (rdkit.Chem.Mol) – input molecule; not modified
Returns:: modified molecule
Return type:: rdkit.Chem.Mol

schrodinger.rdkit.molio.cat_csv_files(source_filenames, dest_filename)¶

Quick and dirty csv concatenation strategy. Assumes all csv files have the same columns and does not deduplicate.

Parameters:

source_filenames – input files
dest_filename – destination file

schrodinger.rdkit.molio.copy_csv_file(input_file, output_file)¶

Copy compressed or uncompressed input .csv file to another .csv file. Output file can also be compressed or uncompressed.

Parameters:

input_file (str) – input file name
output_file (str) – output file name