schrodinger.rdkit.molio module

PathFinder helper functions for reading and writing files using RDKit Mol objects.

class schrodinger.rdkit.molio.MolWriter(filename, generate_coordinates=True, require_stereo=False)

Bases: schrodinger.structure._io.StructureWriter

Write Mol objects to a file using a StructureWriter-like API, optionally generating 3D coordinates.

__init__(filename, generate_coordinates=True, require_stereo=False)

Create a structure writer class based on the format.

Parameters
  • filename (str or pathlib.Path) – The filename to write to.

  • overwrite (bool) – If False, append to an existing file instead of overwriting it.

  • format (str) – The format of the file. Values should be specified by one of the module-level constants MAESTRO, MOL2, SD, SMILES, or SMILESCSV. If the format is not explicitly specified it will be determined from the suffix of the filename. Multi-structure PDB files are not supported.

  • stereo (enum) –

    Use of the stereo option in the constructor is pending deprecation. Please use the setOption method instead.

    See the class docstring for documentation on the stereo options.

  • allow_empty_file (bool) – whether we should create a file with no structures if we don’t append any structures. Only a valid option for Maestro files.

append(mol)

Append the provided structure to the open file.

class schrodinger.rdkit.molio.StructureReaderAdapter(reader, implicitH=True)

Bases: object

A wrapper for a Structure reader, which, when iterated through, yields RDKit Mol objects, and can also be used as a context manager that closes the reader on exit.

__init__(reader, implicitH=True)
Parameters
  • reader (iterable of Structure) – source of structures to convert

  • implicitH (bool) – use implicit hydrogens

class schrodinger.rdkit.molio.BaseCsvMolReader(file)

Bases: object

Parent class for CsvMolReader and CsvMolIterator.

NAME_FIELDS = ('NAME', 's_m_title', 'Name')
__init__(file)
Parameters

file – CSV filename (file may be compressed) or file-like object.

close()
class schrodinger.rdkit.molio.CsvMolReader(file)

Bases: schrodinger.rdkit.molio.BaseCsvMolReader

Read a SMILES CSV file, returning Mol objects.

This is similar to RDKit’s SmilesMolSupplier with delimiter=’,’, except that it uses the csv module instead of naively splitting on commas. This makes it possible to have field values containing commas, as long as they are quoted following the CSV convention. Note, however, that multi-line records are still not supported for efficiency reasons.

Also, gzip-compressed files (identified by the filename ending in “gz”) are supported.

A CsvMolReader supports random access, like a list. Upon instantiation, the file is read in full and kept in memory. For a CSV file having only SMILES and an ID, this takes about 100 MB per million entries.

__init__(file)
Parameters

file – CSV filename (file may be compressed) or file-like object.

__len__()
class schrodinger.rdkit.molio.CsvMolIterator(file)

Bases: schrodinger.rdkit.molio.BaseCsvMolReader

Read a SMILES CSV file, returning Mol objects.

Unlike CsvMolReader, CsvMolIterator does not support random access, but since it only keeps one line in memory at a time, memory use is minimal.

class schrodinger.rdkit.molio.CsvMolWriter(filename, properties=None, cxsmiles=False)

Bases: object

Write a CSV file given Mol objects, using a StructureWriter-like API. The first two columns are the SMILES and title, and the rest are the properties of the molecule.

  • We don’t use structure.SmilesCsvWriter because it is too slow due to all the conversions (the overall job takes 4 times as long, so the bottleneck clearly becomes the writing of the output file!).

  • We don’t use Chem.SmilesWriter because even though it can use comma as a delimiter, it doesn’t write proper CSV files because it doesn’t know how to escape the delimiter.

Also, gzip-compressed files (identified by the filename ending in “gz”) are supported.

__init__(filename, properties=None, cxsmiles=False)
Parameters
  • filename (str or file-like object) – file to write

  • properties (list of str or None) – optional, list of names of properties to write to output file. If None, all the properties are written. (CAVEAT: if filename is a file object rather than an actual filename, only the properties present in the first molecule are written.)

  • cxsmiles (bool) – when writing SMILES, use CXSMILES extensions

append(mol)

Write a molecule to the file. The first time this is called, the header row is written based on mol’s properties or the properties passed to __init__, if any.

Parameters

mol (rdkit.Chem.rdchem.Mol) – molecule

toSmiles(mol)
close()
class schrodinger.rdkit.molio.BasePfxMolReader(filename)

Bases: object

Parent class for PfxMolReader and PfxMolIterator.

__init__(filename)
close()
class schrodinger.rdkit.molio.PfxMolReader(filename)

Bases: schrodinger.rdkit.molio.BasePfxMolReader

Reader for PFX (PathFinder reactants) files. These are really zip archives containing a CSV file and a metadata JSON file.

Like CsvMolReader, PfxMolReader supports random access, like a list. Upon instantiation, the file is read in full and kept in memory. For a file having only SMILES and an ID, this takes about 100 MB per million entries.

csv_mol_reader_class

alias of schrodinger.rdkit.molio.CsvMolReader

__len__()
class schrodinger.rdkit.molio.PfxMolIterator(filename)

Bases: schrodinger.rdkit.molio.BasePfxMolReader

Reader for PFX (PathFinder reactants) files. These are really zip archives containing a CSV file and a metadata JSON file.

Unlike PfxMolReader, PfxMolIterator does not support random access, but since it only keeps one line in memory at a time, memory use is minimal.

csv_mol_reader_class

alias of schrodinger.rdkit.molio.CsvMolIterator

class schrodinger.rdkit.molio.PfxMolWriter(filename, properties=None)

Bases: object

Writer for PFX (PathFinder reactants) files. These are really zip archives containing a CSV file and a metadata JSON file.

__init__(filename, properties=None)
Parameters
  • filename (str) – file to write

  • properties (list of str or None) – optional, list of names of properties to write to output file. If None, all the properties present on the first structure will be written (the assumption is that all molecules will have the same properties, or at least that the first molecule has all the properties that we care about).

append(mol)

Write a molecule to the file.

Parameters

mol (rdkit.Chem.rdchem.Mol) – molecule

property written_count
close()
class schrodinger.rdkit.molio.RdkitMolWriter(filename, v3000=False)

Bases: object

Write Mol objects to a file using the RDKit file-writing classes, but with a StructureWriter-like API. Supports SMILES and SDF.

__init__(filename, v3000=False)
Parameters
  • filename (str) – filename to write

  • v3000 (bool) – when writing SD, force the use of the V3000 format

property written_count
append(mol)
close()
class schrodinger.rdkit.molio.NoneSkipper(supplier)

Bases: object

A wrapper for a mol supplier, which, when iterated through, skips the None mols, and can also be used as a context manager.

__init__(supplier)
Parameters

supplier (iterable of Mol) – supplier of molecules

__len__()
class schrodinger.rdkit.molio.GzippedSDMolSupplier(filename, *a, **kw)

Bases: rdkit.Chem.rdmolfiles.ForwardSDMolSupplier

Subclass of ForwardSDMolSupplier to read gzip-compressed files. Use as a context manager to ensure that the file gets closed.

__init__(filename, *a, **kw)
Parameters
  • filename (str) – gzip-compressed file

  • a – positional arguments to pass through to parent

  • kw – keyword arguments to pass through to parent

schrodinger.rdkit.molio.get_mol_writer(filename, generate_coordinates=True, require_stereo=False, v3000=False, cxsmiles=False)

Return a StructureWriter-like object based on the command-line arguments. RDkit is used for non-Maestro formats.

Parameters
  • filename (str) – filename to write

  • generate_coordinates (bool) – generate 3D coordinates (non-SMILES formats)

  • require_stereo (bool) – when generating coordinates, fail when there’s unspecified stereochemistry, instead of producing an arbitrary isomer

  • v3000 (bool) – when writing SD, force the use of the V3000 format

  • cxsmiles (bool) – when writing SMILES, use CXSMILES extensions

schrodinger.rdkit.molio.supported_output_format(filename)

Check whether we know how to write a file with a given name, but without actually opening a file. Used for argument validation.

Return type

bool

schrodinger.rdkit.molio.get_mol_reader(filename, skip_bad=True, implicitH=True, random_access=True)

Return a Mol reader given a filename or a SMILES string. For .smi and .csv files, use the RDKit SmilesMolSupplier; for other formats, use StructureReader but convert Structure to Mol before yielding each molecule.

Whenever possible, the reader will be a Sequence. This is the currently the case for .smi and .csv files when skip_bad is False. (And for a SMILES string, which returns a list of size 1.)

Parameters
  • skip_bad (bool) – if True, bad structures are skipped implicitly, instead of being yielded as None (only applies to SMILES and CSV formats.)

  • implicitH (bool) – use implicit hydrogens (only has an effect when reading Maestro files)

  • random_access (bool) – if False, the reader object can only be used as an iterator, and the file is not read in memory all at once. (Only applies to CSV and PFX and is ignored for other formats, which provide no random access except for uncompressed SD.)

Return type

Generator or Sequence of Mol

schrodinger.rdkit.molio.get_mol(target, implicitH=True)

Read a Mol from a file or a SMILES string.

Parameters
  • target (str) – filename or SMILES

  • implicitH (bool) – use implicit hydrogens (only has an effect when reading Maestro files)

Return type

rdkit.Chem.Mol

schrodinger.rdkit.molio.combine_output_files(outfiles, out, dedup=True, sort=False, union_csv_columns=False, rdkit=False, v3000=False)

Write the final output file.

Parameters
  • outfiles (list[str]) – subjob output filenames

  • out (str) – output filename

  • dedup (bool) – skip duplicate products

  • sort (bool) – sort output (implies the subjob output is sorted)

  • union_csv_columns (bool) – if csv, union infile columns.

  • rdkit (bool) – Use an RDKit writer for SD files.

  • v3000 (bool) – If using an RDKit writer and writing an SD file, force V3000 format.

schrodinger.rdkit.molio.get_format_handler(infiles, outfile, union_csv_columns=False, rdkit=False, v3000=False)

Return the appropriate format handler for a specified output file type.

Parameters
  • infiles (list[str]) – subjob output filenames, used as input for merging

  • outfile (str) – output filename

  • union_csv_columns (bool) – flag to write out the union of infile csv columns (if infile columns differ)

  • rdkit (bool) – Use an RDKit writer for SD files.

  • v3000 (bool) – If using an RDKit writer and writing an SD file, force V3000 format.

Returns

instance of a subclass of BaseMergeHandler

Return type

CsvMergeHandler, StructureMergeHandler, or SmiMergeHandler

schrodinger.rdkit.molio.merge_files_as_streams(infiles, outfile, file_handler, dedup)

Copies structures from infiles into outfile. Rejects duplicates using ‘file_handler.getCompareKey.’ Assumes infiles are sorted.

Parameters
  • infiles (iterable over str) – names of the structure files to be joined

  • outfile (instance of subclass of BaseMergeHandler) – output file name

  • file_handler – object to handle open, read and write operations for the file.

  • dedup (bool) – flag to indicate if duplicate products should be removed from merged output file

Returns

number of structures written

Return type

int

schrodinger.rdkit.molio.merge_files_in_memory(infiles, outfile, filetype_handler, dedup)

Copies structures from infiles into outfile. Rejects duplicates using filetype_handler.getCompareKey.

Parameters
  • infiles (iterable over str) – names of the structure files to be joined

  • outfile (str) – output file name

Returns

number of structures written

Return type

int

class schrodinger.rdkit.molio.BaseMergeHandler

Bases: object

Base class for filetype handlers for subjob output deduplication and merging.

getProductReader(file)

Given a file name, create and return an iterable file handle to iterate over all products.

Parameters

file (str) – file name

Returns

iterable context manager over filetype-specific product format

Return type

iterable

getProductAppender(file)

Given a file name, create and return a file-writing object that writes with when its “append” method is called.

Parameters

file (str) – file name

Returns

a file handle with context management that supports the append() call used in merge_files_in_memory and merge_files_as_streams.

Return type

file-like object

getCompareKey(product)

Given a product (formatted according to the filetype), return the computed comparison key (SMILES string) for the product.

Parameters

product (filetype-specific product (type varies)) – filetype-specific product

class schrodinger.rdkit.molio.CsvMergeHandler(infiles, outfile, union_columns=True, dedup_field=None)

Bases: schrodinger.rdkit.molio.BaseMergeHandler

Class to bundle csv read/write operations

__init__(infiles, outfile, union_columns=True, dedup_field=None)
Parameters
  • infiles (list(str)) – list of output files to join column, if necessary.

  • outfile (str) – output file

  • union_columns (bool) – flag to write out the union of infile csv columns (if infile columns differ)

  • dedup_field (str) – csv column to use to check for duplicates during deduplication

getProductReader(file)

Open a csv file, skip the first (header) line if necessary, and return a context-managing iterable over all remaining lines.

Parameters

file (str) – file name

Returns

iterable context manager over csv lines

Return type

_CsvReadWrapper (iter(str) or iter(dict))

getProductAppender(file)

Open a csv file, write the first (header) line, and return a line writer that supports the getProductAppender.append calls.

Parameters

file (str) – file name

Returns

a file handle that supports the append() call used in merge_files_in_memory and merge_files_as_streams.

Return type

file-like object

getHeader()

Returns the header for ProductAppenders to reference.

Returns

Header line for the input csv files.

Return type

str

getCompareKey(prod)

Compute SMILES from a given csv-formatted product.

Parameters

prod (dict or list) – product in question

Returns

SMILES string

Return type

str

class schrodinger.rdkit.molio.StructureMergeHandler

Bases: schrodinger.rdkit.molio.BaseMergeHandler

Helper class to bundle structure.Structure IO operations.

__init__()
getProductReader(file)

Create and return a structure reader

Parameters

file (str) – structure file name

Returns

structure reader for file

Return type

structure.StructureReader

getProductAppender(file)

Create and return a structure writer

Parameters

file (str) – structure file name

Returns

structure writer for file

Return type

structure.StructureWriter

getCompareKey(prod)

Compute smiles from a given Schrodinger structure to compare against other structures.

Parameters

prod (structure.Structure) – product in question

Returns

SMILES string

Return type

str

class schrodinger.rdkit.molio.RdkitMergeHandler(v3000=False)

Bases: schrodinger.rdkit.molio.BaseMergeHandler

__init__(v3000=False)
Parameters

v3000 (bool) – If using an RDKit writer and writing an SD file, force V3000 format.

getProductReader(file)

Given a file name, create and return an iterable file handle to iterate over all products.

Parameters

file (str) – file name

Returns

iterable context manager over filetype-specific product format

Return type

iterable

getProductAppender(file)

Given a file name, create and return a file-writing object that writes with when its “append” method is called.

Parameters

file (str) – file name

Returns

a file handle with context management that supports the append() call used in merge_files_in_memory and merge_files_as_streams.

Return type

file-like object

getCompareKey(prod)
Parameters

prod (rdkit.Chem.Mol) – product in question

Returns

SMILES string

Return type

str

class schrodinger.rdkit.molio.SmiMergeHandler

Bases: schrodinger.rdkit.molio.BaseMergeHandler

Helper class to bundle SMILES (.smi) IO operations.

getProductReader(file)

Create and return a SMILES line reader

Parameters

file (str) – SMILES file name

Returns

SMILES line reader for file

Return type

file-like object (__enter__, __exit__, __iter__)

getProductAppender(file)

Create and return a SMILES line writer

Parameters

file (str) – SMILES file name

Returns

SMILES line writer for file

Return type

_SmilesAppender

getCompareKey(prod)

Compute smiles from a given SMILES line for comparison to other SMILES lines.

Parameters

prod (str) – product in question

Returns

SMILES string

Return type

str

schrodinger.rdkit.molio.get_fieldnames(filenames)

Return a list with the union of the field names from all the given CSV files. The field names are listed in the order in which they were first seen. (First all the fields from file #1, then the “new” field names from file #2, etc.)

Parameters

filenames ([str]) – list of CSV files

Returns

list of field names

Return type

[str]

schrodinger.rdkit.molio.is_csvgz(filename)
schrodinger.rdkit.molio.is_pfx(filename)
schrodinger.rdkit.molio.get_pfx_size(filename)

Return the size from the metadata header of a .pfx file.

schrodinger.rdkit.molio.extract_structures(filename, dest_file)

Extract structures from .pfx file into a given file.

schrodinger.rdkit.molio.remove_react_atom_props(mol)

Return a copy of mol where atom properties added by the RDKit reaction module have been stripped out.

Parameters

mol (rdkit.Chem.Mol) – input molecule; not modified

Returns

modified molecule

Return type

rdkit.Chem.Mol

schrodinger.rdkit.molio.cat_csv_files(source_filenames, dest_filename)

Quick and dirty csv concatenation strategy. Assumes all csv files have the same columns and does not deduplicate.

Parameters
  • source_filenames – input files

  • dest_filename – destination file

schrodinger.rdkit.molio.copy_csv_file(input_file, output_file)

Copy compressed or uncompressed input .csv file to another .csv file. Output file can also be compressed or uncompressed.

Parameters
  • input_file (str) – input file name

  • output_file (str) – output file name