schrodinger.rdkit.molio module¶
PathFinder helper functions for reading and writing files using RDKit Mol objects.
- class schrodinger.rdkit.molio.MolWriter(filename, generate_coordinates=True, require_stereo=False)¶
Bases:
StructureWriter
Write Mol objects to a file using a StructureWriter-like API, optionally generating 3D coordinates.
- __init__(filename, generate_coordinates=True, require_stereo=False)¶
Create a structure writer class based on the format.
- Parameters:
filename (str or pathlib.Path) – The filename to write to.
overwrite (bool) – If False, append to an existing file instead of overwriting it.
format (str) – The format of the file. Values should be specified by one of the module-level constants MAESTRO, MOL2, SD, SMILES, or SMILESCSV. If the format is not explicitly specified it will be determined from the suffix of the filename. Multi-structure PDB files are not supported.
stereo (enum) –
Use of the stereo option in the constructor is pending deprecation. Please use the setOption method instead.
See the class docstring for documentation on the stereo options.
allow_empty_file (bool) – whether we should create a file with no structures if we don’t append any structures. Only a valid option for Maestro files.
- append(mol)¶
Append the provided structure to the open file.
- class schrodinger.rdkit.molio.StructureReaderAdapter(reader, implicitH=True)¶
Bases:
object
A wrapper for a Structure reader, which, when iterated through, yields RDKit Mol objects, and can also be used as a context manager that closes the reader on exit.
- __init__(reader, implicitH=True)¶
- Parameters:
reader (iterable of Structure) – source of structures to convert
implicitH (bool) – use implicit hydrogens
- class schrodinger.rdkit.molio.BaseCsvMolReader(file, name_field: str = None)¶
Bases:
object
Parent class for CsvMolReader and CsvMolIterator.
- __init__(file, name_field: str = None)¶
- Parameters:
file (str or file-like object) – CSV filename (file may be compressed) or file-like object.
name_field (str) – name of the field to use as the molecule name, will take precedence over the default NAME_FIELDS
- NAME_FIELDS = ('NAME', 's_m_title', 'Name')¶
- close()¶
- class schrodinger.rdkit.molio.CsvMolReader(file, *args, **kwargs)¶
Bases:
BaseCsvMolReader
Read a SMILES CSV file, returning Mol objects.
This is similar to RDKit’s SmilesMolSupplier with delimiter=’,’, except that it uses the csv module instead of naively splitting on commas. This makes it possible to have field values containing commas, as long as they are quoted following the CSV convention. Note, however, that multi-line records are still not supported for efficiency reasons.
Also, gzip-compressed files (identified by the filename ending in “gz”) are supported.
A CsvMolReader supports random access, like a list. Upon instantiation, the file is read in full and kept in memory. For a CSV file having only SMILES and an ID, this takes about 100 MB per million entries.
- __init__(file, *args, **kwargs)¶
- Parameters:
file (str or file-like object) – CSV filename (file may be compressed) or file-like object.
name_field (str) – name of the field to use as the molecule name, will take precedence over the default NAME_FIELDS
- __len__()¶
- class schrodinger.rdkit.molio.CsvMolIterator(file, name_field: str = None)¶
Bases:
BaseCsvMolReader
Read a SMILES CSV file, returning Mol objects.
Unlike CsvMolReader, CsvMolIterator does not support random access, but since it only keeps one line in memory at a time, memory use is minimal.
- class schrodinger.rdkit.molio.CsvMolWriter(filename, properties=None, cxsmiles=False)¶
Bases:
object
Write a CSV file given Mol objects, using a StructureWriter-like API. The first two columns are the SMILES and title, and the rest are the properties of the molecule.
We don’t use structure.SmilesCsvWriter because it is too slow due to all the conversions (the overall job takes 4 times as long, so the bottleneck clearly becomes the writing of the output file!).
We don’t use Chem.SmilesWriter because even though it can use comma as a delimiter, it doesn’t write proper CSV files because it doesn’t know how to escape the delimiter.
Also, gzip-compressed files (identified by the filename ending in “gz”) are supported.
- __init__(filename, properties=None, cxsmiles=False)¶
- Parameters:
filename (str or file-like object) – file to write
properties (list of str or None) – optional, list of names of properties to write to output file. If None, all the properties are written. (CAVEAT: if
filename
is a file object rather than an actual filename, only the properties present in the first molecule are written.)cxsmiles (bool) – when writing SMILES, use CXSMILES extensions
- append(mol)¶
Write a molecule to the file. The first time this is called, the header row is written based on mol’s properties or the properties passed to __init__, if any.
- Parameters:
mol (rdkit.Chem.rdchem.Mol) – molecule
- toSmiles(mol)¶
- close()¶
- class schrodinger.rdkit.molio.BasePfxMolReader(filename)¶
Bases:
object
Parent class for PfxMolReader and PfxMolIterator.
- __init__(filename)¶
- close()¶
- class schrodinger.rdkit.molio.PfxMolReader(filename)¶
Bases:
BasePfxMolReader
Reader for PFX (PathFinder reactants) files. These are really zip archives containing a CSV file and a metadata JSON file.
Like CsvMolReader, PfxMolReader supports random access, like a list. Upon instantiation, the file is read in full and kept in memory. For a file having only SMILES and an ID, this takes about 100 MB per million entries.
- csv_mol_reader_class¶
alias of
CsvMolReader
- __len__()¶
- class schrodinger.rdkit.molio.PfxMolIterator(filename)¶
Bases:
BasePfxMolReader
Reader for PFX (PathFinder reactants) files. These are really zip archives containing a CSV file and a metadata JSON file.
Unlike PfxMolReader, PfxMolIterator does not support random access, but since it only keeps one line in memory at a time, memory use is minimal.
- csv_mol_reader_class¶
alias of
CsvMolIterator
- class schrodinger.rdkit.molio.PfxMolWriter(filename, properties=None)¶
Bases:
object
Writer for PFX (PathFinder reactants) files. These are really zip archives containing a CSV file and a metadata JSON file.
- __init__(filename, properties=None)¶
- Parameters:
filename (str) – file to write
properties (list of str or None) – optional, list of names of properties to write to output file. If None, all the properties present on the first structure will be written (the assumption is that all molecules will have the same properties, or at least that the first molecule has all the properties that we care about).
- append(mol)¶
Write a molecule to the file.
- Parameters:
mol (rdkit.Chem.rdchem.Mol) – molecule
- property written_count¶
- close()¶
- class schrodinger.rdkit.molio.RdkitMolWriter(filename, v3000=False)¶
Bases:
object
Write Mol objects to a file using the RDKit file-writing classes, but with a StructureWriter-like API. Supports SMILES and SDF.
- __init__(filename, v3000=False)¶
- Parameters:
filename (str) – filename to write
v3000 (bool) – when writing SD, force the use of the V3000 format
- property written_count¶
- append(mol)¶
- close()¶
- class schrodinger.rdkit.molio.NoneSkipper(supplier)¶
Bases:
object
A wrapper for a mol supplier, which, when iterated through, skips the
None
mols, and can also be used as a context manager.- __init__(supplier)¶
- Parameters:
supplier (iterable of Mol) – supplier of molecules
- __len__()¶
- class schrodinger.rdkit.molio.GzippedSDMolSupplier(filename, *a, **kw)¶
Bases:
ForwardSDMolSupplier
Subclass of ForwardSDMolSupplier to read gzip-compressed files. Use as a context manager to ensure that the file gets closed.
- __init__(filename, *a, **kw)¶
- Parameters:
filename (str) – gzip-compressed file
a – positional arguments to pass through to parent
kw – keyword arguments to pass through to parent
- schrodinger.rdkit.molio.get_mol_writer(filename, generate_coordinates=True, require_stereo=False, v3000=False, cxsmiles=False)¶
Return a StructureWriter-like object based on the command-line arguments. RDkit is used for non-Maestro formats.
- Parameters:
filename (str) – filename to write
generate_coordinates (bool) – generate 3D coordinates (non-SMILES formats)
require_stereo (bool) – when generating coordinates, fail when there’s unspecified stereochemistry, instead of producing an arbitrary isomer
v3000 (bool) – when writing SD, force the use of the V3000 format
cxsmiles (bool) – when writing SMILES, use CXSMILES extensions
- schrodinger.rdkit.molio.supported_output_format(filename)¶
Check whether we know how to write a file with a given name, but without actually opening a file. Used for argument validation.
- Return type:
bool
- schrodinger.rdkit.molio.get_mol_reader(filename, skip_bad=True, implicitH=True, random_access=True)¶
Return a Mol reader given a filename or a SMILES string. For .smi and .csv files, use the RDKit SmilesMolSupplier; for other formats, use StructureReader but convert Structure to Mol before yielding each molecule.
Whenever possible, the reader will be a Sequence. This is the currently the case for .smi and .csv files when skip_bad is False. (And for a SMILES string, which returns a list of size 1.)
- Parameters:
skip_bad (bool) – if True, bad structures are skipped implicitly, instead of being yielded as None (only applies to SMILES and CSV formats.)
implicitH (bool) – use implicit hydrogens (only has an effect when reading Maestro files)
random_access (bool) – if False, the reader object can only be used as an iterator, and the file is not read in memory all at once. (Only applies to CSV and PFX and is ignored for other formats, which provide no random access except for uncompressed SD.)
- Return type:
Generator or Sequence of Mol
- schrodinger.rdkit.molio.get_mol(target, implicitH=True)¶
Read a Mol from a file or a SMILES string.
- Parameters:
target (str) – filename or SMILES
implicitH (bool) – use implicit hydrogens (only has an effect when reading Maestro files)
- Return type:
rdkit.Chem.Mol
- schrodinger.rdkit.molio.combine_output_files(outfiles, out, dedup=True, sort=False, union_csv_columns=False, rdkit=False, v3000=False)¶
Write the final output file.
- Parameters:
outfiles (list[str]) – subjob output filenames
out (str) – output filename
dedup (bool) – skip duplicate products
sort (bool) – sort output (implies the subjob output is sorted)
union_csv_columns (bool) – if csv, union infile columns.
rdkit (bool) – Use an RDKit writer for SD files.
v3000 (bool) – If using an RDKit writer and writing an SD file, force V3000 format.
- schrodinger.rdkit.molio.get_format_handler(infiles, outfile, union_csv_columns=False, rdkit=False, v3000=False)¶
Return the appropriate format handler for a specified output file type.
- Parameters:
infiles (list[str]) – subjob output filenames, used as input for merging
outfile (str) – output filename
union_csv_columns (bool) – flag to write out the union of infile csv columns (if infile columns differ)
rdkit (bool) – Use an RDKit writer for SD files.
v3000 (bool) – If using an RDKit writer and writing an SD file, force V3000 format.
- Returns:
instance of a subclass of BaseMergeHandler
- Return type:
- schrodinger.rdkit.molio.merge_files_as_streams(infiles, outfile, file_handler, dedup)¶
Copies structures from
infiles
intooutfile
. Rejects duplicates using ‘file_handler.getCompareKey.’ Assumes infiles are sorted.- Parameters:
infiles (iterable over str) – names of the structure files to be joined
outfile (instance of subclass of BaseMergeHandler) – output file name
file_handler – object to handle open, read and write operations for the file.
dedup (bool) – flag to indicate if duplicate products should be removed from merged output file
- Returns:
number of structures written
- Return type:
int
- schrodinger.rdkit.molio.merge_files_in_memory(infiles, outfile, filetype_handler, dedup)¶
Copies structures from
infiles
intooutfile
. Rejects duplicates using filetype_handler.getCompareKey.- Parameters:
infiles (iterable over str) – names of the structure files to be joined
outfile (str) – output file name
- Returns:
number of structures written
- Return type:
int
- class schrodinger.rdkit.molio.BaseMergeHandler¶
Bases:
object
Base class for filetype handlers for subjob output deduplication and merging.
- getProductReader(file)¶
Given a file name, create and return an iterable file handle to iterate over all products.
- Parameters:
file (str) – file name
- Returns:
iterable context manager over filetype-specific product format
- Return type:
iterable
- getProductAppender(file)¶
Given a file name, create and return a file-writing object that writes with when its “append” method is called.
- Parameters:
file (str) – file name
- Returns:
a file handle with context management that supports the append() call used in merge_files_in_memory and merge_files_as_streams.
- Return type:
file-like object
- getCompareKey(product)¶
Given a product (formatted according to the filetype), return the computed comparison key (SMILES string) for the product.
- Parameters:
product (filetype-specific product (type varies)) – filetype-specific product
- class schrodinger.rdkit.molio.CsvMergeHandler(infiles, outfile, union_columns=True, dedup_field=None)¶
Bases:
BaseMergeHandler
Class to bundle csv read/write operations
- __init__(infiles, outfile, union_columns=True, dedup_field=None)¶
- Parameters:
infiles (list(str)) – list of output files to join column, if necessary.
outfile (str) – output file
union_columns (bool) – flag to write out the union of infile csv columns (if infile columns differ)
dedup_field (str) – csv column to use to check for duplicates during deduplication
- getProductReader(file)¶
Open a csv file, skip the first (header) line if necessary, and return a context-managing iterable over all remaining lines.
- Parameters:
file (str) – file name
- Returns:
iterable context manager over csv lines
- Return type:
_CsvReadWrapper (iter(str) or iter(dict))
- getProductAppender(file)¶
Open a csv file, write the first (header) line, and return a line writer that supports the getProductAppender.append calls.
- Parameters:
file (str) – file name
- Returns:
a file handle that supports the append() call used in merge_files_in_memory and merge_files_as_streams.
- Return type:
file-like object
- getHeader()¶
Returns the header for ProductAppenders to reference.
- Returns:
Header line for the input csv files.
- Return type:
str
- getCompareKey(prod)¶
Compute SMILES from a given csv-formatted product.
- Parameters:
prod (dict or list) – product in question
- Returns:
SMILES string
- Return type:
str
- class schrodinger.rdkit.molio.StructureMergeHandler¶
Bases:
BaseMergeHandler
Helper class to bundle structure.Structure IO operations.
- getProductReader(file)¶
Create and return a structure reader
- Parameters:
file (str) – structure file name
- Returns:
structure reader for file
- Return type:
- getProductAppender(file)¶
Create and return a structure writer
- Parameters:
file (str) – structure file name
- Returns:
structure writer for file
- Return type:
- getCompareKey(prod)¶
Compute smiles from a given Schrodinger structure to compare against other structures.
- Parameters:
prod (structure.Structure) – product in question
- Returns:
SMILES string
- Return type:
str
- class schrodinger.rdkit.molio.RdkitMergeHandler(v3000=False)¶
Bases:
BaseMergeHandler
- __init__(v3000=False)¶
- Parameters:
v3000 (bool) – If using an RDKit writer and writing an SD file, force V3000 format.
- getProductReader(file)¶
Given a file name, create and return an iterable file handle to iterate over all products.
- Parameters:
file (str) – file name
- Returns:
iterable context manager over filetype-specific product format
- Return type:
iterable
- getProductAppender(file)¶
Given a file name, create and return a file-writing object that writes with when its “append” method is called.
- Parameters:
file (str) – file name
- Returns:
a file handle with context management that supports the append() call used in merge_files_in_memory and merge_files_as_streams.
- Return type:
file-like object
- getCompareKey(prod)¶
- Parameters:
prod (rdkit.Chem.Mol) – product in question
- Returns:
SMILES string
- Return type:
str
- class schrodinger.rdkit.molio.SmiMergeHandler¶
Bases:
BaseMergeHandler
Helper class to bundle SMILES (.smi) IO operations.
- getProductReader(file)¶
Create and return a SMILES line reader
- Parameters:
file (str) – SMILES file name
- Returns:
SMILES line reader for file
- Return type:
file-like object (__enter__, __exit__, __iter__)
- getProductAppender(file)¶
Create and return a SMILES line writer
- Parameters:
file (str) – SMILES file name
- Returns:
SMILES line writer for file
- Return type:
_SmilesAppender
- getCompareKey(prod)¶
Compute smiles from a given SMILES line for comparison to other SMILES lines.
- Parameters:
prod (str) – product in question
- Returns:
SMILES string
- Return type:
str
- schrodinger.rdkit.molio.get_fieldnames(filenames)¶
Return a list with the union of the field names from all the given CSV files. The field names are listed in the order in which they were first seen. (First all the fields from file #1, then the “new” field names from file #2, etc.)
- Parameters:
filenames ([str]) – list of CSV files
- Returns:
list of field names
- Return type:
[str]
- schrodinger.rdkit.molio.is_csvgz(filename)¶
- schrodinger.rdkit.molio.is_pfx(filename)¶
- schrodinger.rdkit.molio.get_pfx_size(filename)¶
Return the size from the metadata header of a .pfx file.
- schrodinger.rdkit.molio.extract_structures(filename, dest_file)¶
Extract structures from .pfx file into a given file.
- schrodinger.rdkit.molio.remove_react_atom_props(mol)¶
Return a copy of
mol
where atom properties added by the RDKit reaction module have been stripped out.- Parameters:
mol (rdkit.Chem.Mol) – input molecule; not modified
- Returns:
modified molecule
- Return type:
rdkit.Chem.Mol
- schrodinger.rdkit.molio.cat_csv_files(source_filenames, dest_filename)¶
Quick and dirty csv concatenation strategy. Assumes all csv files have the same columns and does not deduplicate.
- Parameters:
source_filenames – input files
dest_filename – destination file
- schrodinger.rdkit.molio.copy_csv_file(input_file, output_file)¶
Copy compressed or uncompressed input .csv file to another .csv file. Output file can also be compressed or uncompressed.
- Parameters:
input_file (str) – input file name
output_file (str) – output file name