schrodinger.protein.align module

class schrodinger.protein.align.ASLResult(ref_ok, other_ok, other_skips)

Bases: tuple

__contains__(key, /)

Return key in self.

__len__()

Return len(self).

count(value, /)

Return number of occurrences of value.

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

other_ok

Alias for field number 1

other_skips

Alias for field number 2

ref_ok

Alias for field number 0

exception schrodinger.protein.align.CantAlignException

Bases: Exception

Exception raised when an aligner cannot start e.g. due to not enough seqs

__init__(*args, **kwargs)
args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class schrodinger.protein.align.AbstractAligner

Bases: object

Base class of objects that can perform an alignment

abstract run(aln)

Aligns the sequences in an alignment using the parameters supplied on init

Subclasses need to override this default implementation.

Parameters

aln (schrodinger.protein.alignment.BaseAlignment) – The alignment to align

class schrodinger.protein.align.RescodeAligner

Bases: schrodinger.protein.align.AbstractAligner

Aligns sequences by rescode

run(aln)

Aligns the sequences in an alignment using the parameters supplied on init

Subclasses need to override this default implementation.

Parameters

aln (schrodinger.protein.alignment.BaseAlignment) – The alignment to align

class schrodinger.protein.align.AbstractPairwiseAligner(preserve_reference_gaps=False)

Bases: schrodinger.protein.align.AbstractAligner

Abstract class for pairwise alignment where gaps can be merged into the entire alignment to preserve relative alignment of all non-reference sequences to the reference sequence.

Subclasses must implement _getPairwiseGaps to align one sequence to the ref seq. Subclasses may override _run to customize aligning (e.g. validation or setup of additional data needed by _getPairwiseGaps)

__init__(preserve_reference_gaps=False)
Parameters

preserve_reference_gaps (bool) – Whether to preserve the gaps in the reference sequence.

run(aln, seqs_to_align=None, **kwargs)

kwargs are additional arguments that will be passed to _run.

Parameters
  • aln (alignment.Alignment) – The alignment containing sequences to align.

  • seqs_to_align (list(sequence.Sequence)) – The sequences in aln to align against the reference sequence of aln. If None, defaults to the first non-reference sequence in aln (ie aln[1])

Raises

CantAlignException – If seqs_to_align contains a sequence not found in aln.

class schrodinger.protein.align.AbstractNWPairwiseAligner(preserve_reference_gaps=False, gap_open_penalty=1, gap_extend_penalty=0, sub_matrix=None, direct_scores=False, ss_constraints=False, penalize_end_gaps=True)

Bases: schrodinger.protein.align.AbstractPairwiseAligner

Abstract class for the Needleman-Wunsch global alignment algorithm for pairwise sequence alignment with affine gap penalties.

Variables
  • CONSTRAINT_SCORE – Reward amount for keeping constrained residues aligned

  • RES_MATCH_BONUS – Reward amount for aligning matching residues. Used by default if a substitution matrix is not specified.

  • RES_MISMATCH_PENALTY – Penalty for aligning differing residues. Used by default if a subtitution matrix is not specified

Ctype CONSTRAINT_SCORE

float

Ctype RES_MATCH_BONUS

float

Ctype RES_MISMATCH_PENALTY

float

CONSTRAINT_SCORE = 10000
RES_MATCH_BONUS = 1.0
RES_MISMATCH_PENALTY = 1.0
__init__(preserve_reference_gaps=False, gap_open_penalty=1, gap_extend_penalty=0, sub_matrix=None, direct_scores=False, ss_constraints=False, penalize_end_gaps=True)
Parameters
  • preserve_reference_gaps (bool) – Whether to preserve the gaps in the reference sequence

  • gap_open_penalty (float) – Penalty for opening a gap. Should be >=0.

  • gap_extend_penalty (float) – Penalty for extending a gap. Should be >=0.

  • sub_matrix (2D float array or dict mapping (char, char) to float) – Scoring matrix to be used for the alignment. If no matrix is specified, this method uses residue identity measure.

  • direct_scores (bool) – Use scoring matrix directly as (NxM) where N, M are lengths of both sequences rather than default 20x20 substitution matrix.

  • ss_constraints (bool) – Whether to constrain the alignment so no gaps appear in middle of a secondary structure.

  • penalize_end_gaps (bool) – Whether to penalize start/end gaps

run(aln, seqs_to_align=None, **kwargs)

kwargs are additional arguments that will be passed to _run.

Parameters
  • aln (alignment.Alignment) – The alignment containing sequences to align.

  • seqs_to_align (list(sequence.Sequence)) – The sequences in aln to align against the reference sequence of aln. If None, defaults to the first non-reference sequence in aln (ie aln[1])

Raises

CantAlignException – If seqs_to_align contains a sequence not found in aln.

class schrodinger.protein.align.SchrodingerPairwiseAligner(**kwargs)

Bases: schrodinger.protein.align.AbstractNWPairwiseAligner

Implementation of the Needleman-Wunsch global alignment algorithm for pairwise sequence alignment with affine gap penalties.

  1. ability to merge new sequence with existing alignment,

  2. ability to penalize gaps in secondary structure elements,

  3. ability to use custom substitution matrix generated from a family of proteins or provided by the user.

NOTE::

Any residues with variant residue types will have their short codes uppercased. This means they will be treated identically to their standard variant. If a nonstandard residue type has a lowercase short code that doesn’t match its standard variant, or if we need special treatment for variant residues, _getMatrixValue will have to be changed.

__init__(**kwargs)
Parameters
  • preserve_reference_gaps (bool) – Whether to preserve the gaps in the reference sequence

  • gap_open_penalty (float) – Penalty for opening a gap. Should be >=0.

  • gap_extend_penalty (float) – Penalty for extending a gap. Should be >=0.

  • sub_matrix (2D float array or dict mapping (char, char) to float) – Scoring matrix to be used for the alignment. If no matrix is specified, this method uses residue identity measure.

  • direct_scores (bool) – Use scoring matrix directly as (NxM) where N, M are lengths of both sequences rather than default 20x20 substitution matrix.

  • ss_constraints (bool) – Whether to constrain the alignment so no gaps appear in middle of a secondary structure.

  • penalize_end_gaps (bool) – Whether to penalize start/end gaps

getAlignmentScore()

Get the score of the alignment. Found by taking the highest value in the scoring matrix.

Returns

Score of the pairwise alignment.

Return type

float

CONSTRAINT_SCORE = 10000
RES_MATCH_BONUS = 1.0
RES_MISMATCH_PENALTY = 1.0
run(aln, seqs_to_align=None, **kwargs)

kwargs are additional arguments that will be passed to _run.

Parameters
  • aln (alignment.Alignment) – The alignment containing sequences to align.

  • seqs_to_align (list(sequence.Sequence)) – The sequences in aln to align against the reference sequence of aln. If None, defaults to the first non-reference sequence in aln (ie aln[1])

Raises

CantAlignException – If seqs_to_align contains a sequence not found in aln.

class schrodinger.protein.align.BiopythonPairwiseAligner(*args, **kwargs)

Bases: schrodinger.protein.align.AbstractNWPairwiseAligner

Pairwise alignment using Biopython.

NOTE::

Any residues with variant residue types will have their short codes uppercased. This means they will be treated identically to their standard variant. If a nonstandard residue type has a lowercase short code that doesn’t match its standard variant, or if we need special treatment for variant residues, _getMatrixValue will have to be changed.

__init__(*args, **kwargs)
Parameters
  • preserve_reference_gaps (bool) – Whether to preserve the gaps in the reference sequence

  • gap_open_penalty (float) – Penalty for opening a gap. Should be >=0.

  • gap_extend_penalty (float) – Penalty for extending a gap. Should be >=0.

  • sub_matrix (2D float array or dict mapping (char, char) to float) – Scoring matrix to be used for the alignment. If no matrix is specified, this method uses residue identity measure.

  • direct_scores (bool) – Use scoring matrix directly as (NxM) where N, M are lengths of both sequences rather than default 20x20 substitution matrix.

  • ss_constraints (bool) – Whether to constrain the alignment so no gaps appear in middle of a secondary structure.

  • penalize_end_gaps (bool) – Whether to penalize start/end gaps

generateSubMatrix()

Generate the identity substitution matrix if not provided.

generateIdentitySubMatrix(res_keys)

Generate the basic identity sub matrix based on existing values.

Parameters

res_keys – list of values to be included in the sub matrix

getAlignmentScore()

Get the score of the alignment. Found by taking the highest value in the scoring matrix.

Returns

Score of the pairwise alignment.

Return type

float

getMatrixValue(res1, res2)

Returns the score for aligning res1 and res2. These can either be characters or residue.Residue`s. If they're `residue.Residue objects, then we check if they’re matching anchor residues and return a large score if they are. Otherwise, we just use their short-codes by calling str(res).upper().

WARNING::

This is called /A LOT/ by Biopython’s aligner, so if any changes need to be made, make sure that performance is still reasonable.

Parameters
Returns

alignment score

Return type

float

CONSTRAINT_SCORE = 10000
RES_MATCH_BONUS = 1.0
RES_MISMATCH_PENALTY = 1.0
run(aln, seqs_to_align=None, **kwargs)

kwargs are additional arguments that will be passed to _run.

Parameters
  • aln (alignment.Alignment) – The alignment containing sequences to align.

  • seqs_to_align (list(sequence.Sequence)) – The sequences in aln to align against the reference sequence of aln. If None, defaults to the first non-reference sequence in aln (ie aln[1])

Raises

CantAlignException – If seqs_to_align contains a sequence not found in aln.

class schrodinger.protein.align.FamilyPairwiseAligner(anno_type: schrodinger.infra.util.ANNOTATION_TYPES, cdr_scheme: Optional[schrodinger.infra.util.AntibodyCDRScheme] = None, custom_annotation: Optional[schrodinger.models.jsonable.CustomAnnotation] = None, *args, **kwargs)

Bases: schrodinger.protein.align.BiopythonPairwiseAligner

Pairwise alignment for family features using Biopython.

__init__(anno_type: schrodinger.infra.util.ANNOTATION_TYPES, cdr_scheme: Optional[schrodinger.infra.util.AntibodyCDRScheme] = None, custom_annotation: Optional[schrodinger.models.jsonable.CustomAnnotation] = None, *args, **kwargs)
Parameters
  • anno_type – Annotation type - one of: antibody_cdr, gpcr_segment, kinase_features

  • cdr_scheme – Antibody CDR scheme, only required if annotation type is antibody_cdr

  • custom_annotation – custom annotation w/ descriptions, only required if annotation type is custom_annotation

run(aln, seqs_to_align=None, **kwargs)

Aligns the sequences and removes redundant aligned gaps.

generateSubMatrix()

Generate the identity substitution matrix for the annotation type.

getMatrixValue(res1, res2) float

@overrides: BiopythonPairwiseAligner

Return the score for aligning residues based on annotation values.

It will search the substitution matrix for a match then check if against constrained pairs. Constrained values always have priority but are checked second for performance.

CONSTRAINT_SCORE = 10000
RES_MATCH_BONUS = 1.0
RES_MISMATCH_PENALTY = 1.0
generateIdentitySubMatrix(res_keys)

Generate the basic identity sub matrix based on existing values.

Parameters

res_keys – list of values to be included in the sub matrix

getAlignmentScore()

Get the score of the alignment. Found by taking the highest value in the scoring matrix.

Returns

Score of the pairwise alignment.

Return type

float

class schrodinger.protein.align.PrimeSTAAligner(protein_family=None)

Bases: schrodinger.protein.align.AbstractAligner

Sequence alignment using $SCHRODINGER/sta

__init__(protein_family=None)
Parameters

protein_family (str or NoneType) – ‘GPCR’ for specialized alignment or None for default templates.

run(aln, structured_seq=None, constraints=None)
Parameters
  • aln (alignment.Alignment) – The alignment containing sequences to align.

  • structured_seq (sequence.ProteinSequence or NoneType) – Structured sequence to use as reference. If None, the first non-reference seq will be aligned.

  • constraints (list(tuple(residue.Residue, residue.Residue)) or NoneType) – Pairs of (reference_seq, structured_seq) residues to constrain

class schrodinger.protein.align.ClustalAligner

Bases: schrodinger.protein.align.AbstractAligner

Aligns sequences using the Clustal alignment algorithm.

run(aln)

Aligns the sequences in an alignment

Parameters

aln (schrodinger.protein.alignment.BaseAlignment) – The alignment to align

class schrodinger.protein.align.SuperpositionAligner(gap_open_penalty=None, gap_extend_penalty=None)

Bases: schrodinger.protein.align.BiopythonPairwiseAligner

Align structured sequences based on their superposition.

__init__(gap_open_penalty=None, gap_extend_penalty=None)
Parameters
  • preserve_reference_gaps (bool) – Whether to preserve the gaps in the reference sequence

  • gap_open_penalty (float) – Penalty for opening a gap. Should be >=0.

  • gap_extend_penalty (float) – Penalty for extending a gap. Should be >=0.

  • sub_matrix (2D float array or dict mapping (char, char) to float) – Scoring matrix to be used for the alignment. If no matrix is specified, this method uses residue identity measure.

  • direct_scores (bool) – Use scoring matrix directly as (NxM) where N, M are lengths of both sequences rather than default 20x20 substitution matrix.

  • ss_constraints (bool) – Whether to constrain the alignment so no gaps appear in middle of a secondary structure.

  • penalize_end_gaps (bool) – Whether to penalize start/end gaps

CONSTRAINT_SCORE = 10000
RES_MATCH_BONUS = 1.0
RES_MISMATCH_PENALTY = 1.0
generateIdentitySubMatrix(res_keys)

Generate the basic identity sub matrix based on existing values.

Parameters

res_keys – list of values to be included in the sub matrix

generateSubMatrix()

Generate the identity substitution matrix if not provided.

getAlignmentScore()

Get the score of the alignment. Found by taking the highest value in the scoring matrix.

Returns

Score of the pairwise alignment.

Return type

float

getMatrixValue(res1, res2)

Returns the score for aligning res1 and res2. These can either be characters or residue.Residue`s. If they're `residue.Residue objects, then we check if they’re matching anchor residues and return a large score if they are. Otherwise, we just use their short-codes by calling str(res).upper().

WARNING::

This is called /A LOT/ by Biopython’s aligner, so if any changes need to be made, make sure that performance is still reasonable.

Parameters
Returns

alignment score

Return type

float

run(aln, seqs_to_align=None, **kwargs)

kwargs are additional arguments that will be passed to _run.

Parameters
  • aln (alignment.Alignment) – The alignment containing sequences to align.

  • seqs_to_align (list(sequence.Sequence)) – The sequences in aln to align against the reference sequence of aln. If None, defaults to the first non-reference sequence in aln (ie aln[1])

Raises

CantAlignException – If seqs_to_align contains a sequence not found in aln.

class schrodinger.protein.align.AbstractStructureAligner(keywords=None, **kwargs)

Bases: schrodinger.protein.align.AbstractAligner

Subclasses must reimplement run: - Call _setUpSeqs to set up instance attributes for the current alignment - Call _setASLs to validate and store ASLs - Call _getUniqueEidSeqs to get the sequences to align - Call _runStructureAlignment to call the backend

class Result(ref_seq, other_seq, psd, rmsd)

Bases: tuple

__contains__(key, /)

Return key in self.

__len__()

Return len(self).

count(value, /)

Return number of occurrences of value.

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

other_seq

Alias for field number 1

psd

Alias for field number 2

ref_seq

Alias for field number 0

rmsd

Alias for field number 3

__init__(keywords=None, **kwargs)
Parameters

keywords (dict) – Keywords to pass to the ska backend

getResultSeqs()
abstract run(aln)

Aligns the sequences in an alignment using the parameters supplied on init

Subclasses need to override this default implementation.

Parameters

aln (schrodinger.protein.alignment.BaseAlignment) – The alignment to align

class schrodinger.protein.align.StructureAligner(keywords=None, **kwargs)

Bases: schrodinger.protein.align.AbstractStructureAligner

Run structure alignment using the specified sequences to create chain ASLs

run(aln, seqs_to_align, **kwargs)

Aligns the sequences in an alignment using the parameters supplied on init

Subclasses need to override this default implementation.

Parameters

aln (schrodinger.protein.alignment.BaseAlignment) – The alignment to align

class Result(ref_seq, other_seq, psd, rmsd)

Bases: tuple

__contains__(key, /)

Return key in self.

__len__()

Return len(self).

count(value, /)

Return number of occurrences of value.

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

other_seq

Alias for field number 1

psd

Alias for field number 2

ref_seq

Alias for field number 0

rmsd

Alias for field number 3

__init__(keywords=None, **kwargs)
Parameters

keywords (dict) – Keywords to pass to the ska backend

getResultSeqs()
class schrodinger.protein.align.CustomASLStructureAligner(keywords=None, ref_asl=None, other_asl=None)

Bases: schrodinger.protein.align.AbstractStructureAligner

Run structure alignment using specified ASLs

SENTINEL = <object object>
__init__(keywords=None, ref_asl=None, other_asl=None)
Parameters

keywords (dict) – Keywords to pass to the ska backend

evaluateASLs(aln, seqs_to_align)

Determine whether the ASLs match any atoms in the sequences’ structures

Parameters
  • aln – Alignment

  • seqs_to_align – Sequences to align

Return type

ASLResult

run(aln, seqs_to_align, **kwargs)

Aligns the sequences in an alignment using the parameters supplied on init

Subclasses need to override this default implementation.

Parameters

aln (schrodinger.protein.alignment.BaseAlignment) – The alignment to align

class Result(ref_seq, other_seq, psd, rmsd)

Bases: tuple

__contains__(key, /)

Return key in self.

__len__()

Return len(self).

count(value, /)

Return number of occurrences of value.

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

other_seq

Alias for field number 1

psd

Alias for field number 2

ref_seq

Alias for field number 0

rmsd

Alias for field number 3

getResultSeqs()
class schrodinger.protein.align.MaxIdentityAligner

Bases: schrodinger.protein.align.BiopythonPairwiseAligner

Pairwise aligner that maximizes the number of matching residues between two sequences. There are no penalties for mismatches or gaps.

__init__()
Parameters
  • preserve_reference_gaps (bool) – Whether to preserve the gaps in the reference sequence

  • gap_open_penalty (float) – Penalty for opening a gap. Should be >=0.

  • gap_extend_penalty (float) – Penalty for extending a gap. Should be >=0.

  • sub_matrix (2D float array or dict mapping (char, char) to float) – Scoring matrix to be used for the alignment. If no matrix is specified, this method uses residue identity measure.

  • direct_scores (bool) – Use scoring matrix directly as (NxM) where N, M are lengths of both sequences rather than default 20x20 substitution matrix.

  • ss_constraints (bool) – Whether to constrain the alignment so no gaps appear in middle of a secondary structure.

  • penalize_end_gaps (bool) – Whether to penalize start/end gaps

run(aln)

kwargs are additional arguments that will be passed to _run.

Parameters
  • aln (alignment.Alignment) – The alignment containing sequences to align.

  • seqs_to_align (list(sequence.Sequence)) – The sequences in aln to align against the reference sequence of aln. If None, defaults to the first non-reference sequence in aln (ie aln[1])

Raises

CantAlignException – If seqs_to_align contains a sequence not found in aln.

CONSTRAINT_SCORE = 10000
RES_MATCH_BONUS = 1.0
RES_MISMATCH_PENALTY = 1.0
generateIdentitySubMatrix(res_keys)

Generate the basic identity sub matrix based on existing values.

Parameters

res_keys – list of values to be included in the sub matrix

generateSubMatrix()

Generate the identity substitution matrix if not provided.

getAlignmentScore()

Get the score of the alignment. Found by taking the highest value in the scoring matrix.

Returns

Score of the pairwise alignment.

Return type

float

getMatrixValue(res1, res2)

Returns the score for aligning res1 and res2. These can either be characters or residue.Residue`s. If they're `residue.Residue objects, then we check if they’re matching anchor residues and return a large score if they are. Otherwise, we just use their short-codes by calling str(res).upper().

WARNING::

This is called /A LOT/ by Biopython’s aligner, so if any changes need to be made, make sure that performance is still reasonable.

Parameters
Returns

alignment score

Return type

float

class schrodinger.protein.align.StructurelessGapAligner

Bases: schrodinger.protein.align.AbstractAligner

Align all structureless residues with gaps

For example, given the following alignment (where circled letters are structureless residues):

Resnum: 0 1 2 3 4 5 Seq1: Ⓐ Ⓡ Ⓒ A D E Seq2: Ⓒ Ⓐ Ⓝ A D A

The result will be:

Resnum: 0 1 2 3 4 5 6 7 8 Seq1: ~ ~ ~ Ⓐ Ⓡ Ⓒ A D E Seq2: Ⓒ Ⓐ Ⓝ ~ ~ ~ A D A

run(aln, seqs_to_align=None)

Aligns the sequences in an alignment using the parameters supplied on init

Subclasses need to override this default implementation.

Parameters

aln (schrodinger.protein.alignment.BaseAlignment) – The alignment to align