schrodinger.seam.pretrained_coder module¶
A compression coder that uses a pre-trained zstd dictionary for improved compression ratios on structurally similar data.
This is particularly beneficial for disk-intensive Seam pipelines (e.g. PackingSearch) where serialized Structure objects share common patterns like MAE format boilerplate, property keys, and atom type definitions.
Enable via environment variable:
export SCHRODINGER_SEAM_PRETRAINED_COMPRESSION=1
- schrodinger.seam.pretrained_coder.is_pretrained_compression_enabled() bool¶
Check if pretrained dictionary compression is enabled.
- schrodinger.seam.pretrained_coder.train_zstd_dictionary(samples: List[bytes], dict_size: int = 32768) ZstdCompressionDict¶
Train a zstd compression dictionary from sample data.
The dictionary captures common patterns across the samples and can significantly improve compression ratios for similar data, especially for elements that are individually small but structurally similar (e.g. serialized Structure objects from the same molecular system).
- Parameters:
samples – encoded byte strings to train the dictionary on. At least 100 samples of representative data are recommended.
dict_size – maximum dictionary size in bytes.
- Returns:
a trained
ZstdCompressionDictready for use withMaybePretrainedCompressedCoder.
- class schrodinger.seam.pretrained_coder.MaybePretrainedCompressedCoder(coder: Coder, dict_data: Optional[ZstdCompressionDict] = None)¶
Bases:
FastCoderA wrapper coder that compresses elements using a pre-trained zstd dictionary for improved compression ratios.
This coder is similar to
MaybeCompressedCoderbut uses a pre-trained dictionary that captures common patterns in the data. This provides ~40% better compression than standard zstd on Structure-heavy pipelines.The coder is backward compatible: it can decode data encoded by
MaybeCompressedCoder(without a dictionary).Enable in the SeamRunner via environment variable:
export SCHRODINGER_SEAM_PRETRAINED_COMPRESSION=1
- URN = 'seam:coders:MaybePretrainedCompressedCoder'¶
- __init__(coder: Coder, dict_data: Optional[ZstdCompressionDict] = None)¶
- is_deterministic()¶
Whether this coder is guaranteed to encode values deterministically.
A deterministic coder is required for key coders in GroupByKey operations to produce consistent results.
For example, note that the default coder, the PickleCoder, is not deterministic: the ordering of picked entries in maps may vary across executions since there is no defined order, and such a coder is not in general suitable for usage as a key coder in GroupByKey operations, since each instance of the same key may be encoded differently.
- Returns:
Whether coder is deterministic.
- to_type_hint() type¶
- estimate_size(element: bytes) int¶
Estimates the encoded size of the given value, in bytes.
Dataflow estimates the encoded size of a PCollection processed in a pipeline step by using the estimated size of a random sample of elements in that PCollection.
The default implementation encodes the given value and returns its byte size. If a coder can provide a fast estimate of the encoded size of a value (e.g., if the encoding has a fixed size), it can provide its estimate here to improve performance.
- Arguments:
value: the value whose encoded size is to be estimated.
- Returns:
The estimated encoded size of the given value.
- to_runner_api_parameter(unused_context)¶
- static from_runner_api_parameter(payload, components, unused_context)¶