schrodinger.seam.transforms.samplers module

class schrodinger.seam.transforms.samplers.RandomSample(n: int, seed: Optional[int] = None, distinct=False)

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform that returns approximately n random elements.

On average, the number of elements sampled will be at most 0.3% off from n. For small numbers of n (less than or equal to 100,000), it will be exactly n.

The seed value is only used if n is larger than 100,000.

Example usage:

>>> with beam.Pipeline() as p:
...     sample = (p | beam.Create(range(10))
...                 | RandomSample(3))
...     # sample will contain three randomly selected elements

If distinct is True, then the input pcollection is first deduplicated before sampling.

N_CUTOFF = 100000
__init__(n: int, seed: Optional[int] = None, distinct=False)
expand(inputs)
WithCount()

Returns a tuple of the sampled pcollection and a pcollection containing the number of inputs that were sampled from.

Example usage:

>>> with beam.Pipeline() as p:
...     sample, count = (p | beam.Create([1, 1, 2, 2])
...                         | RandomSample(3, distinct=True).WithCount())
...     # sample will contain three randomly selected elements
...     # count will contain the number of elements in the input pcollection (2)
default_label()