schrodinger.seam.transforms.samplers module

class schrodinger.seam.transforms.samplers.RandomSample(n: int, seed: Optional[int] = None, distinct=False)

Bases: PTransform

A PTransform that returns approximately n random elements.

On average, the number of elements sampled will be at most 0.3% off from n. For small numbers of n (less than or equal to 100,000), it will be exactly n.

The seed value is only used if n is larger than 100,000.

Example usage:

>>> with beam.Pipeline() as p:
...     sample = (p | beam.Create(range(10))
...                 | RandomSample(3))
...     # sample will contain three randomly selected elements

If distinct is True, then the input pcollection is first deduplicated before sampling.

N_CUTOFF = 100000
__init__(n: int, seed: Optional[int] = None, distinct=False)
display_data() dict

Returns the display data associated to a pipeline component.

It should be reimplemented in pipeline components that wish to have static display data.

Returns:

Dict[str, Any]: A dictionary containing key:value pairs. The value might be an integer, float or string value; a DisplayDataItem for values that have more data (e.g. short value, label, url); or a HasDisplayData instance that has more display data that should be picked up. For example:

{
  'key1': 'string_value',
  'key2': 1234,
  'key3': 3.14159265,
  'key4': DisplayDataItem('apache.org', url='http://apache.org'),
  'key5': subComponent
}
WithCount()

Returns a tuple of the sampled pcollection and a pcollection containing the number of inputs that were sampled from.

Example usage:

>>> with beam.Pipeline() as p:
...     sample, count = (p | beam.Create([1, 1, 2, 2])
...                         | RandomSample(3, distinct=True).WithCount())
...     # sample will contain three randomly selected elements
...     # count will contain the number of elements in the input pcollection (2)