schrodinger.seam.transforms.partitioners module

class schrodinger.seam.transforms.partitioners.FixedSample(n: int)

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform that takes a PCollection and partitions it into two PCollections. The first PCollection is a random sample of the input PCollection, and the second PCollection is the remaining elements of the input PCollection.

This is useful for creating holdout / test sets in machine learning.

Example usage:

>>> with beam.Pipeline() as p:
...     sample, remaining = (p
...         | beam.Create(list(range(10)))
...         | FixedSample(3))
...     # sample will contain three randomly selected elements from the
...     # input PCollection
...     # remaining will contain the remaining seven elements
__init__(n: int)
expand(pcoll)
class schrodinger.seam.transforms.partitioners.Top(n: int, key: Optional[Callable[[Any], Any]] = None, reverse=False)

Bases: apache_beam.transforms.ptransform.PTransform

A PTransform that takes a PCollection and partitions it into two PCollections. The first PCollection contains the largest n elements of the input PCollection, and the second PCollection contains the remaining elements of the input PCollection.

Parameters:
  • n: The number of elements to take from the input PCollection.

  • key: A function that takes an element of the input PCollection and returns

    a value to compare for the purpose of determining the top n elements, similar to Python’s built-in sorted function.

  • reverse: If True, the top n elements will be the n smallest elements of the

    input PCollection.

Example usage:

>>> with beam.Pipeline() as p:
...     top, remaining = (p
...         | beam.Create(list(range(10)))
...         | Top(3))
...     # top will contain [7, 8, 9]
...     # remaining will contain [0, 1, 2, 3, 4, 5, 6]
__init__(n: int, key: Optional[Callable[[Any], Any]] = None, reverse=False)
expand(pcoll)