schrodinger.seam.transforms.partitioners module¶
- class schrodinger.seam.transforms.partitioners.FixedSample(n: int)¶
Bases:
apache_beam.transforms.ptransform.PTransform
A PTransform that takes a PCollection and partitions it into two PCollections. The first PCollection is a random sample of the input PCollection, and the second PCollection is the remaining elements of the input PCollection.
This is useful for creating holdout / test sets in machine learning.
Example usage:
>>> with beam.Pipeline() as p: ... sample, remaining = (p ... | beam.Create(list(range(10))) ... | FixedSample(3)) ... # sample will contain three randomly selected elements from the ... # input PCollection ... # remaining will contain the remaining seven elements
- __init__(n: int)¶
- expand(pcoll)¶
- class schrodinger.seam.transforms.partitioners.Top(n: int, key: Optional[Callable[[Any], Any]] = None, reverse=False)¶
Bases:
apache_beam.transforms.ptransform.PTransform
A PTransform that takes a PCollection and partitions it into two PCollections. The first PCollection contains the largest n elements of the input PCollection, and the second PCollection contains the remaining elements of the input PCollection.
- Parameters:
n: The number of elements to take from the input PCollection.
- key: A function that takes an element of the input PCollection and returns
a value to compare for the purpose of determining the top n elements, similar to Python’s built-in sorted function.
- reverse: If True, the top n elements will be the n smallest elements of the
input PCollection.
Example usage:
>>> with beam.Pipeline() as p: ... top, remaining = (p ... | beam.Create(list(range(10))) ... | Top(3)) ... # top will contain [7, 8, 9] ... # remaining will contain [0, 1, 2, 3, 4, 5, 6]
- __init__(n: int, key: Optional[Callable[[Any], Any]] = None, reverse=False)¶
- expand(pcoll)¶