Collections API¶
Collections¶
| 
 | Superclass for streaming collections | 
| 
 | Map a function across all batch elements of this stream | 
| 
 | Accumulate a function with state across batch elements | 
| Verify elements that pass through this stream | 
Batch¶
| 
 | A Stream of tuples or lists | 
| 
 | Filter elements by a predicate | 
| 
 | Map a function across all elements | 
| 
 | Pick a field out of all elements | 
| Convert to a streaming dataframe | |
| Concatenate batches and return base Stream | 
Dataframes¶
| 
 | A Streaming dataframe | 
| 
 | Groupby aggreagtions | 
| 
 | Compute rolling aggregations | 
| 
 | Assign new columns to this dataframe | 
| Sum frame | |
| Average | |
| Cumulative sum | |
| Cumulative product | |
| Cumulative minimum | |
| Cumulative maximum | 
| 
 | Groupby aggregations on streaming dataframes | 
| Groupby-count | |
| Groupby-mean | |
| Groupby-size | |
| 
 | Groupby-std | 
| Groupby-sum | |
| 
 | Groupby-variance | 
| 
 | Rolling aggregations | 
| 
 | Rolling aggregation | 
| 
 | Rolling count | 
| Rolling maximum | |
| Rolling mean | |
| Rolling median | |
| Rolling minimum | |
| 
 | Rolling quantile | 
| 
 | Rolling standard deviation | 
| Rolling sum | |
| 
 | Rolling variance | 
| 
 | Sliding window operations | 
| 
 | Apply an arbitrary function over each window of data | 
| Count elements within window | |
| 
 | Groupby-aggregations within window | 
| Sum elements within window | |
| Number of elements within window | |
| 
 | Compute standard deviation of elements within window | 
| 
 | Compute variance of elements within window | 
| 
 | Rolling aggregation | 
| 
 | Rolling count | 
| Rolling maximum | |
| Rolling mean | |
| Rolling median | |
| Rolling minimum | |
| 
 | Rolling quantile | 
| 
 | Rolling standard deviation | 
| Rolling sum | |
| 
 | Rolling variance | 
| 
 | A streaming dataframe of random data | 
Details¶
- 
class rapidz.collection.Streaming(stream=None, example=None, stream_type=None)¶
- Superclass for streaming collections - Do not create this class directly, use one of the subclasses instead. - Parameters
- stream: rapidz.Stream
- example: object
- An object to represent an example element of this stream 
 
 - See also - rapidz.dataframe.StreamingDataFrame,- rapidz.dataframe.StreamingBatch- Methods - accumulate_partitions(func, *args, **kwargs)- Accumulate a function with state across batch elements - map_partitions(func, *args, **kwargs)- Map a function across all batch elements of this stream - verify(x)- Verify elements that pass through this stream - emit - 
accumulate_partitions(func, *args, **kwargs)¶
- Accumulate a function with state across batch elements - See also 
 - 
static map_partitions(func, *args, **kwargs)¶
- Map a function across all batch elements of this stream - The output stream type will be determined by the action of that function on the example - See also 
 - 
verify(x)¶
- Verify elements that pass through this stream 
 
- 
class rapidz.batch.Batch(stream=None, example=None)¶
- A Stream of tuples or lists - This streaming collection manages batches of Python objects such as lists of text or dictionaries. By batching many elements together we reduce overhead from Python. - This library is typically used at the early stages of data ingestion before handing off to streaming dataframes - Examples - >>> text = Streaming.from_file(myfile) # doctest: +SKIP >>> b = text.partition(100).map(json.loads) # doctest: +SKIP - Methods - accumulate_partitions(func, *args, **kwargs)- Accumulate a function with state across batch elements - filter(predicate)- Filter elements by a predicate - map(func, **kwargs)- Map a function across all elements - map_partitions(func, *args, **kwargs)- Map a function across all batch elements of this stream - pluck(ind)- Pick a field out of all elements - sum()- Sum elements - Convert to a streaming dataframe - Concatenate batches and return base Stream - verify(x)- Verify elements that pass through this stream - emit - 
accumulate_partitions(func, *args, **kwargs)¶
- Accumulate a function with state across batch elements - See also - Streaming.map_partitions
 - 
filter(predicate)¶
- Filter elements by a predicate 
 - 
map(func, **kwargs)¶
- Map a function across all elements 
 - 
static map_partitions(func, *args, **kwargs)¶
- Map a function across all batch elements of this stream - The output stream type will be determined by the action of that function on the example - See also - Streaming.accumulate_partitions
 - 
pluck(ind)¶
- Pick a field out of all elements 
 - 
sum()¶
- Sum elements 
 - 
to_dataframe()¶
- Convert to a streaming dataframe - This calls - pd.DataFrameon all list-elements of this stream
 - 
to_stream()¶
- Concatenate batches and return base Stream - Returned stream will be composed of single elements 
 - 
verify(x)¶
- Verify elements that pass through this stream 
 
- 
- 
class rapidz.dataframe.DataFrame(*args, **kwargs)¶
- A Streaming dataframe - This is a logical collection over a stream of Pandas dataframes. Operations on this object will translate to the appropriate operations on the underlying Pandas dataframes. - See also - Series- Attributes
- columns
- dtypes
- index
- size
- size of frame 
 
 - Methods - accumulate_partitions(func, *args, **kwargs)- Accumulate a function with state across batch elements - assign(**kwargs)- Assign new columns to this dataframe - count()- Count of frame - cummax()- Cumulative maximum - cummin()- Cumulative minimum - cumprod()- Cumulative product - cumsum()- Cumulative sum - groupby(other)- Groupby aggreagtions - map_partitions(func, *args, **kwargs)- Map a function across all batch elements of this stream - mean()- Average - Reset Index - rolling(window[, min_periods])- Compute rolling aggregations - round([decimals])- Round elements in frame - set_index(index, **kwargs)- Set Index - sum()- Sum frame - tail([n])- Round elements in frame - to_frame()- Convert to a streaming dataframe - verify(x)- Verify consistency of elements that pass through this stream - window([n, value])- Sliding window operations - aggregate - astype - emit - map - query - 
accumulate_partitions(func, *args, **kwargs)¶
- Accumulate a function with state across batch elements - See also - Streaming.map_partitions
 - 
assign(**kwargs)¶
- Assign new columns to this dataframe - Alternatively use setitem syntax - Examples - >>> sdf = sdf.assign(z=sdf.x + sdf.y) # doctest: +SKIP >>> sdf['z'] = sdf.x + sdf.y # doctest: +SKIP 
 - 
count()¶
- Count of frame 
 - 
cummax()¶
- Cumulative maximum 
 - 
cummin()¶
- Cumulative minimum 
 - 
cumprod()¶
- Cumulative product 
 - 
cumsum()¶
- Cumulative sum 
 - 
groupby(other)¶
- Groupby aggreagtions 
 - 
static map_partitions(func, *args, **kwargs)¶
- Map a function across all batch elements of this stream - The output stream type will be determined by the action of that function on the example - See also - Streaming.accumulate_partitions
 - 
mean()¶
- Average 
 - 
reset_index()¶
- Reset Index 
 - 
rolling(window, min_periods=1)¶
- Compute rolling aggregations - When followed by an aggregation method like - sum,- mean, or- stdthis produces a new Streaming dataframe whose values are aggregated over that window.- The window parameter can be either a number of rows or a timedelta like ``”2 minutes”` in which case the index should be a datetime index. - This operates by keeping enough of a backlog of records to maintain an accurate stream. It performs a copy at every added dataframe. Because of this it may be slow if the rolling window is much larger than the average stream element. - Parameters
- window: int or timedelta
- Window over which to roll 
 
- Returns
- Rolling object
 
 - See also - DataFrame.window
- more generic window operations 
 
 - 
round(decimals=0)¶
- Round elements in frame 
 - 
set_index(index, **kwargs)¶
- Set Index 
 - 
size¶
- size of frame 
 - 
sum()¶
- Sum frame 
 - 
tail(n=5)¶
- Round elements in frame 
 - 
to_frame()¶
- Convert to a streaming dataframe 
 - 
verify(x)¶
- Verify consistency of elements that pass through this stream 
 - 
window(n=None, value=None)¶
- Sliding window operations - Windowed operations are defined over a sliding window of data, either with a fixed number of elements: - >>> df.window(n=10).sum() # sum of the last ten elements - or over an index value range (index must be monotonic): - >>> df.window(value='2h').mean() # average over the last two hours - Windowed dataframes support all normal arithmetic, aggregations, and groupby-aggregations. - See also - DataFrame.rolling
- mimic’s Pandas rolling aggregations 
 - Examples - >>> df.window(n=10).std() >>> df.window(value='2h').count() - >>> w = df.window(n=100) >>> w.groupby(w.name).amount.sum() >>> w.groupby(w.x % 10).y.var() 
 
- 
class rapidz.dataframe.Rolling(sdf, window, min_periods)¶
- Rolling aggregations - This intermediate class enables rolling aggregations across either a fixed number of rows or a time window. - Examples - >>> sdf.rolling(10).x.mean() # doctest: +SKIP >>> sdf.rolling('100ms').x.mean() # doctest: +SKIP - Methods - aggregate(*args, **kwargs)- Rolling aggregation - count(*args, **kwargs)- Rolling count - max()- Rolling maximum - mean()- Rolling mean - median()- Rolling median - min()- Rolling minimum - quantile(*args, **kwargs)- Rolling quantile - std(*args, **kwargs)- Rolling standard deviation - sum()- Rolling sum - var(*args, **kwargs)- Rolling variance - 
aggregate(*args, **kwargs)¶
- Rolling aggregation 
 - 
count(*args, **kwargs)¶
- Rolling count 
 - 
max()¶
- Rolling maximum 
 - 
mean()¶
- Rolling mean 
 - 
median()¶
- Rolling median 
 - 
min()¶
- Rolling minimum 
 - 
quantile(*args, **kwargs)¶
- Rolling quantile 
 - 
std(*args, **kwargs)¶
- Rolling standard deviation 
 - 
sum()¶
- Rolling sum 
 - 
var(*args, **kwargs)¶
- Rolling variance 
 
- 
- 
class rapidz.dataframe.Window(sdf, n=None, value=None)¶
- Windowed aggregations - This provides a set of aggregations that can be applied over a sliding window of data. - See also - DataFrame.window
- contains full docstring 
 - Attributes
- columns
- dtypes
- example
- index
- size
- Number of elements within window 
 
 - Methods - apply(func)- Apply an arbitrary function over each window of data - count()- Count elements within window - groupby(other)- Groupby-aggregations within window - mean()- Average elements within window - std([ddof])- Compute standard deviation of elements within window - sum()- Sum elements within window - Count groups of elements within window - var([ddof])- Compute variance of elements within window - aggregate - full - map_partitions - reset_index - 
apply(func)¶
- Apply an arbitrary function over each window of data 
 - 
count()¶
- Count elements within window 
 - 
groupby(other)¶
- Groupby-aggregations within window 
 - 
mean()¶
- Average elements within window 
 - 
size¶
- Number of elements within window 
 - 
std(ddof=1)¶
- Compute standard deviation of elements within window 
 - 
sum()¶
- Sum elements within window 
 - 
value_counts()¶
- Count groups of elements within window 
 - 
var(ddof=1)¶
- Compute variance of elements within window 
 
- 
class rapidz.dataframe.GroupBy(root, grouper, index=None)¶
- Groupby aggregations on streaming dataframes - Methods - count()- Groupby-count - mean()- Groupby-mean - size()- Groupby-size - std([ddof])- Groupby-std - sum()- Groupby-sum - var([ddof])- Groupby-variance - 
count()¶
- Groupby-count 
 - 
mean()¶
- Groupby-mean 
 - 
size()¶
- Groupby-size 
 - 
std(ddof=1)¶
- Groupby-std 
 - 
sum()¶
- Groupby-sum 
 - 
var(ddof=1)¶
- Groupby-variance 
 
- 
- 
class rapidz.dataframe.Random(freq='100ms', interval='500ms', dask=False)¶
- A streaming dataframe of random data - The x column is uniformly distributed. The y column is poisson distributed. The z column is normally distributed. - This class is experimental and will likely be removed in the future - Parameters
- freq: timedelta
- The time interval between records 
- interval: timedelta
- The time interval between new dataframes, should be significantly larger than freq 
 
- Attributes
- columns
- dtypes
- index
- size
- size of frame 
 
 - Methods - accumulate_partitions(func, *args, **kwargs)- Accumulate a function with state across batch elements - assign(**kwargs)- Assign new columns to this dataframe - count()- Count of frame - cummax()- Cumulative maximum - cummin()- Cumulative minimum - cumprod()- Cumulative product - cumsum()- Cumulative sum - groupby(other)- Groupby aggreagtions - map_partitions(func, *args, **kwargs)- Map a function across all batch elements of this stream - mean()- Average - reset_index()- Reset Index - rolling(window[, min_periods])- Compute rolling aggregations - round([decimals])- Round elements in frame - set_index(index, **kwargs)- Set Index - sum()- Sum frame - tail([n])- Round elements in frame - to_frame()- Convert to a streaming dataframe - verify(x)- Verify consistency of elements that pass through this stream - window([n, value])- Sliding window operations - aggregate - astype - emit - map - query - stop