Collections API¶
Collections¶
|
Superclass for streaming collections |
|
Map a function across all batch elements of this stream |
|
Accumulate a function with state across batch elements |
Verify elements that pass through this stream |
Batch¶
|
A Stream of tuples or lists |
|
Filter elements by a predicate |
|
Map a function across all elements |
|
Pick a field out of all elements |
Convert to a streaming dataframe |
|
Concatenate batches and return base Stream |
Dataframes¶
|
A Streaming dataframe |
|
Groupby aggreagtions |
|
Compute rolling aggregations |
|
Assign new columns to this dataframe |
Sum frame |
|
Average |
|
Cumulative sum |
|
Cumulative product |
|
Cumulative minimum |
|
Cumulative maximum |
|
Groupby aggregations on streaming dataframes |
Groupby-count |
|
Groupby-mean |
|
Groupby-size |
|
|
Groupby-std |
Groupby-sum |
|
|
Groupby-variance |
|
Rolling aggregations |
|
Rolling aggregation |
|
Rolling count |
Rolling maximum |
|
Rolling mean |
|
Rolling median |
|
Rolling minimum |
|
|
Rolling quantile |
|
Rolling standard deviation |
Rolling sum |
|
|
Rolling variance |
|
Sliding window operations |
|
Apply an arbitrary function over each window of data |
Count elements within window |
|
|
Groupby-aggregations within window |
Sum elements within window |
|
Number of elements within window |
|
|
Compute standard deviation of elements within window |
|
Compute variance of elements within window |
|
Rolling aggregation |
|
Rolling count |
Rolling maximum |
|
Rolling mean |
|
Rolling median |
|
Rolling minimum |
|
|
Rolling quantile |
|
Rolling standard deviation |
Rolling sum |
|
|
Rolling variance |
|
A streaming dataframe of random data |
Details¶
-
class
rapidz.collection.
Streaming
(stream=None, example=None, stream_type=None)¶ Superclass for streaming collections
Do not create this class directly, use one of the subclasses instead.
- Parameters
- stream: rapidz.Stream
- example: object
An object to represent an example element of this stream
See also
rapidz.dataframe.StreamingDataFrame
,rapidz.dataframe.StreamingBatch
Methods
accumulate_partitions
(func, *args, **kwargs)Accumulate a function with state across batch elements
map_partitions
(func, *args, **kwargs)Map a function across all batch elements of this stream
verify
(x)Verify elements that pass through this stream
emit
-
accumulate_partitions
(func, *args, **kwargs)¶ Accumulate a function with state across batch elements
See also
-
static
map_partitions
(func, *args, **kwargs)¶ Map a function across all batch elements of this stream
The output stream type will be determined by the action of that function on the example
See also
-
verify
(x)¶ Verify elements that pass through this stream
-
class
rapidz.batch.
Batch
(stream=None, example=None)¶ A Stream of tuples or lists
This streaming collection manages batches of Python objects such as lists of text or dictionaries. By batching many elements together we reduce overhead from Python.
This library is typically used at the early stages of data ingestion before handing off to streaming dataframes
Examples
>>> text = Streaming.from_file(myfile) # doctest: +SKIP >>> b = text.partition(100).map(json.loads) # doctest: +SKIP
Methods
accumulate_partitions
(func, *args, **kwargs)Accumulate a function with state across batch elements
filter
(predicate)Filter elements by a predicate
map
(func, **kwargs)Map a function across all elements
map_partitions
(func, *args, **kwargs)Map a function across all batch elements of this stream
pluck
(ind)Pick a field out of all elements
sum
()Sum elements
Convert to a streaming dataframe
Concatenate batches and return base Stream
verify
(x)Verify elements that pass through this stream
emit
-
accumulate_partitions
(func, *args, **kwargs)¶ Accumulate a function with state across batch elements
See also
Streaming.map_partitions
-
filter
(predicate)¶ Filter elements by a predicate
-
map
(func, **kwargs)¶ Map a function across all elements
-
static
map_partitions
(func, *args, **kwargs)¶ Map a function across all batch elements of this stream
The output stream type will be determined by the action of that function on the example
See also
Streaming.accumulate_partitions
-
pluck
(ind)¶ Pick a field out of all elements
-
sum
()¶ Sum elements
-
to_dataframe
()¶ Convert to a streaming dataframe
This calls
pd.DataFrame
on all list-elements of this stream
-
to_stream
()¶ Concatenate batches and return base Stream
Returned stream will be composed of single elements
-
verify
(x)¶ Verify elements that pass through this stream
-
-
class
rapidz.dataframe.
DataFrame
(*args, **kwargs)¶ A Streaming dataframe
This is a logical collection over a stream of Pandas dataframes. Operations on this object will translate to the appropriate operations on the underlying Pandas dataframes.
See also
Series
- Attributes
- columns
- dtypes
- index
size
size of frame
Methods
accumulate_partitions
(func, *args, **kwargs)Accumulate a function with state across batch elements
assign
(**kwargs)Assign new columns to this dataframe
count
()Count of frame
cummax
()Cumulative maximum
cummin
()Cumulative minimum
cumprod
()Cumulative product
cumsum
()Cumulative sum
groupby
(other)Groupby aggreagtions
map_partitions
(func, *args, **kwargs)Map a function across all batch elements of this stream
mean
()Average
Reset Index
rolling
(window[, min_periods])Compute rolling aggregations
round
([decimals])Round elements in frame
set_index
(index, **kwargs)Set Index
sum
()Sum frame
tail
([n])Round elements in frame
to_frame
()Convert to a streaming dataframe
verify
(x)Verify consistency of elements that pass through this stream
window
([n, value])Sliding window operations
aggregate
astype
emit
map
query
-
accumulate_partitions
(func, *args, **kwargs)¶ Accumulate a function with state across batch elements
See also
Streaming.map_partitions
-
assign
(**kwargs)¶ Assign new columns to this dataframe
Alternatively use setitem syntax
Examples
>>> sdf = sdf.assign(z=sdf.x + sdf.y) # doctest: +SKIP >>> sdf['z'] = sdf.x + sdf.y # doctest: +SKIP
-
count
()¶ Count of frame
-
cummax
()¶ Cumulative maximum
-
cummin
()¶ Cumulative minimum
-
cumprod
()¶ Cumulative product
-
cumsum
()¶ Cumulative sum
-
groupby
(other)¶ Groupby aggreagtions
-
static
map_partitions
(func, *args, **kwargs)¶ Map a function across all batch elements of this stream
The output stream type will be determined by the action of that function on the example
See also
Streaming.accumulate_partitions
-
mean
()¶ Average
-
reset_index
()¶ Reset Index
-
rolling
(window, min_periods=1)¶ Compute rolling aggregations
When followed by an aggregation method like
sum
,mean
, orstd
this produces a new Streaming dataframe whose values are aggregated over that window.The window parameter can be either a number of rows or a timedelta like ``”2 minutes”` in which case the index should be a datetime index.
This operates by keeping enough of a backlog of records to maintain an accurate stream. It performs a copy at every added dataframe. Because of this it may be slow if the rolling window is much larger than the average stream element.
- Parameters
- window: int or timedelta
Window over which to roll
- Returns
- Rolling object
See also
DataFrame.window
more generic window operations
-
round
(decimals=0)¶ Round elements in frame
-
set_index
(index, **kwargs)¶ Set Index
-
size
¶ size of frame
-
sum
()¶ Sum frame
-
tail
(n=5)¶ Round elements in frame
-
to_frame
()¶ Convert to a streaming dataframe
-
verify
(x)¶ Verify consistency of elements that pass through this stream
-
window
(n=None, value=None)¶ Sliding window operations
Windowed operations are defined over a sliding window of data, either with a fixed number of elements:
>>> df.window(n=10).sum() # sum of the last ten elements
or over an index value range (index must be monotonic):
>>> df.window(value='2h').mean() # average over the last two hours
Windowed dataframes support all normal arithmetic, aggregations, and groupby-aggregations.
See also
DataFrame.rolling
mimic’s Pandas rolling aggregations
Examples
>>> df.window(n=10).std() >>> df.window(value='2h').count()
>>> w = df.window(n=100) >>> w.groupby(w.name).amount.sum() >>> w.groupby(w.x % 10).y.var()
-
class
rapidz.dataframe.
Rolling
(sdf, window, min_periods)¶ Rolling aggregations
This intermediate class enables rolling aggregations across either a fixed number of rows or a time window.
Examples
>>> sdf.rolling(10).x.mean() # doctest: +SKIP >>> sdf.rolling('100ms').x.mean() # doctest: +SKIP
Methods
aggregate
(*args, **kwargs)Rolling aggregation
count
(*args, **kwargs)Rolling count
max
()Rolling maximum
mean
()Rolling mean
median
()Rolling median
min
()Rolling minimum
quantile
(*args, **kwargs)Rolling quantile
std
(*args, **kwargs)Rolling standard deviation
sum
()Rolling sum
var
(*args, **kwargs)Rolling variance
-
aggregate
(*args, **kwargs)¶ Rolling aggregation
-
count
(*args, **kwargs)¶ Rolling count
-
max
()¶ Rolling maximum
-
mean
()¶ Rolling mean
-
median
()¶ Rolling median
-
min
()¶ Rolling minimum
-
quantile
(*args, **kwargs)¶ Rolling quantile
-
std
(*args, **kwargs)¶ Rolling standard deviation
-
sum
()¶ Rolling sum
-
var
(*args, **kwargs)¶ Rolling variance
-
-
class
rapidz.dataframe.
Window
(sdf, n=None, value=None)¶ Windowed aggregations
This provides a set of aggregations that can be applied over a sliding window of data.
See also
DataFrame.window
contains full docstring
- Attributes
- columns
- dtypes
- example
- index
size
Number of elements within window
Methods
apply
(func)Apply an arbitrary function over each window of data
count
()Count elements within window
groupby
(other)Groupby-aggregations within window
mean
()Average elements within window
std
([ddof])Compute standard deviation of elements within window
sum
()Sum elements within window
Count groups of elements within window
var
([ddof])Compute variance of elements within window
aggregate
full
map_partitions
reset_index
-
apply
(func)¶ Apply an arbitrary function over each window of data
-
count
()¶ Count elements within window
-
groupby
(other)¶ Groupby-aggregations within window
-
mean
()¶ Average elements within window
-
size
¶ Number of elements within window
-
std
(ddof=1)¶ Compute standard deviation of elements within window
-
sum
()¶ Sum elements within window
-
value_counts
()¶ Count groups of elements within window
-
var
(ddof=1)¶ Compute variance of elements within window
-
class
rapidz.dataframe.
GroupBy
(root, grouper, index=None)¶ Groupby aggregations on streaming dataframes
Methods
count
()Groupby-count
mean
()Groupby-mean
size
()Groupby-size
std
([ddof])Groupby-std
sum
()Groupby-sum
var
([ddof])Groupby-variance
-
count
()¶ Groupby-count
-
mean
()¶ Groupby-mean
-
size
()¶ Groupby-size
-
std
(ddof=1)¶ Groupby-std
-
sum
()¶ Groupby-sum
-
var
(ddof=1)¶ Groupby-variance
-
-
class
rapidz.dataframe.
Random
(freq='100ms', interval='500ms', dask=False)¶ A streaming dataframe of random data
The x column is uniformly distributed. The y column is poisson distributed. The z column is normally distributed.
This class is experimental and will likely be removed in the future
- Parameters
- freq: timedelta
The time interval between records
- interval: timedelta
The time interval between new dataframes, should be significantly larger than freq
- Attributes
- columns
- dtypes
- index
size
size of frame
Methods
accumulate_partitions
(func, *args, **kwargs)Accumulate a function with state across batch elements
assign
(**kwargs)Assign new columns to this dataframe
count
()Count of frame
cummax
()Cumulative maximum
cummin
()Cumulative minimum
cumprod
()Cumulative product
cumsum
()Cumulative sum
groupby
(other)Groupby aggreagtions
map_partitions
(func, *args, **kwargs)Map a function across all batch elements of this stream
mean
()Average
reset_index
()Reset Index
rolling
(window[, min_periods])Compute rolling aggregations
round
([decimals])Round elements in frame
set_index
(index, **kwargs)Set Index
sum
()Sum frame
tail
([n])Round elements in frame
to_frame
()Convert to a streaming dataframe
verify
(x)Verify consistency of elements that pass through this stream
window
([n, value])Sliding window operations
aggregate
astype
emit
map
query
stop