These functions allow definition of custom data aggregators for processing data extracted from raw files. An aggregator is run on each imported file and pulls together the relevant data users are interested in while making sure data formats are correct so that the aggregated data can be merged across several imported files for fast downstream processing.
Usage
orbi_start_aggregator(
dataset,
uid_source,
cast = "as.factor",
regexp = FALSE,
func = NULL,
args = NULL
)
orbi_add_aggregator(
aggregator,
dataset,
column,
source = column,
default = NA,
cast = "as.character",
regexp = FALSE,
func = NULL,
args = NULL
)
Arguments
- dataset
name of the dataset in which to aggregate data, e.g. "file_info"
- uid_source
name of the column (inside the
dataset
) where the aggregators "uid" (unique ID) column should come from- cast
what to cast the values of the resulting column to, most commonly
"as.character"
,"as.integer"
,"as.numeric"
, or"as.factor"
. This is required to ensure all aggregated values have the correct data type.- regexp
whether source columm names should be interpreted as a regular expressions for the purpose of finding the relevant column(s). Note if
regexp = TRUE
, the search for the source column always becomes case-insensitive so this can also be used for a direct match of a source column whose upper/lower casing can be unreliable.- func
name of a processing function to apply before casting the value with the
cast
function. This is optional and can be used to conduct more elaborate preprocessing of a data or combining data from multiple source columns in the correct way (e.g. pasting together from multiple columns).- args
an optional list of arguments to pass to the
func
in addition to the values coming from the source colummn(s)- aggregator
the aggregator table generated by
orbi_start_aggregator()
or passed from a previous call toorbi_add_aggregator()
for constructing the entire aggregator by piping- column
the name of the column in which data should be stored
- source
single character column name or vector of column names (if alternatives could be the source) where in the
dataset
to find data for thecolumn
. If a vector of multiple column names is provided (e.g.source = c("a1", "a2")
), the first column name that's found during processing of a dataset will be used and passed to the function defined infunc
(if any) and then the one defined incast
. To provide multiple parameters from the data tofunc
, define a list instead of a vectorsource = list("a", "b", "c")
or if multiple alternative columns can be the source for any of the arguments, define assource = list(c("a1", "a2"), "b", c("c1", "c2", "c3"))
- default
the default value if no
source
columns can be found or another error is encountered during aggregatio. Note that thedefault
value will also be processed with the function incast
to make sure it has the correct data type.