Dynamic data agreggator — orbi_start

These functions allow definition of custom data aggregators for processing data extracted from raw files. An aggregator is run on each imported file and pulls together the relevant data users are interested in while making sure data formats are correct so that the aggregated data can be merged across several imported files for fast downstream processing.

Usage

orbi_start_aggregator(
  dataset,
  uid_source,
  cast = "as.factor",
  regexp = FALSE,
  func = NULL,
  args = NULL
)

orbi_add_aggregator(
  aggregator,
  dataset,
  column,
  source = column,
  default = NA,
  cast = "as.character",
  regexp = FALSE,
  func = NULL,
  args = NULL
)

Arguments

dataset: name of the dataset in which to aggregate data, e.g. "file_info"
uid_source: name of the column (inside the dataset) where the aggregators "uid" (unique ID) column should come from
cast: what to cast the values of the resulting column to, most commonly "as.character", "as.integer", "as.numeric", or "as.factor". This is required to ensure all aggregated values have the correct data type.
regexp: whether source columm names should be interpreted as a regular expressions for the purpose of finding the relevant column(s). Note if regexp = TRUE, the search for the source column always becomes case-insensitive so this can also be used for a direct match of a source column whose upper/lower casing can be unreliable.
func: name of a processing function to apply before casting the value with the cast function. This is optional and can be used to conduct more elaborate preprocessing of a data or combining data from multiple source columns in the correct way (e.g. pasting together from multiple columns).
args: an optional list of arguments to pass to the func in addition to the values coming from the source colummn(s)
aggregator: the aggregator table generated by orbi_start_aggregator() or passed from a previous call to orbi_add_aggregator() for constructing the entire aggregator by piping
column: the name of the column in which data should be stored
source: single character column name or vector of column names (if alternatives could be the source) where in the dataset to find data for the column. If a vector of multiple column names is provided (e.g. source = c("a1", "a2")), the first column name that's found during processing of a dataset will be used and passed to the function defined in func (if any) and then the one defined in cast. To provide multiple parameters from the data to func, define a list instead of a vector source = list("a", "b", "c") or if multiple alternative columns can be the source for any of the arguments, define as source = list(c("a1", "a2"), "b", c("c1", "c2", "c3"))
default: the default value if no source columns can be found or another error is encountered during aggregatio. Note that the default value will also be processed with the function in cast to make sure it has the correct data type.

Value

a tibble holding all the columns that the aggregator will generate when run against data from a file

Functions

orbi_start_aggregator(): start the aggregator, requires definition of where the unique ID of a dataset comes from
orbi_add_aggregator(): add additional column to aggregate data for