Skip to contents

These functions allow definition of custom data aggregators for processing data extracted from raw files. An aggregator is run on each imported file and pulls together the relevant data users are interested in while making sure data formats are correct so that the aggregated data can be merged across several imported files for fast downstream processing.

Usage

orbi_start_aggregator(
  dataset,
  uid_source,
  cast = "as.factor",
  regexp = FALSE,
  func = NULL,
  args = NULL
)

orbi_add_aggregator(
  aggregator,
  dataset,
  column,
  source = column,
  default = NA,
  cast = "as.character",
  regexp = FALSE,
  func = NULL,
  args = NULL
)

Arguments

dataset

name of the dataset in which to aggregate data, e.g. "file_info"

uid_source

name of the column (inside the dataset) where the aggregators "uid" (unique ID) column should come from

cast

what to cast the values of the resulting column to, most commonly "as.character", "as.integer", "as.numeric", or "as.factor". This is required to ensure all aggregated values have the correct data type.

regexp

whether source columm names should be interpreted as a regular expressions for the purpose of finding the relevant column(s). Note if regexp = TRUE, the search for the source column always becomes case-insensitive so this can also be used for a direct match of a source column whose upper/lower casing can be unreliable.

func

name of a processing function to apply before casting the value with the cast function. This is optional and can be used to conduct more elaborate preprocessing of a data or combining data from multiple source columns in the correct way (e.g. pasting together from multiple columns).

args

an optional list of arguments to pass to the func in addition to the values coming from the source colummn(s)

aggregator

the aggregator table generated by orbi_start_aggregator() or passed from a previous call to orbi_add_aggregator() for constructing the entire aggregator by piping

column

the name of the column in which data should be stored

source

single character column name or vector of column names (if alternatives could be the source) where in the dataset to find data for the column. If a vector of multiple column names is provided (e.g. source = c("a1", "a2")), the first column name that's found during processing of a dataset will be used and passed to the function defined in func (if any) and then the one defined in cast. To provide multiple parameters from the data to func, define a list instead of a vector source = list("a", "b", "c") or if multiple alternative columns can be the source for any of the arguments, define as source = list(c("a1", "a2"), "b", c("c1", "c2", "c3"))

default

the default value if no source columns can be found or another error is encountered during aggregatio. Note that the default value will also be processed with the function in cast to make sure it has the correct data type.

Value

a tibble holding all the columns that the aggregator will generate when run against data from a file

Functions

  • orbi_start_aggregator(): start the aggregator, requires definition of where the unique ID of a dataset comes from

  • orbi_add_aggregator(): add additional column to aggregate data for