As the introductory vignette shows, writing R code with
eider
simply consists of a call to
run_pipeline()
. Most of the time spent using this library
will be spent defining the features themselves using JSON.
Features are JSON objects, which are an association of keys and values tied together within curly braces. Keys are always strings, and values can be strings, numbers, booleans, arrays, or objects themselves, as shown in the example object below. Conceptually, JSON objects are similar to R lists.
{
"key_1": "a string",
"key_2": 1,
"key_3": true,
"key_4": [1, 2, 3],
"key_5": {
"nested_key_1": "a string",
"nested_key_2": 1
}
}
To be correctly parsed by eider
, each feature must
contain a specific set of keys. The keys that are shared across all
features are:
(string) source_table
The name of the table to be read in. Note that this is not a
filename: it is a unique identifier which is passed as part of the
data_sources
argument to run_pipeline()
.
Please see the introductory vignette for an
explanation of this.
(string) output_feature_name
This determines the name of the column in the final dataframe that
eider
produces.
It can be any string you like, as long as there are no clashes
between multiple features; the feature name "id"
is also
reserved and cannot be used.
(string) grouping_column
This is the name of the column in the input dataframe that the feature will be calculated over. The feature column will contain the result of the calculation for each unique value in this column.
In the case of medical data, this would typically be the name of the column containing the patient ID. In the remainder of this vignette, we will refer to the values within this column as “IDs”.
(number; optional)
absent_default_value
This is the value that will be used for IDs that are not present in
the input dataframe. Because eider
only calculates numeric
features, this has to be a number. If omitted, eider
will
insert NA
values for missing IDs.
(string) transformation_type
This defines the type of calculation performed for the feature. Each transformation type may require an extra set of keys to be specified for the feature to be correctly calculated.
The available transformation types can be split into a few groups:
Transformation types: "count"
,
"present"
The two simplest features are "count"
, which counts the
number of occurrences of each ID in the dataset, and
"present"
, which outputs 1
if the ID was found
in the dataset and 0
if not.
Examples of the "count"
feature type are provided in A&E
features 1 and 2, as well as SMR04
feature 1.
The "present"
feature type is showcased in A&E
feature 3, as well as LTC
features 2 and 3.
Transformation types: "sum"
,
"nunique"
, "mean"
, "median"
,
"sd"
, "first"
, "last"
,
"min"
, "max"
As these features act with respect to the values in a specific column, they require a single extra key to be specified:
(string) aggregation_column
The name of the column containing the values to be aggregated.
The feature will be calculated for each unique ID by aggregating the values in this column.
Example features with summary functions include all the PIS features, and also SMR04
features 2, 3, and 4. These cover the transformation types
"nunique"
, "sum"
, and "max"
.
Transformation types: "time_since"
The time_since
transformation type calculates the period
of time between a given date and the first (or last) date in the dataset
for each ID. This feature requires a few more keys:
(string) date_column
The name of the column containing the dates to be used in the calculation.
(string) cutoff_date
The date to be used as the reference point for the calculation. This
should be in the format YYYY-MM-DD
.
(boolean) from_first
If true
, the feature will calculate the time between the
cutoff date and the first date in the dataset for each ID. If
false
, it will calculate the time between the cutoff date
and the last date.
(string) time_units
The unit of time to be used in the calculation. This can be either
"days"
or "years"
: a year is defined as 365.25
days.
Examples of "time_since"
features are given in A&E
feature 4 and LTC
feature 1.
Transformation types: "combine_linear"
,
"combine_min"
, "combine_max"
Combination features are a way of combining the results of multiple
features into a single feature. They have a slightly different structure
to the rest: broadly speaking, these transformation types require a
subfeature
key, which is itself an object which contains
the features which are to be combined.
Combination features are covered in a separate vignette.
While the above may seem like a large number of possible calculations, on their own they offer no way of controlling which parts of the input data are to be considered.
In addition to the keys shown above, (non-combination) features may
also contain the preprocess
and filter
keys,
which perform transformations on the input table before the features are
calculated from them. Preprocessing refers to the modification of values
within a table, whereas filtering does not modify the values, but only
allows rows that pass a set of criteria to be considered when
calculating the feature.
Preprocessing is performed prior to filtering: thus, if both are specified, filtering is performed on the already-preprocessed values.
Both the preprocess
and filter
keys are
themselves JSON objects, and are detailed respectively in the preprocessing and filtering vignettes.