An overview of features

As the introductory vignette shows, writing R code with eider simply consists of a call to run_pipeline(). Most of the time spent using this library will be spent defining the features themselves using JSON.

Features as JSON

Features are JSON objects, which are an association of keys and values tied together within curly braces. Keys are always strings, and values can be strings, numbers, booleans, arrays, or objects themselves, as shown in the example object below. Conceptually, JSON objects are similar to R lists.

{
  "key_1": "a string",
  "key_2": 1,
  "key_3": true,
  "key_4": [1, 2, 3],
  "key_5": {
    "nested_key_1": "a string",
    "nested_key_2": 1
  }
}

To be correctly parsed by eider, each feature must contain a specific set of keys. The keys that are shared across all features are:

  • (string) source_table

    The name of the table to be read in. Note that this is not a filename: it is a unique identifier which is passed as part of the data_sources argument to run_pipeline(). Please see the introductory vignette for an explanation of this.

  • (string) output_feature_name

    This determines the name of the column in the final dataframe that eider produces.

    It can be any string you like, as long as there are no clashes between multiple features; the feature name "id" is also reserved and cannot be used.

  • (string) grouping_column

    This is the name of the column in the input dataframe that the feature will be calculated over. The feature column will contain the result of the calculation for each unique value in this column.

    In the case of medical data, this would typically be the name of the column containing the patient ID. In the remainder of this vignette, we will refer to the values within this column as “IDs”.

  • (number; optional) absent_default_value

    This is the value that will be used for IDs that are not present in the input dataframe. Because eider only calculates numeric features, this has to be a number. If omitted, eider will insert NA values for missing IDs.

  • (string) transformation_type

    This defines the type of calculation performed for the feature. Each transformation type may require an extra set of keys to be specified for the feature to be correctly calculated.

Transformation types

The available transformation types can be split into a few groups:

Counting

Transformation types: "count", "present"

The two simplest features are "count", which counts the number of occurrences of each ID in the dataset, and "present", which outputs 1 if the ID was found in the dataset and 0 if not.

Examples of the "count" feature type are provided in A&E features 1 and 2, as well as SMR04 feature 1.

The "present" feature type is showcased in A&E feature 3, as well as LTC features 2 and 3.

Summaries

Transformation types: "sum", "nunique", "mean", "median", "sd", "first", "last", "min", "max"

As these features act with respect to the values in a specific column, they require a single extra key to be specified:

  • (string) aggregation_column

    The name of the column containing the values to be aggregated.

The feature will be calculated for each unique ID by aggregating the values in this column.

Example features with summary functions include all the PIS features, and also SMR04 features 2, 3, and 4. These cover the transformation types "nunique", "sum", and "max".

Time-based

Transformation types: "time_since"

The time_since transformation type calculates the period of time between a given date and the first (or last) date in the dataset for each ID. This feature requires a few more keys:

  • (string) date_column

    The name of the column containing the dates to be used in the calculation.

  • (string) cutoff_date

    The date to be used as the reference point for the calculation. This should be in the format YYYY-MM-DD.

  • (boolean) from_first

    If true, the feature will calculate the time between the cutoff date and the first date in the dataset for each ID. If false, it will calculate the time between the cutoff date and the last date.

  • (string) time_units

    The unit of time to be used in the calculation. This can be either "days" or "years": a year is defined as 365.25 days.

Examples of "time_since" features are given in A&E feature 4 and LTC feature 1.

Combination features

Transformation types: "combine_linear", "combine_min", "combine_max"

Combination features are a way of combining the results of multiple features into a single feature. They have a slightly different structure to the rest: broadly speaking, these transformation types require a subfeature key, which is itself an object which contains the features which are to be combined.

Combination features are covered in a separate vignette.

Preprocessing and filtering

While the above may seem like a large number of possible calculations, on their own they offer no way of controlling which parts of the input data are to be considered.

In addition to the keys shown above, (non-combination) features may also contain the preprocess and filter keys, which perform transformations on the input table before the features are calculated from them. Preprocessing refers to the modification of values within a table, whereas filtering does not modify the values, but only allows rows that pass a set of criteria to be considered when calculating the feature.

Preprocessing is performed prior to filtering: thus, if both are specified, filtering is performed on the already-preprocessed values.

Both the preprocess and filter keys are themselves JSON objects, and are detailed respectively in the preprocessing and filtering vignettes.