Preprocessing

library(eider)
library(magrittr)

Inside each feature JSON, an optional preprocess object can be included, which causes the input table to be modified in a particular way before the feature is calculated.

This is primarily useful for data where each row represents some subdivision of a larger entity, and the user wants to calculate features based on the information from those larger entity. In particular, this is useful for episodic data, where each row represents an episode within a continuous hospital stay.

Motivation

We begin by making the case for why preprocessing can be required for certain features.

Consider the following data frame. (This is a heavily simplified version of the example SMR04 data bundled with the package, which you can obtain using eider_example('random_smr04_data.csv').)

input_table <- data.frame(
  id = c(1, 1, 1, 1),
  admission_date = as.Date(c(
    "2015-01-01", "2016-01-01", "2016-01-04", "2017-01-01"
  )),
  discharge_date = as.Date(c(
    "2015-01-05", "2016-01-04", "2016-01-08", "2017-01-08"
  )),
  cis_marker = c(1, 2, 2, 3),
  episode_within_cis = c(1, 1, 2, 1),
  diagnosis = c("A", "B", "C", "B")
)

input_table
#>   id admission_date discharge_date cis_marker episode_within_cis diagnosis
#> 1  1     2015-01-01     2015-01-05          1                  1         A
#> 2  1     2016-01-01     2016-01-04          2                  1         B
#> 3  1     2016-01-04     2016-01-08          2                  2         C
#> 4  1     2017-01-01     2017-01-08          3                  1         B

Here, each row is an episode; multiple episodes make up a continuous inpatient stay (hence the abbreviation “cis”). The cis_marker field is used to label stays, and can thus be used to identify episodes belonging to the same stay. In this case, the episode_within_cis tells us the order of the episodes within a stay; such information is not always present, though.

In this table snippet, there is only one patient: they have had 3 distinct stays; the second of these comprises 2 episodes.

Such information can be tricky to perform filtering on, because the admission_date and discharge_date pertain to each episode, but we are often interested in stay-level data: for example, when the patient was first admitted to hospital.

Consider the following question: how many stays has a patient had since 5 January 2016 in which they had a diagnosis of “B”? For the patient in this table, the answer is 2: both the 2016 and 2017 stays had a diagnosis of “B”, and both stays ended after 5 January 2016.

If we were to naively try to perform this calculation without accounting for the dates, we could write something like json_examples/preprocessing1.json:

{
  "output_feature_name": "naive",
  "transformation_type": "nunique",
  "source_table": "input_table",
  "aggregation_column": "cis_marker",
  "grouping_column": "id",
  "absent_default_value": 0,
  "filter": {
    "type": "and",
    "subfilter": {
      "date_filter": {
        "column": "discharge_date",
        "type": "date_gt_eq",
        "value": "2016-01-05"
      },
      "diagnosis_filter": {
        "column": "diagnosis",
        "type": "in",
        "value": [
          "B"
        ]
      }
    }
  }
}

Running this would give:

results <- run_pipeline(
  data_sources = list(input_table = input_table),
  feature_filenames = "json_examples/preprocessing1.json"
)

results$features
#>   id naive
#> 1  1     1

We got a value of 1, which is incorrect! What gives? As it happens, the filter was applied to each episode, and because the first episode of the 2016 stay ended before 5 January, it was not counted in the data. The second episode of the 2016 stay was also removed because its diagnosis was not “B”. So only the third stay, in 2017, was counted.

Preprocessing specification

The way eider approaches this issue is to allow users to preprocess their data. This is accomplished by specifying a preprocess object in the feature JSON. In our case, to merge episode dates into stays, we can say that we would like:

  • for each unique pair of id and cis_marker,
  • replace the value of the admission date with the earliest of all episodes,
  • and replace the discharge date replaced with the latest of all episodes.

In dplyr terms, one would write a pipeline like this:

processed_table <- input_table %>%
  dplyr::group_by(id, cis_marker) %>%
  dplyr::mutate(
    admission_date = min(admission_date),
    discharge_date = max(discharge_date)
  ) %>%
  dplyr::ungroup()

processed_table
#> # A tibble: 4 × 6
#>      id admission_date discharge_date cis_marker episode_within_cis diagnosis
#>   <dbl> <date>         <date>              <dbl>              <dbl> <chr>    
#> 1     1 2015-01-01     2015-01-05              1                  1 A        
#> 2     1 2016-01-01     2016-01-08              2                  1 B        
#> 3     1 2016-01-01     2016-01-08              2                  2 C        
#> 4     1 2017-01-01     2017-01-08              3                  1 B

Notice how the dates for both episodes in stay 2 are now the same, and reflect the overall dates for the stay.

Returning to the eider library, this information is (unsurprisingly) specified in JSON. Including a preprocess object in the feature will cause the input table to be modified as above:

{
  "preprocess": {
    "on": ["id", "cis_marker"],
    "retain_min": ["admission_date"],
    "retain_max": ["discharge_date"]
  },
}

The preprocess object contains one mandatory key:

  • (array of strings) "on": the names of the columns by which the data should be grouped for preprocessing

and several optional keys can be provided, corresponding to the operations which should be performed. All of these keys refer to column names:

  • (array of strings) "retain_min": retain the minimum value within each group
  • (array of strings) "retain_max": retain the maximum value within each group
  • (array of strings) "replace_with_sum": sum the values within each group and replace the original values with the sum

Columns may not be specified in more than one of the above keys (i.e., you cannot preprocess the same column twice).

Returning to the example

We can now rewrite the feature JSON to include the preprocessing step (json_examples/preprocessing2.json):

{
  "output_feature_name": "correct",
  "transformation_type": "nunique",
  "source_table": "input_table",
  "aggregation_column": "cis_marker",
  "grouping_column": "id",
  "absent_default_value": 0,
  "filter": {
    "type": "and",
    "subfilter": {
      "date_filter": {
        "column": "discharge_date",
        "type": "date_gt_eq",
        "value": "2016-01-05"
      },
      "diagnosis_filter": {
        "column": "diagnosis",
        "type": "in",
        "value": [
          "B"
        ]
      }
    }
  },
  "preprocess": {
    "on": [
      "id",
      "cis_marker"
    ],
    "retain_min": [
      "admission_date"
    ],
    "retain_max": [
      "discharge_date"
    ]
  }
}

and rerunning the pipeline gives us the correct value of 2. Note that although the preprocess object is placed after the filter object in the JSON, the preprocessing is always done prior to filtering. The order of the keys in the JSON has no effect whatsoever on the result.

results <- run_pipeline(
  data_sources = list(input_table = input_table),
  feature_filenames = "json_examples/preprocessing2.json"
)

results$features
#>   id correct
#> 1  1       2

An example for replace_with_sum

To motivate the use of replace_with_sum, we can add a column to our previous data frame to denote the length of each episode:

input_table_with_sum <- input_table %>%
  dplyr::mutate(days = as.numeric(discharge_date - admission_date))

input_table_with_sum
#>   id admission_date discharge_date cis_marker episode_within_cis diagnosis days
#> 1  1     2015-01-01     2015-01-05          1                  1         A    4
#> 2  1     2016-01-01     2016-01-04          2                  1         B    3
#> 3  1     2016-01-04     2016-01-08          2                  2         C    4
#> 4  1     2017-01-01     2017-01-08          3                  1         B    7

Now consider a different question, which is: how many stays has a patient had which lasted for a week or more? To answer this, we need to first sum up the days for each stay, and we can then filter based on this sum. This is accomplished with json_examples/preprocessing3.json:

{
  "output_feature_name": "using_sum",
  "transformation_type": "nunique",
  "source_table": "input_table",
  "aggregation_column": "cis_marker",
  "grouping_column": "id",
  "absent_default_value": 0,
  "filter": {
    "column": "days",
    "type": "gt_eq",
    "value": 7
  },
  "preprocess": {
    "on": [
      "id",
      "cis_marker"
    ],
    "replace_with_sum": [
      "days"
    ]
  }
}
results <- run_pipeline(
  data_sources = list(input_table = input_table_with_sum),
  feature_filenames = "json_examples/preprocessing3.json"
)

results$features
#>   id using_sum
#> 1  1         2

See also

The Gallery section contains two examples of preprocessing in action: both PIS feature 4 and SMR04 feature 4 use the replace_with_sum preprocessing function.