This series of vignettes in the Gallery section aim to
demonstrate the functionality of eider
through examples
that are similar to real-life usage. To do this, we have created a
series of randomly generated datasets that are stored with the package.
You can access these datasets using the eider_example()
function, which will return the path to where the dataset is stored in
your installation of R.
ltc_data_filepath <- eider_example("random_ltc_data.csv")
ltc_data_filepath
#> [1] "/tmp/RtmpqoMDH9/Rinstb692df9905c/eider/extdata/random_ltc_data.csv"
In this specific vignette, we are using simulated long-term condition
(LTC) data. Our dataset does not contain every column specified in here,
but serves as a useful example of how real-life data may be treated
using eider
.
ltc_data <- utils::read.csv(ltc_data_filepath) %>%
dplyr::mutate(asthma = lubridate::ymd(asthma),
diabetes = lubridate::ymd(diabetes),
parkinsons = lubridate::ymd(parkinsons))
dplyr::glimpse(ltc_data)
#> Rows: 20
#> Columns: 4
#> $ id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
#> $ asthma <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2017-07-22, NA…
#> $ diabetes <date> NA, NA, NA, NA, NA, NA, 2015-06-27, 2017-01-19, NA, NA, 20…
#> $ parkinsons <date> NA, 2017-05-22, NA, NA, NA, 2015-09-22, NA, NA, 2016-12-08…
(Note that when the data is loaded by eider
, the date
columns are automatically converted to the date type for you: you do not
need to do the manual processing above.)
This simplified table has 4 columns:
id
, which is a numeric patient ID;asthma
, diabetes
, and
parkinsons
, which are columns with dates indicating when a
patient was first diagnosed with the corresponding condition. If the
patient has never been diagnosed with this condition, the value is
NA
.In this example, we will calculate the number of years since each patient was first diagnosed with asthma.
years_asthma_filepath <- eider_example("years_with_asthma.json")
writeLines(readLines(years_asthma_filepath))
#> {
#> "source_table": "ltc",
#> "grouping_column": "id",
#> "transformation_type": "time_since",
#> "time_units": "years",
#> "from_first": true,
#> "output_feature_name": "years_with_asthma",
#> "date_column": "asthma",
#> "cutoff_date": "2021-03-25",
#> "absent_default_value": 40
#> }
This is very similar to one
of the examples in the A&E data vignette. Here, we use a
"time_since"
transformation type, and additionally specify
"time_units"
as "years"
to obtain the result
as a number of years (formally, the number of days divided by
365.25).
In this particular example, the "from_first"
parameter
is set to true
. Because each patient only has one row in
the table, there is no ‘first’ row, and thus this parameter could
equally well be set to false
. (However, it cannot be
omitted, as it is a required parameter for the "time_since"
transformation type.)
res <- run_pipeline(
data_sources = list(ltc = ltc_data_filepath),
feature_filenames = years_asthma_filepath
)
dplyr::glimpse(res$features)
#> Rows: 20
#> Columns: 2
#> $ id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
#> $ years_with_asthma <dbl> 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 3, 40, 4…
This example is slightly more interesting because it involves a more
ingenious filter operation. We would like a binary feature here which
has value 1 if the patient has asthma, and 0 otherwise. However, we
cannot simply use a "present"
or "count"
transformation type without filtering, because every patient appears in
this table.
We need to first filter the table such that all the rows where an
NA
value appears in the asthma column are removed. However,
eider
’s filter operation does not support filtering based
on NA
values! To work around this, what we can do is to
filter based on the dates: if we choose only the rows where the date is
greater than some sentinel value which is a long time in the
past, any genuine dates in the table will pass this test, but NA values
will not.
Thus, what we need is a "date_gt"
filter with a value
that is suitably far in the past such that any real date in the table
will come after it.
has_asthma_filepath <- eider_example("has_asthma.json")
writeLines(readLines(has_asthma_filepath))
#> {
#> "source_table": "ltc",
#> "transformation_type": "present",
#> "grouping_column": "id",
#> "output_feature_name": "has_asthma",
#> "filter": {
#> "column": "asthma",
#> "type": "date_gt",
#> "value": "1800-01-01"
#> }
#> }
res <- run_pipeline(
data_sources = list(ltc = ltc_data_filepath),
feature_filenames = has_asthma_filepath
)
dplyr::glimpse(res$features)
#> Rows: 20
#> Columns: 2
#> $ id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
#> $ has_asthma <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0
As a final example, we will calculate the number of long-term
conditions each patient has a diagnosis for. This essentially involves
calculating one binary 0/1 feature for each condition (much like Feature 2, and
then summing them up. Thus, we need to use a
"combine_linear"
transformation type, with the weights of
each individual feature set to 1 (see the combination feature vignette for more
information).
The full JSON looks like this:
num_ltcs_filepath <- eider_example("number_of_ltcs.json")
writeLines(readLines(num_ltcs_filepath))
#> {
#> "transformation_type": "combine_linear",
#> "output_feature_name": "number_of_ltcs",
#> "subfeature": {
#> "asthma": {
#> "weight": 1,
#> "source_table": "ltc",
#> "transformation_type": "present",
#> "grouping_column": "id",
#> "filter": {
#> "column": "asthma",
#> "type": "date_gt",
#> "value": "1800-01-01"
#> }
#> },
#> "diabetes": {
#> "weight": 1,
#> "source_table": "ltc",
#> "transformation_type": "present",
#> "grouping_column": "id",
#> "filter": {
#> "column": "diabetes",
#> "type": "date_gt",
#> "value": "1800-01-01"
#> }
#> },
#> "parkinsons": {
#> "weight": 1,
#> "source_table": "ltc",
#> "transformation_type": "present",
#> "grouping_column": "id",
#> "filter": {
#> "column": "parkinsons",
#> "type": "date_gt",
#> "value": "1800-01-01"
#> }
#> }
#> }
#> }
The subfeature
object contains a named list of the
individual features that we want to combine: each of these have exactly
the same structure as before, except that the filtering is performed on
a different column each time. Each individual subfeature is also given a
"weight": 1
, as described previously. Finally, the
"output_feature_name"
field is lifted to the top level of
the JSON instead of in each individual subfeature.
res <- run_pipeline(
data_sources = list(ltc = ltc_data_filepath),
feature_filenames = num_ltcs_filepath
)
dplyr::glimpse(res$features)
#> Rows: 20
#> Columns: 2
#> $ id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
#> $ number_of_ltcs <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 2, 0, 0, 1, 0, 0, 0, 0…