Filtering

Filtering relies on a filter key in the feature JSON, for which the corresponding value is an object:

{
    "output_feature_name": "..."
    ...,
    "filter": {
       ...
    },
    ...
}

This vignette will describe the types of filters available and how they can be expressed in this JSON object.

Basic filters

The most basic filter would be to accept all rows where the value in a given column meets a certain criterion. For example, we may only want to look at rows for which the value in the age column is greater than 18. There are three pieces of information here which must be encoded:

  • (string) column: the name of the column to filter on
  • (string) type: the type of filter to apply
  • value: the value to compare against. The type of this must match the type of the column specified.

The type key can take one of the following values:

  • "in"

    In this case, the value key can either be a single value or an array of multiple values. This filers for rows where the value in the column matches one of the provided values.

  • "gt" / "lt" / "gt_eq" / "lt_eq"

    The value in the column must be greater than, less than, greater than or equal to, or less than or equal to the value respectively. For these filters, only one value can be passed.

  • "date_gt" / "date_lt" / "date_gt_eq" / "date_lt_eq"

    The same as above, but to be used for dates. This distinction is mandatory because JSON does not have a native date type, so the value must be passed as a string in the format YYYY-MM-DD.

Thus, for example, if we assume that the column age contains integers, then

{
    "filter": {
        "column": "age",
        "type": "in",
        "value": [18, 19, 20]
    }
}

selects for people who are 18, 19, or 20 years old; and likewise if the date_of_birth column contains dates, then

{
    "filter": {
        "column": "date_of_birth",
        "type": "date_gt_eq",
        "value": "2000-01-01"
    }
}

selects for people born on or after January 1, 2000.

As a reminder, the table to be filtered is passed as the source_table key in the feature JSON. This is described in the introductory vignette.

Compound filters

Filters may also be combined using the Boolean logic operators AND, OR, and NOT. Instead of providing the three keys column, type, and value, only type is specified, together with subfilter, which is itself an object which maps filter names to filter objects.

An example is probably the easiest way to explain this. If we wanted to select for people who are not 18, 19, or 20 years old, we could write:

{
    "filter": {
        "type": "not",
        "subfilter": {
            "age_filter": {
                "column": "age",
                "type": "in",
                "value": [18, 19, 20]
            }
        }
    }
}

The filter name, age_filter, can be any string you want: it is used only for logging purposes. Notice how the object that age_filter refers to, i.e.,

{
    "column": "age",
    "type": "in",
    "value": [18, 19, 20]
}

is itself just a basic filter as has already been described above.

or and and filters work in exactly the same way, with the sole difference being that to accomplish any meaningful results you would generally want the subfilter object to contain two or more filters that work in tandem. For example, the below filters for 18-year-olds who have a height of over 180 (presumably centimetres; although this depends on the input table!):

{
    "filter": {
        "type": "and",
        "subfilter": {
            "age_filter": {
                "column": "age",
                "type": "in",
                "value": [18]
            },
            "height_filter": {
                "column": "height",
                "type": "gt",
                "value": 180
            }
        }
    }
}

Again, age_filter and height_filter are just names for logging purposes, and themselves refer to basic filters.

Finally, note that the subfilters need not necessarily be basic filters, though: they can be compound filters themselves. So it is possible to nest filters as deeply as you like to accomplish the desired result.

Examples

library(eider)

Consider the following input table with five rows:

input_table <- data.frame(
  id = c(1, 1, 1, 2, 3),
  date = as.Date(c(
    "2024-01-01", "2024-01-02", "2024-01-03",
    "2023-01-01", "2023-01-01"
  )),
  diagnosis = c("A", "B", "C", "B", "C")
)
input_table
#>   id       date diagnosis
#> 1  1 2024-01-01         A
#> 2  1 2024-01-02         B
#> 3  1 2024-01-03         C
#> 4  2 2023-01-01         B
#> 5  3 2023-01-01         C

and suppose we would like to count the number of times each person has had a diagnosis of “B”. This is a transformation_type of "count"; but if we do not perform any filtering, "count" will simply return the number of times each person appears in this dataframe.

The following JSON is provided as json_examples/filter1.json:

{
  "output_feature_name": "unfiltered_count",
  "transformation_type": "count",
  "source_table": "input_table",
  "grouping_column": "id",
  "absent_default_value": 0
}

Passing this into run_pipeline() will give us the following result:

results <- run_pipeline(
  data_sources = list(input_table = input_table),
  feature_filenames = "json_examples/filter1.json"
)

results$features
#>   id unfiltered_count
#> 1  1                3
#> 2  2                1
#> 3  3                1

Instead of this, if we specify a filter for the diagnosis column (in json_examples/filter2.json), we can achieve the desired result:

{
  "output_feature_name": "filtered_count",
  "transformation_type": "count",
  "source_table": "input_table",
  "grouping_column": "id",
  "absent_default_value": 0,
  "filter": {
    "column": "diagnosis",
    "type": "in",
    "value": [
      "B"
    ]
  }
}
results <- run_pipeline(
  data_sources = list(input_table = input_table),
  feature_filenames = "json_examples/filter2.json"
)

results$features
#>   id filtered_count
#> 1  1              1
#> 2  2              1
#> 3  3              0

Note how patient 3 has a count of 0, because they have never been diagnosed with “B”. This is specified by the absent_default_value key in the JSON.

If we would like to find diagnoses of “B” only in 2024, then we can add a date filter to the mix, and combine it with the diagnosis filter with an AND filter (json_examples/filter3.json):

{
  "output_feature_name": "doubly_filtered_count",
  "transformation_type": "count",
  "source_table": "input_table",
  "grouping_column": "id",
  "absent_default_value": 0,
  "filter": {
    "type": "and",
    "subfilter": {
      "diagnosis_filter": {
        "column": "diagnosis",
        "type": "in",
        "value": [
          "B"
        ]
      },
      "date_filter": {
        "column": "date",
        "type": "date_gt_eq",
        "value": "2024-01-01"
      }
    }
  }
}
results <- run_pipeline(
  data_sources = list(input_table = input_table),
  feature_filenames = "json_examples/filter3.json"
)

results$features
#>   id doubly_filtered_count
#> 1  1                     1
#> 2  2                     0
#> 3  3                     0

Patient 2’s diagnosis has been excluded this time because the corresponding entry came in 2023.

See also

Filtering is used widely in the Gallery examples. See, for example: