Filtering datasources
Most data tables are returned as regular tibble
objects,
but some datasources contain large tables and require filtering at the
source. For example, OECD database usually requires filtering on the
data before anything can be retrieved. For convenience, Robonomist
provides an easy-to-use workflow to fetch data.
The OECD SDMX-ML api, on which the
oecd
dataset is based, has become unstable recently. We recommened to using the newoecd3
database, which uses the SDMX-json api which is more reliable.
The initial data request (e.g data("oecd3/QNA")
) returns
a data structure object that acts very much like a regular tibble, but
it does not yet contain the actual data. This object is printed like a
tibble, and it can be handled with common dplyr
verbs such
as filter
and distinct
in a
magrittr
pipe using %>%
.
data("oecd3/QNA")
#> # Title: Quarterly National Accounts
#> # OECD_v3: Uncollected data structure
#> # A tibble: 3,646,536 × 6
#> Country Subject Measure Frequency time value
#> <chr> <chr> <chr> <chr> <date> <collect>
#> 1 Finland Net purchases abroad National currenc… Quarterly 1990-01-01 ??
#> 2 Finland Net purchases abroad National currenc… Quarterly 1990-04-01 ??
#> 3 Finland Net purchases abroad National currenc… Quarterly 1990-07-01 ??
#> 4 Finland Net purchases abroad National currenc… Quarterly 1990-10-01 ??
#> 5 Finland Net purchases abroad National currenc… Quarterly 1991-01-01 ??
#> 6 Finland Net purchases abroad National currenc… Quarterly 1991-04-01 ??
#> 7 Finland Net purchases abroad National currenc… Quarterly 1991-07-01 ??
#> 8 Finland Net purchases abroad National currenc… Quarterly 1991-10-01 ??
#> 9 Finland Net purchases abroad National currenc… Quarterly 1992-01-01 ??
#> 10 Finland Net purchases abroad National currenc… Quarterly 1992-04-01 ??
#> # ℹ 3,646,526 more rows
#> # This data has not yet been collected from OECD api, and all rows might not be available. Please use `filter()` to limit the number of rows under a million, and use `collect()` to retrieve actual data.
x <-
data("oecd3/QNA") %>%
filter(Country == "Finland",
str_detect(Subject, "Gross domestic prod"),
Frequency=="Quarterly") |>
filter(lubridate::year(time) > 2019L)
x
#> # Title: Quarterly National Accounts
#> # OECD_v3: Uncollected data structure
#> # A tibble: 390 × 6
#> Country Subject Measure Frequency time value
#> <chr> <chr> <chr> <chr> <date> <collect>
#> 1 Finland Gross domestic product at mar… Nation… Quarterly 2020-01-01 ??
#> 2 Finland Gross domestic product at mar… Nation… Quarterly 2020-04-01 ??
#> 3 Finland Gross domestic product at mar… Nation… Quarterly 2020-07-01 ??
#> 4 Finland Gross domestic product at mar… Nation… Quarterly 2020-10-01 ??
#> 5 Finland Gross domestic product at mar… Nation… Quarterly 2021-01-01 ??
#> 6 Finland Gross domestic product at mar… Nation… Quarterly 2021-04-01 ??
#> 7 Finland Gross domestic product at mar… Nation… Quarterly 2021-07-01 ??
#> 8 Finland Gross domestic product at mar… Nation… Quarterly 2021-10-01 ??
#> 9 Finland Gross domestic product at mar… Nation… Quarterly 2022-01-01 ??
#> 10 Finland Gross domestic product at mar… Nation… Quarterly 2022-04-01 ??
#> # ℹ 380 more rows
#> # This data has not yet been collected from OECD api, and all rows might not be available. Please use `filter()` to limit the number of rows under a million, and use `collect()` to retrieve actual data.
distinct(x, Subject)
#> # A tibble: 4 × 1
#> Subject
#> <chr>
#> 1 Gross domestic product at market prices - output approach
#> 2 Gross domestic product - expenditure approach
#> 3 Gross domestic product - income approach
#> 4 Gross domestic product
After data structure object has been filtered, the actual data can be
collected with collect()
.
x |> collect()
#> # Robonomist id: oecd3/QNA
#> # Title: Quarterly National Accounts
#> # Vintage: 2024-06-13 05:31:05.142084
#> # A tibble: 390 × 6
#> Country Subject Measure Frequency time value
#> <chr> <chr> <chr> <chr> <date> <dbl>
#> 1 Finland Gross domestic product at market … Nation… Quarterly 2020-01-01 58424
#> 2 Finland Gross domestic product at market … Nation… Quarterly 2020-04-01 57751
#> 3 Finland Gross domestic product at market … Nation… Quarterly 2020-07-01 59082
#> 4 Finland Gross domestic product at market … Nation… Quarterly 2020-10-01 62781
#> 5 Finland Gross domestic product at market … Nation… Quarterly 2021-01-01 57665
#> 6 Finland Gross domestic product at market … Nation… Quarterly 2021-04-01 62729
#> 7 Finland Gross domestic product at market … Nation… Quarterly 2021-07-01 62551
#> 8 Finland Gross domestic product at market … Nation… Quarterly 2021-10-01 67719
#> 9 Finland Gross domestic product at market … Nation… Quarterly 2022-01-01 62705
#> 10 Finland Gross domestic product at market … Nation… Quarterly 2022-04-01 67312
#> # ℹ 380 more rows
Also common dplyr
verbs that require the actual data
will trigger the collect
function automatically.
data("oecd3/QNA") %>%
filter(Subject == "Gross domestic product - expenditure approach",
str_detect(Measure, "Growth rate.*year"),
Frequency == "Quarterly") %>%
filter(lubridate::year(time) >= 2015) %>%
group_by(Country) %>%
summarize(`Average Q/Q growth rate` = mean(value))
#> # A tibble: 58 × 2
#> Country `Average Q/Q growth rate`
#> <chr> <dbl>
#> 1 Argentina 0.503
#> 2 Australia 2.41
#> 3 Austria 1.26
#> 4 Belgium 1.70
#> 5 Brazil 0.542
#> 6 Bulgaria 2.84
#> 7 Canada 1.64
#> 8 Chile 2.11
#> 9 China (People's Republic of) 5.86
#> 10 Colombia 2.80
#> # ℹ 48 more rows