Filtering large data tables • robonomistClient

Filtering datasources

Most data tables are returned as regular tibble objects, but some datasources contain large tables and require filtering at the source. For example, OECD database usually requires filtering on the data before anything can be retrieved. For convenience, Robonomist provides an easy-to-use workflow to fetch data.

The OECD SDMX-ML api, on which the oecd dataset is based, has become unstable recently. We recommened to using the new oecd3 database, which uses the SDMX-json api which is more reliable.

The initial data request (e.g data("oecd3/QNA")) returns a data structure object that acts very much like a regular tibble, but it does not yet contain the actual data. This object is printed like a tibble, and it can be handled with common dplyr verbs such as filter and distinct in a magrittr pipe using %>%.

data("oecd3/QNA")
#> # Title:    Quarterly National Accounts
#> # OECD_v3:  Uncollected data structure
#> # A tibble: 3,646,536 × 6
#>    Country Subject              Measure           Frequency time           value
#>    <chr>   <chr>                <chr>             <chr>     <date>     <collect>
#>  1 Finland Net purchases abroad National currenc… Quarterly 1990-01-01        ??
#>  2 Finland Net purchases abroad National currenc… Quarterly 1990-04-01        ??
#>  3 Finland Net purchases abroad National currenc… Quarterly 1990-07-01        ??
#>  4 Finland Net purchases abroad National currenc… Quarterly 1990-10-01        ??
#>  5 Finland Net purchases abroad National currenc… Quarterly 1991-01-01        ??
#>  6 Finland Net purchases abroad National currenc… Quarterly 1991-04-01        ??
#>  7 Finland Net purchases abroad National currenc… Quarterly 1991-07-01        ??
#>  8 Finland Net purchases abroad National currenc… Quarterly 1991-10-01        ??
#>  9 Finland Net purchases abroad National currenc… Quarterly 1992-01-01        ??
#> 10 Finland Net purchases abroad National currenc… Quarterly 1992-04-01        ??
#> # ℹ 3,646,526 more rows
#> # This data has not yet been collected from OECD api, and all rows might not be available. Please use `filter()` to limit the number of rows under a million, and use `collect()` to retrieve actual data.

x <-
  data("oecd3/QNA") %>%
  filter(Country == "Finland",
         str_detect(Subject, "Gross domestic prod"),
         Frequency=="Quarterly") |>
  filter(lubridate::year(time) > 2019L)
x
#> # Title:    Quarterly National Accounts
#> # OECD_v3:  Uncollected data structure
#> # A tibble: 390 × 6
#>    Country Subject                        Measure Frequency time           value
#>    <chr>   <chr>                          <chr>   <chr>     <date>     <collect>
#>  1 Finland Gross domestic product at mar… Nation… Quarterly 2020-01-01        ??
#>  2 Finland Gross domestic product at mar… Nation… Quarterly 2020-04-01        ??
#>  3 Finland Gross domestic product at mar… Nation… Quarterly 2020-07-01        ??
#>  4 Finland Gross domestic product at mar… Nation… Quarterly 2020-10-01        ??
#>  5 Finland Gross domestic product at mar… Nation… Quarterly 2021-01-01        ??
#>  6 Finland Gross domestic product at mar… Nation… Quarterly 2021-04-01        ??
#>  7 Finland Gross domestic product at mar… Nation… Quarterly 2021-07-01        ??
#>  8 Finland Gross domestic product at mar… Nation… Quarterly 2021-10-01        ??
#>  9 Finland Gross domestic product at mar… Nation… Quarterly 2022-01-01        ??
#> 10 Finland Gross domestic product at mar… Nation… Quarterly 2022-04-01        ??
#> # ℹ 380 more rows
#> # This data has not yet been collected from OECD api, and all rows might not be available. Please use `filter()` to limit the number of rows under a million, and use `collect()` to retrieve actual data.

distinct(x, Subject)
#> # A tibble: 4 × 1
#>   Subject                                                  
#>   <chr>                                                    
#> 1 Gross domestic product at market prices - output approach
#> 2 Gross domestic product - expenditure approach            
#> 3 Gross domestic product - income approach                 
#> 4 Gross domestic product

After data structure object has been filtered, the actual data can be collected with collect().

x |> collect()
#> # Robonomist id: oecd3/QNA
#> # Title:         Quarterly National Accounts
#> # Vintage:       2024-06-13 05:31:05.142084
#> # A tibble:      390 × 6
#>    Country Subject                            Measure Frequency time       value
#>    <chr>   <chr>                              <chr>   <chr>     <date>     <dbl>
#>  1 Finland Gross domestic product at market … Nation… Quarterly 2020-01-01 58424
#>  2 Finland Gross domestic product at market … Nation… Quarterly 2020-04-01 57751
#>  3 Finland Gross domestic product at market … Nation… Quarterly 2020-07-01 59082
#>  4 Finland Gross domestic product at market … Nation… Quarterly 2020-10-01 62781
#>  5 Finland Gross domestic product at market … Nation… Quarterly 2021-01-01 57665
#>  6 Finland Gross domestic product at market … Nation… Quarterly 2021-04-01 62729
#>  7 Finland Gross domestic product at market … Nation… Quarterly 2021-07-01 62551
#>  8 Finland Gross domestic product at market … Nation… Quarterly 2021-10-01 67719
#>  9 Finland Gross domestic product at market … Nation… Quarterly 2022-01-01 62705
#> 10 Finland Gross domestic product at market … Nation… Quarterly 2022-04-01 67312
#> # ℹ 380 more rows

Also common dplyr verbs that require the actual data will trigger the collect function automatically.

data("oecd3/QNA") %>%
  filter(Subject == "Gross domestic product - expenditure approach",
         str_detect(Measure, "Growth rate.*year"),
         Frequency == "Quarterly") %>%
  filter(lubridate::year(time) >= 2015) %>%
  group_by(Country) %>%
  summarize(`Average Q/Q growth rate` = mean(value))
#> # A tibble: 58 × 2
#>    Country                      `Average Q/Q growth rate`
#>    <chr>                                            <dbl>
#>  1 Argentina                                        0.503
#>  2 Australia                                        2.41 
#>  3 Austria                                          1.26 
#>  4 Belgium                                          1.70 
#>  5 Brazil                                           0.542
#>  6 Bulgaria                                         2.84 
#>  7 Canada                                           1.64 
#>  8 Chile                                            2.11 
#>  9 China (People's Republic of)                     5.86 
#> 10 Colombia                                         2.80 
#> # ℹ 48 more rows