Machine_readable.Rmd
# packages used in this walkthrough
library(sysrevdata)
library(tidyverse)
library(leaflet)
# this avoids tidyverse conflicts with the base function filter
conflicted::conflict_prefer("filter", "dplyr")
Along with a narrative synthesis, systematic reviews and maps typically require complex visualisations. In addition, it is common for review authors to upgrade a systematic map into one or more focused systematic reviews, which may involve . Before this can happen, review authors need to convert their database of studies from a condensed or wide format into a format that is ready for analysis. The easiest way to do this is to produce a long or tidy format, one variable per study per row.
This is the ideal way to share data for future syntheses, as it is machine readable for multiple analyses. It also explicitly separates connected data, rather than compressing multiple methods and outcomes onto a single line, losing links between columns. In this long version, each linkage between populations, interventions, comparators, outcomes, study methods, etc. is preserved explicitly on a separate row - a row of independent data.
In this walkthrough, we’ll consider the bufferstrips
dataset, which starts of as wide-formatted (each level of each variable is presented as a separate column).
# spatial variables
buffer_example %>%
select(short_title, contains("spatial"))
#> # A tibble: 5 x 7
#> short_title spatialscale_pl~ spatialscale_fi~ spatialscale_fa~
#> <chr> <chr> <chr> <chr>
#> 1 Aaron (200~ <NA> <NA> <NA>
#> 2 Aavik (200~ <NA> <NA> <NA>
#> 3 Aavik (201~ <NA> <NA> <NA>
#> 4 Abu-Zreig ~ Plot scale <NA> <NA>
#> 5 Abu-Zreig ~ Plot scale <NA> <NA>
#> # ... with 3 more variables: spatialscale_catchment <chr>,
#> # spatialscale_regional <chr>, spatialscale_notdescribed <chr>
We want to convert these wide data to long-form data, dropping the elements and producing a table where each row contains a unique of each for each study.
We’ll use the collection of variables we obtained in the creating a narrative synthesis table vignette.
buffer_variables
#> [1] "es" "farmingproductionsystem"
#> [3] "farmingsystem" "intervention"
#> [5] "measurementquarter" "outcome"
#> [7] "spatialscale" "striplocation"
#> [9] "stripmanagement" "studydesign"
#> [11] "vegetationtype"
Here’s one way of transforming the data to long format.
buffer_example_long <-
buffer_example %>%
# this function pivots longer
pivot_longer(
# see the narrative vignette for how this vector was created
cols = contains(buffer_variables),
# name of column we will put the column names of the wide data
names_to = "category_type",
# name of column we will put the values of those columns in
values_to = "subcategory_value",
# drop the NA values
values_drop_na = TRUE
) %>%
mutate(
subcategory_type = map_chr(
category_type,
.f = function(x){ifelse(
str_detect(x, "_"),
str_match(x, "_(\\w+)") %>% pluck(2),
NA
)}),
category_type = if_else(
# extract the prefix of the column names with _
str_detect(category_type, "_"),
str_extract(category_type, "[a-z]*"),
category_type
)
)
# newly created columns
buffer_example_long %>%
select(short_title, category_type, subcategory_type, subcategory_value)
#> # A tibble: 97 x 4
#> short_title category_type subcategory_type subcategory_value
#> <chr> <chr> <chr> <chr>
#> 1 Aaron (2005) vegetated strip_description Riparian buffer
#> 2 Aaron (2005) studydesign observational Observational
#> 3 Aaron (2005) farmingsystem notdescribed Not described
#> 4 Aaron (2005) farmingproductionsystem notdescribed Not described
#> 5 Aaron (2005) vegetationtype notdescribed Not described
#> 6 Aaron (2005) stripmanagement notdescribed Not described
#> 7 Aaron (2005) intervention presence Strip presence
#> 8 Aaron (2005) intervention presenceinfo Percentage riparian~
#> 9 Aaron (2005) es supporting_biodive~ Biodiversity
#> 10 Aaron (2005) Time since interventio~ <NA> Not stated
#> # ... with 87 more rows
Suppose, however, that we began with condensed data, as we created in the creating a narrative synthesis table vignette.
condensed_buffer_example
#> # A tibble: 5 x 24
#> item_id short_title title year period google_scholar_~ nation study_country
#> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 2.06e7 Aaron (200~ Inve~ 2005 2005-~ http://scholar.~ USA Maryland, USA
#> 2 2.06e7 Aavik (200~ What~ 2008 2005-~ http://scholar.~ Eston~ Estonia
#> 3 2.06e7 Aavik (201~ Quan~ 2010 2010-~ http://scholar.~ Eston~ Estonia
#> 4 2.06e7 Abu-Zreig ~ Expe~ 2004 2000-~ http://scholar.~ Not s~ Not stated
#> 5 2.06e7 Abu-Zreig ~ Phos~ 2003 2000-~ http://scholar.~ Canada Ontario, Can~
#> # ... with 16 more variables: study_location <chr>, latitute <chr>,
#> # longitude <chr>, `Study length (years)` <chr>,
#> # itervention_structureinfo <chr>, es <chr>, farmingproductionsystem <chr>,
#> # farmingsystem <chr>, intervention <chr>, measurementquarter <chr>,
#> # outcome <chr>, spatialscale <chr>, striplocation <chr>,
#> # stripmanagement <chr>, studydesign <chr>, vegetationtype <chr>
We want to take these data and create the same long-format we have above.
condensed_buffer_example %>%
pivot_longer(
contains(buffer_variables),
names_to = "category_type",
values_to = "subcategory_value"
) %>%
separate(subcategory_value,
# there are a maximum of 8 different subcategories
into = letters[1:8],
sep = "; ") %>%
pivot_longer(letters[1:8],
values_to = "subcategory_value") %>%
# get rid of nas
filter(!is.na(subcategory_value)) %>%
# drop redundant column
select(-name) %>%
# from here is just for display
select(short_title, category_type, subcategory_value)
#> # A tibble: 134 x 3
#> short_title category_type subcategory_value
#> <chr> <chr> <chr>
#> 1 Aaron (2005) es Riparian buffer
#> 2 Aaron (2005) es Observational
#> 3 Aaron (2005) es Not described
#> 4 Aaron (2005) es Not described
#> 5 Aaron (2005) es Not described
#> 6 Aaron (2005) es Not described
#> 7 Aaron (2005) es Strip presence
#> 8 Aaron (2005) es Percentage riparian cover
#> 9 Aaron (2005) studydesign Observational
#> 10 Aaron (2005) farmingproductionsystem Not described
#> # ... with 124 more rows
Now we have our data in long-form, we can perform various analyses, including (if we wanted to) meta-analysis on extracted full quantitative data from each study.
We might be interested in the number of countries in the systematic review, in which case we can use the original data where each row is a study (and studies are the independent data needed when we look at countries: each study was conducted in a specific country, so long data aren’t necessary yet).
bufferstrips %>%
count(study_country)
#> # A tibble: 113 x 2
#> study_country n
#> <chr> <int>
#> 1 Alberta, Canada 1
#> 2 Argentina 5
#> 3 Arkansas, Kentucky and Mississippi, USA 1
#> 4 Arkansas, USA 5
#> 5 Austria 3
#> 6 Belgium 15
#> 7 British Columbia, Canada 4
#> 8 California, USA 6
#> 9 Central and eastern USA 1
#> 10 Central district, Russia 1
#> # ... with 103 more rows
But to see the number of, say, observations in the farming production system, we will use our long data.
buffer_example_long %>%
filter(category_type == "farmingproductionsystem") %>%
count(subcategory_value)
#> # A tibble: 5 x 2
#> subcategory_value n
#> <chr> <int>
#> 1 Cropped fields (arable) 1
#> 2 Livestock 1
#> 3 Mixed conventional and organic, multiple farms (not described) 1
#> 4 Not described 3
#> 5 Other (please specify) 1
For creating
we need the data in the wide-format with one row per study. Our bufferstrips
dataset is in the wide format already so lets try and reconfigure a wide database from the long formatted data that we created above.
back_to_wide <-
buffer_example_long %>%
pivot_wider(
id_cols = -contains("category"),
names_from = c(category_type, subcategory_type),
names_sep = "_",
values_from = subcategory_value
)
We can use a wide formatted dataset to plot a cartographic map of the study locations for example.
back_to_wide %>%
select(short_title,latitute, longitude, google_scholar_link) %>%
mutate(lat=as.numeric(latitute)) %>%
mutate(lng=as.numeric(longitude)) %>%
mutate(tag = paste0("Scholar_link: <a href=", google_scholar_link,">", google_scholar_link, "</a>")) %>%
leaflet(width = "100%") %>%
addTiles() %>% # Add default OpenStreetMap map tiles
addMarkers(lng=~lng, lat=~lat, popup=~tag, clusterOptions = markerClusterOptions())
The code above is just applied to a subset of the data but we can apply the same code to our bufferstrips dataframe.
map<-sysrevdata::bufferstrips %>%
select(short_title,latitute, longitude, google_scholar_link) %>%
mutate(lat=as.numeric(latitute)) %>%
mutate(lng=as.numeric(longitude)) %>%
mutate(tag = paste0("Scholar_link: <a href=", google_scholar_link,">", google_scholar_link, "</a>"))
# you might need to tidy up the encoding in the dataframe to get it to work with leaflet
Encoding(x = map$tag) <- "UTF-8"
# replace all non UTF-8 character strings with an empty space
map$tag <-
iconv( x = map$tag,
from = "UTF-8"
, to = "UTF-8"
, sub = "" )
map %>% leaflet(width = "100%") %>%
addTiles() %>% # Add default OpenStreetMap map tiles
addMarkers(lng=~lng, lat=~lat, popup=~tag, clusterOptions = markerClusterOptions())