data-schema
data-schema.Rmd
Getting the data
Sources
There are several sites where players upload Super Metroid speed running data.
site | api | description (why we’re interested, these assumptions should be checked) |
---|---|---|
speedrun | python(how to get historical?) & curl | most complete list of speed runs |
splitio | python | by-segment times for each run, but for a smaller number of runners than speedrun.com |
deertier | there is an api, but I think we can get everything we need with
rvest
|
super metroid game-specific speed running site |
tribble(
~source, ~runs, ~players,
"speedrun.com", nrow(src_run_df), NA,
"splits.io", NA, NA,
"deertier", NA, NA
) %>%
gt()
source | runs | players |
---|---|---|
speedrun.com | 582 | NA |
splits.io | NA | NA |
deertier | NA | NA |
Desired output: ggplot
-friendly/tidy data
Dataframes that describe runs, runners, and, eventually, categories. Ideally, aggregated across the datasets.
However, this will results in missing data, so specific analyses need to take that into account, or use source-specific datasets from the aggregated set with caveats. Linking ids are in italics.
Either way, we need to know what we have in each dataset, and have a universal schema across datasets.
Rank of runs from speedrun.com
Each row describes one speed run; a player may have multiple runs in a database.
run_df | description |
---|---|
player_id | unique identifier of player |
rank | rank of player |
location | geographic location of player |
t_hr | human readable total time |
date | timestamp of run upload |
run_id | unique identifier of run |
t_s | total time of run in s |
Segments for each run from splitsio
In addition to the fields above, for splits data, we have one row per split recorded on a run. There are as many rows as there are splits for a run.
segment_df | description |
---|---|
run_id | unique identifier of run |
segment_id | unique identifier of segment |
t_s | time in seconds, measured to millisecond precision |