Skip to contents

Getting the data

Sources

There are several sites where players upload Super Metroid speed running data.

site api description (why we’re interested, these assumptions should be checked)
speedrun python(how to get historical?) & curl most complete list of speed runs
splitio python by-segment times for each run, but for a smaller number of runners than speedrun.com
deertier there is an api, but I think we can get everything we need with rvest super metroid game-specific speed running site

tribble(
  ~source, ~runs, ~players,
  "speedrun.com", nrow(src_run_df), NA,
  "splits.io", NA, NA,
  "deertier", NA, NA
) %>% 
  gt()
source runs players
speedrun.com 582 NA
splits.io NA NA
deertier NA NA

Desired output: ggplot-friendly/tidy data

Dataframes that describe runs, runners, and, eventually, categories. Ideally, aggregated across the datasets.

However, this will results in missing data, so specific analyses need to take that into account, or use source-specific datasets from the aggregated set with caveats. Linking ids are in italics.

Either way, we need to know what we have in each dataset, and have a universal schema across datasets.

Rank of runs from speedrun.com

Each row describes one speed run; a player may have multiple runs in a database.

run_df description
player_id unique identifier of player
rank rank of player
location geographic location of player
t_hr human readable total time
date timestamp of run upload
run_id unique identifier of run
t_s total time of run in s

Segments for each run from splitsio

In addition to the fields above, for splits data, we have one row per split recorded on a run. There are as many rows as there are splits for a run.

segment_df description
run_id unique identifier of run
segment_id unique identifier of segment
t_s time in seconds, measured to millisecond precision

deertier

  • port code

speedrunslive

Have enough from splitsio and srdcom for analyses.