data-schema

library(supermetroid)
library(gt)
library(tidyverse)

Getting the data

Sources

There are several sites where players upload Super Metroid speed running data.

site	api	description (why we’re interested, these assumptions should be checked)
speedrun	python(how to get historical?) & curl	most complete list of speed runs
splitio	python	by-segment times for each run, but for a smaller number of runners than speedrun.com
deertier	there is an api, but I think we can get everything we need with `rvest`	super metroid game-specific speed running site


tribble(
  ~source, ~runs, ~players,
  "speedrun.com", nrow(src_run_df), NA,
  "splits.io", NA, NA,
  "deertier", NA, NA
) %>% 
  gt()

source	runs	players
speedrun.com	582	NA
splits.io	NA	NA
deertier	NA	NA

Desired output: `ggplot`-friendly/tidy data

Dataframes that describe runs, runners, and, eventually, categories. Ideally, aggregated across the datasets.

However, this will results in missing data, so specific analyses need to take that into account, or use source-specific datasets from the aggregated set with caveats. Linking ids are in italics.

Either way, we need to know what we have in each dataset, and have a universal schema across datasets.

Rank of runs from speedrun.com

Each row describes one speed run; a player may have multiple runs in a database.

run_df	description
player_id	unique identifier of player
rank	rank of player
location	geographic location of player
t_hr	human readable total time
date	timestamp of run upload
run_id	unique identifier of run
t_s	total time of run in s

Segments for each run from splitsio

In addition to the fields above, for splits data, we have one row per split recorded on a run. There are as many rows as there are splits for a run.

segment_df	description
run_id	unique identifier of run
segment_id	unique identifier of segment
t_s	time in seconds, measured to millisecond precision

deertier

port code

speedrunslive

Have enough from splitsio and srdcom for analyses.