Retention analytics#

These analyses are recreations of player retention I created at a video game studio I worked at. User retention is used widely as a KPI in software development.

The game producers and developers were interested in retention by build, but the CTO and CEO were more interested in retention over time.

Challenges in retention analytics#

Release length vary#

There is no assurance, of course, that there will be the same number of users per build. Nor, crucially, is there a standard length of release between patches of software, or even minor changes, major changes. The length of time the release is active to download from an app store will affect the number of users who access it. When using statistical estimators, such as proportions, it’s better to take this into account by providing confidence intervals, which reflect the variability in sample sizes.

Minor changes instead of patches#

At the video game studio, it was generally better to provide analytics on minor changes, as patches could be out for only a matter of days. Semantic versioning is common in software development, where the version numbers reflect iterations of the software build in the form [major change].[minor change].[patch], indexed from 0, so that version 3.14.1 is the third major change of the software, the fourteenth minor change, and there was a patch after release 3.14.0. So, as data analysts, there was necessarily an exercise in string-splitting required. For this reason, we had a versions dataset in our DBT pipeline that could be called to any analysis, providing a [major change].[minor change] column, joined on build [major change].[minor change].[patch].

Daily retention over time#

Providing confidence over time presented another challenge, for stakeholders wanted time aggregations, so that they could interpret changes in Day d retention over time. For example, they wanted to discuss how Day 3 retention had fared by month, quarter, and year. However, these were now aggregates of proportions, wherein the Day 0 was not for the build, but indexed from the date being aggregated, so that for each date being aggregated, there was a different sample size. In this case, it’s necessary to not only provide confidence intervals, but weight them and the proportion estimated for the time period.

Next we look at some retention analytics by build, by time, the math and calculations required for providing confidence on retention by build and over time. Finally how these analyses were developed.