Hard is Easy, and Easy is Hard

The development paradox of data science

Author
Affiliation

Dr Charles T. Gray, Datapunk

Good Enough Data & Systems Lab

Published

July 9, 2025

Statistical modelling is hard

Consulting on a question about real-world data—choosing the correct statistical approach for a problem—is unquestionably a challenging conceptual task.

This begins with an assumption we have some inputs, let’s call this table XX, and we will get some outputs of interest via algorithmic transformation, let’s call this table YY.

Hard is Easy

However, once the desired shapes of XX and YY have been carefully chosen with sound mathematical reasoning, the implementation of the algorithm that transforms XX to YY has become so trivial it is but a handful of lines of code.

Easy is Hard

The dirty secret at the heart of data science that leadership are struggling to grasp is this: getting table XX is never easy, it’s always hard.

Instead, we have a set of tables, let’s call them [Z][Z] wherein each table may arrive not only in different shapes, but often different formats.

The transformation pipeline to shape [Z][Z] to XX represents the vast proportion of development work required. This process can be conceptualised as one enormous directed acyclic graph.

This singularity grows, always with a finite number of nodes and edges, but ever increasing in complexity and scale. So that it becomes impossible for humans to govern the logic of the transformations by eyeballing the code.

Worse still, the entire team need to understand the XX desired, and that XX is subject to change as decisions are made about the desired YY during development. Increasingly, it’s rare for the people getting the XX and computing the YY to be the same people.

Confusions abound.


Without tests and validation to communicate shared assumptions, a butterfly of shit data can flap its wings and cause a tornado of broken analyses, trauma, and workplace toxicity.

A butterfly of shit data