Hard is Easy, and Easy is Hard

The development paradox of data science

Author

Affiliation

Dr Charles T. Gray, Datapunk

Good Enough Data & Systems Lab

Published

July 9, 2025

Statistical modelling is hard

Consulting on a question about real-world data—choosing the correct statistical approach for a problem—is unquestionably a challenging conceptual task.

This begins with an assumption we have some inputs, let’s call this table $X$ , and we will get some outputs of interest via algorithmic transformation, let’s call this table $Y$ .

Hard is Easy

However, once the desired shapes of $X$ and $Y$ have been carefully chosen with sound mathematical reasoning, the implementation of the algorithm that transforms $X$ to $Y$ has become so trivial it is but a handful of lines of code.

Easy is Hard

The dirty secret at the heart of data science that leadership are struggling to grasp is this: getting table $X$ is never easy, it’s always hard.

Instead, we have a set of tables, let’s call them $[Z]$ wherein each table may arrive not only in different shapes, but often different formats.

The transformation pipeline to shape $[Z]$ to $X$ represents the vast proportion of development work required. This process can be conceptualised as one enormous directed acyclic graph.

This singularity grows, always with a finite number of nodes and edges, but ever increasing in complexity and scale. So that it becomes impossible for humans to govern the logic of the transformations by eyeballing the code.

Worse still, the entire team need to understand the $X$ desired, and that $X$ is subject to change as decisions are made about the desired $Y$ during development. Increasingly, it’s rare for the people getting the $X$ and computing the $Y$ to be the same people.

Confusions abound.

Without tests and validation to communicate shared assumptions, a butterfly of shit data can flap its wings and cause a tornado of broken analyses, trauma, and workplace toxicity.

Statistical modelling is hard

Hard is Easy

Easy is Hard

A butterfly of shit data