Hard is Easy, and Easy is Hard
The development paradox of data science
Statistical modelling is hard
Consulting on a question about real-world data—choosing the correct statistical approach for a problem—is unquestionably a challenging conceptual task.
This begins with an assumption we have some inputs, let’s call this table , and we will get some outputs of interest via algorithmic transformation, let’s call this table .
Hard is Easy
However, once the desired shapes of and have been carefully chosen with sound mathematical reasoning, the implementation of the algorithm that transforms to has become so trivial it is but a handful of lines of code.
Easy is Hard
The dirty secret at the heart of data science that leadership are struggling to grasp is this: getting table is never easy, it’s always hard.
Instead, we have a set of tables, let’s call them wherein each table may arrive not only in different shapes, but often different formats.
The transformation pipeline to shape to represents the vast proportion of development work required. This process can be conceptualised as one enormous directed acyclic graph.
This singularity grows, always with a finite number of nodes and edges, but ever increasing in complexity and scale. So that it becomes impossible for humans to govern the logic of the transformations by eyeballing the code.
Worse still, the entire team need to understand the desired, and that is subject to change as decisions are made about the desired during development. Increasingly, it’s rare for the people getting the and computing the to be the same people.
Confusions abound.
Without tests and validation to communicate shared assumptions, a butterfly of shit data can flap its wings and cause a tornado of broken analyses, trauma, and workplace toxicity.