Questionable Analytical Observations

And what to do about them

Author
Affiliation

Dr Charles T. Gray, Datapunk

Good Enough Data & Systems Lab

View slides in full screen

1 Ceci n’est pas un datum

Any data engineer will tell you there’s a relentlessly comedic side to the industry, wherein confusion over what an observation is abounds, leading to unexpectedly large deviations from vision (Figure 1).

Figure 1: Nigel gave me a drawing that said 18 inches. Now, whether or not he knows the difference between feet and inches is not my problem. I do what I’m told.This is Spinal Tap (1984) [1].

Those in the trenches of the modern data stack (toolchains for working with data at scale [2]) know this is why the community have developed principles, such as FAIR [3], and open-source tools, such as DBT-core [4] that opinionate FAIR.

These engineers are also the first to tell you just how far from reality industry and research are from realising even a fraction of those aspirations [5] [6] [7]. This manuscript is a roadmap for the first step in developing a data product–it’s a lot more challenging than you might think.

Something that tools, such as DBT, are beginning to intuit is that the governance of data must be an iterative workflow that interoperates with the recursive nature of inquiry, not a policy checkbox at the end–such an aim is (argued in a sister paper) chaotically impossible given how development plans morph and change as questions are refined.

We must correct for epistemic drift1 (deviation from intention) in the chaos induced by people and machines interoperating to answer questions with observations at scale. Or, despite our best efforts, we end at the frontier psychiatry (Figure 2) of data science.

Figure 2: Each scientist or engineer is skilled in their own right, but epistemic drift creates an absurdist orchestra of bad data and broken dashboards [8].

We begin declaratively, as engineers are busy and it’s good to give them a table of instructions on the second page so they can stop reading and go code something useful. This manuscript is intended as a tool for engineers to obtain rarely-granted scope from leadership to build reusable data architecture. We then explain what motivates this framework, and provide examples in R, SQL, and Python in appendices.

2 Clopen FAIR Data Entity - User Access Data Test

This test defines the first living analysis development team goal, usually instantiated in an agile (for democratising development [9]) tool, such as an epic in JIRA [10]. This test draws from the V-model of systems development lifecycle that privileges end-user access [11].

2.1 First Team Goal: Instantiate the FAIR data entity test

We consider the FAIR data entity test instantiated when:

  1. the data analyst validates minimum assumptions about a single data point in the analyst’s chosen tool,
  2. notably, where there is no requirement the validation passes the test,
  3. and the test is clopen2–closed according to security requirements, open to analysts and stakeholders.

2.2 Framing FAIR

The FAIR framing shown in Figure 3 is a way to communicate the black box of data to decision makers. In particular, this framing highlights the distinction between what they conceptualise as data (Can’t you just email me a spreadsheet?) and a data product of reusable data architecture at scale that will save untold wages, time, and consternation.

Validation of Data Entity meets Minimum FAIR Criteria
FAIR principle Minimum requirements
Findable Data documentation wherein columns & tables have descriptions for the data entity lineage. Clearly states access protocols; i.e., data security. Ensures subject matter experts can advise on governance, as well as contribute to validation.
Accessible Understandable in terms of a data entity, a semantic abstraction in stakeholder terms (e.g., a number on an invoice). The helps ensure data governance is applied as subject matter experts advise.
Interoperable Workflows to ensure data governance, security, and access function as intended when analysing data entity. Facilitates subject matter experts contributing to data validation.
Reusable Requested data transformations provided in published layer of dedicated database, standard tests applied to data entity (see below for details) to confirm validity of documentated assumptions.
Figure 3: Validation Requirements for a Clopen FAIR data entity.

2.3 Minimal assumptions to test

The minimal set of tests show in Figure 4 for

  1. uniqueness
  2. missingness
  3. duplicates

allows for data platform development whilst demonstrating a good-faith roadmap to compliance for, say, the EU Comission’s Environmental, Social, and Governance reporting (ESG) [12]. Ideally, the analyst provides further assumptions, and a previously calculated datum, however this is notoriously difficult for engineers to obtain in practice, thus we provide a set of defaults so that engineers can begin work.

Tests Applied on Data Product Layers by Observability, Descending
Test Tested
Analytical Observation
output unique key data entity
Semantic Transformation
output unique key data entity joined across raw sources and tested
Source
output unique key data entity defined and tested
input unique key combination of columns that define a unique row
input freshness incrementation or snapshot field
input not empty table-level test
We say a unique key has been tested when the same combination of columns have not null and unique tests applied.
Freshness tests configurations: daily ingestion (warn > 1 day, error > 1 week); weekly ingestion (warn > 1 week, error > 2 weeks)
Figure 4: FAIR data entity tests3.

It can be a challenge to communicate to leadership with little experiences of the trenches of data development the obstacles in scaling legacy data to living analyses. By setting these minimal goals, the problem solving is democratised from engineer to analyst.

We now turn to motivating the solution we just provided.

3 Expectation vs reality

Humans, bless ’em, are not great at conceptualising the black boxes they work with. A critical error we make is expecting a living analysis development lifecycle to reflect the workflow of an individual analyst producing a one-off report. Thus, living analysis development tends to have an expectation backbone (Figure 5), whether it be to produce a spreadsheet, some business intelligence dashboard, or deploy a machine learning algorithm by bespoke tooling.

Figure 5: The anal beads of data development.

Now, Spivak rightly notes the nature of human inquiry is cyclical [13], where result \to question is looped through, however reality is a good deal messier. Despite our best-laid agile plans, there’s an inevitable drift that occurs during the predictably unpredictable recursiveness of data development (Figure 6).

Figure 6: The tangled anal beads of data development.

4 Mind your Ps and Qs

The problem of getting tangled in the anal beads of data development (Figure 6) lies in the relationship between the assumptions of the question–posed, by, say, a machine learning model or business KPI–and the answer reported in a dashboard or report. This is where logic and practice begin to diverge.

Let’s gesture at the structure of logic, and show where it begins to break down in living analysis development.

In formal terms, any scientific statement from data takes the form:

Given we have these observations, we assume this result.

Consider a KPI:

Given we have averaged the positive invoice items over years, we assume this is the company’s annual revenue.

Or a statistical model:

Given we have these observations, we assume this model provides evidence for the result.

Formally, logic phrases this as:

pq. p \implies q.

We say this as if p, then q. In data science, we take pp observations, and assume qq result. However, crack open any logic text [say, 14], and we find that the truth of this implication may be vacuous (Figure 7).

p q p → q
T T T
T F F
F T T
F F T
Figure 7: Standard propositional logic tells us that any result that follows a false assumption is vacuously true.

The practice of data science is a living instantiation of this truth table. Consider the KPI example of annual revenue taken by summing all different revenue streams p1p_1 to pnp_n for each year to get to qq, the desired KPI.

Year p₁ p₂ pₙ q (KPI)
2022 $1.2M
2023 $1.4M
Figure 8: Expected structure: If all assumptions p1p_1 to pnp_n hold, then result qq is valid.

However, what happens when some of the observatations are spurious? Perhaps there are missing datapoints or duplicates.

Year p₁ p₂ pₙ q (KPI)
2022 🗑️ 🗑️
2023 🗑️ 🗑️ 🗑️
Figure 9: Observed structure: Some assumptions are clean, some drifted—but all results are trash.

This is commonly described as the garbage in, garbage out problem of data science.

4.1 Epistemic drift over time

Worse still, living analysis development is recursive. Suppose a stakeholder asks for revenue excluding a soon-to-be-discontinued product. The request is mentioned briefly to the analyst, but never reaches the data engineer. The system remains unchanged.

Consider game analytics. Player events are commonly instantiated by developers in a nested, tree-like structure, but analysts need flat data. At project start, the analyst verifies a key assumption: the minigame of interest appears only once per level. A dashboard is deployed using a modern stack—for example, Unity Analytics → Redshift → DBT → Redshift → Tableau.

Later, developers add a second minigame to one of the levels and encode it differently in the tree, unaware that game designers are relying on the dashboard to track average minigame completion time. The analyst, unaware of the schema change, had averaged a nested field assuming uniqueness. That assumption no longer holds. One level now duplicates data, skewing the result. The data engineer doesn’t know it matters. The dashboard updates. It looks fine.

But the numbers are now meaningless–Unity Analytics ↛\not \to Tableau. The data is still valid, the dashboard still functions, the relationship is no longer valid.

No one notices. No one can. The assumption has silently broken.

4.2 Epistemic drift at scale

Now consider a research software engineer assisting a professor with a simulation study. The professor provides synthetic data and a model script for the engineer to fit thousands of models for parameter tuning. At first, the pipeline runs smoothly.

As the experiment scales, the professor supplies more and more data—terabytes of it. The dataset now spans billions of rows, far beyond what the engineer can feasibly inspect. The model being fit is a log-based model requiring strictly positive numeric input—an assumption that would be clear in real-world biological data, but is easy to violate in simulation.

Suppose a single mistake is introduced into the simulated data—negative values that breach the model’s assumptions. That error propagates silently, replicated thousands of times across model fits. Eventually, the pipeline breaks. The engineer, working downstream and unaware of the log constraint embedded in the professor’s script, is left to debug a system that no longer holds, without the contextual understanding of the model the professor has.

It can take weeks, even months, to trace such a failure to its source.

5 Democratising data development

But if we take a category theoretic approach–espoused in texts such as Category Theory for the Sciences [13]–to data development and privilege verifying not data points, but the relationships between data development roles and processes, we have a roadmap to mitigating garbage data. By ensuring integrity is retained between those who generate the data, those who wrangle the data, and those who analyse and report, we find a way of preventing the vacuousness truth of FFF \to F.

If the game developers and the professor described above had living validations that revealed their assumptions from the start, those who understand the data are able to diagnose garbage data far more efficiently.

Figure 10: Platform! Engineer! Analyst! Stakeholder!–By your powers combined… the modern data stack is yours!Captain Planet and the Planeteers (1990-1996) [15]

The modern data stack is a powerful beast.

Figure 11: Establishing a datum via FAIR data entity test allows the team to scope for specific challenges to do with different aspects of the stack.

Problem is, if you show the engineers FAIR criteria [3], you will hear, sure, sure, next week.

“I’ve found it best not to do any documetation at all, otherwise people point out what’s missing.” – Real things i hear engineers say.

The reality is while data analytic question is reworked, there is little data engineers can know are true about the data. Schema are often non-existent. Because the

6 Appendices

6.1 DBT FDE test on invoice items

6.2 Python FDE test example on image classification

6.3 R FDE tests for this singularity

In all tests, we ask, what is the minimum thing we can do? Then structure exists and we can expand.

6.3.1 Generator fn \to display

Testing the morphism between graph visualisation generator and display.

How do I call code reusably across both this manuscript, a .qmd file, and a slide deck I’m embedding the same .qmd file?

# test where I am
getwd()
[1] "/home/cantabile/Documents/repos/good-enough"
# need to source
source("R/test-mooncake.R")

# test calling from R
mooncake_test
[1] "hi Mooncake"

7 References

[1]
Reiner R. This Is Spinal Tap 1984.
[2]
[3]
Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 2016;3:160018. https://doi.org/10.1038/sdata.2016.18.
[4]
Labs D. Dbt-labs/dbt-core 2025.
[5]
Landi A, Thompson M, Giannuzzi V, Bonifazi F, Labastida I, Silva Santos LOB da, et al. The A of FAIRAs Open as Possible, as Closed as Necessary. Data Intelligence 2020;2:47–55. https://doi.org/10.1162/dint_a_00027.
[6]
Feuerriegel S, Dolata M, Schwabe G. Fair AI. Business & Information Systems Engineering 2020;62:379–84. https://doi.org/10.1007/s12599-020-00650-3.
[7]
Boeckhout M, Zielhuis GA, Bredenoord AL. The FAIR guiding principles for data stewardship: Fair enough? European Journal of Human Genetics 2018;26:931–6. https://doi.org/10.1038/s41431-018-0160-0.
[8]
[9]
Kent Beck, Mike Beedle, Arie van Bennekum, Alistair Cockburn, Ward Cunningham, Martin Fowler, et al. Manifesto for Agile Software Development 2001.
[10]
[11]
Forsberg K, Mooz H. 7.17. System Engineering for Faster, Cheaper, Better. INCOSE International Symposium 1998;8:917–27. https://doi.org/10.1002/j.2334-5837.1998.tb00130.x.
[12]
[13]
Spivak DI. Category Theory for the Sciences. MIT Press; 2014.
[14]
Smith P. An Introduction to Formal Logic. Cambridge University Press; 2003.
[15]
Captain Planet and the Planeteers 1990.

Footnotes

  1. Look to the sister paper for a deeper discussion of epistemic drift.↩︎

  2. In a nod to the topologically dynamic space data development lives in, we appropriate the mathematical term clopen which defines a set that is both open and closed. If you think something being both mathematically closed and open is weird and confusing, you’re not alone. ↩︎

  3. Contributor acknowledgement: Many thanks to the Data & AI team at TDCnet, Denmark, for contributions to test development and advisory discussions.↩︎