Validation of Data Entity meets Minimum FAIR Criteria | |
---|---|
FAIR principle | Minimum requirements |
Findable | Data documentation wherein columns & tables have descriptions for the data entity lineage. Clearly states access protocols; i.e., data security. Ensures subject matter experts can advise on governance, as well as contribute to validation. |
Accessible | Understandable in terms of a data entity, a semantic abstraction in stakeholder terms (e.g., a number on an invoice). The helps ensure data governance is applied as subject matter experts advise. |
Interoperable | Workflows to ensure data governance, security, and access function as intended when analysing data entity. Facilitates subject matter experts contributing to data validation. |
Reusable | Requested data transformations provided in published layer of dedicated database, standard tests applied to data entity (see below for details) to confirm validity of documentated assumptions. |
Questionable Analytical Observations
And what to do about them
1 Ceci n’est pas un datum
Any data engineer will tell you there’s a relentlessly comedic side to the industry, wherein confusion over what an observation is abounds, leading to unexpectedly large deviations from vision (Figure 1).
Those in the trenches of the modern data stack (toolchains for working with data at scale [2]) know this is why the community have developed principles, such as FAIR [3], and open-source tools, such as DBT-core [4] that opinionate FAIR.
These engineers are also the first to tell you just how far from reality industry and research are from realising even a fraction of those aspirations [5] [6] [7]. This manuscript is a roadmap for the first step in developing a data product–it’s a lot more challenging than you might think.
Something that tools, such as DBT, are beginning to intuit is that the governance of data must be an iterative workflow that interoperates with the recursive nature of inquiry, not a policy checkbox at the end–such an aim is (argued in a sister paper) chaotically impossible given how development plans morph and change as questions are refined.
We must correct for epistemic drift1 (deviation from intention) in the chaos induced by people and machines interoperating to answer questions with observations at scale. Or, despite our best efforts, we end at the frontier psychiatry (Figure 2) of data science.
We begin declaratively, as engineers are busy and it’s good to give them a table of instructions on the second page so they can stop reading and go code something useful. This manuscript is intended as a tool for engineers to obtain rarely-granted scope from leadership to build reusable data architecture. We then explain what motivates this framework, and provide examples in R, SQL, and Python in appendices.
2 Clopen FAIR Data Entity - User Access Data Test
This test defines the first living analysis development team goal, usually instantiated in an agile (for democratising development [9]) tool, such as an epic in JIRA [10]. This test draws from the V-model of systems development lifecycle that privileges end-user access [11].
2.1 First Team Goal: Instantiate the FAIR data entity test
We consider the FAIR data entity test instantiated when:
- the data analyst validates minimum assumptions about a single data point in the analyst’s chosen tool,
- notably, where there is no requirement the validation passes the test,
- and the test is clopen2–closed according to security requirements, open to analysts and stakeholders.
2.2 Framing FAIR
The FAIR framing shown in Figure 3 is a way to communicate the black box of data to decision makers. In particular, this framing highlights the distinction between what they conceptualise as data
(Can’t you just email me a spreadsheet?) and a data product
of reusable data architecture at scale that will save untold wages, time, and consternation.
2.3 Minimal assumptions to test
The minimal set of tests show in Figure 4 for
- uniqueness
- missingness
- duplicates
allows for data platform development whilst demonstrating a good-faith roadmap to compliance for, say, the EU Comission’s Environmental, Social, and Governance reporting (ESG) [12]. Ideally, the analyst provides further assumptions, and a previously calculated datum, however this is notoriously difficult for engineers to obtain in practice, thus we provide a set of defaults so that engineers can begin work.
Tests Applied on Data Product Layers by Observability, Descending | ||
---|---|---|
Test | Tested | |
Analytical Observation | ||
output | unique key | data entity |
Semantic Transformation | ||
output | unique key | data entity joined across raw sources and tested |
Source | ||
output | unique key | data entity defined and tested |
input | unique key | combination of columns that define a unique row |
input | freshness | incrementation or snapshot field |
input | not empty | table-level test |
We say a unique key has been tested when the same combination of columns have not null and unique tests applied. | ||
Freshness tests configurations: daily ingestion (warn > 1 day, error > 1 week); weekly ingestion (warn > 1 week, error > 2 weeks) |
It can be a challenge to communicate to leadership with little experiences of the trenches of data development the obstacles in scaling legacy data to living analyses. By setting these minimal goals, the problem solving is democratised from engineer to analyst.
We now turn to motivating the solution we just provided.
3 Expectation vs reality
Humans, bless ’em, are not great at conceptualising the black boxes they work with. A critical error we make is expecting a living analysis development lifecycle to reflect the workflow of an individual analyst producing a one-off report. Thus, living analysis development tends to have an expectation backbone (Figure 5), whether it be to produce a spreadsheet, some business intelligence dashboard, or deploy a machine learning algorithm by bespoke tooling.
Now, Spivak rightly notes the nature of human inquiry is cyclical [13], where result question is looped through, however reality is a good deal messier. Despite our best-laid agile plans, there’s an inevitable drift that occurs during the predictably unpredictable recursiveness of data development (Figure 6).
4 Mind your Ps and Qs
The problem of getting tangled in the anal beads of data development (Figure 6) lies in the relationship between the assumptions of the question–posed, by, say, a machine learning model or business KPI–and the answer reported in a dashboard or report. This is where logic and practice begin to diverge.
Let’s gesture at the structure of logic, and show where it begins to break down in living analysis development.
In formal terms, any scientific statement from data takes the form:
Given we have these observations, we assume this result.
Consider a KPI:
Given we have averaged the positive invoice items over years, we assume this is the company’s annual revenue.
Or a statistical model:
Given we have these observations, we assume this model provides evidence for the result.
Formally, logic phrases this as:
We say this as if p, then q. In data science, we take observations, and assume result. However, crack open any logic text [say, 14], and we find that the truth of this implication may be vacuous (Figure 7).
p | q | p → q |
---|---|---|
T | T | T |
T | F | F |
F | T | T |
F | F | T |
The practice of data science is a living instantiation of this truth table. Consider the KPI example of annual revenue taken by summing all different revenue streams to for each year to get to , the desired KPI.
Year | p₁ | p₂ | ⋯ | pₙ | q (KPI) |
---|---|---|---|---|---|
2022 | ✓ | ✓ | ⋯ | ✓ | $1.2M |
2023 | ✓ | ✓ | ⋯ | ✓ | $1.4M |
However, what happens when some of the observatations are spurious? Perhaps there are missing datapoints or duplicates.
Year | p₁ | p₂ | ⋯ | pₙ | q (KPI) |
---|---|---|---|---|---|
2022 | ✓ | 🗑️ | ⋯ | ✓ | 🗑️ |
2023 | 🗑️ | ✓ | ⋯ | 🗑️ | 🗑️ |
This is commonly described as the garbage in, garbage out problem of data science.
4.1 Epistemic drift over time
Worse still, living analysis development is recursive. Suppose a stakeholder asks for revenue excluding a soon-to-be-discontinued product. The request is mentioned briefly to the analyst, but never reaches the data engineer. The system remains unchanged.
Consider game analytics. Player events are commonly instantiated by developers in a nested, tree-like structure, but analysts need flat data. At project start, the analyst verifies a key assumption: the minigame of interest appears only once per level. A dashboard is deployed using a modern stack—for example, Unity Analytics → Redshift → DBT → Redshift → Tableau.
Later, developers add a second minigame to one of the levels and encode it differently in the tree, unaware that game designers are relying on the dashboard to track average minigame completion time. The analyst, unaware of the schema change, had averaged a nested field assuming uniqueness. That assumption no longer holds. One level now duplicates data, skewing the result. The data engineer doesn’t know it matters. The dashboard updates. It looks fine.
But the numbers are now meaningless–Unity Analytics Tableau. The data is still valid, the dashboard still functions, the relationship is no longer valid.
No one notices. No one can. The assumption has silently broken.
4.2 Epistemic drift at scale
Now consider a research software engineer assisting a professor with a simulation study. The professor provides synthetic data and a model script for the engineer to fit thousands of models for parameter tuning. At first, the pipeline runs smoothly.
As the experiment scales, the professor supplies more and more data—terabytes of it. The dataset now spans billions of rows, far beyond what the engineer can feasibly inspect. The model being fit is a log-based model requiring strictly positive numeric input—an assumption that would be clear in real-world biological data, but is easy to violate in simulation.
Suppose a single mistake is introduced into the simulated data—negative values that breach the model’s assumptions. That error propagates silently, replicated thousands of times across model fits. Eventually, the pipeline breaks. The engineer, working downstream and unaware of the log constraint embedded in the professor’s script, is left to debug a system that no longer holds, without the contextual understanding of the model the professor has.
It can take weeks, even months, to trace such a failure to its source.
5 Democratising data development
But if we take a category theoretic approach–espoused in texts such as Category Theory for the Sciences [13]–to data development and privilege verifying not data points, but the relationships between data development roles and processes, we have a roadmap to mitigating garbage data. By ensuring integrity is retained between those who generate the data, those who wrangle the data, and those who analyse and report, we find a way of preventing the vacuousness truth of .
If the game developers and the professor described above had living validations that revealed their assumptions from the start, those who understand the data are able to diagnose garbage data far more efficiently.
The modern data stack is a powerful beast.
Problem is, if you show the engineers FAIR criteria [3], you will hear, sure, sure, next week.
“I’ve found it best not to do any documetation at all, otherwise people point out what’s missing.” – Real things i hear engineers say.
The reality is while data analytic question is reworked, there is little data engineers can know are true about the data. Schema are often non-existent. Because the
6 Appendices
6.1 DBT FDE test on invoice items
6.2 Python FDE test example on image classification
6.3 R FDE tests for this singularity
In all tests, we ask, what is the minimum thing we can do? Then structure exists and we can expand.
6.3.1 Generator fn display
Testing the morphism between graph visualisation generator and display.
How do I call code reusably across both this manuscript, a .qmd file, and a slide deck I’m embedding the same .qmd file?
# test where I am
getwd()
[1] "/home/cantabile/Documents/repos/good-enough"
# need to source
source("R/test-mooncake.R")
# test calling from R
mooncake_test
[1] "hi Mooncake"
7 References
Footnotes
Look to the sister paper for a deeper discussion of epistemic drift.↩︎
In a nod to the topologically dynamic space data development lives in, we appropriate the mathematical term clopen which defines a set that is both open and closed. If you think something being both mathematically closed and open is weird and confusing, you’re not alone. ↩︎
Contributor acknowledgement: Many thanks to the Data & AI team at TDCnet, Denmark, for contributions to test development and advisory discussions.↩︎