Governing Data for the People, by the People

ESG Data Governance & Living Analysis Development Lifescycle

Charles T. Gray, PhD – Datapunk

DBT Copenhagen Meetup

2025-03-26

Expectation

Expected Living Analysis Development Lifecycle

Reality

Ceci n’est pas un datum

This is Spinal Tap (Reiner 1984)

Reality of Living Analysis Development Lifecycle

Absurdist Orchestra of Ungoverned Data Analytics

Reality $\to$ expectation

Realistic deliverables for data developers

A realistic first epic
docs
Data Validation
Date of Validation
Results
How to use this validation
Limitations of this validation
enddocs

Validation of a FAIR data entity.

Analyst records the outcome of a validation of an analytical observation.

{% docs validation_2025_01 %}

See ticket for more context [DAA-665]. First, naive validation on ESG kpis. Compare 1 data point with [previously reported number] from 2023.

January 10, 2025.

E1-6_07 for Gross scope 1 greenhouse gas emissions for Diesel in 2023 is approximately 7 times the reported measure.

KPI_PREVIOUS	KPI_MEASURE	PROPORTION_OF_REPORTED_MEASURE	UNIT_OF_MEASUREMENT	EMISSION_FACTOR_USED
3,111,543.42	21,814,714.592…	7.011…	liter	239.080…

This dbt analysis will not be written to snowflake. This is a script that can be run to compare previously reported figures.

Test-driven development can help narrow the scope of quality objectives, so checking to see if the propotion of the reported measure is getting closer to 1 is potentially a useful optimisation technique to speed debugging.

This validation was written while calculated kpis are still dynamic, it required further calculation & joins, and will likely require refactoring into a new validation if it is to be reused.

{% enddocs %}

Roadmap to FAIR Compliance

Is the roadmap to developer wellbeing.

Validation of Data Entity meets Minimum FAIR Criteria
FAIR principle	Minimum requirements
Findable	Data documentation wherein columns & tables have descriptions for the data entity lineage. Clearly states access protocols; i.e., data security. Ensures subject matter experts can advise on governance, as well as contribute to validation.
Accessible	Understandable in terms of a data entity, a semantic abstraction in stakeholder terms (e.g., a number on an invoice). The helps ensure data governance is applied as subject matter experts advise.
Interoperable	Workflows to ensure data governance, security, and access function as intended when analysing data entity. Facilitates subject matter experts contributing to data validation.
Reusable	Requested data transformations provided in published layer of dedicated database, standard tests applied to data entity (see below for details) to confirm validity of documentated assumptions.

Go FAIR ((noauthor_fair_nodate?)).

Minimum tests on the data entity

Tests Applied on Data Product Layers by Observability, Descending
	Test	Tested
Analytical Observation
output	unique key	data entity
Semantic Transformation
output	unique key	data entity joined across raw sources and tested
Source
output	unique key	data entity defined and tested
input	unique key	combination of columns that define a unique row
input	freshness	incrementation or snapshot field
input	not empty	table-level test
We say a unique key has been tested when the same combination of columns have not null and unique tests applied.
Freshness tests configurations: daily ingestion (warn > 1 day, error > 1 week); weekly ingestion (warn > 1 week, error > 2 weeks)

- name: wf_ap
        description: '{{ doc("invoice_item_id") }}'
        data_tests:
          - tdc__table_contains_data
          - unique:
              column_name: "type_no || '-' || doc_id"
          - not_null:
              column_name: "type_no || '-' || doc_id"
       freshness: 
          warn_after: {count: 1, period: day}
          error_after: {count: 7, period: day}
        loaded_at_field: opdateret_dato

{% test tdc__table_contains_data(model) %}
  SELECT 
    CASE 
      WHEN (SELECT COUNT(*) FROM {{model}} limit 1) > 0 THEN 'Table is not empty'
      ELSE 'Table is empty'
    END AS result
    having result='Table is empty'
{% endtest %}

{% docs invoice_item_id %}

  ### Data entity

  Each row in these data is used to identify an item on an invoice; some 
  expenditure. Each row is uniquely identified by `type_no` and `doc_id`, 
  this is concatenated in published by `invoice_item_id`. 

  ### Tests

  [Data product standard tests applied].
  

{% enddocs %}

Datapunks… assemble!

Governing Data

Is for the people.

And you’re the people to make that happen.

Test Appendix (on the presheaf of visualisations)

Test: generator $\to$ display

Test the functor between visualisation generator and slide display.

Where am I?

Code

getwd()

[1] "/home/cantabile/Documents/repos/good-enough"

What am I testing?

Code

cat R/test-mooncake.R

# test this file loads

mooncake_test = "hi Mooncake"

Source & Print it

Code

source("R/test-mooncake.R")

# Print it 
mooncake_test

[1] "hi Mooncake"

Identity tests

Can we validate a single datum?

I expect there to be a node ingest that goes to a node transform representing data engineering tasks in the expected instance of living analysis development lifecycle.

Code

# source generator
source("R/ButtonCategory.R")

Test edges

AnalBeadsEdges <- ButtonEdgeDesignCategory$new(preset = "anal_beads")
AnalBeadsEdges$testEdges()

Test if edge names contain required fields:

TRUE

Test if edges are non-empty:

TRUE

Edge test passing status:

[1] TRUE

dplyr::sample_n(AnalBeadsEdges$edges, 3)

# A tibble: 3 × 3
  from      to       line_type
  <chr>     <chr>    <chr>    
1 transform validate intended 
2 validate  document intended 
3 question  source   intended

HairyAnalBeadsEdges <- ButtonEdgeDesignCategory$new(
  preset = "hairy_anal_beads")
HairyAnalBeadsEdges$testEdges()

Test if edge names contain required fields:

TRUE

Test if edges are non-empty:

TRUE

Edge test passing status:

[1] TRUE

dplyr::sample_n(HairyAnalBeadsEdges$edges, 3)

# A tibble: 3 × 6
  from      to        reason               line_type reason_na project_integrity
  <chr>     <chr>     <chr>                <chr>     <lgl>     <chr>            
1 interpret decision  ""                   intended  TRUE      backlog          
2 question  source    ""                   intended  TRUE      backlog          
3 interpret transform "measure\nmisunders… unintend… FALSE     actioned

Nodes: Identity test on nodes

AnalBeadsNodes <- ButtonNodeDesignCategory$new(preset = "anal_beads")
AnalBeadsNodes$testNodes()

Test if node names contain required fields:

TRUE

Test if nodes are non-empty:

TRUE

Edge test passing status:

[1] TRUE

dplyr::sample_n(AnalBeadsNodes$nodes, 3)

# A tibble: 3 × 2
  node      node_colour     
  <chr>     <chr>           
1 transform data engineering
2 question  decision making 
3 validate  data engineering

HairyAnalBeadsNodes <- ButtonNodeDesignCategory$new(
  preset = "hairy_anal_beads")
HairyAnalBeadsNodes$testNodes()

Test if node names contain required fields:

TRUE

Test if nodes are non-empty:

TRUE

Edge test passing status:

[1] TRUE

dplyr::sample_n(HairyAnalBeadsNodes$nodes, 3)

# A tibble: 3 × 2
  node     node_colour     
  <chr>    <chr>           
1 source   project planning
2 validate data engineering
3 ingest   data engineering

AnalBeads <- ButtonCategory$new(preset = "anal_beads")
AnalBeads$testNodes()

Test if node names contain required fields:

TRUE

Test if nodes are non-empty:

TRUE

Edge test passing status:

[1] TRUE

dplyr::sample_n(AnalBeads$nodes, 3)

# A tibble: 3 × 2
  node    node_colour     
  <chr>   <chr>           
1 ingest  data engineering
2 source  project planning
3 analyse data analysis

HairyAnalBeads <- ButtonCategory$new(preset = "hairy_anal_beads")
HairyAnalBeads$testNodes()

Test if node names contain required fields:

TRUE

Test if nodes are non-empty:

TRUE

Edge test passing status:

[1] TRUE

dplyr::sample_n(HairyAnalBeads$nodes, 3)

# A tibble: 3 × 2
  node     node_colour     
  <chr>    <chr>           
1 source   project planning
2 question decision making 
3 document data engineering

Earlier tests

Identity test on edge object.

Can we validate a single datum?

I expect there to be a node ingest that goes to a node transform representing data engineering tasks in the expected instance of living analysis lifecycle.

Test: failed FAIR data entity test

Expectations

Code

# Create the expected datum for testing
# Expecting an edge from "ingest" to "transform" as part of the "data engineering" process

(
  expected_datum <- data.frame(
    from = "ingest",
    to = "transform",
    node_type = "data engineering"
  )
)

    from        to        node_type
1 ingest transform data engineering

Test: failed FAIR data entity test

Check the Nodes are non-empty

Code

# source generator
source("R/ButtonCategory.R")

# generate button graph
AnalBeads <- 
  ButtonCategory$new()

# Get this test to pass on Nodes

# non-empty intersection
nrow(
  AnalBeads$nodes |>
  dplyr::inner_join(expected_datum)
) > 0

Error in `dplyr::inner_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.

Test: failed FAIR data entity test

Check the column names are as expected

Role is a node attribute!

Code

# show expected
expected_datum

    from        to        node_type
1 ingest transform data engineering

Code

# exact columns
all(
  colnames(expected_datum) %in% colnames(AnalBeads$nodes)
)

[1] FALSE

Code

# show Nodes
head(AnalBeads$nodes, 3)

# A tibble: 3 × 2
  node      node_colour     
  <chr>     <chr>           
1 source    project planning
2 ingest    data engineering
3 transform data engineering

Test: FAIR data entity test

Expectations

Expected Nodes.

Code

# Create the expected datum for testing
# Expecting an edge from "ingest" to "transform" as part of the "data engineering" process

(
  expected_edge <- data.frame(
    from = "ingest",
    to = "transform"
  )
)

    from        to
1 ingest transform

Expected nodes.

Code

# create expected nodes

(
  expected_nodes <- data.frame(
    name = c("ingest", "transform")
  ) |>
    dplyr::mutate(role = "data engineering")
)

       name             role
1    ingest data engineering
2 transform data engineering

Test: FAIR data entity test

Check the Nodes are non-empty

Code

# source generator
source("R/ButtonCategory.R")

# generate button graph
AnalBeads <- 
  ButtonCategory$new()

# Get this test to pass on Nodes

# non-empty intersection
nrow(
  AnalBeads$nodes |>
  dplyr::inner_join(expected_datum)
) > 0

Error in `dplyr::inner_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.

Test: FAIR data entity test

Check edge column names are as expected

Code

# show expected
expected_edge

    from        to
1 ingest transform

Code

# exact columns
all(
  colnames(expected_edge) %in% colnames(AnalBeads$nodes)
)

[1] FALSE

Code

# show Nodes
head(AnalBeads$nodes, 3)

# A tibble: 3 × 2
  node      node_colour     
  <chr>     <chr>           
1 source    project planning
2 ingest    data engineering
3 transform data engineering

Test: FAIR data entity test

Check node attributes are as expected

Code

# show expected
expected_nodes

       name             role
1    ingest data engineering
2 transform data engineering

Code

# exact columns
all(
  colnames(expected_nodes) %in% colnames(AnalBeads$nodes)
)

[1] FALSE

Code

# show Nodes
head(AnalBeads$nodes, 3)

# A tibble: 3 × 2
  node      node_colour     
  <chr>     <chr>           
1 source    project planning
2 ingest    data engineering
3 transform data engineering

References

Reiner, Rob. 1984. “This Is Spinal Tap.” Comedy, {Music}. Spinal Tap Prod., Goldcrest Films International.

Governing Data for the People, by the People

Expectation

Expected Living Analysis Development Lifecycle

Reality

Ceci n’est pas un datum

Reality of Living Analysis Development Lifecycle

Absurdist Orchestra of Ungoverned Data Analytics

Reality →\to expectation

Realistic deliverables for data developers

Roadmap to FAIR Compliance

Minimum tests on the data entity

Minimum test examples

Development Chaos & Social Entropy

Datapunks… assemble!

Governing Data

Test Appendix (on the presheaf of visualisations)

Test: generator →\to display

Identity tests

Test edges

Nodes: Identity test on nodes

Earlier tests

Test: failed FAIR data entity test

Expectations

Test: failed FAIR data entity test

Check the Nodes are non-empty

Test: failed FAIR data entity test

Check the column names are as expected

Test: FAIR data entity test

Expectations

Test: FAIR data entity test

Check the Nodes are non-empty

Test: FAIR data entity test

Test: FAIR data entity test

References

Reality $\to$ expectation

Test: generator $\to$ display