Governing Data for the People, by the People

ESG Data Governance & Living Analysis Development Lifescycle

Charles T. Gray, PhD – Datapunk

DBT Copenhagen Meetup

2025-03-26

Expectation

Expected Living Analysis Development Lifecycle

Reality

Ceci n’est pas un datum

This is Spinal Tap (Reiner 1984)

Reality of Living Analysis Development Lifecycle

Absurdist Orchestra of Ungoverned Data Analytics

Reality \to expectation

Realistic deliverables for data developers

Validation of a FAIR data entity.

Analyst records the outcome of a validation of an analytical observation.

{% docs validation_2025_01 %}

See ticket for more context [DAA-665]. First, naive validation on ESG kpis. Compare 1 data point with [previously reported number] from 2023.

January 10, 2025.

E1-6_07 for Gross scope 1 greenhouse gas emissions for Diesel in 2023 is approximately 7 times the reported measure.

KPI_PREVIOUS KPI_MEASURE PROPORTION_OF_REPORTED_MEASURE UNIT_OF_MEASUREMENT EMISSION_FACTOR_USED
3,111,543.42 21,814,714.592… 7.011… liter 239.080…

This dbt analysis will not be written to snowflake. This is a script that can be run to compare previously reported figures.

Test-driven development can help narrow the scope of quality objectives, so checking to see if the propotion of the reported measure is getting closer to 1 is potentially a useful optimisation technique to speed debugging.

This validation was written while calculated kpis are still dynamic, it required further calculation & joins, and will likely require refactoring into a new validation if it is to be reused.

{% enddocs %}

Roadmap to FAIR Compliance

Is the roadmap to developer wellbeing.

Validation of Data Entity meets Minimum FAIR Criteria
FAIR principle Minimum requirements
Findable Data documentation wherein columns & tables have descriptions for the data entity lineage. Clearly states access protocols; i.e., data security. Ensures subject matter experts can advise on governance, as well as contribute to validation.
Accessible Understandable in terms of a data entity, a semantic abstraction in stakeholder terms (e.g., a number on an invoice). The helps ensure data governance is applied as subject matter experts advise.
Interoperable Workflows to ensure data governance, security, and access function as intended when analysing data entity. Facilitates subject matter experts contributing to data validation.
Reusable Requested data transformations provided in published layer of dedicated database, standard tests applied to data entity (see below for details) to confirm validity of documentated assumptions.

Go FAIR (FAIR Principles. GO FAIR (n.d.)).

Minimum tests on the data entity

Tests Applied on Data Product Layers by Observability, Descending
Test Tested
Analytical Observation
output unique key data entity
Semantic Transformation
output unique key data entity joined across raw sources and tested
Source
output unique key data entity defined and tested
input unique key combination of columns that define a unique row
input freshness incrementation or snapshot field
input not empty table-level test
We say a unique key has been tested when the same combination of columns have not null and unique tests applied.
Freshness tests configurations: daily ingestion (warn > 1 day, error > 1 week); weekly ingestion (warn > 1 week, error > 2 weeks)

Minimum test examples

- name: wf_ap
        description: '{{ doc("invoice_item_id") }}'
        data_tests:
          - tdc__table_contains_data
          - unique:
              column_name: "type_no || '-' || doc_id"
          - not_null:
              column_name: "type_no || '-' || doc_id"
       freshness: 
          warn_after: {count: 1, period: day}
          error_after: {count: 7, period: day}
        loaded_at_field: opdateret_dato
{% test tdc__table_contains_data(model) %}
  SELECT 
    CASE 
      WHEN (SELECT COUNT(*) FROM {{model}} limit 1) > 0 THEN 'Table is not empty'
      ELSE 'Table is empty'
    END AS result
    having result='Table is empty'
{% endtest %}
{% docs invoice_item_id %}

  ### Data entity

  Each row in these data is used to identify an item on an invoice; some 
  expenditure. Each row is uniquely identified by `type_no` and `doc_id`, 
  this is concatenated in published by `invoice_item_id`. 

  ### Tests

  [Data product standard tests applied].
  

{% enddocs %}

Development Chaos & Social Entropy

Datapunks… assemble!

Governing Data

Is for the people.

And you’re the people to make that happen.

Test Appendix (on the presheaf of visualisations)

Test: generator \to display

Test the functor between visualisation generator and slide display.

Where am I?

Code
getwd()
[1] "/home/cantabile/Documents/repos/good-enough"

What am I testing?

Code
cat R/test-mooncake.R
# test this file loads

mooncake_test = "hi Mooncake"

Source & Print it

Code
source("R/test-mooncake.R")

# Print it 
mooncake_test
[1] "hi Mooncake"

Identity tests

Can we validate a single datum?

I expect there to be a node ingest that goes to a node transform representing data engineering tasks in the expected instance of living analysis development lifecycle.

Code
# source generator
source("R/ButtonCategory.R")

Test edges

AnalBeadsEdges <- ButtonEdgeDesignCategory$new(preset = "anal_beads")
AnalBeadsEdges$testEdges()
Test if edge names contain required fields:
TRUE
Test if edges are non-empty:
TRUE
Edge test passing status:
[1] TRUE
dplyr::sample_n(AnalBeadsEdges$edges, 3)
# A tibble: 3 × 3
  from      to       line_type
  <chr>     <chr>    <chr>    
1 transform validate intended 
2 validate  document intended 
3 question  source   intended 
HairyAnalBeadsEdges <- ButtonEdgeDesignCategory$new(
  preset = "hairy_anal_beads")
HairyAnalBeadsEdges$testEdges()
Test if edge names contain required fields:
TRUE
Test if edges are non-empty:
TRUE
Edge test passing status:
[1] TRUE
dplyr::sample_n(HairyAnalBeadsEdges$edges, 3)
# A tibble: 3 × 6
  from      to        reason               line_type reason_na project_integrity
  <chr>     <chr>     <chr>                <chr>     <lgl>     <chr>            
1 interpret decision  ""                   intended  TRUE      backlog          
2 question  source    ""                   intended  TRUE      backlog          
3 interpret transform "measure\nmisunders… unintend… FALSE     actioned         

Nodes: Identity test on nodes

AnalBeadsNodes <- ButtonNodeDesignCategory$new(preset = "anal_beads")
AnalBeadsNodes$testNodes()
Test if node names contain required fields:
TRUE
Test if nodes are non-empty:
TRUE
Edge test passing status:
[1] TRUE
dplyr::sample_n(AnalBeadsNodes$nodes, 3)
# A tibble: 3 × 2
  node      node_colour     
  <chr>     <chr>           
1 transform data engineering
2 question  decision making 
3 validate  data engineering
HairyAnalBeadsNodes <- ButtonNodeDesignCategory$new(
  preset = "hairy_anal_beads")
HairyAnalBeadsNodes$testNodes()
Test if node names contain required fields:
TRUE
Test if nodes are non-empty:
TRUE
Edge test passing status:
[1] TRUE
dplyr::sample_n(HairyAnalBeadsNodes$nodes, 3)
# A tibble: 3 × 2
  node     node_colour     
  <chr>    <chr>           
1 source   project planning
2 validate data engineering
3 ingest   data engineering
AnalBeads <- ButtonCategory$new(preset = "anal_beads")
AnalBeads$testNodes()
Test if node names contain required fields:
TRUE
Test if nodes are non-empty:
TRUE
Edge test passing status:
[1] TRUE
dplyr::sample_n(AnalBeads$nodes, 3)
# A tibble: 3 × 2
  node    node_colour     
  <chr>   <chr>           
1 ingest  data engineering
2 source  project planning
3 analyse data analysis   
HairyAnalBeads <- ButtonCategory$new(preset = "hairy_anal_beads")
HairyAnalBeads$testNodes()
Test if node names contain required fields:
TRUE
Test if nodes are non-empty:
TRUE
Edge test passing status:
[1] TRUE
dplyr::sample_n(HairyAnalBeads$nodes, 3)
# A tibble: 3 × 2
  node     node_colour     
  <chr>    <chr>           
1 source   project planning
2 question decision making 
3 document data engineering

Earlier tests

Identity test on edge object.

Can we validate a single datum?

I expect there to be a node ingest that goes to a node transform representing data engineering tasks in the expected instance of living analysis lifecycle.

Test: failed FAIR data entity test

Expectations

Code
# Create the expected datum for testing
# Expecting an edge from "ingest" to "transform" as part of the "data engineering" process

(
  expected_datum <- data.frame(
    from = "ingest",
    to = "transform",
    node_type = "data engineering"
  )
)
    from        to        node_type
1 ingest transform data engineering

Test: failed FAIR data entity test

Check the Nodes are non-empty

Code
# source generator
source("R/ButtonCategory.R")

# generate button graph
AnalBeads <- 
  ButtonCategory$new()

# Get this test to pass on Nodes

# non-empty intersection
nrow(
  AnalBeads$nodes |>
  dplyr::inner_join(expected_datum)
) > 0
Error in `dplyr::inner_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.

Test: failed FAIR data entity test

Check the column names are as expected

Role is a node attribute!

Code
# show expected
expected_datum
    from        to        node_type
1 ingest transform data engineering
Code
# exact columns
all(
  colnames(expected_datum) %in% colnames(AnalBeads$nodes)
)
[1] FALSE
Code
# show Nodes
head(AnalBeads$nodes, 3)
# A tibble: 3 × 2
  node      node_colour     
  <chr>     <chr>           
1 source    project planning
2 ingest    data engineering
3 transform data engineering

Test: FAIR data entity test

Expectations

Expected Nodes.

Code
# Create the expected datum for testing
# Expecting an edge from "ingest" to "transform" as part of the "data engineering" process

(
  expected_edge <- data.frame(
    from = "ingest",
    to = "transform"
  )
)
    from        to
1 ingest transform

Expected nodes.

Code
# create expected nodes

(
  expected_nodes <- data.frame(
    name = c("ingest", "transform")
  ) |>
    dplyr::mutate(role = "data engineering")
)
       name             role
1    ingest data engineering
2 transform data engineering

Test: FAIR data entity test

Check the Nodes are non-empty

Code
# source generator
source("R/ButtonCategory.R")

# generate button graph
AnalBeads <- 
  ButtonCategory$new()

# Get this test to pass on Nodes

# non-empty intersection
nrow(
  AnalBeads$nodes |>
  dplyr::inner_join(expected_datum)
) > 0
Error in `dplyr::inner_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.

Test: FAIR data entity test

Check edge column names are as expected

Code
# show expected
expected_edge
    from        to
1 ingest transform
Code
# exact columns
all(
  colnames(expected_edge) %in% colnames(AnalBeads$nodes)
)
[1] FALSE
Code
# show Nodes
head(AnalBeads$nodes, 3)
# A tibble: 3 × 2
  node      node_colour     
  <chr>     <chr>           
1 source    project planning
2 ingest    data engineering
3 transform data engineering

Test: FAIR data entity test

Check node attributes are as expected

Code
# show expected
expected_nodes
       name             role
1    ingest data engineering
2 transform data engineering
Code
# exact columns
all(
  colnames(expected_nodes) %in% colnames(AnalBeads$nodes)
)
[1] FALSE
Code
# show Nodes
head(AnalBeads$nodes, 3)
# A tibble: 3 × 2
  node      node_colour     
  <chr>     <chr>           
1 source    project planning
2 ingest    data engineering
3 transform data engineering

References

FAIR Principles. GO FAIR.” n.d. Accessed March 8, 2025. https://www.go-fair.org/fair-principles/.
Reiner, Rob. 1984. This Is Spinal Tap. Spinal Tap Prod., Goldcrest Films International.