ESG Data Governance & Living Analysis Development Lifescycle
DBT Copenhagen Meetup
2025-03-26
This is Spinal Tap (Reiner 1984)
Validation of a FAIR data entity.
Analyst records the outcome of a validation of an analytical observation.
{% docs validation_2025_01 %}
See ticket for more context [DAA-665]. First, naive validation on ESG kpis. Compare 1 data point with [previously reported number] from 2023.
January 10, 2025.
E1-6_07 for Gross scope 1 greenhouse gas emissions for Diesel
in 2023 is approximately 7 times the reported measure.
KPI_PREVIOUS | KPI_MEASURE | PROPORTION_OF_REPORTED_MEASURE | UNIT_OF_MEASUREMENT | EMISSION_FACTOR_USED |
---|---|---|---|---|
3,111,543.42 | 21,814,714.592… | 7.011… | liter | 239.080… |
This dbt analysis will not be written to snowflake. This is a script that can be run to compare previously reported figures.
Test-driven development can help narrow the scope of quality objectives, so checking to see if the propotion of the reported measure is getting closer to 1 is potentially a useful optimisation technique to speed debugging.
This validation was written while calculated kpis are still dynamic, it required further calculation & joins, and will likely require refactoring into a new validation if it is to be reused.
{% enddocs %}
Is the roadmap to developer wellbeing.
Validation of Data Entity meets Minimum FAIR Criteria | |
---|---|
FAIR principle | Minimum requirements |
Findable | Data documentation wherein columns & tables have descriptions for the data entity lineage. Clearly states access protocols; i.e., data security. Ensures subject matter experts can advise on governance, as well as contribute to validation. |
Accessible | Understandable in terms of a data entity, a semantic abstraction in stakeholder terms (e.g., a number on an invoice). The helps ensure data governance is applied as subject matter experts advise. |
Interoperable | Workflows to ensure data governance, security, and access function as intended when analysing data entity. Facilitates subject matter experts contributing to data validation. |
Reusable | Requested data transformations provided in published layer of dedicated database, standard tests applied to data entity (see below for details) to confirm validity of documentated assumptions. |
Go FAIR (“FAIR Principles. GO FAIR” (n.d.)).
Tests Applied on Data Product Layers by Observability, Descending | ||
---|---|---|
Test | Tested | |
Analytical Observation | ||
output | unique key | data entity |
Semantic Transformation | ||
output | unique key | data entity joined across raw sources and tested |
Source | ||
output | unique key | data entity defined and tested |
input | unique key | combination of columns that define a unique row |
input | freshness | incrementation or snapshot field |
input | not empty | table-level test |
We say a unique key has been tested when the same combination of columns have not null and unique tests applied. | ||
Freshness tests configurations: daily ingestion (warn > 1 day, error > 1 week); weekly ingestion (warn > 1 week, error > 2 weeks) |
- name: wf_ap
description: '{{ doc("invoice_item_id") }}'
data_tests:
- tdc__table_contains_data
- unique:
column_name: "type_no || '-' || doc_id"
- not_null:
column_name: "type_no || '-' || doc_id"
freshness:
warn_after: {count: 1, period: day}
error_after: {count: 7, period: day}
loaded_at_field: opdateret_dato
{% docs invoice_item_id %}
### Data entity
Each row in these data is used to identify an item on an invoice; some
expenditure. Each row is uniquely identified by `type_no` and `doc_id`,
this is concatenated in published by `invoice_item_id`.
### Tests
[Data product standard tests applied].
{% enddocs %}
Is for the people.
And you’re the people to make that happen.
Test the functor between visualisation generator and slide display.
Where am I?
What am I testing?
Source & Print it
Can we validate a single datum?
I expect there to be a node
ingest
that goes to a nodetransform
representingdata engineering
tasks in theexpected
instance of living analysis development lifecycle.
Test if edge names contain required fields:
TRUE
Test if edges are non-empty:
TRUE
Edge test passing status:
[1] TRUE
# A tibble: 3 × 3
from to line_type
<chr> <chr> <chr>
1 transform validate intended
2 validate document intended
3 question source intended
HairyAnalBeadsEdges <- ButtonEdgeDesignCategory$new(
preset = "hairy_anal_beads")
HairyAnalBeadsEdges$testEdges()
Test if edge names contain required fields:
TRUE
Test if edges are non-empty:
TRUE
Edge test passing status:
[1] TRUE
# A tibble: 3 × 6
from to reason line_type reason_na project_integrity
<chr> <chr> <chr> <chr> <lgl> <chr>
1 interpret decision "" intended TRUE backlog
2 question source "" intended TRUE backlog
3 interpret transform "measure\nmisunders… unintend… FALSE actioned
Test if node names contain required fields:
TRUE
Test if nodes are non-empty:
TRUE
Edge test passing status:
[1] TRUE
# A tibble: 3 × 2
node node_colour
<chr> <chr>
1 transform data engineering
2 question decision making
3 validate data engineering
HairyAnalBeadsNodes <- ButtonNodeDesignCategory$new(
preset = "hairy_anal_beads")
HairyAnalBeadsNodes$testNodes()
Test if node names contain required fields:
TRUE
Test if nodes are non-empty:
TRUE
Edge test passing status:
[1] TRUE
# A tibble: 3 × 2
node node_colour
<chr> <chr>
1 source project planning
2 validate data engineering
3 ingest data engineering
Test if node names contain required fields:
TRUE
Test if nodes are non-empty:
TRUE
Edge test passing status:
[1] TRUE
# A tibble: 3 × 2
node node_colour
<chr> <chr>
1 ingest data engineering
2 source project planning
3 analyse data analysis
Test if node names contain required fields:
TRUE
Test if nodes are non-empty:
TRUE
Edge test passing status:
[1] TRUE
# A tibble: 3 × 2
node node_colour
<chr> <chr>
1 source project planning
2 question decision making
3 document data engineering
Identity test on edge object.
Can we validate a single datum?
I expect there to be a node
ingest
that goes to a nodetransform
representingdata engineering
tasks in theexpected
instance of living analysis lifecycle.
Error in `dplyr::inner_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.
Role is a node attribute!
from to node_type
1 ingest transform data engineering
[1] FALSE
# A tibble: 3 × 2
node node_colour
<chr> <chr>
1 source project planning
2 ingest data engineering
3 transform data engineering
Expected Nodes.
from to
1 ingest transform
Expected nodes.
Error in `dplyr::inner_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.
Check edge column names are as expected
from to
1 ingest transform
[1] FALSE
# A tibble: 3 × 2
node node_colour
<chr> <chr>
1 source project planning
2 ingest data engineering
3 transform data engineering
Check node attributes are as expected
name role
1 ingest data engineering
2 transform data engineering
[1] FALSE
# A tibble: 3 × 2
node node_colour
<chr> <chr>
1 source project planning
2 ingest data engineering
3 transform data engineering