In a dbt mesh architecture, multiple teams maintain independent dbt projects that can reference each other’s models. Without shared governance standards, each team develops its own conventions—leading to the coordination challenges explored below.
As organizations scale their data platforms with data mesh architectures, a practical challenge emerges: how do you maintain consistent data governance when multiple teams independently create and manage their own data models? Cross-domain data often lacks clear ownership, and upstream changes can break downstream systems without warning. Data mesh architectures benefit from governance structures that match their scale.
The approach described here uses centralized policies with decentralized execution. This article covers why decentralized governance becomes difficult at scale, how centralized standards address this, and walks through an implementation using the dbt-data-governance-standards package as a reference.
Every organization is different. The goal is to illustrate one approach that you can adapt for your own governance strategy.
Credit to Darren Haken’s talk at Coalesce which influenced many of the ideas here.
The Problem with Decentralized Governance
When Every Team Invents Their Own Standards
Imagine a typical enterprise data platform with three analytics teams: the Customer Analytics team, the Finance Analytics team, and the Product Analytics team. Each team maintains their own dbt project within a dbt mesh architecture. Without coordination, here’s what their model metadata might look like:
# Team A (Customer Analytics) - uses "owner" with names
models:
- name: dim_customers
description: "Customer dimension table"
meta:
owner: "John Smith"
classification: "sensitive"
pii: "yes"
# Team B (Finance Analytics) - uses "data_owner" with emails
models:
- name: fct_revenue
description: "Revenue fact table"
meta:
data_owner: "jane.doe@company.com"
data_classification: "CONFIDENTIAL"
contains_pii: false
# Team C (Product Analytics) - minimal or no metadata
models:
- name: fct_user_events
description: "User events fact table"
# No governance metadata at all
At a glance, this might seem fine—each team is documenting their models. But look closer:
- Inconsistent field names:
ownervsdata_ownervs nothing - Inconsistent values:
"sensitive"vs"CONFIDENTIAL","yes"vsfalse - Missing fields: Team C has no governance metadata whatsoever
- No schema enforcement: Any value can be put in any field
The Consequences at Scale
These inconsistencies compound as the organization grows. Consider a common scenario:
A data scientist discovers an issue in fct_revenue. The numbers don’t match the finance report. Who owns this model?
- Team A’s model says
owner: "John"— but John left the company 6 months ago - Team B’s model says
data_owner: "finance-team@company.com"— but that distribution list may be outdated - Team C’s model has no ownership metadata at all
Result: Messages to various channels, emails that may not reach anyone, time spent on discovery rather than resolution.
This scenario repeats across every governance dimension:
Compliance audits become manual discovery exercises. When Legal asks “show me all models containing PII and their retention policies,” someone must review every YAML file, interpret each team’s conventions, and manually compile results.
Cross-team collaboration breaks down. Without shared vocabulary, teams can’t answer basic questions: “Is this model stable? Can we depend on it? Will the schema change without warning?”
Data catalogs show incomplete information. Tools like Atlan, DataHub, or Monte Carlo expect consistent schemas. When every team uses different conventions, these tools require custom mapping logic for each team.
Standards drift over time. Even if teams start with similar conventions, without enforcement, they diverge. New team members bring their own preferences. Six months later, consistency is difficult to recover.
The Centralized Standards Pattern
Define Once, Distribute Everywhere, Enforce Automatically
Rather than relying on documentation alone, this approach uses tooling:
- Define the standard in a central, versioned package
- Distribute the standard as a dbt package that teams install
- Provide tools (macros) that make compliance the path of least resistance
- Enforce automatically in CI/CD so non-compliant models cannot merge
The central package contains everything needed for consistent governance:
- Schema definitions specifying exactly which fields are required and their types
- Retention policies codifying compliance requirements (GDPR, CCPA, HIPAA)
- Validation rules that check models against the schema and policies
- dbt macros that generate compliant metadata with minimal effort
The Schema: Your Single Source of Truth
At the core of centralized governance is a well-defined metadata schema. Here’s how it’s defined in Python:
# dbt_data_governance_standards/schemas/metadata_schema.py
class DataClassification(str, Enum):
"""Data classification levels."""
PUBLIC = "PUBLIC"
INTERNAL = "INTERNAL"
CONFIDENTIAL = "CONFIDENTIAL"
RESTRICTED = "RESTRICTED"
class DataLifecycle(str, Enum):
"""Data lifecycle stages."""
DEVELOPMENT = "DEVELOPMENT"
STAGING = "STAGING"
PRODUCTION = "PRODUCTION"
DEPRECATED = "DEPRECATED"
class StandardMetadata:
"""Standard metadata structure for dbt models."""
REQUIRED_FIELDS = {
"data_governance.data_owner": str,
"data_governance.data_steward": str,
"data_governance.data_classification": str,
"data_governance.data_lifecycle": str,
}
OPTIONAL_FIELDS = {
"data_governance.has_pii": bool,
"data_governance.pii_retention_days": int,
"data_governance.can_be_referenced": bool,
"data_governance.update_schedule": str,
"data_governance.sla": str,
}
This schema is explicit:
- Four required fields that every model must have: data owner, data steward, classification, and lifecycle stage
- Five optional fields for additional governance metadata like PII flags and retention periods
- Enums for controlled vocabularies ensuring
data_classificationcan only be one of four valid values
The Macro: Making Compliance Easy
Knowing the schema exists is one thing. Getting engineers to use it correctly is another. That’s where the dbt macro comes in:
-- macros/governance/get_standard_metadata.sql
{% macro get_standard_metadata(
data_owner,
data_steward,
data_classification,
data_lifecycle,
has_pii=false,
pii_retention_days=none,
can_be_referenced=true,
update_schedule=none,
sla=none
) %}
{% set metadata = {
"data_governance": {
"data_owner": data_owner,
"data_steward": data_steward,
"data_classification": data_classification,
"data_lifecycle": data_lifecycle,
"has_pii": has_pii,
"can_be_referenced": can_be_referenced
}
} %}
{% if pii_retention_days is not none %}
{% set _ = metadata["data_governance"].update({"pii_retention_days": pii_retention_days}) %}
{% endif %}
{% if update_schedule is not none %}
{% set _ = metadata["data_governance"].update({"update_schedule": update_schedule}) %}
{% endif %}
{% if sla is not none %}
{% set _ = metadata["data_governance"].update({"sla": sla}) %}
{% endif %}
{{ return(metadata) }}
{% endmacro %}
The macro is optional but provides several benefits:
- Named parameters make it clear what values are needed
- Default values for optional fields reduce boilerplate
- Consistent structure is guaranteed—no typos in field names
- IDE autocomplete helps engineers discover available options
Alternatively, teams can write the same structure directly in YAML without using the macro—the validator checks the structure regardless of how it was authored.
Dual Installation: Macros for Authors, Validation for CI
The dbt-data-data-governance-standards package illustrates two complementary purposes through two installation methods:
1. dbt Package: For Model Authors
Teams add the package to their packages.yml:
# packages.yml
packages:
- git: "https://github.com/ekoepplin/dbt-data-governance-standards.git"
revision: main
After running dbt deps, teams can add governance metadata to their models. There are two equivalent approaches:
Option A: Using the macro (provides autocomplete and compile-time validation)
# models/marts/schema.yml
version: 2
models:
- name: dim_customers
description: "Customer dimension table with profile and contact information"
meta: {{ dbt_data_governance_standards.get_standard_metadata(
data_owner="customer-analytics@company.com",
data_steward="data-governance@company.com",
data_classification="CONFIDENTIAL",
data_lifecycle="PRODUCTION",
has_pii=true,
pii_retention_days=365
) }}
Option B: Writing YAML directly (no macro needed)
# models/marts/schema.yml
version: 2
models:
- name: dim_customers
description: "Customer dimension table with profile and contact information"
meta:
data_governance:
data_owner: "customer-analytics@company.com"
data_steward: "data-governance@company.com"
data_classification: "CONFIDENTIAL"
data_lifecycle: "PRODUCTION"
has_pii: true
pii_retention_days: 365
Both approaches produce the same structure. The macro provides autocomplete and catches missing required fields at compile time, while direct YAML is simpler for teams that prefer not to use the macro. Choose whichever fits your workflow.
Column-level metadata is always written as plain YAML:
columns:
- name: customer_id
description: "Unique customer identifier (surrogate key)"
- name: email
description: "Customer email address"
meta:
data_governance:
is_pii: true
anonymization_method: "hash"
- name: full_name
description: "Customer full name"
meta:
data_governance:
is_pii: true
anonymization_method: "redact"
- name: created_at
description: "Account creation timestamp"
meta:
data_governance:
retention_date_field: true
The macro approach makes compliance the path of least resistance—engineers get IDE autocomplete and don’t need to memorize the schema. But teams who prefer plain YAML can use it directly; the validator checks the structure regardless of how it was authored.
2. Python Package: For CI/CD Validation
The same package can be installed as a Python package for running validation:
# Install using pip
pip install git+https://github.com/ekoepplin/dbt-data-governance-standards.git@main
# Or using uv (faster)
uv pip install git+https://github.com/ekoepplin/dbt-data-governance-standards.git@main
Once installed, validate any dbt project:
dbt-governance-validate --dbt-project /path/to/your/dbt/project
What this approach means:
- Authors get helpful macros that make writing compliant metadata easy
- CI/CD gets a validation tool that ensures compliance is mandatory
Validation Rules: Governance as Code
Validation rules are the heart of automated enforcement. The validator parses your dbt model YAML files (e.g., models/schema.yml), extracts the meta block from each model, and checks it against the governance rules.
Rules are defined as Python functions and configured via YAML for flexibility.
Rule Configuration
Rules are configured in rules_config.yml, allowing teams to enable, disable, or adjust severity without modifying code:
rules:
- name: "PII models without retention_days"
func: "detect_pii_without_retention_days"
severity: "error"
enabled: true
# ... additional rules for metadata, classification, etc.
Each rule maps to a validation function. Teams can set severity: "warning" during migration, then tighten to "error" once compliant.
Rule Implementation
Each rule receives the parsed models from your YAML files and follows a simple pattern: iterate models, check conditions, collect errors.
def detect_pii_without_retention_days(models, validator):
"""
models: list of model dictionaries parsed from your schema.yml files
Each model dict contains 'name', 'meta', 'columns', etc.
"""
errors = []
for model in models:
if model_has_pii(model) and not has_retention_days(model):
errors.append(f"Model '{model['name']}': missing retention_days")
return len(errors) == 0, errors
The pattern is the same for any rule—check a condition, report violations. This consistency makes it easy to add new rules as governance requirements evolve.
Example: Complete Validation Run
Here’s what a validation run looks like against a project with mixed compliance:
$ dbt-governance-validate --dbt-project ./analytics
Scanning 47 models in ./analytics/models...
✓ dim_products: all checks passed
✓ dim_dates: all checks passed
✗ dim_customers: 2 errors
- has_pii is True but pii_retention_days is not specified
- Column 'email' missing anonymization_method
✓ fct_orders: all checks passed
✗ fct_user_sessions: 3 errors
- Missing required field: data_governance.data_owner
- Missing required field: data_governance.data_classification
- Missing required field: data_governance.data_lifecycle
✓ stg_stripe__payments: all checks passed
...
Summary:
Models scanned: 47
Passed: 42
Failed: 5
Errors: 11
Failed models:
- dim_customers (2 errors)
- fct_user_sessions (3 errors)
- int_customer_orders (2 errors)
- mart_revenue_daily (3 errors)
- mart_churn_analysis (1 error)
Each error is specific and actionable. The engineer knows exactly which model, which field, and what’s missing.
PII and Compliance: Policy-Driven Validation
Handling PII correctly is a practical requirement under regulations like GDPR, CCPA, and HIPAA. These impose specific requirements on how personal data must be managed, retained, and deleted. Regulations like GDPR, CCPA, and HIPAA impose specific requirements on how personal data must be managed, retained, and eventually deleted.
Defining Retention Policies
Policies are defined in policies/retention_policies.yml, separate from validation rules. This makes them easy to update as regulations change:
# policies/retention_policies.yml
retention_policies:
pii_retention:
minimum_days: 90 # Can't delete too quickly
maximum_days: 2555 # Must delete eventually (GDPR)
deletion:
require_anonymization_method: true
When policy requirements change, you update this file. The next CI run checks all models against the new values.
For example, if legal requires reducing the maximum retention from 7 years to 3 years:
# Before
pii_retention:
maximum_days: 2555 # ~7 years
# After
pii_retention:
maximum_days: 1095 # ~3 years
The next CI run across all team projects will flag any model with pii_retention_days > 1095.
Validating Against Policies
Validation rules load the retention_policies.yml file and check each model’s metadata against those policy values. The pattern is straightforward:
def detect_invalid_retention_periods(models, validator):
# Load policies from retention_policies.yml
policies = load_policies() # reads policies/retention_policies.yml
min_days = policies["pii_retention"]["minimum_days"]
max_days = policies["pii_retention"]["maximum_days"]
for model in models:
if model_has_pii(model):
if retention_days < min_days or retention_days > max_days:
errors.append(f"Model '{model['name']}': retention out of range")
The key insight: policies are data, not code. When regulations change, update retention_policies.yml—no code deployment needed.
Column-Level PII Metadata
Beyond model-level metadata, PII columns specify how they should be handled during deletion:
columns:
- name: email
meta:
data_governance:
is_pii: true
anonymization_method: "hash" # or "redact", "generalize"
This column-level metadata enables automated deletion workflows. Here’s how a GDPR deletion request can be processed:
# Example: Generate deletion SQL from governance metadata
def generate_deletion_sql(model_name: str, customer_id: str) -> str:
"""Generate anonymization SQL based on column metadata."""
model = load_model_metadata(model_name)
updates = []
for column in model["columns"]:
meta = column.get("meta", {}).get("data_governance", {})
if meta.get("is_pii"):
method = meta.get("anonymization_method", "redact")
if method == "hash":
updates.append(f"{column['name']} = SHA256({column['name']})")
elif method == "redact":
updates.append(f"{column['name']} = '[REDACTED]'")
elif method == "generalize":
updates.append(f"{column['name']} = NULL")
return f"""
UPDATE {model_name}
SET {', '.join(updates)}
WHERE customer_id = '{customer_id}'
"""
# For dim_customers with email (hash) and full_name (redact):
# UPDATE dim_customers
# SET email = SHA256(email), full_name = '[REDACTED]'
# WHERE customer_id = 'cust_12345'
The deletion logic derives directly from the metadata. No manual mapping required.
CI/CD Integration: The Enforcement Layer
Having standards and macros is not enough. Compliance must be enforced automatically, or it will erode over time.
Why CI Matters for Governance at Scale
Documentation alone is insufficient for consistent enforcement. CI/CD integration provides automated validation.
When governance validation runs in CI/CD:
- Non-compliant models cannot merge. The gate is automatic.
- Feedback is immediate. Engineers see what needs fixing before code review.
- Standards evolve together. Update the central package, and every team’s next CI run validates against the new rules.
- Audits are straightforward. If it passed CI, it meets the defined standards.
What Enforcement Looks Like
When a model doesn’t meet governance standards, CI fails with clear, actionable messages:
❌ Model 'dim_customers': has_pii is True but pii_retention_days is not specified
❌ Column 'email': missing anonymization_method
The engineer knows exactly what to fix. No ambiguity, no interpretation.
Comparison: Manual vs. Automated Enforcement
| Manual Enforcement | Automated Enforcement |
|---|---|
| ”Please follow the governance guide" | "PR blocked until compliant” |
| Standards drift over months | Standards checked on every commit |
| Audit requires manual file review | Audit relies on CI validation history |
The implementation—GitHub Actions, pre-commit hooks, choice of validator—is straightforward once automated enforcement is in place.
Here’s a complete GitHub Actions workflow example:
# .github/workflows/governance-check.yml
name: Governance Validation
on:
pull_request:
paths:
- 'models/**/*.yml'
- 'models/**/*.sql'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install validator
run: |
pip install git+https://github.com/ekoepplin/dbt-data-governance-standards.git@main
- name: Validate governance metadata
run: |
dbt-governance-validate --dbt-project . --output-format github
- name: Generate compliance report
if: always()
run: |
dbt-governance-validate --dbt-project . --output-format markdown > compliance-report.md
- name: Upload report
if: always()
uses: actions/upload-artifact@v4
with:
name: compliance-report
path: compliance-report.md
This workflow runs on every PR that modifies model files, blocks non-compliant changes, and generates an audit-ready compliance report.
Benefits Summary: Why Centralize?
| Aspect | Decentralized Approach | Centralized Approach |
|---|---|---|
| Schema | Each team invents their own | Single schema, many consumers |
| Compliance | Manual, audit-driven | Automated, continuous |
| Standards Evolution | Drift over time, inconsistent updates | Evolve in one place, version-controlled |
| Enforcement | Hope and documentation | CI blocks non-compliant models |
| PII Handling | Varies by team interpretation | Uniform policies enforced everywhere |
| Onboarding | Learn each team’s conventions | Learn one standard, use everywhere |
| Tooling Integration | Custom mappings per team | Consistent metadata for all tools |
| Audit Readiness | Manual discovery and compilation | Automated reports from metadata |
The centralized approach doesn’t remove team autonomy—teams still own their models and make decisions about data classification and ownership. What it does remove is the burden of inventing and maintaining governance conventions. That responsibility moves to a central team that can:
- Keep up with regulatory changes
- Evolve the schema as needs change
- Ensure all teams benefit from improvements
- Provide a single point of truth for compliance
Conclusion
Data governance at scale requires more than good intentions and documentation. It requires:
- A single source of truth for what compliant metadata looks like
- Tools that make compliance easy for the engineers writing models
- Automated enforcement that makes non-compliance impossible to merge
The dbt-data-governance-standards package provides one reference implementation. By centralizing the definition of governance standards while decentralizing their application, organizations can scale their data mesh architectures while maintaining consistency.
Practical outcomes of this approach:
- Compliance audits can query metadata directly rather than reviewing files manually
- New team members learn one standard that applies across all projects
- GDPR deletion requests can be automated against documented PII fields
- Data catalogs receive consistent metadata across all teams
Start with one team. Validate the approach. Expand from there.