Modern data platforms are growing rapidly in complexity, and ensuring that Personally Identifiable Information (PII) is properly tagged, metadata is maintained, and SQL lineage is clearly tracked is essential for security, governance, and regulatory compliance. Fortunately, by integrating GPT-4, OpenMetadata, dbt, Trino, and Python, organizations can create an automated, intelligent data management pipeline.

In this article, we’ll walk through how to automate PII tagging, metadata enrichment, and lineage tracking using this toolchain.

Overview of Tools and Architecture

  • GPT-4: Used for semantic analysis of column names and descriptions to infer PII and enrich metadata.

  • OpenMetadata: The metadata platform to orchestrate metadata, lineage, classification, and tagging.

  • dbt: For data transformation and model documentation.

  • Trino: Distributed SQL query engine used for data discovery and analysis.

  • Python: Glue language to stitch together the entire workflow via APIs and SDKs.

Setting Up OpenMetadata

To begin, install and launch OpenMetadata using Docker:

bash
git clone https://github.com/open-metadata/OpenMetadata
cd OpenMetadata/docker
docker-compose up -d

Once OpenMetadata is up (default: http://localhost:8585), configure ingestion pipelines for Trino and dbt.

Ingestion Setup Example (YAML):

yaml
source:
type: trino
serviceName: trino_service
config:
hostPort: localhost:8080
database: my_database
workflowConfig:
openMetadataServerConfig:
hostPort: http://localhost:8585

Run ingestion using the CLI:

bash
metadata ingest -c trino_ingest.yaml

Connecting dbt for Transformation Metadata

Ensure your dbt_project.yml and models have descriptions:

yaml

version: 2

models:
name: customer
description: “Customer data with PII like name and email”
columns:
name: email
description: “Customer’s email address”

Then use the dbt ingestion connector to pull this into OpenMetadata:

yaml
source:
type: dbt
serviceName: dbt_service
config:
dbtManifestPath: /path/to/manifest.json
dbtCatalogPath: /path/to/catalog.json
dbtRunResultsPath: /path/to/run_results.json
workflowConfig:
openMetadataServerConfig:
hostPort: http://localhost:8585

Ingest with:

bash
metadata ingest -c dbt_ingest.yaml

Using GPT-4 for Smart PII Detection and Metadata Enrichment

GPT-4 can be used via the OpenAI API to infer column sensitivity and generate rich descriptions.

Sample Python Script to Tag Columns with PII Using GPT-4:

python
import openai
from metadata.ingestion.ometa.ometa_api import OpenMetadata
from metadata.generated.schema.entity.data.table import Table
openai.api_key = “your-openai-api-key”
metadata = OpenMetadata(host_port=“http://localhost:8585”)def is_pii(column_name, description):
prompt = f”””
Does the following column contain PII? Column name:
{column_name}, Description: {description}.
Respond with only “YES” or “NO”.
“””
response = openai.ChatCompletion.create(
model=“gpt-4”,
messages=[{“role”: “user”, “content”: prompt}],
temperature=0
)
return “YES” in response.choices[0].message[“content”].strip()def tag_columns_with_pii(fqn: str):
table: Table = metadata.get_by_name(entity=Table, fqn=fqn)
for column in table.columns:
if is_pii(column.name, column.description or “”):
metadata.add_tag_to_column(table, column.name, tag_fqn=“PII.Sensitive”)tag_columns_with_pii(“trino_service.my_database.customer”)

You can automate this across all tables and run it daily using Airflow or cron.

Tracking SQL Lineage Automatically

Lineage from dbt and Trino can be auto-tracked by OpenMetadata during ingestion. For deeper analysis (e.g., user queries or ad hoc SQL), GPT-4 can parse raw SQL and infer lineage.

Example: Using GPT-4 for SQL Lineage Extraction

python
def extract_lineage_from_sql(sql: str):
prompt = f"""
Given the following SQL query, identify all source and target tables involved.
SQL:
{sql}
Return JSON: {{ "source_tables": [...], "target_tables": [...] }}
"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return response.choices[0].message["content"]
sql = “INSERT INTO analytics.daily_summary SELECT * FROM staging.raw_events WHERE event_type = ‘click'”
print(extract_lineage_from_sql(sql))

This output can then be programmatically pushed to OpenMetadata using its API:

python

from metadata.generated.schema.entity.services.ingestionPipelines.pipeline import LineageDetails

metadata.add_lineage(
LineageDetails(
fromEntity=“staging.raw_events”,
toEntity=“analytics.daily_summary”,
)
)

Building a Unified Python Pipeline

Combine all components into one automation script:

python
def automate_data_management():
# Ingest metadata
os.system("metadata ingest -c trino_ingest.yaml")
os.system("metadata ingest -c dbt_ingest.yaml")
# Tag PII
tables = metadata.list_all_tables()
for table in tables:
tag_columns_with_pii(table.fullyQualifiedName)# Optionally parse SQL logs and generate lineage
for sql_query in get_recent_queries_from_trino():
lineage = extract_lineage_from_sql(sql_query)
# Parse and post to OpenMetadata
post_lineage_to_openmetadata(lineage)automate_data_management()

Visualizing and Validating in OpenMetadata

Once tagging and lineage are complete, you can use the OpenMetadata UI to:

  • See PII tags on columns

  • Search for “PII” tagged fields

  • View lineage DAGs for data pipelines

  • Validate dbt models with full metadata context

Adding Custom Tags and Glossary Terms

You can also enhance semantic metadata with a business glossary:

python

from metadata.generated.schema.metadataIngestion.workflow import Tag

metadata.create_tag(Tag(name=“Finance”, description=“Finance-related data”))
metadata.add_tag_to_column(table, “salary”, tag_fqn=“Finance.Sensitive”)

This helps data stewards and governance officers quickly classify datasets based on business domains.

Conclusion

Automating PII tagging, metadata enrichment, and SQL lineage tracking is no longer an aspirational goal—it’s an operational necessity. By integrating GPT-4’s AI capabilities with OpenMetadata’s governance platform, dbt’s model documentation, Trino’s powerful querying engine, and Python’s flexibility, data teams can achieve:

  • Improved compliance with regulations like GDPR, HIPAA, and CCPA.

  • Faster data discovery through smart tagging and descriptions.

  • Accurate lineage tracing for impact analysis and auditability.

  • Governance at scale without manual bottlenecks.

As data continues to scale across lakes, warehouses, and clouds, automation through intelligent systems like GPT-4 and OpenMetadata will be the cornerstone of trustworthy data ecosystems. With minimal effort and maximum adaptability, this architecture empowers both technical and non-technical users to collaborate over cleaner, safer, and smarter data.