Twenty-five years in data management has taught me that technology problems are rarely technology problems.
They usually masquerade as technology problems for a while. The whole debate stays comfortably technical — until somewhere along the way, somebody asks a deceptively simple question, and the entire conversation changes.
That happened to me during a large financial services merger.
At the time, the focus was on integration. Two organizations were becoming one. Systems needed to be consolidated, applications rationalized, platforms aligned, and operating models standardized. It was the kind of transformation that happens every day across large enterprises, albeit with a few extra zeroes attached to the budget.
Like most mergers, the initial conversations centered around technology. What would the future-state architecture look like? Which platforms would survive? Which tools would be retired? How quickly could we complete the integration?
Then somebody asked a question that should have been easy to answer.
"Which datasets are we actually migrating?"
The room became noticeably quieter.
Not because the question was hard to understand — but because nobody had ever had to answer it before.
In day-to-day operations, you never decide, dataset by dataset, what deserves to survive. A merger forces exactly that question, across the entire ecosystem, all at once. And in that moment, it was clear that none of us actually knew.
When "Move Everything" Stops Being a Strategy
If you've ever moved houses after living in the same place for twenty years, you know the feeling. At first, the plan sounds simple. Pack everything, load the truck, move into the new place.
Then reality arrives.
You discover boxes in the attic that haven't been opened in a decade. There are things in the garage that nobody remembers buying. Some items are unquestionably valuable. Others should have been discarded years ago. Most fall into a gray area where nobody is entirely sure whether they're important, sentimental, useful, or simply taking up space.
Enterprise data behaves exactly the same way.
The company being acquired operated a sizeable data ecosystem: roughly 15 petabytes in all — approximately 14 petabytes in Snowflake, another petabyte in Teradata — and roughly 1.3 million datasets spread across raw zones, staging environments, curated layers, analytical marts, and sandboxes accumulated over years of business operations.
When people hear numbers like that, they naturally focus on scale. Fifteen petabytes. More than a million datasets. Massive infrastructure.
But scale wasn't the problem. The problem was deciding what deserved to survive.
Because every dataset migrated carries a cost. Storage costs. Engineering costs. Governance costs. Operational support costs. Risk management costs. Every unnecessary dataset that survives a merger becomes part of the acquiring company's future burden.
Technical debt, I've learned over the years, always insists on traveling first class.
The Questions Nobody Could Answer
As we started evaluating the data estate, we discovered something uncomfortable. For a surprising number of datasets, we couldn't confidently answer even the most basic questions.
Who owns this dataset?
Who uses it?
What business process depends on it?
Is it actively consumed, or has it simply been sitting untouched for years?
Does it contain regulated information?
Would anybody notice if it disappeared tomorrow?
These weren't academic questions. They were multi-million-dollar questions.
Every week we delayed rationalization, infrastructure costs continued to accumulate. Every dataset migrated unnecessarily increased future operational overhead. Every dataset retired incorrectly created the potential for business disruption, regulatory exposure, or broken downstream processes.
The challenge wasn't simply moving data. The challenge was making thousands of business decisions at scale.
And business decisions require context. Unfortunately, context was in short supply.
The Most Important Data Is Often the Least Documented
One lesson seems to repeat itself in every large organization I've worked with. The datasets everyone worries about are rarely the datasets with the best documentation. In fact, the opposite is often true.
Some of the most critical assets in an organization were created years ago, embedded into operational processes, inherited across multiple technology generations, and quietly became indispensable. Over time, people changed roles, teams reorganized, systems evolved, and institutional knowledge slowly evaporated.
The data remained. The explanations did not.
Meanwhile, beautifully documented datasets with polished descriptions and clean metadata often turned out to have very little actual business relevance.
If you've spent enough time in this industry, you've probably encountered some version of a table named:
CUSTOMER_PROFILE_V7_FINAL
Nobody remembers who created it. Nobody wants to claim ownership of it. For the Data Engineering team, it's one of thousands of tables to which data gets loaded. They are owners of the data pipeline, but not the dataset and its contents.
And absolutely nobody wants to be the person who deletes it. Because somewhere deep down, everyone suspects it might be important.
Following the Footprints
At this point, we realized opinions were not going to solve the problem. Every stakeholder had a view of what mattered. Every business unit had assets they considered critical. Every team had historical reasons why something should be preserved.
What we needed was evidence.
So we began collecting metadata. Lots of it — Physical metadata, Logical metadata, Usage metadata, Ownership metadata. Anything that could help us understand not only what the data was, but whether it actually mattered.
One particular category became incredibly valuable: Usage metadata.
People often tell you what they believe is important. Their behavior tells you what is actually important.
Some datasets that were described as mission critical hadn't been queried in months. Others with minimal documentation were being accessed hundreds of times each week by multiple business teams.
Usage patterns became our equivalent of footprints at a crime scene. They revealed where people had actually been, not where they claimed to have been.
The data, as it turns out, was surprisingly honest.
The Plot Twist Nobody Expected
For years, we had worked diligently to build modern, well-governed data products — exactly the kind of assets you'd expect to carry forward into a future state. So the assumption felt safe: the legacy systems would be retired, and our newer data products would survive. After all, newer technology should win.
Except mergers don't work that way.
The acquiring company had already selected its future-state architecture, governance framework, operating model, and technology standards. The decision had effectively been made before many of the individual datasets were even evaluated.
The question wasn't whether Company B's data products were good. Many were. The question was whether they aligned with the future state that Company A had already chosen.
Those are entirely different conversations.
As a result, many newer data products were ultimately not prioritized for migration. Instead, attention shifted toward legacy datasets that supported operational processes, risk functions, analytics, and regulatory obligations. Ironically, those legacy assets often carried the weakest metadata and the fewest available subject matter experts (SME).
The Clock Was Working Against Us
Mergers don't just consolidate technology. They consolidate organizations.
As staffing models evolved, many experienced employees departed. Subject matter experts retired, changed roles, accepted severance packages, or moved on to new opportunities.
The knowledge was walking out the door while the data remained behind.
At one point, I remember thinking about the absurdity of the situation. We had less than three months to evaluate more than a million datasets while simultaneously losing access to many of the people who understood them.
Imagine being handed a library with no catalog, no index, and no librarians, then being told to identify every book worth keeping before the building closes.
That was essentially our challenge.
From 1.3 Million to 25,000
Eventually, the metadata began telling a story. Ownership patterns emerged. Usage patterns emerged. Business dependencies became visible. Datasets that appeared important were later eliminated. Datasets that initially seemed insignificant suddenly surfaced as critical assets. The usage metadata (write, reads, access provisioned) served as a vital signal in distinguishing belief from reality.
Gradually, the universe became smaller. The migration candidate pool shrank from approximately 1.3 million datasets to roughly 25,000.
It was a remarkable achievement. For a brief moment, we felt like we had solved the problem.
Then we looked at what would be required to properly govern those 25,000 datasets per Company A's data management policy and standards. And that's when the real problem finally revealed itself.
Each dataset needed dozens of new metadata attributes to be curated — Definitions. Ownership (Data Stewards). Privacy classifications. Risk classifications. Business Criticality. Regulatory Impact.
Then there were the columns. Hundreds of thousands of them.
By the time the calculations were complete, the projected effort exceeded 20,000 hours of manual metadata curation and several million dollars in labor and program management costs.
The timeline, unfortunately, remained unchanged.
The Realization
Looking back, this was the moment the entire initiative changed direction.
We had spent months treating the challenge as a migration problem. It wasn't.
We had spent months treating it as a governance problem. It wasn't that either.
What we were really facing was a knowledge problem.
The organization possessed the data. The organization possessed the technology. The organization possessed governance frameworks, policies, standards, and controls. What it lacked was a scalable way to convert institutional knowledge into usable metadata before that knowledge disappeared.
For the first time in my career, I found myself asking a different question. We had spent decades teaching people how to govern data.
What if we could teach data how to help govern itself?
That question would eventually lead us toward a concept we began calling the Data Governance Agent.
And that is where the story really begins.
In Part 2, I'll walk through the discovery that 20,000 hours of metadata curation was never going to work — and how that realization pushed us toward AI-assisted governance.