How We Cleaned 50,000+ Records in Less Than a Day with AI

A case study on how AI turned a messy dataset into a $10K win

I recently got involved in a data cleanup project that started with a spreadsheet from hell. 50,000+ records where the same company appeared as 'SIEMENS', 'Siemens AG', 'Siemens Deutschland', and fourteen other variations.

Half the records had missing company names entirely. The other half looked like they'd been entered by different people on different planets. Customers typed their company names by hand as part of a registration form, creating a beautiful mess of inconsistency that made any meaningful analysis impossible.

The team's initial estimate was that it would take a whole month to clean this dataset up, incurring about $10K in labor costs. Or abandon the project entirely.

Most consultants would have walked away. I saw the perfect AI opportunity.

Let’s dive in!

Why Messy Data Is Your Best AI Starting Point

Here's a screenshot that quickly turned into a meme which would kill most data analysis projects in the classical world:

However, thanks to AI, a dataset like this can quickly turn into a lucrative opportunity. The messy reality that makes traditional approaches expensive makes simple AI approaches suddenly extremely valuable. In other words: Perfect data problems can have small returns. Chaotic data problems might have dramatic ones – if done right.

A Case Study on AI Data Cleaning

Let's take a look at how we solved this challenge practically. First, let me give you some context:

  • Mid-sized organization (~500+ employees)

  • Professional events industry

  • Data came from registration forms where people could insert their company name as plain text

  • Challenge: Analyze company-level insights (simple questions like: how many people by company attended the event? Which companies came back? What trends do we observe?)

  • Complication: Whether we would actually get insight or not wasn't clear until we actually did the analysis

  • Approach: Use AI to dramatically cut the cost and effort for performing this analysis.

Here’s how we did it:

Stage 1: External Reference Matching We started by finding clean reference data. Using domain names and websites where available, we pulled company information from external providers like Apollo. This immediately solved 40% of the dataset - all the records where we could match a registration business domain to a proper company name.

Stage 2: Similarity Detection For the remaining records, we ran similarity searches like Levenshtein distance on unique company name values. This flagged obvious near-matches: "SIEMENS" and "Siemens AG" clearly belonged together, but the algorithm caught subtler ones too. No AI needed yet - just smart pattern recognition.

Stage 3: AI Harmonization For records with very high similarity (low Levenshtein distance), we used AI - in this case o3-mini - to compare pairs along with their context: user-provided industry, country, and company name. The AI decided whether these were the same company with different spelling or two distinct companies that just sound similar. For confirmed duplicates, we kept the external reference name when available, or the most complete version when not. This process analyzed 10,000+ records in an iterative loop in less than 1 hour.

Each stage made the next one more effective. External matching reduced the problem size. Similarity detection created manageable clusters. AI handled the final judgment calls with high accuracy.

The key takeaway here is that AI wasn't really doing the heavy lifting - it was helping us complete the last mile that would have been impossible otherwise: making nuanced judgment calls on pre-filtered, contextual pairs.

Results

We finished in 6.5 hours what would have taken a month manually. The team saved $10K in external labor costs and discovered patterns they couldn't see before: which companies sent the most attendees, return customer trends, and industry clustering that informed their event strategy.

We even found some surprising insights. Like identifying a relatively small but high-profile industry player that had the team like "Wait, they were here with 5 people? How did we not know this??!" Now they can highlight that attendance in promotional material for their next show.

The data went from unusable mess to strategic asset in less than a day.

Conclusions

Messy data doesn't have to be a roadblock for AI.

Your ugliest dataset could be your biggest AI opportunity. While teams procrastinate with 'clean data first, AI second' strategies, smart organizations are using AI right now to unlock clean data, which unlocks useful insights and builds the foundation for even more advanced AI initiatives.

Using AI to get better AI? You're in good company. All the big AI labs are using AI to structure, enhance, and expand their training data.

Stop waiting for perfect data to start with AI. Start with whatever makes your team groan the most - that's where the clearest wins live.

Happy data crunching!

See you next Friday,
Tobias

Reply

or to participate.