What 30,000 Messy Wine Entries Actually Look Like
We wrote about helping a wine ERP stop losing weekends to data cleanup. This post goes deeper — into the data itself.
Feb 24, 2026
You know the story: a wine catalog built over years from dozens of suppliers, and nobody trusts the numbers anymore. But what does the mess actually look like when you open the spreadsheet?
Here's a sample from a real import batch. Five entries, supposedly five different wines:
| Type | Name | Winery | Region | Vintage |
|---|---|---|---|---|
| Vin Effervescent | Prestige Initiale Grand Cru | François Girard | Champagne | 2010 |
| Vin Effervescent | Prestige Initiale | Domaine François Girard | Champagne | 2010 |
| Vin Rosé | Odyssée Archiver | Jean & Matthieu Compeyrot | Côtes De Provence | 2022 |
| Vin Rosé | Odyssée Archiver | Jean & Matthieu Compeyrot | Côtes De Provence | 2022 |
| Vin Effervescent | Flamboyante | Juliette Pétret | Champagne | — |
Five entries. Three actual wines. This pattern repeated thousands of times across the catalog.
Why Traditional Deduplication Fails on Wine Data
Exact string matching catches rows 3 and 4 — they're identical. But it completely misses rows 1 and 2, which are the same wine expressed differently.
The differences are subtle and domain-specific:
"Prestige Initiale Grand Cru" vs "Prestige Initiale" — The first includes the vineyard classification. Both refer to the same cuvée. A string match sees 0% overlap. A sommelier sees the same bottle.
"François Girard" vs "Domaine François Girard" — One supplier includes "Domaine," the other doesn't. Standard practice in French wine. Completely breaks exact matching.
Fuzzy matching gets closer, but creates false positives. "Réserve Blanc De Blancs Grand Cru" by François Girard is not the same wine as "Prestige Initiale Grand Cru" by François Girard — even though they share a winery and classification terms. A similarity score would flag them. They shouldn't be flagged.
Wine data needs something that understands context, not just characters.
Three Scenarios From the Real Cleanup
Scenario 1: Near Duplicate — caught by context, not strings
| Entry in Catalog | Existing Match | |
|---|---|---|
| Name | Prestige Initiale Grand Cru | Prestige Initiale |
| Winery | François Girard | Domaine François Girard |
| Region | Champagne | Champagne |
| Vintage | 2010 | 2010 |
| Type | Vin Effervescent | Vin Effervescent |
| Match | 91% | |
| Decision | Duplicate — merged | |
The cuvée name includes an extra classification ("Grand Cru") and the winery prefix differs ("Domaine"). But the producer, region, and vintage confirm it's the same wine.
This is the one that hurts most. Without context-aware detection, it stays in your catalog forever — creating ghost inventory, pricing conflicts, and confused clients.
Scenario 2: Exact Duplicate — hidden in plain sight
| Entry in Catalog | Existing Match | |
|---|---|---|
| Name | Odyssée Archiver | Odyssée Archiver |
| Winery | Jean & Matthieu Compeyrot | Jean & Matthieu Compeyrot |
| Region | CĂ´tes De Provence | CĂ´tes De Provence |
| Vintage | 2022 | 2022 |
| Type | Vin Rosé | Vin Rosé |
| Match | 100% | |
| Decision | Duplicate — removed | |
Identical entry from a different import source. The easy case — except when you have 30,000 entries and thousands of these hiding across years of imports from different suppliers. Nobody catches them manually because nobody's looking at the whole catalog at once.
Scenario 3: Unique Entry — confirmed, not just assumed
| Entry in Catalog | Closest Match 1 | Closest Match 2 | Closest Match 3 | |
|---|---|---|---|---|
| Name | Flamboyante | 745 | Brut Nature Fleur De L'europe | Réserve Blanc De Blancs Grand Cru |
| Winery | Juliette Pétret | Jacquesson | Fleury | François Girard |
| Region | Champagne | Champagne | Champagne | Champagne |
| Match | No match found | |||
| Decision | Unique — added to catalog | |||
All four are Champagnes. A naive system would hesitate. Context-aware classification understands that shared appellation alone doesn't make a duplicate. Different producer, different cuvée — confirmed as a new entry.
The Part Nobody Talks About: Auditability
Cleaning data is one thing. Trusting the cleanup is another.
Every decision in the pipeline — duplicate, near-duplicate, or unique — comes with an explanation: what it was compared against, why the call was made, and what the confidence level was. The team can review any classification, override it if needed, and know exactly why a given entry was kept or removed.
This is the difference between a black-box cleanup tool and something a team actually adopts. AI proposes. Humans approve.
It's Not Just a Wine Problem
The patterns — inconsistent naming, supplier abbreviations, duplicate entries from multiple sources, variant formats across import batches — exist in any industry with complex product catalogs.
Spirits distributors deal with the same thing. So do cosmetics aggregators managing shade variants across 500 brands. Auto parts catalogs where one fitting has twelve names. Specialty food importers with origin, grade, and certification data that never matches across suppliers.
The domain vocabulary changes. The data problem is identical.
This is the second post in our series on wine catalog data. Read the first: How We Helped a Wine ERP Stop Losing Weekends to Data Cleanup.
Reflekt Lab builds AI-powered data cleanup for product catalogs. Based in Bordeaux. Let's talk.