Reflekt Lab - What 30,000 Messy Wine Entries Actually Look Like

You know the story: a wine catalog built over years from dozens of suppliers, and nobody trusts the numbers anymore. But what does the mess actually look like when you open the spreadsheet?

Here's a sample from a real import batch. Five entries, supposedly five different wines:

Type	Name	Winery	Region	Vintage
Vin Effervescent	Prestige Initiale Grand Cru	François Girard	Champagne	2010
Vin Effervescent	Prestige Initiale	Domaine François Girard	Champagne	2010
Vin Rosé	Odyssée Archiver	Jean & Matthieu Compeyrot	Côtes De Provence	2022
Vin Rosé	Odyssée Archiver	Jean & Matthieu Compeyrot	Côtes De Provence	2022
Vin Effervescent	Flamboyante	Juliette Pétret	Champagne	—

Five entries. Three actual wines. This pattern repeated thousands of times across the catalog.

Why Traditional Deduplication Fails on Wine Data

Exact string matching catches rows 3 and 4 — they're identical. But it completely misses rows 1 and 2, which are the same wine expressed differently.

The differences are subtle and domain-specific:

"Prestige Initiale Grand Cru" vs "Prestige Initiale" — The first includes the vineyard classification. Both refer to the same cuvée. A string match sees 0% overlap. A sommelier sees the same bottle.

"François Girard" vs "Domaine François Girard" — One supplier includes "Domaine," the other doesn't. Standard practice in French wine. Completely breaks exact matching.

Fuzzy matching gets closer, but creates false positives. "Réserve Blanc De Blancs Grand Cru" by François Girard is not the same wine as "Prestige Initiale Grand Cru" by François Girard — even though they share a winery and classification terms. A similarity score would flag them. They shouldn't be flagged.

Wine data needs something that understands context, not just characters.

Three Scenarios From the Real Cleanup

Scenario 1: Near Duplicate — caught by context, not strings

	Entry in Catalog	Existing Match
Name	Prestige Initiale Grand Cru	Prestige Initiale
Winery	François Girard	Domaine François Girard
Region	Champagne	Champagne
Vintage	2010	2010
Type	Vin Effervescent	Vin Effervescent
Match	91%
Decision	Duplicate — merged

The cuvée name includes an extra classification ("Grand Cru") and the winery prefix differs ("Domaine"). But the producer, region, and vintage confirm it's the same wine.

This is the one that hurts most. Without context-aware detection, it stays in your catalog forever — creating ghost inventory, pricing conflicts, and confused clients.

Scenario 2: Exact Duplicate — hidden in plain sight

	Entry in Catalog	Existing Match
Name	Odyssée Archiver	Odyssée Archiver
Winery	Jean & Matthieu Compeyrot	Jean & Matthieu Compeyrot
Region	Côtes De Provence	Côtes De Provence
Vintage	2022	2022
Type	Vin Rosé	Vin Rosé
Match	100%
Decision	Duplicate — removed

Identical entry from a different import source. The easy case — except when you have 30,000 entries and thousands of these hiding across years of imports from different suppliers. Nobody catches them manually because nobody's looking at the whole catalog at once.

Scenario 3: Unique Entry — confirmed, not just assumed

	Entry in Catalog	Closest Match 1	Closest Match 2	Closest Match 3
Name	Flamboyante	745	Brut Nature Fleur De L'europe	Réserve Blanc De Blancs Grand Cru
Winery	Juliette Pétret	Jacquesson	Fleury	François Girard
Region	Champagne	Champagne	Champagne	Champagne
Match	No match found
Decision	Unique — added to catalog

All four are Champagnes. A naive system would hesitate. Context-aware classification understands that shared appellation alone doesn't make a duplicate. Different producer, different cuvée — confirmed as a new entry.

The Part Nobody Talks About: Auditability

Cleaning data is one thing. Trusting the cleanup is another.

Every decision in the pipeline — duplicate, near-duplicate, or unique — comes with an explanation: what it was compared against, why the call was made, and what the confidence level was. The team can review any classification, override it if needed, and know exactly why a given entry was kept or removed.

This is the difference between a black-box cleanup tool and something a team actually adopts. AI proposes. Humans approve.

It's Not Just a Wine Problem

The patterns — inconsistent naming, supplier abbreviations, duplicate entries from multiple sources, variant formats across import batches — exist in any industry with complex product catalogs.

Spirits distributors deal with the same thing. So do cosmetics aggregators managing shade variants across 500 brands. Auto parts catalogs where one fitting has twelve names. Specialty food importers with origin, grade, and certification data that never matches across suppliers.

The domain vocabulary changes. The data problem is identical.

This is the second post in our series on wine catalog data. Read the first: How We Helped a Wine ERP Stop Losing Weekends to Data Cleanup.

Reflekt Lab builds AI-powered data cleanup for product catalogs. Based in Bordeaux. Let's talk.