What 30,000 Messy Wine Entries Actually Look Like

We wrote about helping a wine ERP stop losing weekends to data cleanup. This post goes deeper — into the data itself.

Feb 24, 2026

You know the story: a wine catalog built over years from dozens of suppliers, and nobody trusts the numbers anymore. But what does the mess actually look like when you open the spreadsheet?

Here's a sample from a real import batch. Five entries, supposedly five different wines:

TypeNameWineryRegionVintage
Vin EffervescentPrestige Initiale Grand CruFrançois GirardChampagne2010
Vin EffervescentPrestige InitialeDomaine François GirardChampagne2010
Vin RoséOdyssée ArchiverJean & Matthieu CompeyrotCôtes De Provence2022
Vin RoséOdyssée ArchiverJean & Matthieu CompeyrotCôtes De Provence2022
Vin EffervescentFlamboyanteJuliette PétretChampagne—

Five entries. Three actual wines. This pattern repeated thousands of times across the catalog.


Why Traditional Deduplication Fails on Wine Data

Exact string matching catches rows 3 and 4 — they're identical. But it completely misses rows 1 and 2, which are the same wine expressed differently.

The differences are subtle and domain-specific:

"Prestige Initiale Grand Cru" vs "Prestige Initiale" — The first includes the vineyard classification. Both refer to the same cuvée. A string match sees 0% overlap. A sommelier sees the same bottle.

"François Girard" vs "Domaine François Girard" — One supplier includes "Domaine," the other doesn't. Standard practice in French wine. Completely breaks exact matching.

Fuzzy matching gets closer, but creates false positives. "Réserve Blanc De Blancs Grand Cru" by François Girard is not the same wine as "Prestige Initiale Grand Cru" by François Girard — even though they share a winery and classification terms. A similarity score would flag them. They shouldn't be flagged.

Wine data needs something that understands context, not just characters.


Three Scenarios From the Real Cleanup

Scenario 1: Near Duplicate — caught by context, not strings

Entry in CatalogExisting Match
NamePrestige Initiale Grand CruPrestige Initiale
WineryFrançois GirardDomaine François Girard
RegionChampagneChampagne
Vintage20102010
TypeVin EffervescentVin Effervescent
Match91%
DecisionDuplicate — merged

The cuvée name includes an extra classification ("Grand Cru") and the winery prefix differs ("Domaine"). But the producer, region, and vintage confirm it's the same wine.

This is the one that hurts most. Without context-aware detection, it stays in your catalog forever — creating ghost inventory, pricing conflicts, and confused clients.

Scenario 2: Exact Duplicate — hidden in plain sight

Entry in CatalogExisting Match
NameOdyssée ArchiverOdyssée Archiver
WineryJean & Matthieu CompeyrotJean & Matthieu Compeyrot
RegionCĂ´tes De ProvenceCĂ´tes De Provence
Vintage20222022
TypeVin RoséVin Rosé
Match100%
DecisionDuplicate — removed

Identical entry from a different import source. The easy case — except when you have 30,000 entries and thousands of these hiding across years of imports from different suppliers. Nobody catches them manually because nobody's looking at the whole catalog at once.

Scenario 3: Unique Entry — confirmed, not just assumed

Entry in CatalogClosest Match 1Closest Match 2Closest Match 3
NameFlamboyante745Brut Nature Fleur De L'europeRéserve Blanc De Blancs Grand Cru
WineryJuliette PétretJacquessonFleuryFrançois Girard
RegionChampagneChampagneChampagneChampagne
MatchNo match found
DecisionUnique — added to catalog

All four are Champagnes. A naive system would hesitate. Context-aware classification understands that shared appellation alone doesn't make a duplicate. Different producer, different cuvée — confirmed as a new entry.


The Part Nobody Talks About: Auditability

Cleaning data is one thing. Trusting the cleanup is another.

Every decision in the pipeline — duplicate, near-duplicate, or unique — comes with an explanation: what it was compared against, why the call was made, and what the confidence level was. The team can review any classification, override it if needed, and know exactly why a given entry was kept or removed.

This is the difference between a black-box cleanup tool and something a team actually adopts. AI proposes. Humans approve.


It's Not Just a Wine Problem

The patterns — inconsistent naming, supplier abbreviations, duplicate entries from multiple sources, variant formats across import batches — exist in any industry with complex product catalogs.

Spirits distributors deal with the same thing. So do cosmetics aggregators managing shade variants across 500 brands. Auto parts catalogs where one fitting has twelve names. Specialty food importers with origin, grade, and certification data that never matches across suppliers.

The domain vocabulary changes. The data problem is identical.


This is the second post in our series on wine catalog data. Read the first: How We Helped a Wine ERP Stop Losing Weekends to Data Cleanup.

Reflekt Lab builds AI-powered data cleanup for product catalogs. Based in Bordeaux. Let's talk.