K-Means Customer Segmentation on US Retail Data

A walkthrough on public US retail data of the same customer-segmentation methodology I run with enterprise retail clients. Methodology is real. Numbers are real model output from this dataset; in production deployments the absolute lift is larger because the customer base is bigger and the action loops are tighter.

The Challenge

Customer acquisition keeps getting more expensive in every retail category. The natural response is "invest more in retention" — but retention budgets only compound when they're aimed at the right customers. In practice, most marketing organizations have a customer database, a built-in 3-way segmentation (Consumer / Corporate / Home Office in this case), and a marketing calendar that treats those three buckets as if everyone inside them is roughly the same.

They aren't. Inside the 793-customer base in this dataset, annual revenue per customer spans $4 at the small end and $25,000+ at the top. A flat marketing strategy spends the same dollars per head on both, which means it overpays the bottom and underpays the top. The job: turn a noisy 4-year transaction log into actionable customer tiers, with a numeric basis for who gets premium treatment, who gets reactivation campaigns, and who gets — honestly — minimum viable spend.

Classic RFM (Recency, Frequency, Monetary) is the standard starting point. But it has a known weakness: Monetary alone hides important variation. A customer who places 8 small orders and one who places 1 huge order can both look identical on M, despite needing radically different marketing. The fix is the LRFMC model — an extension of RFM originally developed for airline customer-value analysis — which adds two dimensions: relationship Length (tenure) and a per-order value Coefficient (avg basket size).

The Approach

I ran the data through five steps before drawing any conclusions about tiers. Each step is about earning the right to act on the next.

1. Audit and clean. 9,994 line items across 5,009 unique orders and 793 customers, spanning 2015-01-03 to 2018-12-30. Drop the empty rows. Parse the dates. No exotic data quality issues here — this dataset is clean by design, but in production this is where most of the project time actually goes.

2. Engineer LRFMC features per customer. Aggregate transaction-level data to the customer level. For each Customer ID: L = days from first order, R = days since last order, F = distinct invoices, M = total Sales, C = avg basket size (M / F). The C dimension is the one most teams miss — it captures per-order size, which is independent of frequency and reveals whales hiding inside otherwise-average customers.

3. Standardize. StandardScaler on all five features. K-Means is a distance-based algorithm; without standardization, Monetary (thousands of dollars) would dominate Frequency (single-digit orders).

4. Pick k with Calinski-Harabasz, not intuition. Sweep K-Means with k from 3 to 8. Score each model on the Calinski-Harabasz index — the ratio of between-cluster dispersion to within-cluster dispersion. Higher is better. The textbook playbook says k=5; the data argued for k=6.

The extra cluster turns out to be operationally meaningful: it cleanly separates a long-tenured-but-dormant cohort from the lower-spending recent joiners, which would otherwise be averaged together and given the same generic offer.

5. Read the clusters as value pentagons. Cluster centroids in standardized LRFMC space are hard to read as a table of numbers. The pentagon visualization makes them readable at a glance — a spike on M and C with low R is your gold tier; a flat shape with high R is a churned-but-not-yet-archived customer.

The Outcome

Six tiers emerged, each with a sharply different value pentagon — mapped to operational marketing actions:

VIP Whales (15 customers, 1.9%) — $12,387 avg revenue, $2,695 avg basket. Named-account retention, personal account managers, premium customer programs. Loss = catastrophic to the P&L.
Loyal Heavy Buyers (107 customers, 13.5%) — $6,907 avg revenue, 8.1 orders, recent activity. Premium retention budget. Cross-sell into Technology and Furniture from their Office Supplies base.
Steady Mid-Tier (221 customers, 27.9%) — $3,012 avg revenue, 8.9 orders. High frequency, low basket. Volume backbone. Targeted upsell to grow basket size, not order count.
Recent Joiners (109 customers, 13.7%) — Short tenure, still building history. Structured onboarding. Goal: graduate them to Mid-Tier or Loyal Heavy within 6 months.
Light Engagement (249 customers, 31.4%) — The modal customer. Tenured but low everything. Minimum-viable marketing — email only. Don't allocate CAC dollars here.
Dormant (92 customers, 11.6%) — 561+ days since last order. One reactivation attempt. If silent after that, archive from active marketing roster.

The headline: the top 15.4% of customers (VIP Whales + Loyal Heavy Buyers) generate 40.3% of revenue, while the bottom 43% (Light Engagement + Dormant) generate only 22.6%. A flat marketing budget treats them the same; an LRFMC-segmented budget routes 2× more spend per head to the top tier — and matches what the data actually argues.

Business Impact

In production, this segmentation gets rerun monthly. A customer's tier is a function of recent behaviour, not a permanent label. A Recent Joiner who graduates to Loyal Heavy within 6 months is a marketing success; a Loyal Heavy who drifts toward Dormant is a retention emergency. The model earns its keep when it surfaces those transitions — not when it freezes a snapshot.

The pattern this analysis crystallizes: cheap unsupervised models on the right business question move the P&L further than expensive supervised models on the wrong one. K-Means is a 60-year-old algorithm; LRFMC is a 20-year-old framework. Applied to real US retail transactional data, they re-prioritise a four-year marketing spend in an afternoon.

Key Takeaways

Always engineer C alongside M. Avg basket size separates whales from frequent small-ticket buyers — the two have identical Monetary totals but radically different marketing economics.
Don't pick k from intuition. Calinski-Harabasz takes one extra pass and tells you whether the data actually has the structure you assumed.
Tiers are perishable. Recompute monthly. The transitions between tiers carry more signal than the tiers themselves.
Read clusters in raw units, not z-scores, when briefing the business. "$12K avg basket" is actionable; "z = 3.6 on M" is not.