Product Data Science Case Study

Merchant
Growth
Intelligence

Which merchants are silently failing, and what should the platform build to save them?

Olist Brazilian E-Commerce dataset  |  ~100,000 orders  |  ~3,000 sellers  |  2016 to 2018

3k Sellers analyzed
24.2% Silently failing
81% GMV from Champions
4 segments Actionable tiers

Most merchants look fine on the surface.

The intuitive way to measure merchant health is to look at review scores. When most merchants cluster near 4 to 5 stars, the platform appears healthy. But the data suggests this picture misses a significant portion of the problem.

Merchants with average review score above 4.0
High-rated merchants (review score 4.0+)
24.2%

of merchants with review scores above 4.0 have declining order volume. Their reviews look fine. Their businesses are quietly shrinking.

Based on 641 merchants with enough order history to compute a trajectory (first half vs. second half of the dataset). One limitation: this uses order volume, not true churn, since the dataset does not record account deactivations.

Delivery reliability matters.
Review scores lag behind.

To see whether this is worth acting on, I ran a simple statistical test: which early signals best predict whether a merchant ends up declining? The answer reframed the whole analysis.

Significant predictor

Early-period late delivery rate

p = 0.028
Two-sample t-test, declining vs. growing
Declining merchants: 6.0% late rate
Growing merchants: 3.9% late rate
Point-biserial r = 0.071
Not significant

Early-period average review score

p = 0.267
Two-sample t-test, declining vs. growing
Declining merchants: 4.22 avg score
Growing merchants: 4.28 avg score
Difference is not detectable

The data suggests that delivery reliability is a leading indicator of merchant decline, while review scores are a lagging one. By the time a merchant's review score drops noticeably, the underlying fulfillment problems have usually been building for a while.

The regression analysis (Part 4) reinforces this: each 10 percentage point increase in late delivery rate costs 0.138 review points (p less than 0.001, 95% CI: -0.173 to -0.103), roughly 4x the per-unit effect of raw shipping speed. Buyers care about whether the delivery promise was kept, not just how long it took. The freight-to-price ratio is a secondary lever: each unit increase in avg_freight_ratio costs an additional 0.122 review points (p = 0.003, 95% CI: -0.203 to -0.041), which points to pricing tooling as a second area worth investigating after fulfillment.

I'd want to verify this pattern on a larger dataset before making strong product decisions around it. The Olist dataset covers two years of one marketplace. The directional finding seems solid, but the specific numbers would need validation on Shopify's own data.

Four segments. Four different interventions.

With delivery reliability established as the key signal, the next question is: which merchants need attention most? Using RFM (recency, frequency, monetary) features and K-means clustering (k=4, validated with silhouette analysis), the ~3,000 sellers fall into four actionable tiers. Hover over each card to see the product recommendation.

Champions
645
Merchants
80.9%
of GMV
6.9%
Late rate
65
Median orders
Product recommendation
Champions drive 81% of GMV and have the highest late rate of any active segment (6.9%). A fulfillment performance dashboard showing their late rate trend over time, benchmarked against their category median, could help them spot problems before reviews drop. Losing even a few Champions has outsized GMV impact.
Top categories: health_beauty · sports_leisure · housewares
Rising Stars
999
Merchants
15%
of GMV
29.3%
Declining
10
Median orders
Product recommendation
Nearly 1 in 3 Rising Stars is stalling despite healthy reviews (4.29 median). A weekly notification showing order volume trend vs. similar merchants in their category, paired with one specific suggestion. This is the highest-leverage intervention in the funnel.
Top categories: health_beauty · bed_bath_table · computers_accessories
At Risk
748
Merchants
1.2%
of GMV
4.67
Avg review
2
Median orders
Product recommendation
At Risk merchants have the highest review scores (4.67) but only 2 median orders. Buyers are happy when they do purchase. The barrier appears to be discoverability. Their quality metrics are already strong. The platform could test a 'first 30 days' program: boosted visibility in search results or category pages for new merchants until they hit a threshold.
Top categories: housewares · baby · fashion_accessories
Dormant
568
Merchants
406
Days inactive
52.8%
Were declining
100%
Churned
Product recommendation
Over half of Dormant merchants were already on a declining trajectory before going inactive. Reactivation campaigns are low-yield. The more valuable use of this segment is as a diagnostic signal: their early patterns can inform when to trigger alerts for current At Risk and Rising Star merchants.
Top categories: books_general_interest · housewares · toys_games

A leaky funnel.

These four segments describe stages in a merchant lifecycle, one that has clear leakage points the platform can address.

At Risk 748 2 median orders · 4.67 avg review Rising Stars 999 10 median orders · 29.3% declining 29.3% stalling Dormant 568 52.8% were declining Champions 645 65 median orders · 80.9% of GMV ✓ Platform goal

At Risk merchants join the platform but rarely reach traction (2 median orders). The ones who break through become Rising Stars, but 29.3% of those are stalling. The Rising Stars who sustain growth become Champions, the 22% of sellers driving 81% of GMV. Dormant merchants show what happens when the platform misses the signals at each earlier stage: more than half were already on a declining trajectory before they went inactive.

The interventions map to stages of this funnel. Onboarding programs for At Risk, proactive alerts for Rising Stars, scale-aware fulfillment tools for Champions. Each is a different problem requiring a different solution.

Build proactive
merchant health alerts.

The decision memo synthesizes the analysis into a single recommendation. Start with Rising Stars, where the data shows the clearest combination of urgency and opportunity.

What to build

A weekly notification showing each merchant's on-time delivery rate trend over the last 30 days, benchmarked against the category median. When a merchant's rate drops below the category median, the alert escalates with a specific, actionable suggestion.

Example alert
"Your on-time rate dropped to 82% this month. Merchants in health_beauty with rates above 90% fulfill 2 days faster on average. Consider adjusting your estimated delivery windows."
1

Start with Rising Stars

29.3% are declining, but they are still active (28-day median recency) and their fundamentals are solid (4.29 reviews, 5.0% late rate). They are at the inflection point between growth and decline, which is where there is still time to intervene.

2

Extend to Champions over time

Champions have the highest late rate of any active segment (6.9%), because fulfillment strain increases with volume. The same alert infrastructure can extend to Champions with tuned thresholds appropriate for their scale.

3

Use Dormant patterns as training signal

52.8% of Dormant merchants were already declining before going inactive. Their early trajectories can inform the pattern-matching for when to trigger alerts for current merchants, without needing to build new data collection.

How to test it.

Before shipping the feature, I designed a randomized experiment using variance from the actual Olist data. The power analysis surfaced something important about how the guardrail metric should be measured.

Parameter Value Notes
Primary metric On-time delivery rate Measured at seller level
Randomization unit Seller-level Avoids contamination within a single merchant
MDE 5 percentage points Grounded in observed gap between declining and growing merchants
Required N per group 34 merchants (68 total) 5.5% of the 1,226 eligible sellers. Not sample-limited.
Eligible pool 1,226 merchants Sellers with 10+ orders in the dataset
Runtime 6 weeks 4-week observation + 2-week buffer. Observation-limited, not sample-limited.
GMV guardrail (merchant level) 6.4% power at N=34 Effectively random. Would need 2,101 per group to detect 10% decrease.
GMV guardrail (order level) 99.9% power ~3,812 orders/month from eligible sellers. Resolves in under 4 weeks.
6.4%

The guardrail problem

This was the finding I want to highlight most from the experiment design. At the planned sample size of 34 merchants per group, the GMV guardrail has only 6.4% statistical power. That means if the alerts were secretly harming merchants' revenue by 10%, we would fail to detect it 93.6% of the time. The guardrail would be decorative.

The fix is to measure GMV at the order level rather than the merchant level. Individual order values have lower variance than merchant-level aggregates, and there are many more of them. The eligible sellers generate roughly 3,812 orders per month, which gets us to the required 1,233 orders per group in under 4 weeks. Same observation window, dramatically more power (99.9%).

I initially expected all metrics to use the same unit of analysis. Running the power calculation showed me why that assumption breaks down when metric variance differs dramatically across levels. This is something I'd want to discuss with more experienced team members before finalizing the design.