Which merchants are silently failing, and what should the platform build to save them?
The intuitive way to measure merchant health is to look at review scores. When most merchants cluster near 4 to 5 stars, the platform appears healthy. But the data suggests this picture misses a significant portion of the problem.
of merchants with review scores above 4.0 have declining order volume. Their reviews look fine. Their businesses are quietly shrinking.
Based on 641 merchants with enough order history to compute a trajectory (first half vs. second half of the dataset). One limitation: this uses order volume, not true churn, since the dataset does not record account deactivations.
To see whether this is worth acting on, I ran a simple statistical test: which early signals best predict whether a merchant ends up declining? The answer reframed the whole analysis.
The data suggests that delivery reliability is a leading indicator of merchant decline, while review scores are a lagging one. By the time a merchant's review score drops noticeably, the underlying fulfillment problems have usually been building for a while.
The regression analysis (Part 4) reinforces this: each 10 percentage point increase in late delivery rate costs 0.138 review points (p less than 0.001, 95% CI: -0.173 to -0.103), roughly 4x the per-unit effect of raw shipping speed. Buyers care about whether the delivery promise was kept, not just how long it took. The freight-to-price ratio is a secondary lever: each unit increase in avg_freight_ratio costs an additional 0.122 review points (p = 0.003, 95% CI: -0.203 to -0.041), which points to pricing tooling as a second area worth investigating after fulfillment.
I'd want to verify this pattern on a larger dataset before making strong product decisions around it. The Olist dataset covers two years of one marketplace. The directional finding seems solid, but the specific numbers would need validation on Shopify's own data.
With delivery reliability established as the key signal, the next question is: which merchants need attention most? Using RFM (recency, frequency, monetary) features and K-means clustering (k=4, validated with silhouette analysis), the ~3,000 sellers fall into four actionable tiers. Hover over each card to see the product recommendation.
These four segments describe stages in a merchant lifecycle, one that has clear leakage points the platform can address.
At Risk merchants join the platform but rarely reach traction (2 median orders). The ones who break through become Rising Stars, but 29.3% of those are stalling. The Rising Stars who sustain growth become Champions, the 22% of sellers driving 81% of GMV. Dormant merchants show what happens when the platform misses the signals at each earlier stage: more than half were already on a declining trajectory before they went inactive.
The interventions map to stages of this funnel. Onboarding programs for At Risk, proactive alerts for Rising Stars, scale-aware fulfillment tools for Champions. Each is a different problem requiring a different solution.
The decision memo synthesizes the analysis into a single recommendation. Start with Rising Stars, where the data shows the clearest combination of urgency and opportunity.
A weekly notification showing each merchant's on-time delivery rate trend over the last 30 days, benchmarked against the category median. When a merchant's rate drops below the category median, the alert escalates with a specific, actionable suggestion.
29.3% are declining, but they are still active (28-day median recency) and their fundamentals are solid (4.29 reviews, 5.0% late rate). They are at the inflection point between growth and decline, which is where there is still time to intervene.
Champions have the highest late rate of any active segment (6.9%), because fulfillment strain increases with volume. The same alert infrastructure can extend to Champions with tuned thresholds appropriate for their scale.
52.8% of Dormant merchants were already declining before going inactive. Their early trajectories can inform the pattern-matching for when to trigger alerts for current merchants, without needing to build new data collection.
Before shipping the feature, I designed a randomized experiment using variance from the actual Olist data. The power analysis surfaced something important about how the guardrail metric should be measured.
| Parameter | Value | Notes |
|---|---|---|
| Primary metric | On-time delivery rate | Measured at seller level |
| Randomization unit | Seller-level | Avoids contamination within a single merchant |
| MDE | 5 percentage points | Grounded in observed gap between declining and growing merchants |
| Required N per group | 34 merchants (68 total) | 5.5% of the 1,226 eligible sellers. Not sample-limited. |
| Eligible pool | 1,226 merchants | Sellers with 10+ orders in the dataset |
| Runtime | 6 weeks | 4-week observation + 2-week buffer. Observation-limited, not sample-limited. |
| GMV guardrail (merchant level) | 6.4% power at N=34 | Effectively random. Would need 2,101 per group to detect 10% decrease. |
| GMV guardrail (order level) | 99.9% power | ~3,812 orders/month from eligible sellers. Resolves in under 4 weeks. |
This was the finding I want to highlight most from the experiment design. At the planned sample size of 34 merchants per group, the GMV guardrail has only 6.4% statistical power. That means if the alerts were secretly harming merchants' revenue by 10%, we would fail to detect it 93.6% of the time. The guardrail would be decorative.
The fix is to measure GMV at the order level rather than the merchant level. Individual order values have lower variance than merchant-level aggregates, and there are many more of them. The eligible sellers generate roughly 3,812 orders per month, which gets us to the required 1,233 orders per group in under 4 weeks. Same observation window, dramatically more power (99.9%).
I initially expected all metrics to use the same unit of analysis. Running the power calculation showed me why that assumption breaks down when metric variance differs dramatically across levels. This is something I'd want to discuss with more experienced team members before finalizing the design.