The complete analytical pipeline: data cleaning decisions, regression outputs with CIs, segmentation validation, power analysis, and the Ollama/Llama 3 AI summary implementation.
The Olist dataset has nine tables. The goal was to collapse everything into a single merchant-level feature table (one row per seller) with aggregated metrics suitable for regression and segmentation.
The analysis started with a skeptical question: does a high review score actually mean a merchant is healthy? The answer shapes everything that follows.
Merchant segmentation used RFM (recency, frequency, monetary) features, log-transformed and standardized, with K-means clustering. The k selection involved a genuine trade-off between compactness and interpretability.
The silhouette score peaks at k=3 (0.418) and drops at k=4 (0.372). I chose k=4 because the four resulting clusters map cleanly to four actionable product tiers (Champions, Rising Stars, At Risk, Dormant), while the k=3 solution collapsed two meaningful groups into one. The 0.046-point drop in silhouette felt like a reasonable trade-off for the added interpretability, but this is a judgment call worth discussing.
| Segment | Count | GMV Share | Median Orders | Recency (d) | Avg Review | Late Rate | % Declining |
|---|---|---|---|---|---|---|---|
| Champions | 645 | 80.9% | 65 | 12 | 4.20 | 6.9% | 15.4% |
| Rising Stars | 999 | 15.0% | 10 | 28 | 4.29 | 5.0% | 29.3% |
| At Risk | 748 | 1.2% | 2 | 44 | 4.67 | 0.0% | 30.0% |
| Dormant | 568 | 2.9% | 2 | 406 | 4.50 | 0.0% | 52.8% |
One thing worth noting: At Risk and Rising Stars are separated primarily by recency and frequency, not by review score. At Risk merchants actually have higher review scores (4.67 vs. 4.29), which reinforces the finding that review scores do not distinguish struggling merchants from growing ones within the active seller population.
The regression used statsmodels OLS (not sklearn) to get interpretable p-values and confidence intervals. The dependent variable is average review score per merchant. Filtered to sellers with 10+ orders (n=1,226, 90.7% of GMV).
# OLS Regression Results — Dependent variable: avg_review_score
# n=1,226 merchants (sellers with 10+ orders)
═══════════════════════════════════════════════════════════════════
R-squared: 0.343 Adj. R-squared: 0.334
F-statistic: 39.48 Prob (F-stat): 3.17e-94
Observations: 1,226
───────────────────────────────────────────────────────────────────
coef p-value [95% CI lo 95% CI hi]
pct_orders_late -1.3800 1.25e-14 -1.7267 -1.0333
avg_freight_ratio -0.1221 3.05e-03 -0.2027 -0.0414
avg_shipping_days -0.0351 8.96e-33 -0.0407 -0.0295
num_categories -0.0208 4.02e-02 -0.0407 -0.0009
avg_price 0.0004 7.71e-03 0.0001 0.0007
avg_delivery_delay_d -0.0003 0.930 not significant when late rate controlled
order_count -0.0002 0.217 not significant
───────────────────────────────────────────────────────────────────
# Shapiro-Wilk on residuals: W=0.930, p=0.0000 → OLS still unbiased (n=1,226 » 30)
SHAP values were computed on a parallel sklearn model and show consistent ranking. The feature importance ordering matches the OLS coefficient picture, which builds confidence in the result.
The power analysis used observed variance from the 1,226 eligible merchants. It produced one surprising finding that changed the experiment design: the GMV guardrail is severely underpowered at the planned sample size unless we change the unit of analysis.
| Metric | Unit | MDE | Sigma | N per group | Power |
|---|---|---|---|---|---|
| On-time delivery rate | Merchant | 5 pp | 7.17 pp | 34 | 80.0% |
| GMV guardrail | Merchant | 10% decrease | log-GMV: 1.22 | 2,101 (needed) | 6.4% at N=34 |
| GMV guardrail | Order | 10% decrease | log-order: 0.92 | 1,233 orders | 99.9% |
Structured merchant metrics are converted into plain-language health briefs using a locally-running LLM. No API key, no data leaves the machine. The goal is to show that AI tools can make analytical outputs accessible to non-technical stakeholders (account managers, merchant success teams) without new data collection.
SYSTEM: You are a merchant success analyst at an e-commerce marketplace. Your job is to write a brief, plain-language health summary for a specific merchant, based only on the structured metrics provided. Do not invent numbers, percentages, or facts that are not in the input. Write exactly 3 to 4 sentences. Structure your response as follows: 1. State the merchant's segment and top product category, and characterise their overall business trajectory in one sentence. 2. Highlight their single strongest metric (what they are doing well). 3. Identify their single most important risk or weakness, using the data. 4. Recommend one specific, actionable intervention the platform should take. Keep the language direct and concrete. A merchant success manager should be able to read this brief in 20 seconds and know exactly what to do. # Temperature 0.3 — reduces hallucination, keeps output grounded in input # num_predict 250 — caps length to enforce brief format
Merchant profile: - Segment: At Risk - Top category: Housewares - Total GMV: R$532 - Total orders: 3 - Average review score: 5.00 / 5.0 - On-time delivery rate: 100.0% - Average shipping days: 7.1 days - Recency: 114 days since last order - Order volume trajectory: Declining (-50% order volume change H1 vs H2)
data/processed/merchant_briefs.json.
Each notebook runs end-to-end and corresponds to one analytical module. Links go to the GitHub repository.