Technical Deep Dive

Methodology,
models, and
diagnostics.

The complete analytical pipeline: data cleaning decisions, regression outputs with CIs, segmentation validation, power analysis, and the Ollama/Llama 3 AI summary implementation.

From nine raw CSVs to
one merchant feature table.

The Olist dataset has nine tables. The goal was to collapse everything into a single merchant-level feature table (one row per seller) with aggregated metrics suitable for regression and segmentation.

9 Raw CSVs Olist tables Clean and Merge Join on order_id, seller_id, product_id Feature Engineering Aggregate to merchant level merchant_features 2,960 rows, 30+ cols 100k orders nulls, translate categories RFM + OLS features ~3,000 sellers
Category translation Applied Portuguese-to-English mapping from the product_category_name_translation table early in the pipeline, before any analysis. Categories like "informatica_acessorios" become "computers_accessories".
Delivery date filtering Filtered to orders with status "delivered" that had valid purchase and delivery timestamps. Orders with missing estimated delivery dates were excluded from late delivery rate calculations (~3% of orders).
Merchant-level aggregation All order-level data was aggregated to seller level: sum for GMV, mean for review scores and shipping days, count for orders, max for recency calculation. Weighted averages were not used because order weighting would favor high-volume merchants.
10-order filter for regression Merchants with fewer than 10 orders were excluded from regression analysis. Below 10 orders, per-merchant metric estimates (especially on-time rate and average review score) are too noisy to be meaningful. This retained 1,226 of 2,960 sellers (41%) covering 90.7% of GMV.
Trajectory split The dataset midpoint (September 7, 2017) split the date range roughly in half. Merchants with orders in both halves received a trajectory label based on order volume change (H1 to H2): growing, stable, or declining.
Geographic consolidation Brazilian states with fewer than 15 sellers were grouped into an "Other" category for the regression. The reference state was Distrito Federal (DF). All state coefficients are relative to that baseline.

The review score is
a lagging indicator.

The analysis started with a skeptical question: does a high review score actually mean a merchant is healthy? The answer shapes everything that follows.

Review Score Distribution vs. Declining Order Trajectory (merchants with 4.0+ avg score, n=641)
Order volume change (H1 to H2) 4.0 4.2 4.5 4.8+ Average review score (H1) 0% +50% -50% → stable silently failing Growing / stable Declining
Takeaway: Red points (declining merchants) are scattered across the full review score range, including the 4.0 to 5.0 band. A monitoring system that watches review scores would not distinguish these merchants from their growing neighbors. (Schematic representation of the pattern; see notebook 02 for the exact scatter plot.)

Statistical tests: H1 metrics vs. trajectory

H1 Late Delivery Rate
p = 0.028
t = 2.208  |  r = 0.071
Declining merchants: 6.0% late rate. Growing merchants: 3.9% late rate. The 2.1pp gap is statistically significant. Significant
H1 Average Review Score
p = 0.267
t = -1.111  |  r = -0.023
Declining merchants: 4.22. Growing merchants: 4.28. The difference is not detectable statistically. Not significant

K-means on RFM features,
validated with silhouette analysis.

Merchant segmentation used RFM (recency, frequency, monetary) features, log-transformed and standardized, with K-means clustering. The k selection involved a genuine trade-off between compactness and interpretability.

Choosing k: silhouette peaked at k=3, but k=4 was selected

The silhouette score peaks at k=3 (0.418) and drops at k=4 (0.372). I chose k=4 because the four resulting clusters map cleanly to four actionable product tiers (Champions, Rising Stars, At Risk, Dormant), while the k=3 solution collapsed two meaningful groups into one. The 0.046-point drop in silhouette felt like a reasonable trade-off for the added interpretability, but this is a judgment call worth discussing.

k=2
Silhouette: 0.401
Too coarse
k=3
Silhouette: 0.418
Best score
k=4
Silhouette: 0.372
Selected (interpretable)
k=5
Silhouette: 0.376
Diminishing returns
k=6+
Silhouette: < 0.34
Over-segmented

Cluster profile table

Segment Count GMV Share Median Orders Recency (d) Avg Review Late Rate % Declining
Champions 645 80.9% 65 12 4.20 6.9% 15.4%
Rising Stars 999 15.0% 10 28 4.29 5.0% 29.3%
At Risk 748 1.2% 2 44 4.67 0.0% 30.0%
Dormant 568 2.9% 2 406 4.50 0.0% 52.8%

One thing worth noting: At Risk and Rising Stars are separated primarily by recency and frequency, not by review score. At Risk merchants actually have higher review scores (4.67 vs. 4.29), which reinforces the finding that review scores do not distinguish struggling merchants from growing ones within the active seller population.

OLS on review score:
what predicts buyer satisfaction.

The regression used statsmodels OLS (not sklearn) to get interpretable p-values and confidence intervals. The dependent variable is average review score per merchant. Filtered to sellers with 10+ orders (n=1,226, 90.7% of GMV).

        # OLS Regression Results — Dependent variable: avg_review_score
# n=1,226 merchants (sellers with 10+ orders)
═══════════════════════════════════════════════════════════════════
R-squared:        0.343       Adj. R-squared:   0.334
F-statistic:      39.48       Prob (F-stat):    3.17e-94
Observations:     1,226
───────────────────────────────────────────────────────────────────
                       coef      p-value     [95% CI lo    95% CI hi]
pct_orders_late      -1.3800   1.25e-14     -1.7267       -1.0333
avg_freight_ratio    -0.1221   3.05e-03     -0.2027       -0.0414
avg_shipping_days    -0.0351   8.96e-33     -0.0407       -0.0295
num_categories       -0.0208   4.02e-02     -0.0407       -0.0009
avg_price             0.0004   7.71e-03      0.0001        0.0007
avg_delivery_delay_d -0.0003   0.930        not significant when late rate controlled
order_count          -0.0002   0.217        not significant
───────────────────────────────────────────────────────────────────
# Shapiro-Wilk on residuals: W=0.930, p=0.0000 → OLS still unbiased (n=1,226 » 30)
      

Key coefficients

pct_orders_late (per 10pp)
-0.138
95% CI: [-0.173, -0.103]   p < 0.001
Each 10 percentage point increase in late delivery rate costs 0.138 review points. This is roughly 4x the per-unit effect of raw shipping days. p < 0.001
avg_shipping_days (per day)
-0.035
95% CI: [-0.041, -0.030]   p < 0.001
Each additional day of average shipping time reduces review score by 0.035 points. Significant, but smaller than the late rate effect. p < 0.001
avg_delivery_delay_days
Not significant
p = 0.930 when late rate is controlled
Once we control for whether a promise was kept (late rate), how late the delivery was becomes irrelevant. Once the late rate is controlled, the magnitude of delay becomes irrelevant. The binary 'was it kept?' signal is what matters. p = 0.93
avg_freight_ratio
-0.122
95% CI: [-0.203, -0.041]   p = 0.003
Higher freight-to-price ratio reduces review scores. Buyers may perceive high freight as poor value. Suggests pricing strategy matters beyond just quality signals. p = 0.003

SHAP feature importance

SHAP values were computed on a parallel sklearn model and show consistent ranking. The feature importance ordering matches the OLS coefficient picture, which builds confidence in the result.

Mean |SHAP| by feature (higher = more important for predicting avg_review_score)
avg_shipping_days
0.104
pct_orders_late
0.076
state_SP
0.034
avg_freight_ratio
0.019
num_categories
0.017
avg_price
0.011
num_skus
0.003
Takeaway: avg_shipping_days is the single most impactful feature by mean SHAP value. pct_orders_late is second. State effects (especially SP, Sao Paulo) contribute about one-third of the shipping_days effect. Volume and catalogue features (order_count, num_skus) have near-zero SHAP importance for review score prediction, though they are strong predictors of GMV.
Diagnostics note: The Shapiro-Wilk test flagged non-normal residuals (W=0.930, p<0.001). OLS is still reliable at n=1,226, which is well above the asymptotic normality threshold. Residual plots showed mild heteroscedasticity. VIF values for state variables were elevated (state_SP: 18.7) but this is expected given collinearity among geographic dummies. No action was taken on the VIF for states because dropping state variables could introduce bias since delivery logistics clearly differ by region. Cook's distance identified no influential outliers beyond what you would expect in a real-world commerce dataset.

The guardrail metric
was effectively decorative.

The power analysis used observed variance from the 1,226 eligible merchants. It produced one surprising finding that changed the experiment design: the GMV guardrail is severely underpowered at the planned sample size unless we change the unit of analysis.

34
merchants per group needed to detect a 5pp improvement in on-time delivery rate (80% power, alpha=0.05). Only 5.5% of the 1,226 eligible pool. The experiment is not sample-limited.
6.4%
statistical power for the merchant-level GMV guardrail at N=34/group. If alerts were secretly harming revenue by 10%, we would miss it 93.6% of the time.
99.9%
power for the GMV guardrail when measured at the order level instead of the merchant level. ~3,812 orders/month from eligible sellers, reaches required N in under 4 weeks.
6 wks
total experiment runtime: 4-week observation window plus 2-week buffer. Observation-limited, not sample-limited. Full enrollment could happen on day one.

Power analysis summary table

Metric Unit MDE Sigma N per group Power
On-time delivery rate Merchant 5 pp 7.17 pp 34 80.0%
GMV guardrail Merchant 10% decrease log-GMV: 1.22 2,101 (needed) 6.4% at N=34
GMV guardrail Order 10% decrease log-order: 0.92 1,233 orders 99.9%
Power curve: required N per group vs. minimum detectable effect (alpha=0.05, power=0.80)
0 200 400 600 800 1 2 3 4 5 6 MDE (percentage points) N per group 5pp, N=34 chosen design point
Takeaway: The curve shows that requiring detection of a 1pp effect would need 809 merchants per group (1.3x the entire eligible pool). At 5pp, we need only 34, well within reach. The chosen design point sits in a comfortable range: even dropping the MDE to 3pp would require only 91 per group, still feasible.

Ollama + Llama 3 8B,
running entirely locally.

Structured merchant metrics are converted into plain-language health briefs using a locally-running LLM. No API key, no data leaves the machine. The goal is to show that AI tools can make analytical outputs accessible to non-technical stakeholders (account managers, merchant success teams) without new data collection.

Input
Merchant metrics (structured JSON)
Model
Llama 3 8B via Ollama (local)
Config
temp=0.3, max_tokens=250
Output
3 to 4 sentence health brief

System prompt

SYSTEM: You are a merchant success analyst at an e-commerce marketplace.
Your job is to write a brief, plain-language health summary for a specific
merchant, based only on the structured metrics provided. Do not invent
numbers, percentages, or facts that are not in the input.

Write exactly 3 to 4 sentences. Structure your response as follows:
1. State the merchant's segment and top product category, and characterise
   their overall business trajectory in one sentence.
2. Highlight their single strongest metric (what they are doing well).
3. Identify their single most important risk or weakness, using the data.
4. Recommend one specific, actionable intervention the platform should take.

Keep the language direct and concrete. A merchant success manager should be
able to read this brief in 20 seconds and know exactly what to do.

# Temperature 0.3 — reduces hallucination, keeps output grounded in input
# num_predict 250 — caps length to enforce brief format
      

Example: input and output

Input (structured user message)
Merchant profile:
- Segment: At Risk
- Top category: Housewares
- Total GMV: R$532
- Total orders: 3
- Average review score: 5.00 / 5.0
- On-time delivery rate: 100.0%
- Average shipping days: 7.1 days
- Recency: 114 days since last order
- Order volume trajectory: Declining
  (-50% order volume change H1 vs H2)
Output (Llama 3 8B, live generation)
This At-Risk merchant in the Housewares category has seen a decline in their business, with a -50% drop in order volume between H1 and H2. Despite this, they are excelling in customer satisfaction, boasting an impressive 5.00/5.0 average review score and a 100% on-time delivery rate. However, the merchant's recency is concerning, with a 114-day gap since their last order, indicating a need for re-engagement efforts. To address this, I recommend that the platform sends targeted promotional outreach to this merchant's inactive customer base to encourage repeat business and stimulate growth.
Design notes: The system prompt constrains output to exactly 3 to 4 sentences and forbids the model from inventing numbers. This was intentional: early testing showed the model would produce plausible-sounding but fabricated statistics when given free rein. The low temperature (0.3) keeps outputs grounded in the input without producing identical briefs across merchants. A total of 10 briefs were generated across 4 segments (2 to 3 per segment). Mean brief length was 101 words. Outputs are saved to data/processed/merchant_briefs.json.

Full analysis on GitHub.

Each notebook runs end-to-end and corresponds to one analytical module. Links go to the GitHub repository.