Technical Deep Dive | Merchant Growth Intelligence

01 / Data Pipeline

From nine raw CSVs to
one merchant feature table.

The Olist dataset has nine tables. The goal was to collapse everything into a single merchant-level feature table (one row per seller) with aggregated metrics suitable for regression and segmentation.

Category translation Applied Portuguese-to-English mapping from the product_category_name_translation table early in the pipeline, before any analysis. Categories like "informatica_acessorios" become "computers_accessories".

Delivery date filtering Filtered to orders with status "delivered" that had valid purchase and delivery timestamps. Orders with missing estimated delivery dates were excluded from late delivery rate calculations (~3% of orders).

Merchant-level aggregation All order-level data was aggregated to seller level: sum for GMV, mean for review scores and shipping days, count for orders, max for recency calculation. Weighted averages were not used because order weighting would favor high-volume merchants.

10-order filter for regression Merchants with fewer than 10 orders were excluded from regression analysis. Below 10 orders, per-merchant metric estimates (especially on-time rate and average review score) are too noisy to be meaningful. This retained 1,226 of 2,960 sellers (41%) covering 90.7% of GMV.

Trajectory split The dataset midpoint (September 7, 2017) split the date range roughly in half. Merchants with orders in both halves received a trajectory label based on order volume change (H1 to H2): growing, stable, or declining.

Geographic consolidation Brazilian states with fewer than 15 sellers were grouped into an "Other" category for the regression. The reference state was Distrito Federal (DF). All state coefficients are relative to that baseline.

02 / Challenge the Metric

The review score is
a lagging indicator.

The analysis started with a skeptical question: does a high review score actually mean a merchant is healthy? The answer shapes everything that follows.

Review Score Distribution vs. Declining Order Trajectory (merchants with 4.0+ avg score, n=641)

Takeaway: Red points (declining merchants) are scattered across the full review score range, including the 4.0 to 5.0 band. A monitoring system that watches review scores would not distinguish these merchants from their growing neighbors. (Schematic representation of the pattern; see notebook 02 for the exact scatter plot.)

Statistical tests: H1 metrics vs. trajectory

H1 Late Delivery Rate

p = 0.028

t = 2.208 | r = 0.071

Declining merchants: 6.0% late rate. Growing merchants: 3.9% late rate. The 2.1pp gap is statistically significant. Significant

H1 Average Review Score

p = 0.267

t = -1.111 | r = -0.023

Declining merchants: 4.22. Growing merchants: 4.28. The difference is not detectable statistically. Not significant

03 / Segmentation

K-means on RFM features,
validated with silhouette analysis.

Merchant segmentation used RFM (recency, frequency, monetary) features, log-transformed and standardized, with K-means clustering. The k selection involved a genuine trade-off between compactness and interpretability.

Choosing k: silhouette peaked at k=3, but k=4 was selected

The silhouette score peaks at k=3 (0.418) and drops at k=4 (0.372). I chose k=4 because the four resulting clusters map cleanly to four actionable product tiers (Champions, Rising Stars, At Risk, Dormant), while the k=3 solution collapsed two meaningful groups into one. The 0.046-point drop in silhouette felt like a reasonable trade-off for the added interpretability, but this is a judgment call worth discussing.

k=2

Silhouette: 0.401

Too coarse

k=3

Silhouette: 0.418

Best score

k=4

Silhouette: 0.372

Selected (interpretable)

k=5

Silhouette: 0.376

Diminishing returns

k=6+

Silhouette: < 0.34

Over-segmented

Cluster profile table

Segment	Count	GMV Share	Median Orders	Recency (d)	Avg Review	Late Rate	% Declining
Champions	645	80.9%	65	12	4.20	6.9%	15.4%
Rising Stars	999	15.0%	10	28	4.29	5.0%	29.3%
At Risk	748	1.2%	2	44	4.67	0.0%	30.0%
Dormant	568	2.9%	2	406	4.50	0.0%	52.8%

One thing worth noting: At Risk and Rising Stars are separated primarily by recency and frequency, not by review score. At Risk merchants actually have higher review scores (4.67 vs. 4.29), which reinforces the finding that review scores do not distinguish struggling merchants from growing ones within the active seller population.

04 / Regression

OLS on review score:
what predicts buyer satisfaction.

The regression used statsmodels OLS (not sklearn) to get interpretable p-values and confidence intervals. The dependent variable is average review score per merchant. Filtered to sellers with 10+ orders (n=1,226, 90.7% of GMV).

        # OLS Regression Results — Dependent variable: avg_review_score
# n=1,226 merchants (sellers with 10+ orders)
═══════════════════════════════════════════════════════════════════
R-squared:        0.343       Adj. R-squared:   0.334
F-statistic:      39.48       Prob (F-stat):    3.17e-94
Observations:     1,226
───────────────────────────────────────────────────────────────────
                       coef      p-value     [95% CI lo    95% CI hi]
pct_orders_late      -1.3800   1.25e-14     -1.7267       -1.0333
avg_freight_ratio    -0.1221   3.05e-03     -0.2027       -0.0414
avg_shipping_days    -0.0351   8.96e-33     -0.0407       -0.0295
num_categories       -0.0208   4.02e-02     -0.0407       -0.0009
avg_price             0.0004   7.71e-03      0.0001        0.0007
avg_delivery_delay_d -0.0003   0.930        not significant when late rate controlled
order_count          -0.0002   0.217        not significant
───────────────────────────────────────────────────────────────────
# Shapiro-Wilk on residuals: W=0.930, p=0.0000 → OLS still unbiased (n=1,226 » 30)

Key coefficients

pct_orders_late (per 10pp)

-0.138

95% CI: [-0.173, -0.103] p < 0.001

Each 10 percentage point increase in late delivery rate costs 0.138 review points. This is roughly 4x the per-unit effect of raw shipping days. p < 0.001

avg_shipping_days (per day)

-0.035

95% CI: [-0.041, -0.030] p < 0.001

Each additional day of average shipping time reduces review score by 0.035 points. Significant, but smaller than the late rate effect. p < 0.001

avg_delivery_delay_days

Not significant

p = 0.930 when late rate is controlled

Once we control for whether a promise was kept (late rate), how late the delivery was becomes irrelevant. Once the late rate is controlled, the magnitude of delay becomes irrelevant. The binary 'was it kept?' signal is what matters. p = 0.93

avg_freight_ratio

-0.122

95% CI: [-0.203, -0.041] p = 0.003

Higher freight-to-price ratio reduces review scores. Buyers may perceive high freight as poor value. Suggests pricing strategy matters beyond just quality signals. p = 0.003

SHAP feature importance

SHAP values were computed on a parallel sklearn model and show consistent ranking. The feature importance ordering matches the OLS coefficient picture, which builds confidence in the result.

Mean |SHAP| by feature (higher = more important for predicting avg_review_score)

avg_shipping_days

0.104

pct_orders_late

0.076

state_SP

0.034

avg_freight_ratio

0.019

num_categories

0.017

avg_price

0.011

num_skus

0.003

Takeaway: avg_shipping_days is the single most impactful feature by mean SHAP value. pct_orders_late is second. State effects (especially SP, Sao Paulo) contribute about one-third of the shipping_days effect. Volume and catalogue features (order_count, num_skus) have near-zero SHAP importance for review score prediction, though they are strong predictors of GMV.

Diagnostics note: The Shapiro-Wilk test flagged non-normal residuals (W=0.930, p<0.001). OLS is still reliable at n=1,226, which is well above the asymptotic normality threshold. Residual plots showed mild heteroscedasticity. VIF values for state variables were elevated (state_SP: 18.7) but this is expected given collinearity among geographic dummies. No action was taken on the VIF for states because dropping state variables could introduce bias since delivery logistics clearly differ by region. Cook's distance identified no influential outliers beyond what you would expect in a real-world commerce dataset.

05 / Power Analysis

The guardrail metric
was effectively decorative.

The power analysis used observed variance from the 1,226 eligible merchants. It produced one surprising finding that changed the experiment design: the GMV guardrail is severely underpowered at the planned sample size unless we change the unit of analysis.

merchants per group needed to detect a 5pp improvement in on-time delivery rate (80% power, alpha=0.05). Only 5.5% of the 1,226 eligible pool. The experiment is not sample-limited.

6.4%

statistical power for the merchant-level GMV guardrail at N=34/group. If alerts were secretly harming revenue by 10%, we would miss it 93.6% of the time.

99.9%

power for the GMV guardrail when measured at the order level instead of the merchant level. ~3,812 orders/month from eligible sellers, reaches required N in under 4 weeks.

6 wks

total experiment runtime: 4-week observation window plus 2-week buffer. Observation-limited, not sample-limited. Full enrollment could happen on day one.

Power analysis summary table

Metric	Unit	MDE	Sigma	N per group	Power
On-time delivery rate	Merchant	5 pp	7.17 pp	34	80.0%
GMV guardrail	Merchant	10% decrease	log-GMV: 1.22	2,101 (needed)	6.4% at N=34
GMV guardrail	Order	10% decrease	log-order: 0.92	1,233 orders	99.9%

Power curve: required N per group vs. minimum detectable effect (alpha=0.05, power=0.80)

Takeaway: The curve shows that requiring detection of a 1pp effect would need 809 merchants per group (1.3x the entire eligible pool). At 5pp, we need only 34, well within reach. The chosen design point sits in a comfortable range: even dropping the MDE to 3pp would require only 91 per group, still feasible.

06 / AI Summaries

Ollama + Llama 3 8B,
running entirely locally.

Structured merchant metrics are converted into plain-language health briefs using a locally-running LLM. No API key, no data leaves the machine. The goal is to show that AI tools can make analytical outputs accessible to non-technical stakeholders (account managers, merchant success teams) without new data collection.

Input

Merchant metrics (structured JSON)

→

Model

Llama 3 8B via Ollama (local)

→

Config

temp=0.3, max_tokens=250

→

Output

3 to 4 sentence health brief

System prompt

SYSTEM: You are a merchant success analyst at an e-commerce marketplace.
Your job is to write a brief, plain-language health summary for a specific
merchant, based only on the structured metrics provided. Do not invent
numbers, percentages, or facts that are not in the input.

Write exactly 3 to 4 sentences. Structure your response as follows:
1. State the merchant's segment and top product category, and characterise
   their overall business trajectory in one sentence.
2. Highlight their single strongest metric (what they are doing well).
3. Identify their single most important risk or weakness, using the data.
4. Recommend one specific, actionable intervention the platform should take.

Keep the language direct and concrete. A merchant success manager should be
able to read this brief in 20 seconds and know exactly what to do.

# Temperature 0.3 — reduces hallucination, keeps output grounded in input
# num_predict 250 — caps length to enforce brief format

Example: input and output

Input (structured user message)

Merchant profile:
- Segment: At Risk
- Top category: Housewares
- Total GMV: R$532
- Total orders: 3
- Average review score: 5.00 / 5.0
- On-time delivery rate: 100.0%
- Average shipping days: 7.1 days
- Recency: 114 days since last order
- Order volume trajectory: Declining
  (-50% order volume change H1 vs H2)

Output (Llama 3 8B, live generation)

This At-Risk merchant in the Housewares category has seen a decline in their business, with a -50% drop in order volume between H1 and H2. Despite this, they are excelling in customer satisfaction, boasting an impressive 5.00/5.0 average review score and a 100% on-time delivery rate. However, the merchant's recency is concerning, with a 114-day gap since their last order, indicating a need for re-engagement efforts. To address this, I recommend that the platform sends targeted promotional outreach to this merchant's inactive customer base to encourage repeat business and stimulate growth.

Design notes: The system prompt constrains output to exactly 3 to 4 sentences and forbids the model from inventing numbers. This was intentional: early testing showed the model would produce plausible-sounding but fabricated statistics when given free rein. The low temperature (0.3) keeps outputs grounded in the input without producing identical briefs across merchants. A total of 10 briefs were generated across 4 segments (2 to 3 per segment). Mean brief length was 101 words. Outputs are saved to data/processed/merchant_briefs.json.

Methodology,
models, and
diagnostics.

From nine raw CSVs to
one merchant feature table.

The review score is
a lagging indicator.

Statistical tests: H1 metrics vs. trajectory

K-means on RFM features,
validated with silhouette analysis.

Choosing k: silhouette peaked at k=3, but k=4 was selected

Cluster profile table

OLS on review score:
what predicts buyer satisfaction.

Key coefficients

SHAP feature importance

The guardrail metric
was effectively decorative.

Power analysis summary table

Ollama + Llama 3 8B,
running entirely locally.

System prompt

Example: input and output

Full analysis on GitHub.

Methodology,models, anddiagnostics.

From nine raw CSVs toone merchant feature table.

The review score isa lagging indicator.

Statistical tests: H1 metrics vs. trajectory

K-means on RFM features,validated with silhouette analysis.

Choosing k: silhouette peaked at k=3, but k=4 was selected

Cluster profile table

OLS on review score:what predicts buyer satisfaction.

Key coefficients

SHAP feature importance

The guardrail metricwas effectively decorative.

Power analysis summary table

Ollama + Llama 3 8B,running entirely locally.

System prompt

Example: input and output

Full analysis on GitHub.

Methodology,
models, and
diagnostics.

From nine raw CSVs to
one merchant feature table.

The review score is
a lagging indicator.

K-means on RFM features,
validated with silhouette analysis.

OLS on review score:
what predicts buyer satisfaction.

The guardrail metric
was effectively decorative.

Ollama + Llama 3 8B,
running entirely locally.