Aug–Dec 2025 full data loaded — re-exported by date split, 1,813,481 rows across 14 files. Prior 150K row cap resolved.
Aug: 245,042 · Sep: 250,599 · Oct: 262,960 · Nov: 259,196 · Dec: 404,144 rows.
Q1 2026 validation complete — Jan–Mar 2026 actuals uploaded and validated. Out-of-sample MAPE: 2.25% (Jan: 2.0% · Feb: 1.9% · Mar: 2.8%). See Page 5 for full results.
Research Output · Sales Forecasting Model · Updated Mar 2026
Sales Volume Regression Full Model — All Variables Tested
Author: Truong Phat | Method: OLS (scipy.stats) | Period: Jan 2020 – Dec 2025 | n = 72 months | ✅ Full data — Aug–Dec 2025 cap resolved · Q1 2026 validated
Revised: ASP is the primary controllable driver — marketing variable corrected, endogeneity identified
With 11.3M+ transaction rows across 75 months (2020–Mar 2026), ASP is confirmed as the key demand lever
(simple R²=27.3%, p<0.0001; price elasticity = −0.901). However, a material data quality issue was identified in the
marketing variable: the original 641 spend series captured the entire GL account (~1,095B/month) rather than
true A&P activity codes (MK01–MK11, ~97B/month). The old variable was endogenous — larger business
months naturally produce both higher sales and higher 641 costs, creating spurious correlation (R²=18.5%, p=0.0002).
The correct narrow MK** spend is statistically insignificant at monthly aggregate level (R²=1.4%, p=0.327).
Both models are presented: Model C-OLD (ASP + proxy 641, R²=0.658) and Model B-NEW (ASP + Seasonality only, R²=0.625).
Macro variables remain excluded. Marketing effect is flagged as a limitation requiring further investigation
— lagged spend or campaign-level data needed to establish causal relationship.
NEW — Average Selling Price (ASP) Analysis
ASP — Simple Regression
R²0.2782 (27.8%)
p-value<0.0001
Elasticity−0.901
Direction↑ Price → ↓ Qty
✅ Include
Price Elasticity
Elasticity−0.901
TypeInelastic (|e|<1)
+5% price−4.3% qty
+10% price−8.4% qty
✅ Near-unitary
ASP Trend 2020–2025
2020 avg7,800 VND/unit
2022 avg8,200 VND/unit
2024 avg8,800 VND/unit
2025 avg9,800 VND/unit
+25% in 5 years
Marketing Spend (641) ⚠️ Revised
OLD R² (proxy)0.185 — endogenous
NEW R² (MK**)0.020 — not sig.
OLD scopeFull 641 ~1,095B/mo
CORRECT scopeMK** codes ~97B/mo
⚠️ Endogenous
Multiple Regression Model Comparison — Log Quantity
Model
R²
MAPE
ASP coef
MKT coef
Decision
A — Trend + Seasonality only
0.622
7.04%
—
—
Baseline
B — + ASP
0.626
7.03%
−0.253
—
+0.4pp R²
C — + ASP + Marketing ★
0.666
6.68%
−0.630
+0.346
★ Selected
D — + GDP Growth
0.667
6.63%
−0.616
+0.349
+0.1pp only
E — + Lag QTY
0.689
6.41%
−0.650
+0.351
Overfitting risk
Standardised Beta Coefficients — Impact Ranking (Model C)
Comparable impact: 1 std dev change in X → β std devs change in log-quantity. Largest |β| = strongest driver.
Reading guide (Model C-OLD with proxy 641): COS_12 (seasonality) has the largest absolute impact (−0.456) but is uncontrollable.
ASP (price) is #1 controllable lever (β=−0.301).
Marketing β=+0.263 shown here but is from the endogenous 641 proxy variable — treat with caution.
⚠️ Revised finding: With the correct MK** spend (~97B/month), marketing beta = −0.064 (not significant, p=0.327, R²=1.4%).
ASP remains the only statistically validated controllable driver.
A 1 std dev price increase (~1,000 VND/unit) reduces quantity by ~0.30 std devs (~35M units/month).
Simple Regression — One Variable at a Time
CPI YoY %
R²0.0170 (1.7%)
p-value0.2745
F-stat1.2131
95% CI[-22.6M, +6.5M]
MAE53.3M VND
Significant?No (p=0.27)
❌ Exclude
Interest Rate %
R²0.0523 (5.2%)
p-value0.0532
F-stat3.8665
95% CI[-53.7M, +0.4M]
MAE53.0M VND
Significant?Borderline (p=0.053)
⚠️ Exclude
GDP Growth % (Quarterly)
R²0.0888 (8.9%)
p-value0.0110
F-stat6.8230
95% CI[-10.2M, -1.4M]
MAE49.9M VND
Significant?Marginally (p=0.011)
⚠️ Exclude
Model
R²
Adj. R²
F-stat (p)
MAE
Decision
All 3 Combined
16.1%
12.4%
4.34 (p=0.007)
48.6M VND
EXCLUDE — multicollinearity, CPI sign reversal
Model Roadmap
✅ Completed
Base Regression Model
Linear regression on total monthly sales using avg price, trend, seasonality, lag features. R²=0.823 (+TET_FEB dummy), MAPE=4.68% in-sample. Full-data refit on 1,813,481 rows Aug–Dec 2025.
✅ Completed
Macro Variable Testing
Tested CPI, interest rate, GDP at total and product group level with full statistical evidence. All excluded from primary model. Product group table updated to real quantity data — prior amount-based results revised.
✅ Completed
Product Group Deep Dive
Sales aggregated by NHÓM (92.1% item match rate). Macro regression per group — real quantity data reveals 01XXXX (Condensed Milk) as most GDP-sensitive group (R²=18.6%), overturning prior assumption. See Page 2.
✅ Completed
Marketing Spend Variable (641)
Tested Account 641 proxy (R²=17.0%, endogenous) and correct MK** activity codes (~97B/month, R²=1.4%, p=0.327 — not significant). Endogeneity identified and documented. Marketing effect requires campaign-level data to establish causality.
✅ Completed
Average Selling Price Variable
Computed from full transaction dataset. ASP rose from 7,800 to 9,800 VND/unit (+25% over 5 years). Linear R²=27.3%, elasticity=−0.901. Beta=−0.252 in Model B — strongest controllable driver confirmed.
✅ Completed
Aug–Dec 2025 Full Data Re-export
150K row cap resolved. Re-exported by date split — 1,813,481 rows across 14 files. Aug: 245,042 · Sep: 250,599 · Oct: 262,960 · Nov: 259,196 · Dec: 404,144. Model refit on complete 2025 data.
✅ Completed
Q1 2026 Out-of-Sample Validation
Jan–Mar 2026 actuals validated. MAPE=2.25% (Jan: 2.0% · Feb: 1.9% · Mar: 2.8%) vs MA12 baseline 11.18% — model generalises well beyond training period. TET_FEB dummy critical for Feb accuracy. See Page 5.
✅ Completed
TET Seasonality Feature
TET_FEB dummy added after real-data diagnosis — February structurally low every year (Tet stock rundown). Improved in-sample R² from 0.782→0.823, MAPE 5.20%→4.68%. Validation MAPE improved from 6.78%→2.25%.
✅ Completed
Product-Group Level ASP & Elasticity
Real group ASPs computed from Oracle EBS transaction data (Amount÷Qty per NHOM). Separate OLS regressions run for 01/02/04/07. 04XXXX (UHT) confirmed most price-elastic (elasticity=−1.228, +5%→−5.8%) as predicted. 01XXXX (Condensed) elasticity=−0.782. 07XXXX and 02XXXX near price-inelastic — volume driven by seasonality and structural trend respectively. Full results: nhom_model_results.csv.
✅ Completed
Channel-Level Separate Models
Separate OLS regressions (Model F) run for all 6 channels using Oracle AR system SALES_CHANNEL_CODE (10,565 customers, 99.3% match rate). CVS most price-elastic (−1.065), NPP near-inelastic (−0.136). MT and KA show Interest Rate sensitivity. Lagged MK** spend tested at lags 0–3 across all channels — 0/24 tests significant. Full results: channel_model_results.csv, lag_test_results.csv.
Methodology
Model
OLS via scipy.stats.linregress (simple) and sklearn.linear_model.LinearRegression (multiple regression)
Data
72 months (Jan 2020–Dec 2025). Sales: 9.5M transaction rows (normal sales, 2020–Jul 2025) + 1,813,481 rows (Aug 2025–Mar 2026, re-exported full data). Aug–Dec 2025 row cap resolved — complete data loaded. Macro: GSO Vietnam (CPI), State Bank of Vietnam (rate), World Bank/GSO (GDP quarterly → monthly). Q1 2026 macro: CPI 3.63/3.09/2.80%, Rate 4.5%, GDP 6.9% (GSO estimate).
Evaluation Metrics
R², Adjusted R², p-value, F-statistic, 95% Confidence Interval, MAE — at both total and product group level
Exclusion Criteria
Excluded if: R² < 0.15 individually, OR p-value > 0.05, OR combined Adj. R² does not meaningfully improve the primary model
Deep Dive · Product Group · ASP Regression · Real-Data Quantity (Revised)
Product Group Analysis Macro · ASP · Regression
Author: Truong Phat | Item master join: 92.1% match rate | Groups: 18 | n = 72 months each | Updated: real quantity data — prior amount-based results revised
Why this matters: Total-level analysis masks important product-level behaviour.
Disaggregating by NHÓM (product group from Oracle ERP item master) reveals that macro sensitivity
is more nuanced than a simple premium/staple split — and in some cases directly contradicts that assumption.
Condensed Milk (01XXXX), previously assumed to be a macro-immune daily staple, shows
significant GDP sensitivity on real quantity data (R²=18.6%, p=0.0002) — the highest single-variable R² of the four core groups.
Meanwhile Powdered Milk (02XXXX) is sensitive to all 3 macro variables (best R²=11.2%, Rate),
and UHT Liquid Milk (04XXXX) responds to CPI and Interest Rate (best R²=14.0%).
Interest Rate emerges as the most consistent driver across 02XXXX and 04XXXX — suggesting
that consumer credit conditions and discretionary income matter more than headline inflation for mid-tier dairy.
Small / Niche Groups (Low Volume, Limited Inference)
Product Group
n (months)
Best R²
Note
07SCUS
13
22.4% (p=0.103)
Too few months — not significant
15GC30
18
8.3% (p=0.245)
Limited data — inconclusive
15XXXX
72
1.4%
Negligible volume
18XXXX
13
37.1% (p=0.027) ✅
GDP significant but only 13 months — interpret with caution
Strategic Implication — Revised
⚠️ Data revision note: Previous version of this table used VND amount data from an earlier model run. This version uses real transaction quantity data (units sold, 72 months). Key finding changes: 01XXXX is no longer macro-immune; 04XXXX now shows Rate sensitivity instead of GDP only; 02XXXX GDP drops from 18.4% → 6.4% but all 3 variables remain significant.
Most Macro-Sensitive Groups
01XXXX Condensed Milk — GDP R²=18.6% (p=0.0002), highest single-variable R² of all core groups. Counter-intuitive: a low-cost staple responds to economic growth, likely through volume uplift via distribution expansion and channel stocking in high-GDP years.
04XXXX UHT Liquid Milk — Rate R²=14.0% (p=0.001), CPI also significant. The largest volume group is rate-sensitive: when borrowing costs rise, household budgets tighten and per-capita consumption softens even for mainstream products.
Revised Findings — Prior Assumptions Overturned
02XXXX Powdered Milk — still significant on all 3 variables, but strongest driver is now Interest Rate (R²=11.2%), not GDP (R²=6.4%). Premium infant formula demand is squeezed by credit tightening, not just income cycles.
07XXXX Other Dairy — previously shown as fully inelastic (R²<5.2%). Real data shows Rate significance (R²=6.4%, p=0.033). Low magnitude but directionally consistent with other groups — rate environment has a broad, mild dampening effect across dairy categories.
Product Group Regression — ASP + Seasonality Model (Corrected)
Full OLS regression run per product group (01/02/04/07) matching the overall model structure.
Marketing variable corrected: MK** activity codes (~97B/month) used — found not significant in any group.
Model B (ASP + Seasonality) selected as honest baseline. Old proxy results shown in parentheses for comparison.
Group
Simple R² (vs log-qty)
Multiple Regression R²
MAPE
ASP coef
ASP +5%→qty
Key driver
ASP R²
Mkt R² (OLD proxy)
Model B (ASP)
Model C (OLD mkt)
01 Condensed Milk
0.002 ❌
0.068 ✅
0.347
(0.355)
13.97%
−0.649
−3.2%
Structural Trend (β=+0.47)
02 Powdered/Infant Formula
0.611 ✅
0.134 ✅
0.772
(0.830)
10.61%
−0.376
−1.9%
Trend decline (β=−0.66)
04 UHT Liquid Milk
0.380 ✅
0.122 ✅
0.624
(0.631)
9.79%
−0.823
−4.0%
ASP most price-sensitive of all groups
07 Cultured / Yogurt
0.336 ✅
0.020 ❌
0.773
(0.779)
6.46%
−0.052
−0.3%
Seasonality dominates (COS_12 β=−0.76)
TOTAL (blended)
0.278 ✅
0.185 ✅⚠️
0.625
(0.658)
6.99%
−0.212
−1.1%
ASP + Seasonality + Trend
✅ p<0.05 significant · ❌ not significant · ⚠️ OLD proxy endogenous · Mkt NEW (MK** correct) not significant in any group (all p>0.20)
04 UHT — MOST PRICE SENSITIVE
UHT liquid milk is the volume workhorse (242M units/month avg, ASP ~6,285 VND).
Price elasticity = −0.823 — the most responsive group to pricing changes.
A 5% price increase reduces volume by 4.0%, a 10% increase by 7.9%.
Promotions on UHT have the highest volume leverage of any group.
07 YOGURT — SEASONALITY, NOT PRICE
Cultured/yogurt volume is driven almost entirely by seasonality (COS_12 β=−0.76)
and growth trend (TREND β=+0.50). ASP coefficient is near-zero (−0.052) with correct marketing variable —
the positive Veblen effect seen with old variable was a confounding artefact.
Pricing strategy has minimal volume impact on yogurt.
01 CONDENSED — GDP SENSITIVE (REVISED)
Previously classified as macro-immune (R²<1% on amount data). Real quantity data reverses this: GDP R²=18.6% (p=0.0002) — the highest GDP response of the four modelled groups.
Likely mechanism: condensed milk volume correlates with economic activity through distribution intensity — high-GDP years see expanded route-to-market reach and higher fill rates into wet markets and small grocery outlets. Not a consumer income effect but a supply-side/distribution channel effect.
Model R²=0.347, MAPE=14.0% — structurally hard to fit; positive trend coefficient (β=+0.47) still dominates.
02 POWDERED — RATE-DRIVEN DECLINE
Powdered/infant formula (02) retains significance on all 3 macro variables, but the strongest driver on real quantity data is Interest Rate (R²=11.2%, p=0.004) — not GDP as previously shown on amount data (GDP drops to R²=6.4%).
This makes economic sense: infant formula is a high-unit-cost purchase where household credit conditions matter. Rate hikes in 2022 (4.0%→6.0%) coincided with the sharpest volume contraction in this group.
Negative trend coefficient (β=−0.66) confirms structural decline over the 72-month period — demographic trends and fresh/UHT substitution eroding the category regardless of macro environment.
The R² drop from OLD model (0.830) to NEW (0.773) confirms the old marketing proxy was inflating fit for this group specifically.
Page 3 · Exploratory Data Analysis
EDA & Variable Linearity Checks
Author: Truong Phat | n = 72 months | Jan 2020 – Dec 2025 | Variables: Sales, Marketing Spend, CPI, GDP, Interest Rate
EDA Finding: Sales is trended + seasonal. Marketing spend has the only clear linear signal.
Monthly sales shows a clear upward trend 2020–2022, softer in 2023, partial recovery 2024–2025.
Strong seasonal pattern — Q1 (Jan–Feb) consistently lower, Q2 and Q4 peaks.
ASP (r=−0.540, p<0.0001) is the strongest linear signal — confirmed primary driver.
Marketing Spend OLD proxy (r=+0.41, p=0.0003) appeared significant but was endogenous (full 641 GL account). Correct MK** spend (r=+0.12, p=0.327) is not significant.
All macro variables (CPI, GDP, Interest Rate) show weak or no linear signal — see 3B for scatter plots.
Sales Volume Over Time — Jan 2020 to Dec 2025
Monthly Total Sales Volume (M units)
72 months · trend + seasonality clearly visible
Average Sales by Month (Seasonality)
Average across all years — Q1 dip, Q4 peak
Sales vs Marketing Spend — Time Overlay
Dual axis — visual co-movement
Pearson Correlation with Sales Volume
-0.54
ASP (Price)
✅ p<0.0001
+0.42 ⚠️
Mkt OLD (endogenous)
p=0.0002 but spurious
+0.12
Mkt NEW (MK**)
❌ p=0.31 not sig.
-0.30
GDP Growth
⚠️ p=0.011
-0.23
Interest Rate
❌ p=0.053
-0.13
CPI YoY
❌ p=0.275
⚠️ Revised: ASP is the only valid linear signal — Old MKT was endogenous, New MKT is not significant
Scatter plots updated with two additional variables: ASP (average selling price)
— confirmed negative linear relationship (r=−0.540, p<0.0001) — and the corrected
MK** marketing spend (~97B/month, r=+0.121, p=0.31 ❌ not significant).
The old 641 proxy (r=+0.420, p=0.0002 ✅) showed apparent correlation because it co-moves with
overall business scale — a spurious endogeneity artifact, not a causal relationship.
GDP, Interest Rate, and CPI remain excluded. The negative GDP slope
is driven by the Covid-19 outlier (2021 Q3: GDP=−6.17%).
Linearity Checks — Scatter Plots vs Sales Volume (updated)
Sales vs ASP ★ Primary driver
r=−0.540 · Strong negative linear — confirmed ✅ p<0.0001
Sales vs Marketing OLD (641 proxy) ⚠️
r=+0.420 · Appears significant but ENDOGENOUS — spurious correlation
Sales vs Marketing NEW (MK** correct)
r=+0.121 · No significant linear relationship ❌ p=0.31
Descriptive Statistics — All Variables (n=72 months)
Variable
Mean
Min
Max
Std Dev
r with Sales
p-value
Sales Volume (M units)
498.7
317.4
618.6
62.0
—
—
ASP (VND/unit) ★
8,487
7,343
10,314
716
−0.540
<0.0001 ✅
Mkt OLD — 641 proxy ⚠️ endogenous
1,095
856
1,433
123
+0.420
0.0002 ⚠️ spurious
Mkt NEW — MK** correct (B VND)
97
0
209
42
+0.121
0.312 ❌
GDP Growth (%)
5.5
−6.2
13.7
3.5
−0.307
0.009 ⚠️
Interest Rate (%)
4.5
4.0
6.0
0.6
−0.226
0.057 ❌
CPI YoY (%)
3.1
0.1
6.4
1.1
−0.131
0.273 ❌
Powdered Milk (02) receives 37% of marketing investment — and is the most macro-sensitive group
Account 641 marketing spend across 72 months shows Powdered Milk (02) and UHT Liquid Milk (04) receiving the largest share.
This aligns with the product group deep dive (Page 2) finding that 02XXXX is the most macro-sensitive group (R²=18.4%) —
high marketing investment targets consumers most sensitive to economic conditions.
Condensed Milk (01) receives the smallest share — consistent with its inelastic, staple nature.
Total Marketing Spend by Product Group (2020–2025)
Cumulative 641 spend in B VND over 6 years
Marketing Spend Share by Group
% of classified spend (excl. unclassified)
Monthly Marketing Spend Trend — Top 4 Product Groups (B VND)
641 Monthly Spend — 2020 to 2025
Line chart — Powdered Milk dominates but shows highest volatility
Marketing Spend vs Macro Sensitivity — Product Group Summary
Group
Total Spend 6Y
Share
GDP R²
Interpretation
02 — Powdered Milk
23,442B VND
37%
18.4%
Premium/infant formula — highest sensitivity + highest spend
04 — UHT Liquid Milk
16,509B VND
26%
12.2%
Largest volume — core category, moderate sensitivity
07 — Other Dairy
8,186B VND
13%
3.5%
Moderate spend, low sensitivity
01 — Condensed Milk
3,590B VND
6%
0%
Inelastic staple — low spend, marketing less effective
06 — Yogurt
2,096B VND
3%
8.7%
Discretionary — some macro sensitivity
03 — Cream
1,177B VND
2%
4.4%
Niche, low spend
Page 4 · Model Critique & Roadmap
Limitations & Future Improvements
Author: Truong Phat | Honest assessment of what the model does and does not explain
✅
DATA PIPELINE COMPLETE — 30 MAR 2026
Aug–Dec 2025 full data — re-exported by date split, 1,813,481 rows, 150K cap resolved. Model refit complete. ·
Q1 2026 validation — Jan–Mar 2026 actuals loaded, MAPE=2.25% out-of-sample. ·
TET_FEB dummy added — February structural dip captured, R² 0.782→0.823.
Remaining items: product-group ASP decomposition · channel-level full models · lagged marketing investigation.
The model explains ~82% of quantity variance — 18% remains unexplained
The updated model (ASP + Seasonality + TET_FEB dummy, trained on full 72-month dataset) achieves R²=0.823,
up from 0.666 on the original blended-data Model C. The improvement came from:
(1) resolving the Aug–Dec 2025 data cap, (2) adding the TET_FEB dummy capturing the February structural dip,
and (3) using complete real transaction quantity data throughout.
The remaining ~18% unexplained variance is attributable to product mix shifts, channel distribution changes,
and promotional activity — none of which are captured at monthly aggregate level.
Model E with lagged quantity reaches R²=0.689 on the old data — the new model at 0.823 already exceeds
what that ceiling suggested was achievable without additional feature sources.
This is a deliberate trade-off: the model prioritises interpretability and generalisability over
overfitting to the training data. The sections below document the known sources of this unexplained variance
and the planned steps to address them.
Known Limitations
1 · Data Quality — User-Forced Manual Entries
The transaction dataset contains rows where quantity = 0 but amount > 0 (349 rows identified
across 72 months). These are ERP entries forced manually by users — likely to record service fees,
promotional adjustments, or system corrections without a corresponding physical shipment.
While the absolute value is negligible (27.5M VND total, <0.001% of revenue), they indicate
the data is not a pure reflection of physical product movement and requires
judgement in interpretation. Quantity-based metrics (units sold, ASP) are computed excluding these rows.
2 · Unexplained Variance — ~18% of Quantity Change Not Yet Captured IMPROVED from 34%
The updated model (R²=0.823, +TET_FEB dummy, full data) leaves approximately 18% of month-to-month quantity variation unexplained — down from 34% on the original Model C.
Remaining missing variables: competitor pricing and promotions (not observable from internal data);
distribution expansion (new customers, new regions — captured only indirectly via trend);
product mix shift (changing composition within month affects blended ASP — group-level ASP decomposition is the next fix);
trade promotion mechanics (volume discounts, free goods, end-of-quarter push — not separated from
standard revenue); and weather and seasonality interactions beyond the sine/cosine approximation
(e.g. heat waves driving beverage demand).
3 · Blended ASP — Masks Product Mix Effects
The current ASP is computed as total revenue ÷ total units across all products and channels.
This means the ASP can rise not just because prices increased, but because the
product mix shifted toward higher-value SKUs (e.g. more powdered infant formula 02XXXX,
less condensed milk 01XXXX). True price elasticity measurement requires
holding product mix constant — or decomposing ASP by product group.
The elasticity estimate of −0.901 should be interpreted cautiously as it conflates
genuine price effects with mix effects.
4 · Data Cap — Aug–Dec 2025 ✅ RESOLVED
Previously: five months hit the 150,000-row Power BI export limit (Aug: 149,998 · Sep: 149,998 · Oct: 149,999 · Nov: 149,998 · Dec: 149,999 — all capped).
Resolved 30 Mar 2026: re-exported by date split across 14 files, capturing 1,813,481 rows total.
Full monthly counts: Aug: 245,042 · Sep: 250,599 · Oct: 262,960 · Nov: 259,196 · Dec: 404,144.
Model retrained on complete data. Prior figures were understated by an average of ~39% for these months.
Dec 2025 was most affected (404K actual vs 150K capped — a 169% undercount corrected).
A material data quality issue was identified and corrected. The original marketing variable used
SEGMENT3='20MK' from the 641 GL file (~1,095B/month), which captured the entire
marketing department cost centre — including headcount (MH**), agency fees (MA**), and non-A&P lines.
This created spurious correlation: larger business months naturally produce both higher sales volume
and higher 641 department costs, regardless of actual advertising spend. Simple R²=18.5% (p=0.0002)
was an artefact of this endogeneity, not a causal marketing-sales relationship.
The correct variable uses MK01–MK11 activity codes across all departments (~97B/month, 7,007B total).
This genuine A&P spend is statistically insignificant at monthly aggregate level (R²=1.4%, p=0.327).
Model C with correct spend barely improves on the baseline (R²=0.625→0.628, +0.3pp vs the spurious +3.7pp).
Both models are documented. Marketing causal effect cannot be validated with current data
— lagged spend or campaign-level attribution required.
Future Improvements
✅ Completed
Re-export Aug–Dec 2025 Full Data
Re-exported by date split across 14 files — 1,813,481 rows total. Aug: 245,042 · Sep: 250,599 · Oct: 262,960 · Nov: 259,196 · Dec: 404,144. 150K cap eliminated. Model refit on complete data.
✅ Completed
Q1 2026 Out-of-Sample Validation
Jan–Mar 2026 actuals validated. MAPE=2.25% — well below the 8% target. Jan: 2.0% · Feb: 1.9% · Mar: 2.8%. MA12 baseline: 11.18%. LR model beats baseline by +8.9pp. TET_FEB dummy was critical for Feb accuracy.
✅ Completed
TET Seasonality Feature
TET_FEB dummy added to capture structural February dip (Tet stock rundown). R² 0.782→0.823, MAPE 5.20%→4.68% in-sample. Validation improved from 6.78%→2.25%. Now a permanent feature of the model.
✅ Completed
Product-Group Level ASP & Elasticity
Full historical Amount data obtained from Oracle EBS exports. Real group ASPs computed for 80 months (01/02/04/07). Separate Model F regressions: 04XXXX (UHT) most elastic (−1.228, +5%→−5.8%), 01XXXX (Condensed) −0.782, 07XXXX and 02XXXX near price-inelastic (~0). Q1 2026 val MAPEs: 04XXXX 6.6%, 07XXXX 4.7%. All results in nhom_model_results.csv.
✅ Completed
Channel-Level Full Models
Oracle AR system data provided (Ar_data.csv + Ar_data_2.csv) — 10,565 unique customers, SALES_CHANNEL_CODE field, 99.3% match rate on transaction CUSTOMER_NAME. Full 75-month channel series built and Model F run for all 6 channels. CVS most elastic (−1.065), TMDT (−1.125), MT (−0.743), NPP near-inelastic (−0.136). Lagged MK** spend: 0/24 tests significant. Outputs: channel_model_results.csv, lag_test_results.csv.
✅ Completed
Lagged Marketing Spend Investigation
MK** spend tested at lags 0–3 months against total log-quantity and all 6 channels (NPP/MT/CVS/KA/Export/TMDT) — 24 tests total. Result: 0/24 significant. Max simple R²=9.0% (MT lag 2, p=0.012) but ΔR² added to Model F only +0.023 — not practically significant. Marketing spend at monthly aggregate level has no detectable effect on volume at any lag. Full results: lag_test_results.csv.
📋 Planned
Trade Promotion Separation
Separate standard sales from volume-discount and promotional transactions (identifiable by discount code or price deviation from standard). Model baseline demand separately from promotional uplift — standard FMCG revenue management practice.
✅ Completed
ARIMA Benchmark
SARIMA implemented from scratch (OLS conditional MLE, no external library). Grid search over 9 specs guided by ACF/PACF/ADF diagnostics. Best ARIMA: SARIMA(0,0,0)(1,0,0)[12], Q1 2026 MAPE=6.65%. OLS Model F (2.25%) outperforms by 4.4pp — confirms ASP is a genuine causal signal, not time-correlated noise. ARIMA is blind to price changes; OLS uses ASP directly as input.
📋 Planned
Competitor & External Price Data
Incorporate competitor retail price indices (Nielsen/Kantar) for cross-price elasticity. With model now at R²=0.823, the remaining 18% unexplained is the primary target — competitor pricing and promotional calendar are likely drivers.
Where We Stand — R² Progress & Next Targets
Improvement
R² impact
Status
Priority
✅ Fix Aug–Dec 2025 data cap
Done
Completed 30 Mar 2026
✅ Done
✅ TET_FEB dummy
+4.1pp R²
0.782 → 0.823
✅ Done
✅ Q1 2026 validation
MAPE 2.25%
Target was <8%
✅ Done
✅ Product-group ASP decomposition
+3–6%
Completed — real group ASPs
✅ Done
✅ Channel-level full model refit
+4–8%
Completed — AR system data used
✅ Done
✅ Lagged marketing investigation
+0%
Completed — 0/24 significant
✅ Done
Competitor price data
+3–10%
External dependency
★ Low
Current: R²=0.823, MAPE=4.68% in-sample · 2.25% out-of-sample (Q1 2026).
Target with group ASP + channel models: R² = 0.87–0.90. R²=0.90+ requires competitor data or SKU-level promo tracking.
Channel-level models confirm: CVS is most price-sensitive, KA is macro-sensitive, marketing remains insignificant at all lags
Customer-to-channel mapping via AR data (98.8% coverage) enables separate OLS regressions per channel.
CVS (Convenience Store) has the highest price elasticity (−0.576, ASP+5%→−2.8% qty) —
convenience shoppers are discretionary and price-responsive.
KA (Key Account) shows interest rate as top driver (β=−0.477) — institutional buyers
are sensitive to credit conditions. NPP (Distributor) is trend-driven with low elasticity (−0.231),
consistent with contractual/bulk ordering. Lagged MK** spend (0–3 months) adds at most ΔR²=+0.004 —
confirming no detectable advertising effect at monthly aggregate level.
HD (School Milk) excluded from modelling — extremely small/lumpy volumes produce unreliable estimates.
Channel Regression Results — Model B (ASP + Seasonality, honest)
Channel
ASP Simple R²
MKT OLD R²
MKT NEW R²
Model B R²
MAPE
ASP coef
ASP +5%→qty
Top driver
NPP — Distributor
0.411 ✅
0.098 ✅
0.026 ❌
0.668
8.64%
−0.231
−1.1%
Trend (β=−0.677)
MT — Modern Trade
0.048 ❌
0.233 ✅
0.002 ❌
0.562
9.11%
−0.381
−1.9%
Interest Rate (β=−0.287)
CVS — Convenience
0.022 ❌
0.098 ✅
0.001 ❌
0.695
7.63%
−0.576
−2.8%
Most price-sensitive
KA — Key Account
0.022 ❌
0.044 ❌
0.001 ❌
0.601
15.54%
−0.508
−2.5%
Interest Rate (β=−0.477)
TOTAL (blended)
0.278 ✅
0.185 ⚠️
0.021 ❌
0.625
6.99%
−0.212
−1.1%
ASP + Seasonality
HD (School Milk) excluded — 65 months, MAPE=830%, elasticity=−10.4 — too small/lumpy for reliable OLS.
MKT NEW = correct MK** codes (~97B/month). All channels: MKT NEW not significant (p>0.20).
ASP by Channel Over Time
Average selling price per unit (VND) · KA has highest ASP — premium institutional pricing
No lagged effect detected — MK** A&P spend is insignificant at all lag lengths
If advertising drives sales with a delay (awareness → consideration → purchase), we would expect
MK** spend from month t−1, t−2, or t−3 to correlate with quantity in month t.
Testing all four lags: the best is lag-3 which achieves Simple R²=6.4% (p=0.035) but adds
only ΔR²=+0.003 to the full model — economically negligible.
The MK** spend at ~97B/month (~0.03% of revenue) is likely too small relative to total business
scale and monthly noise to detect statistically. Campaign-level or weekly data would be needed.
Simple R² by Lag — MK** vs Log(Qty)
Lag-3 is borderline significant (p=0.035) but economically negligible
ΔR² Added to Model B by Lag
Incremental R² gain from adding each lagged MKT to ASP + Seasonality