Key Account customers are split into two tiers — KAC (Key Account Chain) and
KAM (Key Account Management) — each with different pricing policies, discount structures,
and contract terms. Classification currently relies on manual assignment via the Mã LH field
(KA01 = KAC, KA00 = KAM) inside F1 pricing approval forms. This project builds a
supervised ML model to predict classification from customer profile features,
and will expand to K-Means clustering to discover natural customer segments
beyond the binary KAC/KAM split.
| Field Extracted | Source in PDF | Coverage | Used As Feature |
|---|---|---|---|
| Mã LH (KA00/KA01) | Section I — Customer Info | 67% | TARGET LABEL |
| Client type code (XXXX) | Mã LH suffix | 85% | ✅ One-hot encoded |
| Number of ship-to locations | Số lượng Ship-to | 67% | ✅ Numeric |
| Revenue target / year | Section II — Policy table | 95% | ✅ Log-transformed |
| Revenue target / month | Section II — Policy table | 95% | ✅ Numeric |
| Total discount + support % | Tổng tỷ lệ hỗ trợ | 97% | ✅ Numeric |
| Contract duration (months) | Thời hạn | 98% | ✅ Numeric |
| Historical sales 12M | Section III — History | 81% | ✅ Log-transformed |
| Number of product categories | Policy table rows | 97% | ✅ Numeric |
| Geographic coverage | Phân bố tại | 100% | ✅ Binary flags |
| Order frequency / month | Tần suất mua hàng | 45% | ✅ Numeric (median-filled) |
Gradient Boosting achieved the highest test accuracy (90%) and Gradient Boosting and Random Forest both outperform Logistic Regression on raw AUC. However, Logistic Regression was chosen for three reasons: (1) Interpretability — a KA manager needs to understand why a customer is classified as KAC or KAM; LR coefficients provide direct, explainable answers. (2) Sample size risk — with only 402 labelled records, tree-based models risk overfitting patterns that won't generalise to new customers; LR's regularisation is more robust. (3) Deployment simplicity — LR outputs a calibrated probability score per customer that can be thresholded and explained in a business context. RF is retained as a sanity check: when both models agree, confidence is high; when they disagree, the case is flagged for manual review.
| Model | CV AUC | CV Std | Test AUC | Test Accuracy | Chosen? |
|---|---|---|---|---|---|
| Logistic Regression ★ | 0.910 | ±0.046 | 0.868 | 79% | ✅ Chosen |
| Random Forest | 0.930 | ±0.041 | 0.931 | 86% | Sanity check |
| Gradient Boosting | 0.939 | ±0.042 | 0.938 | 90% | Overfit risk |
| Parameter | Value | Rationale |
|---|---|---|
| C | 1.0 | Moderate regularisation |
| penalty | l2 | Ridge — keeps all features |
| solver | lbfgs | Efficient for small n |
| max_iter | 1000 | Ensures convergence |
| class_weight | None | 60/40 split — balanced |
| scaler | StandardScaler | Required for LR stability |
The most interpretable finding from the LR coefficients: KAC customers have more ship-to locations (coefficient +2.00) — chain customers by definition have multiple outlets. Hospital clients (BVXX) strongly predict KAM (-0.83) — medical channel customers are managed differently regardless of size. HCM-based customers lean KAC (+1.12) — chain headquarters are concentrated in Ho Chi Minh City. Higher monthly revenue target also predicts KAC (+1.06) — chains commit to larger monthly volumes than institutional KAM accounts.
| Feature | Engineering | LR Coef | Direction |
|---|---|---|---|
| NUM_SHIP_TO | Raw count | +2.001 | → KAC |
| IS_HCM | Binary flag from geo coverage | +1.121 | → KAC |
| REVENUE_TARGET_MONTH | Raw VND value | +1.062 | → KAC |
| NUM_PRODUCT_GROUPS | Count of product lines in policy | +0.992 | → KAC |
| NUM_CATEGORIES | Count of product categories (A/B/C/D) | -0.840 | → KAM |
| IS_NATIONAL | Binary flag — nationwide delivery | +0.832 | → KAC |
| CG_HOSPITAL | One-hot from client type code | -0.829 | → KAM |
| CG_SERVICE | One-hot — hotels, airports, transport | +0.818 | → KAC |
| LOG_REVENUE_TARGET_YEAR | Log transform — skewed distribution | -0.714 | → KAM |
| LOG_HIST_SALES_12M | Log transform — 12M historical sales | -0.523 | → KAM |
K-Means clustering on the full AR customer dataset (2,219 KA accounts — ship-to locations, billing geography, credit limits, payment terms, profile class) reveals that customer behaviour does not split cleanly into just KAC and KAM. Instead, 5 natural segments emerge: a pure hospital KAM cluster, two KAC-leaning commercial clusters differentiated by geographic scale, and two KAM-leaning institutional clusters (schools/hotels vs industrial). K=5 achieves the highest silhouette score (0.168) — confirming 5 as the optimal number of segments. The National Chain cluster (76 customers, avg 90 ship-to locations, 100% national coverage) is the clearest KAC signal in the entire dataset.
| Segment | N | KAC% | KAM% | Median Ship-to | Avg Cities | HCM% | National% | Top Profile |
|---|
The clearest separator between KAC and KAM is number of ship-to locations and geographic spread, not just the type of customer. The National Chain cluster (76 customers, avg 90 ship-to, 100% national) is 83% KAC. The Commercial cluster (531 customers, local scale) is 70% KAC but single-city. Meanwhile, hospitals, schools, and industrial accounts are predominantly KAM regardless of size — they are managed through the medical/institutional channel regardless of geographic footprint. This suggests the KAC/KAM boundary is as much about distribution model as it is about customer type.
The supervised model answers "is this customer KAC or KAM?" But a more interesting business question is: "are there natural sub-segments within KAC and KAM that behave differently?" K-Means clustering on the same feature matrix will reveal whether, for example, high-revenue hospital KAM accounts are actually more similar to chain KAC accounts than to small hospital KAM accounts. This unsupervised layer will feed into a richer customer segmentation strategy beyond the current binary classification.