KA Customer Classification

Problem: 676 pricing approval PDFs, no structured database

Key Account customers are split into two tiers — KAC (Key Account Chain) and KAM (Key Account Management) — each with different pricing policies, discount structures, and contract terms. Classification currently relies on manual assignment via the Mã LH field (KA01 = KAC, KA00 = KAM) inside F1 pricing approval forms. This project builds a supervised ML model to predict classification from customer profile features, and will expand to K-Means clustering to discover natural customer segments beyond the binary KAC/KAM split.

ETL Pipeline — From PDFs to Feature Matrix

📄

676 PDFs

Raw F1 forms

⚙️

pdftotext

Layout extraction

🔍

Regex Parser

15 fields per doc

📊

Feature Matrix

402 × 21

Field Extracted	Source in PDF	Coverage	Used As Feature
Mã LH (KA00/KA01)	Section I — Customer Info	67%	TARGET LABEL
Client type code (XXXX)	Mã LH suffix	85%	✅ One-hot encoded
Number of ship-to locations	Số lượng Ship-to	67%	✅ Numeric
Revenue target / year	Section II — Policy table	95%	✅ Log-transformed
Revenue target / month	Section II — Policy table	95%	✅ Numeric
Total discount + support %	Tổng tỷ lệ hỗ trợ	97%	✅ Numeric
Contract duration (months)	Thời hạn	98%	✅ Numeric
Historical sales 12M	Section III — History	81%	✅ Log-transformed
Number of product categories	Policy table rows	97%	✅ Numeric
Geographic coverage	Phân bố tại	100%	✅ Binary flags
Order frequency / month	Tần suất mua hàng	45%	✅ Numeric (median-filled)

Label Distribution — KAC vs KAM

From 402 labelled records (191 pending manual review)

Client Type Code Distribution

BVXX = Hospital · CQXX = Corp Office · GKCF = F&B Chain

Logistic Regression — Chosen for interpretability over raw performance

Gradient Boosting achieved the highest test accuracy (90%) and Gradient Boosting and Random Forest both outperform Logistic Regression on raw AUC. However, Logistic Regression was chosen for three reasons: (1) Interpretability — a KA manager needs to understand why a customer is classified as KAC or KAM; LR coefficients provide direct, explainable answers. (2) Sample size risk — with only 402 labelled records, tree-based models risk overfitting patterns that won't generalise to new customers; LR's regularisation is more robust. (3) Deployment simplicity — LR outputs a calibrated probability score per customer that can be thresholded and explained in a business context. RF is retained as a sanity check: when both models agree, confidence is high; when they disagree, the case is flagged for manual review.

Model Comparison — Cross-Validation (5-fold, stratified)

CV AUC Comparison — All 3 Models

Mean ± std across 5 folds · higher = better · chosen model highlighted

Model	CV AUC	CV Std	Test AUC	Test Accuracy	Chosen?
Logistic Regression ★	0.910	±0.046	0.868	79%	✅ Chosen
Random Forest	0.930	±0.041	0.931	86%	Sanity check
Gradient Boosting	0.939	±0.042	0.938	90%	Overfit risk

Confusion Matrix — Logistic Regression (Test Set)

81 test records · 49 KAC · 32 KAM

True KAM

False KAC

False KAM

True KAC

Precision KAC: 82%

Recall KAC: 84%

Precision KAM: 74%

Recall KAM: 72%

Overall Acc: 79%

Hyperparameters — Logistic Regression

Current settings · to be tuned on full dataset

Parameter	Value	Rationale
C	1.0	Moderate regularisation
penalty	l2	Ridge — keeps all features
solver	lbfgs	Efficient for small n
max_iter	1000	Ensures convergence
class_weight	None	60/40 split — balanced
scaler	StandardScaler	Required for LR stability

          Pending: GridSearchCV over C=[0.01,0.1,0.5,1,5,10] + L1 vs L2 on full dataset
        

Key finding: Number of ship-to locations is the strongest KAC predictor

The most interpretable finding from the LR coefficients: KAC customers have more ship-to locations (coefficient +2.00) — chain customers by definition have multiple outlets. Hospital clients (BVXX) strongly predict KAM (-0.83) — medical channel customers are managed differently regardless of size. HCM-based customers lean KAC (+1.12) — chain headquarters are concentrated in Ho Chi Minh City. Higher monthly revenue target also predicts KAC (+1.06) — chains commit to larger monthly volumes than institutional KAM accounts.

LR Coefficients — Top 10 Features

Positive = predicts KAC · Negative = predicts KAM

Random Forest Feature Importance

Sanity check — consistent top features across both models

Feature Engineering Decisions

Feature	Engineering	LR Coef	Direction
NUM_SHIP_TO	Raw count	+2.001	→ KAC
IS_HCM	Binary flag from geo coverage	+1.121	→ KAC
REVENUE_TARGET_MONTH	Raw VND value	+1.062	→ KAC
NUM_PRODUCT_GROUPS	Count of product lines in policy	+0.992	→ KAC
NUM_CATEGORIES	Count of product categories (A/B/C/D)	-0.840	→ KAM
IS_NATIONAL	Binary flag — nationwide delivery	+0.832	→ KAC
CG_HOSPITAL	One-hot from client type code	-0.829	→ KAM
CG_SERVICE	One-hot — hotels, airports, transport	+0.818	→ KAC
LOG_REVENUE_TARGET_YEAR	Log transform — skewed distribution	-0.714	→ KAM
LOG_HIST_SALES_12M	Log transform — 12M historical sales	-0.523	→ KAM

K=5 wins — 5 natural segments emerge from 2,219 KA customers, each with distinct KAC/KAM composition

K-Means clustering on the full AR customer dataset (2,219 KA accounts — ship-to locations, billing geography, credit limits, payment terms, profile class) reveals that customer behaviour does not split cleanly into just KAC and KAM. Instead, 5 natural segments emerge: a pure hospital KAM cluster, two KAC-leaning commercial clusters differentiated by geographic scale, and two KAM-leaning institutional clusters (schools/hotels vs industrial). K=5 achieves the highest silhouette score (0.168) — confirming 5 as the optimal number of segments. The National Chain cluster (76 customers, avg 90 ship-to locations, 100% national coverage) is the clearest KAC signal in the entire dataset.

SELECT K:

Silhouette Score by K

Higher = better defined clusters · K=5 optimal

Cluster Size Distribution — Current K

Customer count per segment

KAC vs KAM Composition — Per Cluster (Current K)

% split of labelled customers within each segment

Segment Profiles — K=5 (Best)

Segment	N	KAC%	KAM%	Median Ship-to	Avg Cities	HCM%	National%	Top Profile

Key finding: Geography and scale — not industry — define KAC vs KAM

The clearest separator between KAC and KAM is number of ship-to locations and geographic spread, not just the type of customer. The National Chain cluster (76 customers, avg 90 ship-to, 100% national) is 83% KAC. The Commercial cluster (531 customers, local scale) is 70% KAC but single-city. Meanwhile, hospitals, schools, and industrial accounts are predominantly KAM regardless of size — they are managed through the medical/institutional channel regardless of geographic footprint. This suggests the KAC/KAM boundary is as much about distribution model as it is about customer type.

Next: Expand to K-Means clustering — go beyond the binary KAC/KAM split

The supervised model answers "is this customer KAC or KAM?" But a more interesting business question is: "are there natural sub-segments within KAC and KAM that behave differently?" K-Means clustering on the same feature matrix will reveal whether, for example, high-revenue hospital KAM accounts are actually more similar to chain KAC accounts than to small hospital KAM accounts. This unsupervised layer will feed into a richer customer segmentation strategy beyond the current binary classification.

Project Roadmap

✅

PDF ETL Pipeline Completed

Extracted 15 structured fields from 676 F1 pricing approval PDFs using pdftotext + regex parser. 593/594 PDFs successfully processed. 402 records labelled from Mã LH field.

✅

Baseline Classification Model Completed

Logistic Regression trained on 402 labelled records. CV AUC = 0.910. Compared against Random Forest (0.930) and Gradient Boosting (0.939). LR chosen for interpretability and generalisation robustness on small dataset.

🔄

Manual Label Completion In Progress

191 unlabelled PDFs identified (older F1 format without Mã LH field). Manual review Excel exported — labels being filled. Expected to grow dataset to ~590 records (+47%).

📋

Hyperparameter Tuning Planned

GridSearchCV over C=[0.01, 0.1, 0.5, 1.0, 5.0, 10.0] and L1 vs L2 penalty. Re-evaluate class_weight='balanced' with expanded dataset. Expected AUC improvement to 0.92+.

✅

K-Means Customer Segmentation Completed

K-Means clustering (K=3,4,5) on 2,219 KA customers using AR ship-to/bill-to data — geographic spread, credit limits, payment terms, profile class. K=5 optimal (silhouette=0.168). 5 natural segments identified: Hospital KAM, Commercial KAC, School & Hotel KAM, Industrial KAM, National Chain KAC. Key finding: geographic scale (ship-to count, city spread) is the primary KAC differentiator, not industry alone.

📋

Segment Profiling & Business Recommendations Planned

Profile each K-Means cluster by revenue, discount rate, product mix, and geography. Map clusters to existing KAC/KAM labels to identify mis-classifications and pricing policy opportunities.