Commercial Analytics · Key Account Management · ML Classification

KA Customer Classification
KAC vs KAM

Source 676 F1 Pricing Approval PDFs
Labelled 402 records
Model Logistic Regression
CV AUC 0.910
PDF ETL NLP Extraction Logistic Regression scikit-learn Feature Engineering ✅ K-Means K=3,4,5 🔄 Expanding Dataset
2,219
KA Customers
402
Labelled (F1)
0.910
CV AUC (LR)
0.168
Silhouette (K=5)
5
Segments Found

Problem: 676 pricing approval PDFs, no structured database

Key Account customers are split into two tiers — KAC (Key Account Chain) and KAM (Key Account Management) — each with different pricing policies, discount structures, and contract terms. Classification currently relies on manual assignment via the Mã LH field (KA01 = KAC, KA00 = KAM) inside F1 pricing approval forms. This project builds a supervised ML model to predict classification from customer profile features, and will expand to K-Means clustering to discover natural customer segments beyond the binary KAC/KAM split.

ETL Pipeline — From PDFs to Feature Matrix
📄
676 PDFs
Raw F1 forms
⚙️
pdftotext
Layout extraction
🔍
Regex Parser
15 fields per doc
📊
Feature Matrix
402 × 21
Field ExtractedSource in PDFCoverageUsed As Feature
Mã LH (KA00/KA01)Section I — Customer Info67%TARGET LABEL
Client type code (XXXX)Mã LH suffix85%✅ One-hot encoded
Number of ship-to locationsSố lượng Ship-to67%✅ Numeric
Revenue target / yearSection II — Policy table95%✅ Log-transformed
Revenue target / monthSection II — Policy table95%✅ Numeric
Total discount + support %Tổng tỷ lệ hỗ trợ97%✅ Numeric
Contract duration (months)Thời hạn98%✅ Numeric
Historical sales 12MSection III — History81%✅ Log-transformed
Number of product categoriesPolicy table rows97%✅ Numeric
Geographic coveragePhân bố tại100%✅ Binary flags
Order frequency / monthTần suất mua hàng45%✅ Numeric (median-filled)
Label Distribution — KAC vs KAM
From 402 labelled records (191 pending manual review)
Client Type Code Distribution
BVXX = Hospital · CQXX = Corp Office · GKCF = F&B Chain

Logistic Regression — Chosen for interpretability over raw performance

Gradient Boosting achieved the highest test accuracy (90%) and Gradient Boosting and Random Forest both outperform Logistic Regression on raw AUC. However, Logistic Regression was chosen for three reasons: (1) Interpretability — a KA manager needs to understand why a customer is classified as KAC or KAM; LR coefficients provide direct, explainable answers. (2) Sample size risk — with only 402 labelled records, tree-based models risk overfitting patterns that won't generalise to new customers; LR's regularisation is more robust. (3) Deployment simplicity — LR outputs a calibrated probability score per customer that can be thresholded and explained in a business context. RF is retained as a sanity check: when both models agree, confidence is high; when they disagree, the case is flagged for manual review.

Model Comparison — Cross-Validation (5-fold, stratified)
CV AUC Comparison — All 3 Models
Mean ± std across 5 folds · higher = better · chosen model highlighted
ModelCV AUCCV StdTest AUCTest AccuracyChosen?
Logistic Regression ★ 0.910±0.046 0.86879% ✅ Chosen
Random Forest 0.930±0.041 0.93186% Sanity check
Gradient Boosting 0.939±0.042 0.93890% Overfit risk
Confusion Matrix — Logistic Regression (Test Set)
81 test records · 49 KAC · 32 KAM
23
True KAM
9
False KAC
8
False KAM
41
True KAC
Precision KAC: 82%
Recall KAC: 84%
Precision KAM: 74%
Recall KAM: 72%
Overall Acc: 79%
Hyperparameters — Logistic Regression
Current settings · to be tuned on full dataset
ParameterValueRationale
C1.0Moderate regularisation
penaltyl2Ridge — keeps all features
solverlbfgsEfficient for small n
max_iter1000Ensures convergence
class_weightNone60/40 split — balanced
scalerStandardScalerRequired for LR stability
Pending: GridSearchCV over C=[0.01,0.1,0.5,1,5,10] + L1 vs L2 on full dataset

Key finding: Number of ship-to locations is the strongest KAC predictor

The most interpretable finding from the LR coefficients: KAC customers have more ship-to locations (coefficient +2.00) — chain customers by definition have multiple outlets. Hospital clients (BVXX) strongly predict KAM (-0.83) — medical channel customers are managed differently regardless of size. HCM-based customers lean KAC (+1.12) — chain headquarters are concentrated in Ho Chi Minh City. Higher monthly revenue target also predicts KAC (+1.06) — chains commit to larger monthly volumes than institutional KAM accounts.

LR Coefficients — Top 10 Features
Positive = predicts KAC · Negative = predicts KAM
Random Forest Feature Importance
Sanity check — consistent top features across both models
Feature Engineering Decisions
FeatureEngineeringLR CoefDirection
NUM_SHIP_TORaw count+2.001→ KAC
IS_HCMBinary flag from geo coverage+1.121→ KAC
REVENUE_TARGET_MONTHRaw VND value+1.062→ KAC
NUM_PRODUCT_GROUPSCount of product lines in policy+0.992→ KAC
NUM_CATEGORIESCount of product categories (A/B/C/D)-0.840→ KAM
IS_NATIONALBinary flag — nationwide delivery+0.832→ KAC
CG_HOSPITALOne-hot from client type code-0.829→ KAM
CG_SERVICEOne-hot — hotels, airports, transport+0.818→ KAC
LOG_REVENUE_TARGET_YEARLog transform — skewed distribution-0.714→ KAM
LOG_HIST_SALES_12MLog transform — 12M historical sales-0.523→ KAM

K=5 wins — 5 natural segments emerge from 2,219 KA customers, each with distinct KAC/KAM composition

K-Means clustering on the full AR customer dataset (2,219 KA accounts — ship-to locations, billing geography, credit limits, payment terms, profile class) reveals that customer behaviour does not split cleanly into just KAC and KAM. Instead, 5 natural segments emerge: a pure hospital KAM cluster, two KAC-leaning commercial clusters differentiated by geographic scale, and two KAM-leaning institutional clusters (schools/hotels vs industrial). K=5 achieves the highest silhouette score (0.168) — confirming 5 as the optimal number of segments. The National Chain cluster (76 customers, avg 90 ship-to locations, 100% national coverage) is the clearest KAC signal in the entire dataset.

SELECT K:
Silhouette Score by K
Higher = better defined clusters · K=5 optimal
Cluster Size Distribution — Current K
Customer count per segment
KAC vs KAM Composition — Per Cluster (Current K)
% split of labelled customers within each segment
Segment Profiles — K=5 (Best)
SegmentNKAC%KAM% Median Ship-toAvg CitiesHCM%National%Top Profile

Key finding: Geography and scale — not industry — define KAC vs KAM

The clearest separator between KAC and KAM is number of ship-to locations and geographic spread, not just the type of customer. The National Chain cluster (76 customers, avg 90 ship-to, 100% national) is 83% KAC. The Commercial cluster (531 customers, local scale) is 70% KAC but single-city. Meanwhile, hospitals, schools, and industrial accounts are predominantly KAM regardless of size — they are managed through the medical/institutional channel regardless of geographic footprint. This suggests the KAC/KAM boundary is as much about distribution model as it is about customer type.

Next: Expand to K-Means clustering — go beyond the binary KAC/KAM split

The supervised model answers "is this customer KAC or KAM?" But a more interesting business question is: "are there natural sub-segments within KAC and KAM that behave differently?" K-Means clustering on the same feature matrix will reveal whether, for example, high-revenue hospital KAM accounts are actually more similar to chain KAC accounts than to small hospital KAM accounts. This unsupervised layer will feed into a richer customer segmentation strategy beyond the current binary classification.

Project Roadmap
PDF ETL Pipeline Completed
Extracted 15 structured fields from 676 F1 pricing approval PDFs using pdftotext + regex parser. 593/594 PDFs successfully processed. 402 records labelled from Mã LH field.
Baseline Classification Model Completed
Logistic Regression trained on 402 labelled records. CV AUC = 0.910. Compared against Random Forest (0.930) and Gradient Boosting (0.939). LR chosen for interpretability and generalisation robustness on small dataset.
🔄
Manual Label Completion In Progress
191 unlabelled PDFs identified (older F1 format without Mã LH field). Manual review Excel exported — labels being filled. Expected to grow dataset to ~590 records (+47%).
📋
Hyperparameter Tuning Planned
GridSearchCV over C=[0.01, 0.1, 0.5, 1.0, 5.0, 10.0] and L1 vs L2 penalty. Re-evaluate class_weight='balanced' with expanded dataset. Expected AUC improvement to 0.92+.
K-Means Customer Segmentation Completed
K-Means clustering (K=3,4,5) on 2,219 KA customers using AR ship-to/bill-to data — geographic spread, credit limits, payment terms, profile class. K=5 optimal (silhouette=0.168). 5 natural segments identified: Hospital KAM, Commercial KAC, School & Hotel KAM, Industrial KAM, National Chain KAC. Key finding: geographic scale (ship-to count, city spread) is the primary KAC differentiator, not industry alone.
📋
Segment Profiling & Business Recommendations Planned
Profile each K-Means cluster by revenue, discount rate, product mix, and geography. Map clusters to existing KAC/KAM labels to identify mis-classifications and pricing policy opportunities.