The traditional credit scorecard—a simple table that adds points to a base score—remains one of the most widely deployed credit risk models in retail banking. Its survival is not nostalgia: scorecards are interpretable, auditable, and regulatorily palatable in a way that gradient-boosted ensembles are not. Understanding the statistics behind WoE and IV is the foundation for building models that satisfy both the risk quant and the credit committee.
The Target Variable
A scorecard predicts probability of default (PD), where default is typically defined as 90+ days past due within a 12-month performance window. The binary target Y = 1 (default), Y = 0 (non-default). Class imbalance is severe: in prime retail portfolios, default rates are 1–5%; in subprime, 10–20%. The rarity of defaults directly affects all downstream statistics.
Weight of Evidence Transformation
For a predictor variable X binned into k groups, the Weight of Evidence for bin i is:
where ni,1 is the count of defaults in bin i, N1 is total defaults, ni,0 is non-defaults in bin i, N0 is total non-defaults. Bins where WoE is strongly negative concentrate defaults; bins with strongly positive WoE concentrate good borrowers. The WoE transformation maps any variable—continuous or categorical—to a single numerical scale that is log-odds linear with the outcome.
Information Value
The Information Value (IV) of a variable summarises its total predictive power across all bins:
| IV Range | Predictive Power | Scorecard Use |
|---|---|---|
| < 0.02 | Not useful | Exclude |
| 0.02 – 0.10 | Weak | Consider only with domain rationale |
| 0.10 – 0.30 | Medium | Strong candidate |
| 0.30 – 0.50 | Strong | High-value feature |
| > 0.50 | Suspicious | Check for data leakage |
Logistic Regression and the Log-Odds Relationship
Once all predictors are WoE-transformed, logistic regression produces a model in log-odds space that is additive and interpretable:
The final step converts from log-odds to integer scorecard points using a linear scaling function that maps a reference log-odds value (the base score) to a desired point total, with a defined number of points per doubling of odds (PDO—Points to Double Odds). The industry standard is often 600 points at 50:1 odds, doubling every 20 points.
import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression def compute_woe_iv(df: pd.DataFrame, feature: str, target: str, bins: int = 10) -> pd.DataFrame: """Compute WoE and IV for a continuous feature using quantile binning.""" temp = df[[feature, target]].copy() temp['bin'] = pd.qcut(temp[feature], bins, duplicates='drop') stats = temp.groupby('bin')[target].agg( events='sum', total='count' ).reset_index() stats['non_events'] = stats['total'] - stats['events'] N1 = stats['events'].sum() N0 = stats['non_events'].sum() stats['p_event'] = stats['events'] / N1 stats['p_non_event'] = stats['non_events'] / N0 # Add small epsilon to avoid log(0) eps = 1e-6 stats['woe'] = np.log((stats['p_event'] + eps) / (stats['p_non_event'] + eps)) stats['iv_bin'] = (stats['p_event'] - stats['p_non_event']) * stats['woe'] stats['iv'] = stats['iv_bin'].sum() return stats def scale_to_scorecard(log_odds: np.ndarray, base_score: int = 600, base_odds: float = 50.0, pdo: int = 20) -> np.ndarray: """Convert log-odds to integer scorecard points.""" factor = pdo / np.log(2) offset = base_score - factor * np.log(base_odds) return np.round(offset + factor * log_odds).astype(int)
Model Validation
Three metrics are standard for scorecard validation:
Gini coefficient (= 2 × AUC − 1): measures rank-ordering power. A Gini above 0.40 is generally acceptable for a retail consumer scorecard; above 0.55 is considered strong.
Kolmogorov-Smirnov (KS) statistic: the maximum difference between the cumulative distribution of scores for defaults and non-defaults. KS above 0.30 is typically a pass; below 0.20 is a concern.
Population Stability Index (PSI): measures whether the score distribution at monitoring time has shifted from the development population. PSI below 0.10 = stable; 0.10–0.25 = warrants investigation; above 0.25 = significant shift.
"The Gini tells you how well the model ranks borrowers. The KS tells you where it separates them. Neither tells you whether the probabilities are calibrated—for that, you need a reliability diagram."
Regulatory Considerations under IRB
Scorecards used for regulatory capital under the Internal Ratings-Based (IRB) approach are subject to strict EBA guidelines. The PD estimate must represent a long-run average through the cycle, not just point-in-time performance on the development sample. Banks must hold out at least one year of data for out-of-time validation, document the model in a Model Risk Management framework, and subject it to annual backtesting against actual default rates.