← All articles

The traditional credit scorecard—a simple table that adds points to a base score—remains one of the most widely deployed credit risk models in retail banking. Its survival is not nostalgia: scorecards are interpretable, auditable, and regulatorily palatable in a way that gradient-boosted ensembles are not. Understanding the statistics behind WoE and IV is the foundation for building models that satisfy both the risk quant and the credit committee.

The Target Variable

A scorecard predicts probability of default (PD), where default is typically defined as 90+ days past due within a 12-month performance window. The binary target Y = 1 (default), Y = 0 (non-default). Class imbalance is severe: in prime retail portfolios, default rates are 1–5%; in subprime, 10–20%. The rarity of defaults directly affects all downstream statistics.

Weight of Evidence Transformation

For a predictor variable X binned into k groups, the Weight of Evidence for bin i is:

Weight of Evidence
WoEi = ln( pi,Events / pi,Non-Events ) = ln( (ni,1 / N1) / (ni,0 / N0) )

where ni,1 is the count of defaults in bin i, N1 is total defaults, ni,0 is non-defaults in bin i, N0 is total non-defaults. Bins where WoE is strongly negative concentrate defaults; bins with strongly positive WoE concentrate good borrowers. The WoE transformation maps any variable—continuous or categorical—to a single numerical scale that is log-odds linear with the outcome.

Information Value

The Information Value (IV) of a variable summarises its total predictive power across all bins:

Information Value
IV = Σi ( pi,Events − pi,Non-Events ) × WoEi
IV RangePredictive PowerScorecard Use
< 0.02Not usefulExclude
0.02 – 0.10WeakConsider only with domain rationale
0.10 – 0.30MediumStrong candidate
0.30 – 0.50StrongHigh-value feature
> 0.50SuspiciousCheck for data leakage
Feature Information Value — Consumer Credit Portfolio
IV scores for candidate variables. Consumer loan application dataset, 24-month performance window, 85,000 accounts. Variables with IV < 0.02 excluded.

Logistic Regression and the Log-Odds Relationship

Once all predictors are WoE-transformed, logistic regression produces a model in log-odds space that is additive and interpretable:

Logistic Regression Score
log( p / (1−p) ) = β0 + β1·WoE1 + β2·WoE2 + ... + βk·WoEk

The final step converts from log-odds to integer scorecard points using a linear scaling function that maps a reference log-odds value (the base score) to a desired point total, with a defined number of points per doubling of odds (PDO—Points to Double Odds). The industry standard is often 600 points at 50:1 odds, doubling every 20 points.

Score to Points Conversion
Score = Base − ( PDO / ln(2) ) × ( β0 + Σ βj·WoEj )
Python — WoE binning and scorecard scaling
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

def compute_woe_iv(df: pd.DataFrame, feature: str,
                   target: str, bins: int = 10) -> pd.DataFrame:
    """Compute WoE and IV for a continuous feature using quantile binning."""
    temp = df[[feature, target]].copy()
    temp['bin'] = pd.qcut(temp[feature], bins, duplicates='drop')

    stats = temp.groupby('bin')[target].agg(
        events='sum',
        total='count'
    ).reset_index()
    stats['non_events'] = stats['total'] - stats['events']

    N1 = stats['events'].sum()
    N0 = stats['non_events'].sum()

    stats['p_event']     = stats['events'] / N1
    stats['p_non_event'] = stats['non_events'] / N0

    # Add small epsilon to avoid log(0)
    eps = 1e-6
    stats['woe'] = np.log((stats['p_event'] + eps) /
                             (stats['p_non_event'] + eps))
    stats['iv_bin'] = (stats['p_event'] - stats['p_non_event']) * stats['woe']
    stats['iv'] = stats['iv_bin'].sum()

    return stats

def scale_to_scorecard(log_odds: np.ndarray,
                        base_score: int = 600,
                        base_odds: float = 50.0,
                        pdo: int = 20) -> np.ndarray:
    """Convert log-odds to integer scorecard points."""
    factor = pdo / np.log(2)
    offset = base_score - factor * np.log(base_odds)
    return np.round(offset + factor * log_odds).astype(int)

Model Validation

Three metrics are standard for scorecard validation:

Gini coefficient (= 2 × AUC − 1): measures rank-ordering power. A Gini above 0.40 is generally acceptable for a retail consumer scorecard; above 0.55 is considered strong.

Kolmogorov-Smirnov (KS) statistic: the maximum difference between the cumulative distribution of scores for defaults and non-defaults. KS above 0.30 is typically a pass; below 0.20 is a concern.

Population Stability Index (PSI): measures whether the score distribution at monitoring time has shifted from the development population. PSI below 0.10 = stable; 0.10–0.25 = warrants investigation; above 0.25 = significant shift.

ROC Curve — Train, Validation and Out-of-Time Test
Area Under Curve shown. Development: 60,000 accounts (2019–2021). Validation: 15,000 (2021). Out-of-Time: 10,000 (Jan–Jun 2022). Target event = 90+ DPD within 12 months.
Score Distribution — Population vs Defaults
Normalised frequency density of scorecard points. Separation between the two curves (KS statistic) is the key discriminatory power metric. KS = 0.412.

"The Gini tells you how well the model ranks borrowers. The KS tells you where it separates them. Neither tells you whether the probabilities are calibrated—for that, you need a reliability diagram."

Regulatory Considerations under IRB

Scorecards used for regulatory capital under the Internal Ratings-Based (IRB) approach are subject to strict EBA guidelines. The PD estimate must represent a long-run average through the cycle, not just point-in-time performance on the development sample. Banks must hold out at least one year of data for out-of-time validation, document the model in a Model Risk Management framework, and subject it to annual backtesting against actual default rates.