Data Preprocessing Techniques

Expert-defined terms from the Certificate in Credit Risk Analytics in Python course at HealthCareStudies (An LSPM brand). Free to read, free to share, paired with a professional course.

Data Preprocessing Techniques

Amplitude Scaling – Concept #

Adjusting the magnitude of numeric variables to a common range without altering distribution shape. Related terms: min‑max scaling, standardization. Explanation: Each value x is multiplied by a factor k so that the new maximum equals a predefined bound (often 1). Example: Scaling credit‑score values from 300‑850 to 0‑1 by dividing by 850. Practical application: Enables gradient‑based algorithms to converge faster when modeling default probability. Challenges: Choice of scale influences interpretability; extreme outliers can compress the bulk of data into a narrow band.

Binary Binning – Concept #

Converting a continuous variable into two categories based on a threshold. Related terms: discretization, thresholding. Explanation: Values above the cut‑point receive label 1, below receive 0. Example: Flagging loan‑to‑value ratios > 80 % as high‑risk (1) and ≤ 80 % as low‑risk (0). Practical application: Simplifies logistic regression and decision‑tree splits. Challenges: Information loss; selection of an optimal cut‑point may require domain expertise or ROC analysis.

Box‑Cox Transformation – Concept #

Power transformation that stabilizes variance and makes data more normally distributed. Related terms: Yeo‑Johnson, log transformation. Explanation: For each positive x, compute (x^λ − 1)/λ if λ ≠ 0; otherwise use log(x). Λ is estimated via maximum likelihood. Example: Transforming skewed debt‑to‑income ratios before feeding them to a linear discriminant model. Practical application: Improves linear model assumptions in credit risk scoring. Challenges: Requires strictly positive data; selecting λ may be computationally intensive for large datasets.

Bucket Encoding – Concept #

Grouping numeric values into equal‑width intervals (buckets) and representing each bucket with a categorical label. Related terms: equal‑frequency binning, ordinal encoding. Explanation: Divide the range of a variable into N buckets and assign bucket IDs. Example: Creating five buckets for annual income: 0‑20K, 20‑40k, …, 80‑100k. Practical application: Reduces sensitivity to noise in tree‑based models. Challenges: Arbitrary bucket boundaries can create artificial discontinuities; may hide important trends within a bucket.

Chi‑Square Feature Selection – Concept #

Statistical test measuring independence between categorical features and the target variable. Related terms: ANOVA F‑test, mutual information. Explanation: Compute χ² statistic for each feature; higher values indicate stronger association with default/non‑default. Example: Selecting macro‑economic indicators that significantly differ between delinquent and current accounts. Practical application: Filters irrelevant predictors before fitting high‑dimensional models. Challenges: Assumes feature independence; not applicable to continuous variables without prior discretization.

Clipping – Concept #

Restricting variable values to a pre‑defined minimum and/or maximum. Related terms: winsorizing, thresholding. Explanation: Values below lower bound are set to the bound; values above upper bound are set to the bound. Example: Capping interest‑rate values at 25 % to prevent outlier influence. Practical application: Controls extreme values that could destabilize gradient descent in neural networks. Challenges: May distort true distribution; selecting bounds requires domain knowledge.

Correlation Analysis – Concept #

Quantifying linear relationships between pairs of variables. Related terms: Pearson coefficient, Spearman rank, multicollinearity. Explanation: Compute correlation matrix; values near ±1 indicate strong linear association. Example: Detecting high correlation (r = 0.92) Between total debt and loan amount, prompting removal of one feature. Practical application: Guides feature reduction to avoid redundancy in regression models. Challenges: Correlation does not capture nonlinear dependencies; multicollinearity can inflate variance of coefficient estimates.

Data Augmentation – Concept #

Generating synthetic observations to increase training data diversity. Related terms: SMOTE, bootstrapping. Explanation: Apply transformations (e.G., Adding Gaussian noise) to existing records while preserving label integrity. Example: Creating additional high‑risk borrower profiles by perturbing existing default cases. Practical application: Mitigates class imbalance for deep‑learning credit scoring. Challenges: Synthetic data may not reflect real‑world risk dynamics; risk of overfitting to artificial patterns.

Data Imputation – Concept #

Replacing missing values with estimated substitutes. Related terms: mean imputation, KNN imputation, multiple imputation. Explanation: Choose a strategy (statistical or model‑based) to fill NaNs before model training. Example: Imputing missing credit‑history length with the median of the non‑missing group. Practical application: Enables use of complete‑case algorithms such as logistic regression. Challenges: Imputation can introduce bias if missingness is not random; complex methods increase computational load.

Data Normalization – Concept #

Transforming variables to a common scale, typically 0‑1, preserving shape. Related terms: min‑max scaling, unit vector scaling. Explanation: For each feature, compute (x − min)/(max − min). Example: Normalizing the number of past delinquencies to the interval [0,1] before feeding them to a neural network. Practical application: Improves convergence speed for distance‑based algorithms like k‑means clustering. Challenges: Sensitive to outliers; new data points outside original min‑max range require re‑scaling.

Data Standardization – Concept #

Centering variables around zero mean and scaling to unit variance. Related terms: Z‑score scaling, standard scaler. Explanation: Compute (x − μ)/σ for each feature, where μ and σ are the training‑set mean and standard deviation. Example: Standardizing annual income before applying a support‑vector machine classifier. Practical application: Required by algorithms that assume Gaussian‑like input, such as linear discriminant analysis. Challenges: Distributional shifts in production data may invalidate the original μ and σ, necessitating periodic recalibration.

Data Type Conversion – Concept #

Changing the underlying representation of a variable (e.G., From string to numeric). Related terms: casting, parsing. Explanation: Apply Python functions like int(), float(), or pandas .Astype() to coerce types. Example: Converting the “account_open_date” column from object to datetime64[ns] for time‑series feature engineering. Practical application: Enables arithmetic operations on previously non‑numeric fields. Challenges: Inconsistent formatting (e.G., “12/31/2020” Vs “2020‑12‑31”) can cause conversion errors; requires robust cleaning pipelines.

Decomposition – Concept #

Breaking a time‑series into constituent components (trend, seasonality, residual). Related terms: STL, seasonal decomposition. Explanation: Use additive or multiplicative models to isolate systematic patterns. Example: Decomposing monthly default rates to separate long‑term credit‑cycle trends from seasonal spikes. Practical application: Improves forecasting accuracy for portfolio risk projections. Challenges: Requires sufficient historical depth; model misspecification can misattribute variance.

Dimensionality Reduction – Concept #

Reducing the number of variables while preserving essential information. Related terms: PCA, t‑SNE, autoencoder. Explanation: Transform original features into a lower‑dimensional space using linear or nonlinear techniques. Example: Applying Principal Component Analysis to 150 macro‑economic indicators and retaining the first 10 components that explain 95 % of variance. Practical application: Speeds up model training and mitigates the curse of dimensionality in credit‑risk clustering. Challenges: Loss of interpretability; selection of the optimal number of components can be subjective.

Discretization – Concept #

Converting continuous variables into a finite set of intervals (bins). Related terms: bucket encoding, quantile binning. Explanation: Determine bin edges using methods such as equal‑width, equal‑frequency, or entropy‑based algorithms. Example: Transforming credit‑score into five risk categories (A‑E) based on quantiles. Practical application: Enables use of algorithms that require categorical input, like Naïve Bayes. Challenges: Arbitrary binning may introduce bias; too many bins re‑introduce noise.

Duplicate Removal – Concept #

Identifying and eliminating redundant records that can skew model training. Related terms: deduplication, record linkage. Explanation: Use pandas .Duplicated() or fuzzy matching on key fields to flag repeats. Example: Removing two entries that share identical borrower ID and loan number. Practical application: Prevents over‑representation of certain borrowers, preserving sample independence. Challenges: Near‑duplicates (e.G., Misspelled names) may evade simple exact‑match detection and require sophisticated similarity measures.

Feature Engineering – Concept #

Creating new variables that capture domain‑specific insights. Related terms: derived features, interaction terms. Explanation: Combine, transform, or aggregate raw data to expose hidden patterns. Example: Constructing “debt‑to‑income ratio” from total debt and annual income fields. Practical application: Enhances predictive power of credit‑risk models, often more than algorithmic complexity alone. Challenges: Requires deep domain knowledge; risk of leakage if future information is inadvertently incorporated.

Feature Selection – Concept #

Choosing a subset of predictors that contribute most to model performance. Related terms: wrapper methods, embedded methods, filter methods. Explanation: Apply techniques such as recursive feature elimination, L1 regularization, or mutual information ranking. Example: Selecting the top 20 variables from an initial set of 200 based on their contribution to AUC. Practical application: Reduces overfitting and computational cost in high‑dimensional credit scoring. Challenges: Interaction effects may be missed; selection may be unstable across different training splits.

Gaussian Imputation – Concept #

Replacing missing values by draws from a Gaussian distribution fitted to observed data. Related terms: multiple imputation, Monte Carlo imputation. Explanation: Estimate mean μ and variance σ² from non‑missing entries; generate random values ~ N(μ,σ²) for each missing slot. Example: Imputing missing credit‑utilization percentages with draws from the observed utilization distribution. Practical application: Preserves variability for downstream uncertainty quantification. Challenges: Assumes normality; may generate impossible values (e.G., Negative percentages) without truncation.

Hashing Trick – Concept #

Mapping high‑cardinality categorical variables to a fixed‑size vector via a hash function. Related terms: feature hashing, dimensionality reduction. Explanation: Each category string is hashed to an index; collisions are allowed, resulting in a sparse representation. Example: Encoding thousands of merchant‑category codes into a 1,024‑dimensional vector for a gradient‑boosted model. Practical application: Keeps memory footprint low when dealing with large categorical vocabularies in credit‑card transaction data. Challenges: Collision‑induced noise can degrade model accuracy; hash function must be deterministic across training and inference.

Imbalanced Data Handling – Concept #

Techniques to address disproportionate class distributions, common in default prediction where non‑default dominates. Related terms: SMOTE, cost‑sensitive learning. Explanation: Apply resampling (over‑/under‑sampling), synthetic generation, or modify loss functions to penalize misclassification of the minority class. Example: Using random undersampling to reduce non‑default cases from 95 % to 70 % before training a logistic regression. Practical application: Improves recall for default detection, crucial for regulatory reporting. Challenges: Undersampling discards valuable information; oversampling may cause overfitting to synthetic minority examples.

Interaction Terms – Concept #

Variables created by multiplying two or more base features to capture synergistic effects. Related terms: polynomial features, feature crossing. Explanation: For features x₁ and x₂, construct x₁·x₂ as a new predictor. Example: Interaction between loan amount and unemployment rate to model heightened risk during economic downturns. Practical application: Allows linear models to approximate nonlinear relationships without resorting to complex algorithms. Challenges: Exponential growth of possible interactions; risk of multicollinearity if base features are highly correlated.

Jittering – Concept #

Adding small random noise to numeric values to break ties and improve algorithm stability. Related terms: noise injection, perturbation. Explanation: For each observation, add ε ~ Uniform(−δ, δ) where δ is a tiny fraction of the variable’s range. Example: Adding jitter to identical loan‑to‑value ratios before fitting a k‑nearest‑neighbors classifier. Practical application: Prevents deterministic tie‑breaking that could bias model outcomes. Challenges: Must keep δ sufficiently small to avoid distorting true signal.

K‑Nearest Neighbors Imputation – Concept #

Estimating missing values based on the values of the k most similar records. Related terms: distance‑based imputation, hot‑deck imputation. Explanation: Compute distance (e.G., Euclidean) between incomplete record and complete records; average the target variable among the k nearest neighbors. Example: Imputing missing credit‑score using the average score of the 5 nearest borrowers based on income and employment length. Practical application: Retains local structure of the data, useful for heterogeneous credit portfolios. Challenges: Computationally expensive for large datasets; distance metric choice heavily influences results.

Label Encoding – Concept #

Converting categorical labels to integer codes. Related terms: ordinal encoding, integer mapping. Explanation: Assign each distinct category a unique integer (e.G., “Low” = 0, “medium” = 1, “high” = 2). Example: Encoding repayment status (current, delinquent, default) for a survival‑analysis model. Practical application: Required for algorithms that accept only numeric input, such as tree‑based ensembles. Challenges: Implicit ordering may mislead models that assume ordinal relationships when none exist.

Log Transformation – Concept #

Applying natural logarithm to reduce right‑skewness and compress large values. Related terms: log‑1p, Box‑Cox. Explanation: Compute log(x + c) where c is a constant (often 1) to handle zeros. Example: Transforming total outstanding balance before fitting a linear regression to predict loss‑given‑default. Practical application: Stabilizes variance, improves linear model fit. Challenges: Zero or negative values require offset; interpretation shifts from absolute to relative changes.

Missing Value Imputation – Concept #

Broad term covering all strategies to fill gaps in datasets. Related terms: single imputation, multiple imputation. Explanation: Choose an approach (mean, median, mode, model‑based) and apply it consistently across features. Example: Using median household income per ZIP code to impute missing income fields. Practical application: Enables use of complete‑case algorithms without discarding records. Challenges: Imputation bias if missingness mechanism is not random; may underestimate variability.

One‑Hot Encoding – Concept #

Representing each category of a nominal variable as a binary vector. Related terms: dummy variables, binary encoding. Explanation: For a variable with K categories, create K binary columns where only the column corresponding to the observed category is 1. Example: Encoding “employment type” (salaried, self‑employed, unemployed) into three separate columns. Practical application: Prevents unintended ordinal relationships in linear models and neural networks. Challenges: Increases dimensionality, especially with high‑cardinality features; may lead to sparse matrices.

Outlier Detection – Concept #

Identifying observations that deviate markedly from the majority of the data. Related terms: anomaly detection, robust statistics. Explanation: Use statistical rules (e.G., 1.5 × IQR), distance‑based methods (e.G., Mahalanobis), or model‑based scores (e.G., Isolation forest). Example: Flagging a borrower with a debt‑to‑income ratio of 3.5 As an outlier compared to the typical range of 0‑0.8. Practical application: Prevents distortion of model parameters, especially in linear and logistic regressions. Challenges: Defining “outlier” is context‑dependent; aggressive removal may discard rare but legitimate high‑risk cases.

Principal Component Analysis (PCA) – Concept #

Linear dimensionality‑reduction technique that projects data onto orthogonal axes of maximal variance. Related terms: eigen‑decomposition, singular value decomposition. Explanation: Compute covariance matrix, extract eigenvectors (principal components), and retain the top N components that capture desired variance. Example: Reducing 100 financial ratios to 15 principal components that explain 92 % of total variance before feeding them to a support‑vector machine. Practical application: Mitigates multicollinearity and speeds up training of high‑dimensional models. Challenges: Components are linear combinations, making interpretation difficult; sensitive to scaling, so prior standardization is mandatory.

Quantile Transformation – Concept #

Mapping data to a uniform or normal distribution based on empirical quantiles. Related terms: rank‑gauss scaling, power transform. Explanation: Sort values, compute percentile rank, then apply inverse CDF of target distribution. Example: Transforming skewed loan‑age variable to follow a standard normal distribution for a Gaussian Naïve Bayes classifier. Practical application: Improves performance of algorithms assuming Gaussian inputs, such as linear discriminant analysis. Challenges: Requires sufficient data to estimate quantiles reliably; may introduce discontinuities at quantile boundaries.

Rare Category Grouping – Concept #

Consolidating infrequent categorical levels into an “Other” bucket. Related terms: frequency encoding, category merging. Explanation: Identify categories whose frequency falls below a threshold (e.G., 1 %) And replace them with a common label. Example: Grouping merchant codes that appear in fewer than 0.5 % Of transactions into “Other”. Practical application: Reduces dimensionality and prevents overfitting to noise in tree‑based models. Challenges: May mask informative niche patterns; threshold selection is subjective.

Robust Scaling – Concept #

Scaling features using statistics that are resistant to outliers (median and interquartile range). Related terms: median scaling, IQR scaling. Explanation: Compute (x − median)/IQR for each feature. Example: Scaling credit‑limit amounts with robust scaling before training a ridge regression model. Practical application: Provides stable scaling when data contain extreme values, common in credit portfolios. Challenges: Less efficient than standard scaling if data are already well‑behaved; may reduce variance too much for models that benefit from larger spread.

SMOTE (Synthetic Minority Over‑sampling Technique) – Concept #

Generating synthetic minority‑class samples by interpolating between nearest neighbors. Related terms: oversampling, ADASYN. Explanation: For each minority instance, select k nearest minority neighbors, create new points along the line segments joining them. Example: Producing 1,000 synthetic default cases to balance a dataset with 9,000 non‑default observations before fitting a random forest. Practical application: Improves classifier sensitivity to rare default events, aiding regulatory compliance. Challenges: May create overlapping regions with the majority class; synthetic points assume linearity between neighbors, which may not hold in complex credit features.

StandardScaler – Concept #

Scikit‑learn class implementing Z‑score standardization. Related terms: data standardization, preprocessing transformer. Explanation: Fit computes mean μ and standard deviation σ on training data; transform applies (x − μ)/σ. Example: Applying StandardScaler to the “age” feature before feeding it to a logistic regression model. Practical application: Guarantees that each feature contributes equally to distance‑based algorithms. Challenges: Requires storage of μ and σ for future inference; unseen values that lie far outside the training range can produce extreme standardized scores.

Target Encoding – Concept #

Replacing categorical levels with the mean of the target variable for that level. Related terms: mean encoding, likelihood encoding. Explanation: Compute conditional expectation E(y|category) and substitute it for the category. Example: Encoding “state” by the average default rate observed in each state. Practical application: Captures predictive power of high‑cardinality categories with fewer dimensions than one‑hot encoding. Challenges: Prone to overfitting; requires regularization techniques such as smoothing or cross‑validation to mitigate leakage.

Time Series Resampling – Concept #

Aggregating or interpolating time‑indexed data to a different frequency (e.G., Daily to monthly). Related terms: downsampling, upsampling. Explanation: Use pandas .Resample() with aggregation functions (sum, mean) or forward/backward fill for missing periods. Example: Converting daily default counts into monthly totals to align with macro‑economic indicators. Practical application: Aligns disparate data sources for joint modeling of credit risk over time. Challenges: Choice of aggregation method can hide important intra‑period variation; missing timestamps require careful imputation.

Winsorizing – Concept #

Limiting extreme values by replacing them with specified percentile values. Related terms: clipping, truncation. Explanation: Values below the lower percentile p₁ are set to the p₁‑th value; values above the upper percentile p₂ are set to the p₂‑th value. Example: Winsorizing loan‑to‑value ratios at the 1 % and 99 % percentiles to reduce influence of outliers. Practical application: Stabilizes variance for regression models without discarding records. Challenges: Alters original data distribution; selection of percentiles is arbitrary and may impact model bias.

Zero‑Variance Feature Removal – Concept #

Dropping predictors that exhibit no variation across observations. Related terms: constant feature elimination, near‑zero variance. Explanation: Identify columns where standard deviation equals zero and remove them. Example: Removing a “currency” column that is uniformly “USD” in the dataset. Practical application: Prevents singular matrix errors in linear models and reduces computational waste. Challenges: Near‑zero variance features may still carry information when combined with other variables, requiring careful threshold setting.

z‑Score Normalization – Concept #

Synonym for standardization; scaling data to have mean 0 and standard deviation 1. Related terms: standard scaling, unit variance. Explanation: Compute (x − μ)/σ for each observation. Example: Normalizing the “number of credit inquiries” feature before feeding it to a gradient‑boosted tree. Practical application: Facilitates interpretation of coefficients in linear models as effect per standard deviation. Challenges: Sensitive to outliers; requires consistent μ and σ across training and production.

Alpha‑Beta Smoothing – Concept #

Exponential smoothing technique that forecasts a series using level (α) and trend (β) components. Related terms: Holt’s linear trend, time‑series forecasting. Explanation: Update level Lₜ = α·yₜ + (1‑α)·(Lₜ₋₁ + Tₜ₋₁) and trend Tₜ = β·(Lₜ − Lₜ₋₁) + (1‑β)·Tₜ₋₁. Example: Forecasting next month’s aggregate default amount using past monthly totals. Practical application: Provides smooth projections for stress‑testing credit portfolios. Challenges: Requires careful tuning of α and β; assumes linear trend, which may not hold in volatile economic periods.

Beta Distribution Scaling – Concept #

Transforming variables to follow a beta distribution confined to the interval [0, 1]. Related terms: bounded scaling, probability integral transform. Explanation: Fit shape parameters α and β to the data, then apply the beta CDF to map values onto [0, 1]. Example: Scaling probability of default estimates to a beta‑scaled score for integration into a risk‑adjusted pricing model. Practical application: Ensures outputs respect natural bounds (e.G., Probabilities). Challenges: Requires positive data; fitting may be unstable for small sample sizes.

Chi‑Square Binning – Concept #

Discretizing a continuous variable by merging intervals that have similar target distributions, guided by χ² statistics. Related terms: supervised binning, optimal binning. Explanation: Start with many small bins, iteratively combine adjacent bins that minimize the χ² statistic until a target number of bins is reached. Example: Binning “age” into risk groups that best separate default and non‑default outcomes. Practical application: Improves interpretability of scorecards while preserving predictive power. Challenges: Computationally intensive for large datasets; may over‑fit if too many bins are retained.

Data Drift Detection – Concept #

Monitoring changes in the statistical properties of input features over time. Related terms: concept drift, covariate shift. Explanation: Compare distributions (e.G., Using KS test) between training and recent data windows. Example: Detecting a shift in average credit‑score after a regulatory change in lending standards. Practical application: Triggers model retraining alerts to maintain predictive accuracy in credit risk systems. Challenges: Requires continuous data collection; distinguishing benign drift from harmful shift can be non‑trivial.

Elastic Net Regularization – Concept #

Combines L1 (lasso) and L2 (ridge) penalties to shrink coefficients while performing variable selection. Related terms: regularized regression, penalized likelihood. Explanation: Minimize loss + λ₁·|β| + λ₂·β². Example: Fitting a logistic regression for default prediction with elastic net to handle correlated financial ratios. Practical application: Balances sparsity and stability, useful when many predictors are correlated. Challenges: Requires tuning two hyperparameters (α and λ); cross‑validation can be computationally expensive.

Feature Crossing – Concept #

Creating interaction features by concatenating categorical variables into a single combined category. Related terms: polynomial features, crossed columns. Explanation: For categories A₁, A₂ and B₁, B₂, generate new categories A₁_B₁, A₁_B₂, etc. Example: Crossing “employment type” with “housing status” to capture joint risk patterns. Practical application: Boosts performance of linear models on sparse categorical data, especially in gradient‑boosted trees. Challenges: Explosion of categories can lead to sparsity; requires careful pruning of low‑frequency crossed categories.

Gaussian Mixture Modeling (GMM) – Concept #

Probabilistic model representing data as a mixture of multiple Gaussian distributions. Related terms: expectation‑maximization, soft clustering. Explanation: Fit parameters (means, covariances, mixing weights) using EM algorithm. Example: Segmenting borrowers into risk clusters based on income and debt‑to‑income ratio using a 3‑component GMM. Practical application: Provides soft assignments for risk tiering, useful for portfolio allocation. Challenges: Sensitive to initialization; determining the correct number of components often requires information criteria (AIC/BIC).

Histogram Equalization – Concept #

Adjusting the distribution of a numeric variable to achieve a uniform histogram. Related terms: contrast stretching, rank transformation. Explanation: Compute rank of each value, divide by total count, and map to uniform [0,1] interval. Example: Equalizing the distribution of loan‑amount values before training a neural network to avoid bias toward common ranges. Practical application: Enhances representation of rare high‑value loans in model training. Challenges: May amplify noise in low‑density regions; not suitable when preserving original scale is important.

Iterative Imputer – Concept #

Multivariate imputation method that models each feature with missing values as a function of other features, iterating until convergence. Related terms: MICE (Multiple Imputation by Chained Equations), regression imputation. Explanation: Initialize missing entries (e.G., With mean), then sequentially regress each feature on others, updating imputed values each cycle. Example: Using an iterative imputer to fill missing credit‑history length based on income, age, and employment status. Practical application: Captures relationships among variables, leading to more realistic imputations for credit risk datasets. Challenges: Computationally intensive; convergence not guaranteed for highly collinear data.

Kernel Density Estimation (KDE) Scaling – Concept #

Smoothing a variable’s empirical distribution using kernel functions to generate a continuous probability density estimate. Related terms: non‑parametric density estimation, bandwidth selection. Explanation: Estimate f(x) = (1/n)∑K((x‑xᵢ)/h) where K is a kernel and h is bandwidth. Example: Estimating the distribution of credit‑utilization rates to inform a custom quantile‑based scaling. Practical application: Provides a data‑driven transformation that aligns with underlying distribution, useful for probabilistic modeling. Challenges: Choice of bandwidth critically affects smoothness; high‑dimensional KDE suffers from curse of dimensionality.

L1 Normalization – Concept #

Scaling vectors so that the sum of absolute values equals 1. Related terms: Manhattan norm, probability simplex. Explanation: For a vector x, compute x / ∑|xᵢ|. Example: Normalizing the weightings of macro‑economic indicators before feeding them into a linear combination model for PD estimation. Practical application: Ensures that feature contributions are directly comparable as proportionate shares. Challenges: Sensitive to zero entries; does not address variance differences across features.

Lag Feature Creation – Concept #

Generating time‑shifted versions of a variable to capture temporal dependencies. Related terms: autoregressive features, time‑lagged variables. Explanation: For a series yₜ, create yₜ₋₁, yₜ₋₂, … as separate columns. Example: Adding a 3‑month lag of default rate to predict the current month’s risk. Practical application: Captures momentum effects in credit‑risk time series, improving forecast accuracy. Challenges: Increases dimensionality; missing values appear at the beginning of the series and must be handled.

Mahalanobis Distance – Concept #

Multivariate distance metric that accounts for covariance among variables. Related terms: elliptical distance, outlier detection. Explanation: D = √((x‑μ)ᵀΣ⁻¹(x‑μ)), where μ is the mean vector and Σ is the covariance matrix. Example: Measuring how far a borrower’s profile lies from the average risk cluster, flagging potential outliers. Practical application: Detects multivariate anomalies in credit‑risk dashboards. Challenges: Requires invertible covariance matrix; unstable when variables are highly collinear or when sample size is small.

Min‑Max Scaling – Concept #

Rescaling features to a fixed range, typically [0, 1]. Related terms: normalization, amplitude scaling. Explanation: Compute (x − min)/(max − min) for each feature. Example: Scaling the “number of open credit lines” to 0‑1 before feeding it to a logistic regression. Practical application: Required for neural networks and distance‑based methods that are sensitive to absolute magnitude. Challenges: Outliers compress the scale of the majority of data; new values outside the original min‑max range need re‑scaling.

Neural Network Embedding – Concept #

Learning dense vector representations for high‑cardinality categorical variables within a neural network. Related terms: entity embedding, deep learning feature extraction. Explanation: Map each category to a low‑dimensional trainable vector; embeddings are updated during backpropagation. Example: Embedding “merchant category code” into a 8‑dimensional vector for a credit‑card fraud detection model. Practical application: Captures semantic similarity among categories, improving model performance on sparse categorical data. Challenges: Requires sufficient training data; embeddings may not be interpretable without post‑hoc analysis.

Ordinal Encoding – Concept #

Assigning integer values preserving the natural order of categories. Related terms: label encoding, rank encoding. Explanation: Map ordered categories (e.G., “Low”, “medium”, “high”) to 0, 1, 2 respectively. Example: Encoding credit‑rating grades (AAA, AA, A, BBB…) into increasing integers for a linear regression. Practical application: Allows models to exploit ordering information when it is meaningful. Challenges: Implicitly assumes equal spacing between levels; inappropriate for nominal categories.

Partial Least Squares (PLS) – Concept #

Regression technique that projects predictors and response onto latent structures maximizing covariance. Related terms: dimension reduction, multivariate regression.

June 2026 intake · open enrolment
from £99 GBP
Enrol