In this talk presented to the Melbourne Users of R Network (MelbURN), Ross Gayler describes using R to do some bespoke reject inference modelling.

# Population Stability Index (PSI)

The **Population Stability Index** (**PSI**) is one method of indicating whether a scorecard is likely to have degraded over time. It tells us how much the population has changed over time. The PSI can be applied at a score level, by binning the scores. This will tell us whether the population as a whole has shifted over time. Alternatively, the PSI can be applied at an individual variable level.

## Population Stability Index formula (PSI formula)

Assume we have a development population (population 1) and wish to compare it to a more recent population (population 2), a couple of years after the scorecard was developed. The PSI formula is given by:

where:

- the number of observations in bin i for populations 1 and 2

- the total number of observations for populations 1 and 2

As a rule of thumb a PSI<0.1 indicates minimal change in the population. 0.1 to 0.2 indicates changes that might warrant further investigation, and a PSI >0.2 indicates a significant change in the population.

## Notes on the use of the PSI

Note that the PSI does not tell us anything about the relationships between the variables and the outcome, but merely that the distribution of the population has changed. If a high PSI is detected, it may be that the scorecard is performing well but the population has changed due to new origination strategies or changes in economic conditions. As such, the PSI should not be viewed in isolation, and the reasons for population changes should be investigated.

Population shifts can also be tested with a Chi-squared test for binned data, or by comparing the score distributions with the Kolmogorov-Smirnov (KS) test, shifts in means, or similar.

# Basel 2 RWA formula

Credit Scoring models often become inputs into regulatory and economic capital calculations such as the Basel 2 RWA formula. Probability of Default (PD), Exposure at Default (EAD) and Loss given default (LGD) models are all used for this purpose. One measure of loss, *Expected Loss*, is simply given by EL=PD*EAD*LGD. But this is not sufficient to work out how much capital a bank should hold as it does not account for the uncertainties of losses from year to year.

*Unexpected Loss* solves this problem by calculating the losses that might occur in some specified probability. In Basel 2 banks are require to hold sufficient capital to withstand a 1 in 1000 year loss event. The mathematics required to do this is more difficult than in the Expected Loss equation. Luckily for us, the BIS, specifies the formulae and parameters to use to calculate capital requirements and RWA. The Basel 2 RWA formula uses an extension of the Vasicek formula, named after its founder. It is a single factor model which assumes that the default rate of assets is correlated rather than being independent. The higher the asset correlation, the risker an asset is.

Note: Risk Weighted Assets are simply as their name implies. The bank's assets (loans) are weighted according to their riskiness. The amount of capital the bank is required to hold is a mulitple of the total risk weighted assets.

## Calculating Capital Requirement and Risk Weighted Assets for retail

The capital requirement, K, is calculated as

from which we can then derive the Risk Weighted Assets as RWA = K * 12.5 * EAD

where R=0.04 for Credit Cards which also known as Qualifying Revolving Retail Exposures (QRRE) and R=0.15 for Residential Mortgages. N() is the standard normal cumulative distribution and G() is its inverse.

## Calculating Capital Requirement and Risk Weighted Assets for Corporate Exposures

First the asset correlation is calculated,

Then a maturity adjustment parameter is calculated

The capital requirement K is calculated as

From this we can calculated the Risk Weighted Assets as before RWA = K * 12.5 * EAD

Note that the BIS requires a calibration factor of 1.06 to be multiplied by the RWA on top of the results of the above formulae.

## Further information on RWA and Regulatory Capital

Discussion of the various formulae can be found at the following websites

I have created an R function to produce the various outputs of the Basel II RWA formulae.

## R function to calculate RWA and K

Calculate Basel 2 Risk Weighted Assets (RWA) and other related measures Description: Calculate Risk Weighted Assets given PD, EAD, LGD and maturity according to the Basel 2 formula Usage: rwa(pd, ead, lgd, m, s, product, cf, corr) Arguments: PDÂ Â Â Â Â Long-run Probability of Default for the customer or account EADÂ Â Â Â Downturn Exposure at Default for the the customer or account LGDÂ Â Â Â Downturn Loss Given Default for the customer or account mÂ Â Â Â Â Â Maturity (in years) sÂ Â Â Â Â Â Exposure adjustment product One of "Corporate", "Mortgage", "QRRE", "Retail Other" cfÂ Â Â Â Â Confidence for the prediction. Default is set to 0.999 which corresponds to 1/1000 year event corrÂ Â Â Asset Correlation if you wish to override the Basel 2 calculated value Details: Calculate Risk Weighted Assets given PD, EAD, LGD and maturity according to the Basel 2 formula Value: A list is output containing the following: elÂ Â Â Â Â Â Â Â Â Expected Loss ulÂ Â Â Â Â Â Â Â Â Unexpected Loss pd.impliedÂ The pd implied by the Vasicek formula for the specified confidence kÂ Â Â Â Â Â Â Â Â Â Capital requirement factor reg.capÂ Â Â Â Capital requirement in dollars rwaÂ Â Â Â Â Â Â Â Risk weighted assets in dollars risk.weight Final risk weight Author(s): R Credit Scoring http://www.rcreditscoring.com References: http://www.rcreditscoring.com/basel-2-rwa-formula/ http://www.bis.org/bcbs/irbriskweight.pdf http://en.wikipedia.org/wiki/Advanced_IRB

rwa<-function(pd=0.02,ead=100,lgd=0.4,m=2,s=50,product="Corporate", cf=0.999, corr=NULL){ m.adj 50] s[s<5] exp.adj # Correlation r # Maturity adjustment b m.adj # Calculate k k } else if (product=="Mortgage"){ r } else if (product=="QRRE"){ r } else if (product=="Retail Other"){ r } # Override the correlation if specified by the user if (!is.null(corr)) { r } el pd.implied ul k reg.cap rwa risk.weight res return(res) }

## Examples

# Calculate RWA for various products rwa(pd=0.01, ead=100, lgd=0.2, product="Mortgage") rwa(pd=0.01, ead=100, lgd=0.8, product="QRRE") rwa(pd=0.01, ead=100, lgd=0.4, product="Retail Other") rwa(pd=0.02, ead=100, lgd=0.4, m=2, s=50, product="Corporate") # Plot rwa versus PD for mortgages y plot(x = seq(0.001, to=0.1, by=0.001), y = y[ , "rwa"], type="l", xlab="PD", ylab="RWA")

# Credit Scoring Conference

The biennial **Credit Scoring and Credit Control XIII** conference is on again August 28-30, 2013 in Edinburgh. This is the premier conference for both academics and practitioners of credit modelling.

Details can be found at the conference website.

There is also an excellent archive of previous conference presentations and papers

# Binning continuous variables in R - the basics

In credit scoring models continuous variables are often transformed into categorical variables by a process known as binning. While this can reduce the power of the final model as some of the granularity of the variable is lost, when done judiciously there is little loss of power

Some advantages to binning include:

- The scorecard format using binnedÂ variables is easily read and interpreted
- Non-linear dependencies can be modeled using a linear relationship (such as simple logistic regression)
- Some banking systems cannot create scorecards with continuous variables

Disadvantages of binning (or discretizing continuous distributions):

- Frank Harrell has 13 good reasons here why not to bin continuous variables
- The most important disadvantages in my mind relate to the loss of information when you bin the variable
- This is compounded especially if there are too few (or too many) bins. Two analysts will rarely come to the same conclusion as to the correct number of bins and their cutpoints

Considering most retail credit scoring models are binned, getting a good handle of performing basic binning in R is a handy skill. In this post I describe how to bin some (well behaved) continuous data.

The problem: Bin a continuous variable into n equally sized (by number of observations) bins. For this case I am assuming there are no special or missing values and that there are no large concentrations of observations on a particular number.

Firstly we need to create a dataset to test our binning code. For this occasion, I am creating a vector of 1000 normally distributed points with a mean of zero and standard deviation of 50.

> x<-rnorm(1000, mean=0, sd=50) > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -157.1000 -32.9200 -0.5643 0.7896 34.0000 148.3000

Next, let's say we want to create ten bins with equal number of observations in each bin

> bins<-10 > cutpoints<-quantile(x,(0:bins)/bins)

The cutpoints variable holds a vector of the cutpoints used to bin the data. Finally we perform the binning itself to form the discretized variable

> binned <-cut(x,cutpoints,include.lowest=TRUE) > summary(binned)

And there you have it. The binned vector holds our new categorical variable which can then be used for further analysis.

In the next instalment of binning articles I will discuss how to best handle missing, special and highly concentrated observations.

# The Signal and the Noise by Nate Silver: Review

This is a review of Nate Silver's *The Signal and the Noise: Why So Many Predictions Fail-but Some Don't*.

Nate Silver is somewhat of a prediction phenomenon at the moment. During the 2012 US presidential election, Nate's blog fivethirtyeight at the New York Times was one of the most heavily trafficked sites.

Prior to predicting election outcomes, Nate studied baseball and was a successful semi-pro poker player. As you can see, he has spent many years evaluating predictions in a variety of fields. This book presents lessons from his observations.

The style is not as chatty as one would expect from an author like Malcolm Gladwell , yet Nate keeps the readability high and doesn't get bogged down in statistics.

Examples are given from as diverse fields such as weather forecasting, breast cancer detection, baseball and poker.

Some of the messages I got out of the book include:

- Beware of absolutes
- Think probabilistically
- Complicated models are not necessarily better
- Don't mistake noise for the signal
- Bayesian is better

## Table of Contents for the *Signal and the Noise* (with a short explanation on the content of each chapter)

- A catastrophic failure of prediction - explains why so few got it right about the GFC
- Are you smarter than a television pundit - examines the type of people who make good predictions
- All I care about is W's and L's - prediction in baseball
- For years you've been telling us that rain is green - difficulties in forecasting weather
- Desperately seeking signal - earthquakes!
- How to drown in three feet of water - economic forecasting
- Role Models - modeling disease transmission
- Less and less and less wrong - Sports betting (and Bayes' Theorem)
- Rage against the machines - Chess computers versus Kasparov
- The Poker bubble - Poker and the poker bubble
- If you can't beat them... - Group forecasts, herd behavior
- A climate of healthy skepticism - Climate change
- What you don't know can hurt you - Terrorism and 9/11

Although this book is not directly concerned with building credit prediction models, I think it is essential reading for anyone who makes predictions for a living.

# New version of RStudio ready to download (v0.97)

My current favorite R editor/IDE has just released a new version. Features include:

- Enhanced package development tools
- Vim editing mode!
- Heaps more enhancements and bugfixes

# Useful R Libraries for Credit Scoring

Base R will get you only so far. Here are some of the packages that I load most often for my credit scoring projects:

- plyr - For splitting data into manageable pieces
- RODBC - Pulling data directly from databases is preferable to exporting from the database to a csv and importing to R
- ROCR - ROC curves! (and related)
- ggplot2 - Base R graphics can be a bit terse. ggplot2 provides some prettier charts and many more options

While there are more that I use, these are the libraries that I use in practically every project.

# Credit Scoring Datasets

As you would expect, credit scoring data is hard to come by because banks don't like handing out sensitive information to the public.

However there are a few credit scoring datasets available for download and are useful for benchmarking or trying out new algorithms

A couple can be found at the UCI Machine Learning Repository:

- The first is the German Credit Data Set which contains 1000 observations, and 20 attributes of both continuous and categorical in nature
- The second is the Credit Approval Data Set which contains 690 observations and 15 attributes.
- The book Credit Scoring and Its Applications contains a CD with a credit scoring data set with 1000 observations
- A larger dataset is the one used in the Kaggle competition Give Me Some Credit. I talk a bit about the Kaggle Credit Scoring competition in an earlier post.

Note that each of these data files has been processed to remove any customer identifiable features.

Do you know of any more free datasets out there? If so let me know and I will add them to the list.

# Gini Coefficient in R

Rightly or wrongly, the Gini coefficient is the main measure of model discrimination (or rank ordering) used by credit scoring professionals. The easiest way to calculate Gini in R is to use the rcorr.cens function from Frank Harrell's Hmisc library.

So to calculate Gini in R, all you need is the following code given that you have generated scores for your population.

> library(Hmisc) > rcorr.cens(x, S)

where x is the vector of scores for each account and S is a vector of responses (default/non-default).

The Gini is output as the value under the heading Dxy.

Note that while we call this measure Gini in a credit scoring context, it goes under a variety of names: accuracy ratio, Somer's D, and Powerstat. It is also related to C, the area under the ROC curve, by the relationship *Gini = 2*(C-0.5)*.