Using the Gini coefficient to evaluate the performance of credit score models (2024)

The mechanism behind the Gini coefficient, the methods to derive it, common pitfall, and its major drawback.

Published in

Towards Data Science

11 min read

Jan 4, 2020

When a new credit score model is born, usually the first question that comes up is: “what is its Gini?”. To an outsider, it must sound like an odd referral to Disney’s film Aladdin. But, Gini or the Gini coefficient is one of the most popular metrics used by the financial industry for evaluating the performance of credit score models.

Using the Gini coefficient to evaluate the performance of credit score models (3)

The Gini coefficient is a metric that indicates the model’s discriminatory power, namely, the effectiveness of the model in differentiating between “bad” borrowers, who will default in the future, and “good” borrowers, who won’t default in the future. This metric is often used to compare the quality of different models and evaluate their prediction power.

Using the Gini coefficient to evaluate the performance of credit score models (4)

Despite its commonality, some practitioners are not genuinely familiar with the mechanism beyond this Gini coefficient and mistakenly confuse it with a different metric with the same name. While many practitioners mistakenly associate the Gini coefficient with the summary of the Lorenz curve, Corrado Gini’s measure of inequality[1], the Gini coefficient they are using is most of the time Somers’ D, which is the summary of the CAP (Cumulative Accuracy Profile) curve.

Somers’ D is named after Robert H. Somers, who proposed it in 1962[2]. It is a measure of the ordinal relationship between two variables. In the context of credit score models, it measures the ordinal relationship between the models’ predictions, in terms of PD (Probability of Default) or score, and the actual outcome — default or not default. If the model is useful, low scores (high PD) should be more associated with defaults than high scores (low PD).

Somers’ D takes on a value between (-1) and 1. (-1) being a perfect negative ordinal relationship and 1 a perfect ordinal relationship. In practice, a credit score model with Somers’ D of 0.4 is deemed to be good. (Henceforth, I will be addressing Somers’ D as the Gini coefficient)

For the sake of minimalism, I won’t describe the math involved in calculating the Gini coefficient. Instead, I will show three different ways to derive it.

To demonstrate each of these methods, I will be using a sample credit score model which developed using logistic regression and data of 10,000 borrowers from Lending-club.

model <- glm(default ~ fico + loan_amnt + annual_inc + home_ownership, family = "binomial", data = data_set)

Based on the models’ predictions — estimated Probability of Default (PD), I scored each of the borrows from 1 to 1000; 1 represents the lowest PD and 1000 the highest PD.

Extract the Gini coefficient from the CAP curve

The CAP curve, in our context, is designed to capture the ordinal relation between score (PD) and default rate. If our model does a good job of discriminating between good and bad borrowers, we would expect to find more defaults at low scored borrowers than at high scored borrowers. The CAP curve captures this notion by aggregating the cumulative default rate when sampling borrowers from the lowest score to the highest score.

To construct the CAP curve, all the model’s population needs to be ordered by the predicted likelihood of default. Namely, the observation with the lowest score is first, and the observation with the highest score is last. Then, we sample the population from first to last, and after each sampling, we calculate the cumulative default rate. The x-axis of the CAP curve represents the portion of the population sampled, and the y-axis represents the corresponding cumulative default rate.

If our model has perfect discriminatory power, we would expect to reach 100% of the cumulative default rate after sampling a portion of observations, which is equal to the default rate in our data (the green line in the chart below). E.g., if the default rate in our data is 16%, after sampling 16% of the observations, we would capture all the defaults in our data. On the contrary, if we use a random model, i.e., a model which randomly assigns scores in equal distribution, the cumulative default rate will always equal to the portion of observations sampled (the red line in the chart below).

Using the Gini coefficient to evaluate the performance of credit score models (5)

The Gini coefficient is defined as the ratio between the area within the model curve and the random model line (A) and the area between the perfect model curve and the random model line (A+B). Put it differently, the Gini coefficient is a ratio that represents how close our model to be a “perfect model” and how far it is from being a “random model.” Thus, a “perfect model” would get a Gini coefficient of 1, and a “random model” would get a Gini coefficient of 0.

My model gained a low Gini coefficient of 0.26:

Using the Gini coefficient to evaluate the performance of credit score models (6)

Construct the Lorenz curve, extract Corrado Gini’s measure, then derive the Gini coefficient

The Lorenz curve is the inverse of the CAP curve; it is constructed using the same mechanism of sampling observations and aggregating the cumulative default rate, but the sampling is done in reverse order (from highest to lowest score). The Lorenz also has a diagonal line, which is equivalent to the ‘CAP random model’ line and is called “the Line of Equality” (the red line in the chart below).

Another difference between the two curves is the “perfect model” line. Since the Lorenz curve was designed to capture the distribution of wealth, the most discriminative outcome is a case where all the wealth of the population is concentrated in one observation. Namely, the line which is equivalent to the “CAP perfect model” line in the Lorenz curve is constructed from the two perpendicular lines, which are the x-axis and a vertical line, which evolved from the end of the x-axis at the value of 100% (the green line in the chart below). This line represents the case in which all the cumulative outcome is in the last sampled observation. The value of Corrado Gini’s measure is defined as the ratio of the area between the model’s curve and the “Line of Equality” to the area between the “Line of Equality” and the x-axis.

However, when using the Lorenz curve to evaluate the discrimination power of a credit score model and assigning its y-axis to be the cumulative default rate, a problem emerges. Since the y-axis describes the aggregation of a binary outcome (1 or 0), a case where all the cumulative default rate concentrates in one observation doesn’t exist. Put it differently, when evaluating a credit score model using the Lorenz curve, it is impossible to reach the “Perfect model” line. Consequently, an appropriate “perfect model” line for this kind of evaluation should be adjusted to the default rate in the population, as in the CAP curve.

Using the Gini coefficient to evaluate the performance of credit score models (7)

Hence, to adjust Corrado Gini’s measure for credit score model evaluation, we need to deduct the unreachable area from its denominator.

Finally, to derive the Gini coefficient from Corrado Gini’s measure, we can use the following formula:

Using the Gini coefficient to evaluate the performance of credit score models (8)

My model gained Corrado Gini’s measure of 0.22:

Using the Gini coefficient to evaluate the performance of credit score models (9)

The default rate in my sample is 16%, so the Gini coefficient of my model can be calculated as follows:

Using the Gini coefficient to evaluate the performance of credit score models (10)

Construct the ROC curve, extract the AUC, then derive the Gini coefficient

The third method of calculating the Gini coefficient is through another popular curve: the ROC curve. The area under the ROC curve, which is usually called the AUC, is also a popular metric for evaluating and comparing the performance of credit score models. The ROC curve summarizes two ratios from the confusion matrix: the True Positive Ratio (TPR or Recall) and the False Positive Ratio (FPR).

The confusion matrix summarizes, for a given threshold, the number of cases in which:

The model predicted a default and the borrower defaulted — True Positive.
The model predicted a default and the borrower didn’t default — False Positive.
The model predicted no default and the borrower didn’t default — True Negative.
The model predicted no default and the borrower defaulted — False Negative

For example, let’s use the score of 850 as our threshold, i.e., borrows with a score below 850, are predicted to default, and borrowers with a score above 850 are predicted not to default.

Using the Gini coefficient to evaluate the performance of credit score models (11)

The True Positive Ratio (TPR) is defined as the number of defaulted borrowers in which our model caught over the total number of defaulted borrowers in our data. The False Positive Ratio (FPR) is calculated as the number of cases in which the model incorrectly predicted a default over the total number of non-default instances.

Using the Gini coefficient to evaluate the performance of credit score models (12)

The ROC curve is constructed by using confusion matrices that originated from thresholds between 1 to 1000 and driving their TPR and FPR. The y-axis of the ROC curve represents the TPR values, and the x-axis represents the FPR values. The AUC is the area between the curve and the x-axis.

Using the Gini coefficient to evaluate the performance of credit score models (13)

Surprisingly, as shown by Schechtman & Schechtman, 2016[3] there is a linear relationship between the AUC and the Gini coefficient. So, to derive the Gini coefficient from the AUC all you need to do is to use the following formula:

Using the Gini coefficient to evaluate the performance of credit score models (14)

Practitioners tend to disclose the AUC in addition to the Gini coefficient in their model validation report. However, since these metrics have a linear relation, the disclosure of these metrics together doesn’t add any value to the evaluation of the model’s quality.

The AUC of my model is 0.63, hence the Gini coefficient is calculated like this:

Using the Gini coefficient to evaluate the performance of credit score models (15)

Drawbacks and pitfalls of the Gini coefficient

Despite its commonality, the Gini coefficient has some drawbacks and pitfalls you should consider when using it to evaluate and compare credit score models. For the sake of minimalism, in this section, I’ll describe a common pitfall when trying to derive the Gini coefficient and its main drawback.

To illustrate these concepts I’ll use a toy example: 15 borrowers, 2 ”bad” and 13 ”good”, and scores that go from 1 (highest PD) to 10 (lowest PD).

A common pitfall when trying to derive the Gini coefficient is identical scores. In most cases (especially when using large datasets), the credit score model will estimate the same score for different observations. This score duplication raises an issue when trying to derive the Gini coefficient using the CAP curve method. As mentioned above, the first step to derive the CAP curve is to sort the observations by their score, as in the following two examples:

Using the Gini coefficient to evaluate the performance of credit score models (16)

These two examples use the same borrowers with the same scores. The only difference between these two tables is the secondary level of ordering. In Example 1 when there is a case of identical scores (in score 5), the observations which defaulted come first. Example 2 is the other way around. This minor change can have a major effect on the value of the Gini coefficient, e.g. in this case, Example 1 has a Gini coefficient of 0.67, and Example 2 has a Gini coefficient of 0.38.

To avoid this pitfall, I recommend doing a secondary sorting like in Example 1 or simply to derive the Gini coefficient using the AUC method mentioned above.

The major disadvantage of the Gini coefficient comes from the fact that it is an ordinal metric, i.e., it captures the order of values while ignoring the distance between them. This characteristic of the Gini coefficient can sometimes mask poor model performances.

Using the Gini coefficient to evaluate the performance of credit score models (17)

When ordering the borrowers based on the score predicted by both of the models, we get the same “Default” column. This indicates that these two models have the same Gini coefficient (0.85).

When comparing the distributions of scores predicted by these two models we get the following charts:

Using the Gini coefficient to evaluate the performance of credit score models (18)

Both models have the same Gini coefficient. But, Model B was only able to separate the borrowers into three types: 1, 2 and 3, while Model A was able to capture all 10 levels of risk. This suggests that Model A is more sensitive than Model B to the different characteristics of the borrowers, and can differentiate better between different risk levels.

To understand the importance of the feature above, assume that the owner of Model A and the owner of Model B decide to set a score which will be the threshold for their loan approval. I.e., the loan requests of borrowers with a score below or equal to the threshold, will be denied and the loan requests of borrowers with a score above the threshold, will be approved. Both model owners examined their model outcomes on the test sample and decide to use the score which captured 100% of the cumulative default rate as their threshold. Consequently, the owner of Model A sets the threshold to be 6 and the owner of Model B sets her threshold to be 2. By choosing these scores, the model owners got very different results:

Using the Gini coefficient to evaluate the performance of credit score models (19)

The owner of Model A rejected 8 “good” borrowers and approved 5 while the owner of Model B rejected 12 “good” borrowers and approved only 1. The threshold set by the owner of Model A yields FPR of 62%, and the threshold set by the owner of Model B yields FPR of 92%.

Hence, the Gini coefficient inability to capture the model’s effectiveness in differentiating between different levels of risk is a major drawback.

In order to overcome this drawback, I recommend eyeball testing the distribution of the model predictions, as in Gambacorta, Huang, Qiu, & Wang, 2019[4], and using the Precision-Recall curve to evaluate the model’s trade-off between capturing “bad’” borrowers and mistakenly predict default of ‘“good” borrowers.

Summary

The Gini coefficient which is used in the financial industry to evaluate the quality of a credit score model is actually Somers’ D and not Corrado Gini’s measure of inequality.
There are three common methods to derive the Gini coefficient:

Extract the Gini coefficient from the CAP curve.
Construct the Lorenz curve, extract Corrado Gini’s measure, then derive the Gini coefficient.
Construct the ROC curve to extract the AUC then derive the Gini coefficient.

A common pitfall when deriving the Gini coefficient is identical scores.
The major drawback of the Gini coefficient is that it doesn’t capture the model’s sensitivity to different risk levels.

References

[1] Gini, C. (1914). Reprinted: On the measurement of concentration and variability of characters (2005). Metron, LXIII(1), 338.

[2] Somers, R. (1962). A New Asymmetric Measure of Association for Ordinal Variables. American Sociological Review, 27(6), 799–811. Retrieved from www.jstor.org/stable/2090408

[3] Schechtman, E., & Schechtman, G. (2016). The Relationship between Gini Methodology and the ROC curve (SSRN Scholarly Paper No. ID 2739245). Rochester, NY: Social Science Research Network.

[4] Gambacorta, L., Huang, Y., Qiu, H., & Wang, J. (2019). How do machine learning and non-traditional data affect credit scoring? New evidence from a Chinese fintech firm. Retrieved from https://www.bis.org/publ/work834.htm