Applied Data Scientist Excercise for Thetaray
1. Introduction
The dataset seems to be provided from Lending Club Loan Data with a subset of original features and a time span in 2014. In this report we refer to some articles from Lending Club Website like this, to obtain some understanding of the business model that generates the data. In practice, this is more commonly done by engaging the customers directly. The technical analysis has also drawn insights from other discussions in the public domain, e.g., from Kaggle.
This exercise consists of two main parts:
Exploratory Data Analysis focuses on the big picture of the dataset, including understanding the main factors, sanity check, descriptive statistics and some potential business related questions that can be asked about the data.
Feature Engineering & Modelling dives into some details of one of the business problems discovered in EDA, i.e., how to predict the loan grade and thus interest rate and what features are important.
The report is generated by using Jupyter Lab running Python scripts. An HTML version with codes hidden is also exported as a business report.
2. EDA
2.1 Factors in Data
Lending Club (LC) is an online platform enables peer-to-peer personal loans between borrowers and investors. Intuitively the three main entities involved are borrowers, loans and investors, with the first two directly observable in the provided dataset. Given the dataset, the interactions between these entities can be modelled in two:
Loan application: among other things, how the risks of the loans are captured in loan rate, based on the information of borrowers and loans.
Loan repayment: the time series of changed loan status along repayment.
These factors will be used as the foundation for certain assumptions about the data, which can be used for further analysis such as predictions and anomaly detections.
Based on LC website, some key concepts related to the data can be summarized as,
A borrower submits a loan application online with details, specially amount and term.
LC decides if the borrower is qualified based on the information of borrower, loan and etc.
If it meets the requirement, LC calculates the loan rate based on a grade (a ranking system). See how grades relate to rates.
A grade is calculated by
borrower information, such as FICO score, credit attributes and etc.
loan term
loan amount
etc
After the loan rate is decided, the loan will be open to investors to fund. If funding is successful, the borrower will get the money and start to pay installment (usually on a monthly basis)
In this EDA, we will walk through these concepts, identify assumptions behind their relations, and validate them. Looking at these assumptions helps us decide the directions to explore in the modelling part.
2.2 Data Summarization
First let's take a look at the big picture of the dataset.
The summary shows important characteristics of the data, e.g.,
% of missing values shows the general usefulness of a feature. For example, 90.6% of missing values in
desc
column makes it hard to use in any modelling practice. The same applies to columnmths_since_last_record
andmths_since_last_major_derog
. It's also always useful to dig how missing values are generated, e.g., if it's a bug in upstream pipeline or they are generated in some systematic ways.Along with the feature types, # of unique values in a feature determines how it will be modelled, e.g., as numeric, ordinal or categorical. Sometimes the reported feature types are misleading so they need to be combined with other information to be useful. For example column
issue_d
(the month which the loan was funded) should be clearly converted as DateTime, and even though the dtype ofint_rate
(interest rate on the loan) isobject
(string values), a closer look at it reveals it should be modelled as numerical. Another example isint_rate
(loan interest rate). It looks like numerical at the first place, but 69 unique values indicates that it actually has an ordinal/categorical nature. In fact we know that interest rate is highly correlcated withgrade
/subgrade
, which are both ordinal.# of unique values also shows the granularity of column. This information is usually useful in modelling practice. E.g.,
id
,member_id
are not directly useful when looking for similarities in data, unless when looking for anomalies. On the other side,policy_code
andpymnt_plan
are not useful either because they are the same for all examples.
Since the number of features in the dataset is not too big, we manage to squeeze them into one summary table. However, if the number increases, some automation should be introduced to select the features to explore, e.g., based on missing value rate and granularity.
2.3 Sanity Check on Assumptions
Some quick checks of the following assumptions are conducted to confirm our understanding of the business, as introduced in 2.1.
*** Assumption 1: Interest rates are highly related to loan grads/sub-grades. ***
We focus on the correlation between ranking given by LC sub_grade
and the loan rate int_rate
. As discovered in our previous step, we first need to convert int_rate
to numerical and sub_grade
as categorical. Since most of transformations in EDA are temporary, we create views of the dataset and leave the original unchanged.
Summary
The general trend between
loan grade
vsloan rate
aligns with what is described on LC's website. The relation seems to be linear.However, there are clearly outliers in subgrades
B2
,C5
,D2
andE1
. A closer check reveals that all these loans have a fixed rate as low as 6%, although there is no obvious relations among the borrowers or loans.There are much less data beyond
E5
, which is also aligned with the information from LC.Within each subgrade, the distribution of the rates seem to be right schewed with a heavy tail towards higher end. This might be explained by unseen factors, e.g., the changes of rates over time or the change of calculation methods used by LC. As we discussed, interest rates can be modelled as ordinal because of its limited number of unique values. So we don't expect a Gaussian distribution here. A closer look shows that this change of interest rates within a subgrade does correlate to time.
*** Assumption 2: Grads/subgrades are correlated to borrower's information and loan's information ***
First we will look at the distribution of subgrades.
One important piece of borrower's information, i.e. FICO scores, is missing in our data. So we will use "emp_length", "home_ownership", "is_inc_v", "dti", "inq_last_6mths" to represent borrowers.
The most important features about loans are their terms and amounts. We will also use loan purpose data as well as initial_list_status (indicating whether the loan was funded initially).
Note that the selection of features doesn't mean to be exhaustive. They are just used to gain understanding and insights during EDA. The features are selected based on their missing value% and granularity.
Summary
The distribution of annual income from a general population is known to be lognormal. This is also observed in the borrowers.
Grades are related to the number of times a rare event (e.g., delinquency) occurs in life time. So the distribution of grades closely resemble a Poisson distribution.
To see how the information of borrower and loans can influence the grades,
we first cluster the borrowers into "high_credit" and "low_credit" group based on information such as "emp_length", "dbi", "home_ownership" and etc.
we then cluster loans based on their terms and loan_amounts.
from the heatmap we can clearly see that the distribution of grades is highly influenced by loan terms and then amount, specially,
36-month loans tend to generate 'A'-'B' grades whereas 60-month moves it to 'C'-'D'.
for 36-month, the higher the loan amount, the higher the grade. But this is less significant for 60-month in our setting
the influnece by borrower credits is less significant, mainly because we don't have the FICO score available and the calculation of borrower scores may not be accurate.
We will build a machine learning model in Section 3, to verify the correlations between Grades with the information of borrowers and loans considered all together.
*** Some other quick checks ***
These are some other quick checks that focus on numerical values. They mainly serve as "sanity check" on possible errors in the data, e.g.,
total_pymnt (total payed today) == total_rec_int (interest payed) + total_rec_prncp (principle paid) + total_rec_late_fee (late fee) + recoveries
total_rec_prncp + out_prncp == funded_amnt
Both of them have been verified in the dataset. I excluded the codes for simplicity.
3. Feature Engineering & Modelling
3.1 Problem Definition
There are many interesting problems that can be solved by using machine learning models. Here I focus on finding the correlation between interest rate and other information of borrowers and loans. I am interested in this because,
it shows how a predictive machine learning model can be used for other tasks such as feature selection.
it continues the story that we left in the previous section when we try to validate our assumptions.
it has business values - revealing the correlations may help investors to better understand how LC grading system works and make better decisions.
I leave other interesting problems to the 3.5 Next Step section for a brief discussion.
3.2 Feature Engineering
Unlike the ad-hoc ways of creating views in EDA, it usually benefits the machine learning practice by implementating the transformations and feature engineerings in pipelines. First it is for better reusability. It can also benefit the parameter-tunning and validation processes.
For the purpose of this exercise, I choose the same set of features that have been discussed in EDA, after we have checked their missing-value rates and granularity. Through EDA, we have known
"int_rate" is highly predictable by "grade/sub_grade" and "issue_d", with some outliers.
"grade/sub_grade" is correlated to the information of
borrowers: "emp_length", "home_ownership", "is_inc_v", "annual_inc","dti", "inq_last_6mths"
loans: "term", "loan_amnt", "purpose", "initial_list_status"
So we will build a predictive model with:
inputs: "emp_length", "home_ownership", "is_inc_v", "annual_inc","dti", "inq_last_6mths", "term", "loan_amnt", "purpose", "initial_list_status" (we need to make sure "int_rate" is excluded to avoid target leakage)
output: "grade" (this is easier than predicting subgrade)
There might be other interesting features but using these will serve as a baseline, and improvements can be built aggressively in the future. We will use a RandomForest model for the prediction task as well as measuring the importance of features.
In details, the following transformations will be conducted with given rationals.
"grade": we want to capture the general pattern, so we will filter out grades with small sample sizes. We will also exclude outliers discovered in EDA. Even though it has an ordinal nature, we will model it as categorical because a classification problem is generally easier to solve for most machine learning models. The class distribution is not balanced, but it is not too bad either.
"emp_length": transformed to numercial. It can also be modelled as ordinal for tree models, but it helps reduce the input dimensionality for other models.
"home_ownership": categorical
"is_inc_v": categorical
"annual_inc": transformed to log-scale. It's generally easier for most models to deal with balanced data, either inputs/outputs. Using log-scale helps to remove the heavy tail.
"dti": numerical (it is bell shaped)
"inq_last_6mths": numercial
"term": categorical
"loan_amnt", numerical, transfered to log scale
"purpose": categorical
"initial_list_status": categorical
Other setups:
Normalization by standard-scaling is performed even though the RandomForest model is generally not sensitive to value ranges as other models.
I droped missing values for simplicity.
We need to split the data into train and test sets. I use test_size = 0.3.
I don't introduce hyper-parameters in the feature engineering step, although this might apply for future choices, e.g., when we quantize numerical values such as loan amount.
3.3 Modelling
Before we fit models to capture more complicated relations between inputs and outputs, let's first take a look at the linear correlation between the target output "grade" and different generated features. Just as we have observed in EDA, the important features are,
loan terms, specially, changing from 36 months to 60 months tends to increase the grade.
whether the income is verfied
loan purpose, specially credit_card seems to bring down the grade and debt_consolidation will increase it.
inq_last_6mths
dti
annual_inc
Note that the linear correlation is a sufficient but not necessary condition for two variables to be dependent. Using a machine learning model such as Random Forest will help evaluate the feature importances in a complicated way.
To achieve the optimal generalization performance, we use random-search to tune the hyper-parameters of the Random Forest model, i.e., # of estimators in the ensemble, complexity of individual tree and whether to use bootstrap. In the search, we use 3-fold cross validation to avoid overfit. The final results show that accuracy on a separate test set is 41.5%, which is very close to the cross-validation estimate 41.2%. This indicates our model is not overfitted. On the other hand, since our setup of hyperparameter space is big enough to contain very complicated model, the likelihood of having an underfitting model is also very low.
Footnote: In the above search, it might deserve some further exploration on some parameters e.g., "n_estimator" and "max_leaf_nodes", because their optimal values are on the exploration boundary (=500).
Now let's take a look at the final result in terms of confusion matrix and featuren importances.
3.4 Conclusion
From the confusion matrix, we can see that the model does a relatively better job on predicting grade_B and grade_C. Its performance on grade_A and grade_D are also better than random guess. It doesn't do well on recognizing grade_E. Recall this result concides with the observation we made in EDA where grade_B and grade_C are majorities, and basically the features can only help distinguish these two.
In terms of feature importances, the result given by training a random forest is similiar to those by pearson correlation. These most important features remain the same such as loan term, used for credit card, loan amount, income, inq_last_6mths, and whether the income source is verified. Intuitively this says that the relations between these features and the target are generally simple and can even be captured in a linear model. It makes sense because it is more likely that LC prefers such an explainable model than a complicated blackbox.
A classification accuracy of 41% may not be very impressive as a prediction result. It says that some important features may be lacking in our dataset, such as the FICO scores. On the other hand, the model is useful in our analysis because it clearly speculates the important factors that are related to a loan grade and thus its interest rate. This helps to verify our understanding of the data. It also serves as a baseline model for further explorations.
3.5 Next Step
the above EDA is not exhaustive, so other checks should be conducted in the future, e.g.,
geo-distribution of loans applications
time series of loan applications
distribution of loan status
demography of borrowers
for predictive analyis,
it is useful to predict the status of loans, specially the likelihood of default. There have been a lot of works on this in the literature such as this article. detection.
it is also interesting to predict when a loan will default.
for anomaly detection,
although there is no redundant member_id in the dataset, it will be interesting to find "similar" borrowers, e.g., borrowers with smiliar title, employee info, income, address, creding lines and etc. A possible next step is to identify whether they are actually the same people.
there is a high interest to predict "fraud" from both investors and LC's perspectives. In one case, it is interesting to find any loans that default within a short term, e.g., < 3 months. This can be done by filtering on (last_pymnt_d - issue_d) < thr and status.isin([ 'Charged Off', 'Default']).
on the other side, it is also interesting to find people/loans where a "fully paid" happened within a short period. This may be used in applications such as money-laundering detections.