Getting started with Exploratory Data Analysis

Manish Poddar
17 min readApr 20, 2021

--

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing and handling missing values, and making transformations of variables as needed. EDA encompasses IDA. (from Wikipedia)

Here in this blog post we have take lending club case study data from Kaggle and perform EDA on it

In this notebook, we have considered that we work for a consumer finance company that specializes in lending various types of loans to urban customers. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company

If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company

The data given below contains information about past loan applicants and whether they ‘defaulted’ or not. The aim is to identify patterns that indicate if a person is likely to default, which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.

In this case study, we will use EDA to understand how consumer attributes and loan attributes influence the tendency of default.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.set_option('display.max_columns', 150)
pd.set_option('display.max_rows', 150)
%matplotlib inline
sns.set_style("whitegrid")

1 Data Understanding

1.1 Read the data to python data frame and describing it

# Loading Pandas data frame
df = pd.read_csv("loan.csv")
df.head() # looking at head field
/home/quantiphi/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3058: DtypeWarning: Columns (47) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
png
df.describe(percentiles=[0.05,0.15,0.25,0.50,0.75,0.90,0.95,0.99]) # describing the dataframes and finding the stats
png

1.2 Categorizing different attributes of the data frame

df.columns #extracting columnsIndex(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
'term', 'int_rate', 'installment', 'grade', 'sub_grade',
...
'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq',
'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tax_liens',
'tot_hi_cred_lim', 'total_bal_ex_mort', 'total_bc_limit',
'total_il_high_credit_limit'],
dtype='object', length=111)

From above data, frame variables can be divided into three categories:

1)those related to applicants
eg:emp_title,emp_length,open_acc,addr_state etc

2)Loan characteristics
eg:desc,dti,funded_amnt,grade etc

3)customer behaviour variables
eg:delinq_2yrs,last_payment_amnt,revol_bal,acc_now_delinq,bc_open_to_buy,chargeoff_within_12_mths,collection_recovery_fee,collections_12_mths_ex_med,delinq_amnt,installment,last_pymnt_d,max_bal_bc,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mths_since_last_delinq,mths_since_last_major_derog,next_pymnt_d,pct_tl_nvr_dlq,recoveries,revol_util,tot_coll_amt,total_rec_late_fee,total_rec_int,total_rec_prncp,total_rev_hi_lim,out_prncp,out_prncp_inv,pct_tl_nvr_dlq,percent_bc_gt_75.

since customer behaviour variables are generated after the loan is approved so this variables need not to be taken into consideration for this case study.

1.3 check the datatype of various columns

df.dtypes # findinfg the datatypes for all columnsid                                  int64
member_id int64
loan_amnt int64
funded_amnt int64
funded_amnt_inv float64
term object
int_rate object
installment float64
grade object
sub_grade object
emp_title object
emp_length object
home_ownership object
annual_inc float64
verification_status object
issue_d object
loan_status object
pymnt_plan object
url object
desc object
purpose object
title object
zip_code object
addr_state object
dti float64
delinq_2yrs int64
earliest_cr_line object
inq_last_6mths int64
mths_since_last_delinq float64
mths_since_last_record float64
open_acc int64
pub_rec int64
revol_bal int64
revol_util object
total_acc int64
initial_list_status object
out_prncp float64
out_prncp_inv float64
total_pymnt float64
total_pymnt_inv float64
total_rec_prncp float64
total_rec_int float64
total_rec_late_fee float64
recoveries float64
collection_recovery_fee float64
last_pymnt_d object
last_pymnt_amnt float64
next_pymnt_d object
last_credit_pull_d object
collections_12_mths_ex_med float64
mths_since_last_major_derog float64
policy_code int64
application_type object
annual_inc_joint float64
dti_joint float64
verification_status_joint float64
acc_now_delinq int64
tot_coll_amt float64
tot_cur_bal float64
open_acc_6m float64
open_il_6m float64
open_il_12m float64
open_il_24m float64
mths_since_rcnt_il float64
total_bal_il float64
il_util float64
open_rv_12m float64
open_rv_24m float64
max_bal_bc float64
all_util float64
total_rev_hi_lim float64
inq_fi float64
total_cu_tl float64
inq_last_12m float64
acc_open_past_24mths float64
avg_cur_bal float64
bc_open_to_buy float64
bc_util float64
chargeoff_within_12_mths float64
delinq_amnt int64
mo_sin_old_il_acct float64
mo_sin_old_rev_tl_op float64
mo_sin_rcnt_rev_tl_op float64
mo_sin_rcnt_tl float64
mort_acc float64
mths_since_recent_bc float64
mths_since_recent_bc_dlq float64
mths_since_recent_inq float64
mths_since_recent_revol_delinq float64
num_accts_ever_120_pd float64
num_actv_bc_tl float64
num_actv_rev_tl float64
num_bc_sats float64
num_bc_tl float64
num_il_tl float64
num_op_rev_tl float64
num_rev_accts float64
num_rev_tl_bal_gt_0 float64
num_sats float64
num_tl_120dpd_2m float64
num_tl_30dpd float64
num_tl_90g_dpd_24m float64
num_tl_op_past_12m float64
pct_tl_nvr_dlq float64
percent_bc_gt_75 float64
pub_rec_bankruptcies float64
tax_liens float64
tot_hi_cred_lim float64
total_bal_ex_mort float64
total_bc_limit float64
total_il_high_credit_limit float64
dtype: object

Conclusion: Intrest rate(int_rate) and employment length(emp_length) are in object format but it has to be in float format so fixing data type issues

1.4 correct the datatype of the column

#employment length(emp_length)
print("Unique Values before imputing: ",df.emp_length.unique())
df.emp_length.fillna(0,inplace=True)
df.emp_length.replace(['10+ years'],10,inplace=True) # map the values
df.emp_length.replace(['< 1 year'],0,inplace=True)
df.emp_length.replace(['1 year'],1,inplace=True)
df.emp_length.replace(['2 years'],2,inplace=True)
df.emp_length.replace(['3 years'],3,inplace=True)
df.emp_length.replace(['4 years'],4,inplace=True)
df.emp_length.replace(['5 years'],5,inplace=True)
df.emp_length.replace(['6 years'],6,inplace=True)
df.emp_length.replace(['7 years'],7,inplace=True)
df.emp_length.replace(['8 years'],8,inplace=True)
df.emp_length.replace(['9 years'],9,inplace=True)
print("Unique Values after imputing: ",df.emp_length.unique())
Unique Values before imputing: ['10+ years' '< 1 year' '1 year' '3 years' '8 years' '9 years' '4 years'
'5 years' '6 years' '2 years' '7 years' nan]
Unique Values after imputing: [10 0 1 3 8 9 4 5 6 2 7]
# fixing intrest rate(int_rate)
df.loc[:,'int_rate'] = df.loc[:,'int_rate'].apply(lambda x : float(x.strip('%')))#converting percentage to numeric
print("After Fixing data typpes : \n",df.int_rate.head())After Fixing data typpes :
0 10.65
1 15.27
2 15.96
3 13.49
4 12.69
Name: int_rate, dtype: float64

1.5 The target variable, which we want to compare across the other variables, is loan status. The strategy is to compare the average default rates across various other variables and identify the ones that affect default rate the most.

2. DATA CLEANING

2.1 Fix Columns

Removing columns from data frame where null values are greater than 90 percent

print("Dataframe Shape before removing NULL : ",df.shape)
count_of_null = df.isnull().sum(axis=0) #Finding count of null in each column
list_of_count_null_above_90_per = list(count_of_null.iloc[count_of_null.values >(0.9*len(df))].index) #Finding list of columns with null above 90 percent of count
df.drop(columns=list_of_count_null_above_90_per,axis=1,inplace=True) # dropping columns with null values
print("No of column dropped : ",len(list_of_count_null_above_90_per))
print("Dataframe Shape after removing NULL : ",df.shape)
Dataframe Shape before removing NULL : (39717, 111)
No of column dropped : 56
Dataframe Shape after removing NULL : (39717, 55)

2.2 Fix rows

Removing rows from data frame where null values are greater than 90 percent

print("Dataframe Shape before removing NULL : ",df.shape)
count_of_null = df.isnull().sum(axis=1)
list_of_count_null_above_90_per = list(count_of_null.iloc[count_of_null.values >(0.9*len(df.columns))].index)
df.drop(df.index[list_of_count_null_above_90_per],axis=0,inplace=True)# dropping rows with null values
print("No of rows dropped : ",len(list_of_count_null_above_90_per))
print("Dataframe Shape after removing NULL : ",df.shape)
Dataframe Shape before removing NULL : (39717, 55)
No of rows dropped : 0
Dataframe Shape after removing NULL : (39717, 55)

There are no rows with null values greater than 90 percent 2.3 Fix columns for one unique value

Removing columns from the data frame where only one unique value is there. Removing it will not impact our result sets.

print("Dataframe Shape before removing NULL : ",df.shape)
unique_df = df.nunique()
one_unquiue_value_list = list(unique_df.iloc[unique_df.values==1].index)
df.drop(columns=one_unquiue_value_list,axis=1,inplace=True) # dropping columns with one unique values
print("No of columns dropped : ",len(one_unquiue_value_list))
print("Dataframe Shape after removing NULL : ",df.shape)
Dataframe Shape before removing NULL : (39717, 55)
No of columns dropped : 9
Dataframe Shape after removing NULL : (39717, 46)

2.4 Removing irrelevant columns

Now let's look at columns from a business point of view and remove columns that are not needed.

irrelevant_cloumns = ['id','member_id','url','zip_code','delinq_2yrs','revol_bal','collection_recovery_fee','installment','last_pymnt_d','mths_since_last_delinq','recoveries','revol_util','total_rec_late_fee','out_prncp_inv','out_prncp','desc']print("Dataframe Shape before removing irrelevant columns : ",df.shape)
df.drop(columns=irrelevant_cloumns,axis=1,inplace=True)
print("No of column dropped : ",len(irrelevant_cloumns))
print("Dataframe Shape after removing irrelevant columns : ",df.shape)
Dataframe Shape before removing irrelevant columns : (39717, 46)
No of column dropped : 16
Dataframe Shape after removing irrelevant columns : (39717, 30)

2.5 Finding List of columns with continuous and categorical values

df.columns.to_series().groupby(df.dtypes).groups{dtype('int64'): Index(['loan_amnt', 'funded_amnt', 'emp_length', 'inq_last_6mths', 'open_acc',
'pub_rec', 'total_acc'],
dtype='object'),
dtype('float64'): Index(['funded_amnt_inv', 'int_rate', 'annual_inc', 'dti', 'total_pymnt',
'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
'last_pymnt_amnt', 'pub_rec_bankruptcies'],
dtype='object'),
dtype('O'): Index(['term', 'grade', 'sub_grade', 'emp_title', 'home_ownership',
'verification_status', 'issue_d', 'loan_status', 'purpose', 'title',
'addr_state', 'earliest_cr_line', 'last_credit_pull_d'],
dtype='object')}

Conclusion
List of continuous variables : [‘loan_amnt’, ‘funded_amnt’, ‘emp_length’, ‘inq_last_6mths’, ‘open_acc’, ‘pub_rec’, ‘total_acc’,’funded_amnt_inv’, ‘int_rate’, ‘annual_inc’, ‘dti’, ‘total_pymnt’, ‘total_pymnt_inv’, ‘total_rec_prncp’, ‘total_rec_int’, ‘last_pymnt_amnt’, ‘pub_rec_bankruptcies’ ]

List of non-continuous variables : [ ‘term’, ‘grade’, ‘sub_grade’, ‘emp_title’, ‘home_ownership’, ‘verification_status’, ‘issue_d’, ‘loan_status’, ‘desc’, ‘purpose’, ‘title’, ‘addr_state’, ‘earliest_cr_line’, ‘last_credit_pull_d’]

2.6 Standardise continuous values to numeric values

Finding a list of columns that have to be transformed into numeric values.
List : [‘loan_amnt’, ‘funded_amnt’, ‘emp_length’, ‘inq_last_6mths’, ‘open_acc’, ‘pub_rec’, ‘total_acc’,’funded_amnt_inv’, ‘int_rate’, ‘annual_inc’, ‘dti’, ‘total_pymnt’, ‘total_pymnt_inv’, ‘total_rec_prncp’, ‘total_rec_int’, ‘last_pymnt_amnt’, ‘pub_rec_bankruptcies’]

list_continuous = ['loan_amnt', 'funded_amnt', 'inq_last_6mths', 'open_acc', 'pub_rec', 'total_acc','funded_amnt_inv', 'int_rate', 'annual_inc', 'dti', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'last_pymnt_amnt', 'pub_rec_bankruptcies' ]
df.loc[:,list_continuous] = df.loc[:,list_continuous].apply(pd.to_numeric)
list_categorical = [ 'term', 'grade', 'sub_grade', 'emp_length','emp_title', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'purpose', 'title', 'addr_state', 'earliest_cr_line', 'last_credit_pull_d']

2.7 Target column category check.Removing rows for category not needed.

# filtering only fully paid or charged-off
df = df[df['loan_status'] != 'Current']
df['loan_status'] = df['loan_status'].apply(lambda x: 0 if x=='Fully Paid' else 1)

# converting loan_status to integer type
df['loan_status'] = df['loan_status'].apply(lambda x: pd.to_numeric(x))

# summarising the values
df['loan_status'].value_counts()
0 32950
1 5627
Name: loan_status, dtype: int64
df.shape(38577, 30)

3. DATA ANALYSIS

3.1 Univariate Analaysis for various continuous variables

Using distribution graph, violin plot, and box plot to show the distribution of various continuous variables

def plot_univariate_analysis_numeric(column_name):
"""Function to plot distribution for continuous column using Distribution plot, Violin plot and box plot"""
figure, axis=plt.subplots(nrows =1,ncols=3,figsize=(25,10))
axis[0].set_title("Distribution Plot for "+column_name)
sns.distplot(df[column_name],ax=axis[0]) # Distribution plot
axis[1].set_title("Violin Plot for " + column_name)
sns.violinplot(data =df[column_name],ax=axis[1], inner="quartile") # Violin Plot
axis[2].set_title("Box Plot for " + column_name) # Box plot
sns.boxplot(df[column_name],ax=axis[2],orient='v')
plt.show()
for i in list_continuous:
plot_univariate_analysis_numeric(i) # Plotting for various continuous variables.
png
png
png
png
png
png
png
png
png
png
png
png
png
png
png
/home/quantiphi/anaconda3/lib/python3.7/site-packages/numpy/lib/histograms.py:829: RuntimeWarning: invalid value encountered in greater_equal
keep = (tmp_a >= first_edge)
/home/quantiphi/anaconda3/lib/python3.7/site-packages/numpy/lib/histograms.py:830: RuntimeWarning: invalid value encountered in less_equal
keep &= (tmp_a <= last_edge)
/home/quantiphi/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:447: RuntimeWarning: invalid value encountered in greater
X = X[np.logical_and(X > clip[0], X < clip[1])] # won't work for two columns.
/home/quantiphi/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:447: RuntimeWarning: invalid value encountered in less
X = X[np.logical_and(X > clip[0], X < clip[1])] # won't work for two columns.
png

Insight: Analysis for above plots:
1)loan amount lies between 6000–15000
2)Funded amount range is 5500–15000
3)Funded amount investment range is 5000–13000
4)employment length of the loan seekers is between 2–9 years
5)Number of derogatory public record is very less, mainly concentrated at the center
6)Interest rate of the loan lies between 8.8–14 percent
7)dti ratio lies between 8–18

3.2 Univariate Analysis for various categorical variables

Using the bar to show the count of various categorical variables

emp_title column :

As we can see below that no of the unique values are very large it will be difficult to show all values count on the bar plot so taking the top 50 values and displaying them on the bar plot.

df.emp_title.describe()count       36191
unique 28027
top US Army
freq 131
Name: emp_title, dtype: object
result = []
df_temp = df.groupby('emp_title') # Group by column name
for key,item in df_temp:
temp = {}
temp['emp_title'] = key
temp['Count'] = item.emp_length.count() # Find length of column
result.append(temp)
df_result = pd.DataFrame(result) # Create data frame
df_plot = df_result.nlargest(50,['Count']) # Finding top 50
plt.figure(figsize=(20, 15))
plt.title('emp_title')
g = sns.barplot(x='emp_title', y='Count', data=df_plot) # Plot using barplot
plt.xticks(rotation=90) # Name column vertically
plt.show()
png

Insight: The job title for which the maximum loan is taken by the borrower is US Army.

list_categorical.remove('emp_title')

title column :

As we can see below that no of the unique values are very large it will be difficult to show all values count on the bar plot so taking the top 50 values and displaying them on the bar plot.

df.title.describe()count                  38566
unique 19297
top Debt Consolidation
freq 2090
Name: title, dtype: object
result = []
df_temp = df.groupby('title') # Group by column name
for key,item in df_temp:
temp = {}
temp['title'] = key
temp['Count'] = item.emp_length.count() # Find length of column
result.append(temp)
df_result = pd.DataFrame(result) # Create data frame
df_plot = df_result.nlargest(50,['Count']) # Finding top 50
plt.figure(figsize=(20, 15))
plt.title('title')
g = sns.barplot(x='title', y='Count', data=df_plot) # Plot using barplot
plt.xticks(rotation=90) # Name column vertically
plt.show()
png

Insight: Loan title with a maximum count of loan is Debt Consolidation.

list_categorical.remove('title')

earliest_cr_line column :

As we can see below that no of the unique values are very large it will be difficult to show all values count on the bar plot so taking the top 50 values and displaying them on the bar plot.

df.earliest_cr_line.describe()count      38577
unique 524
top Oct-99
freq 360
Name: earliest_cr_line, dtype: object
result = []
df_temp = df.groupby('earliest_cr_line') # Group by column name
for key,item in df_temp:
temp = {}
temp['earliest_cr_line'] = key
temp['Count'] = item.emp_length.count() # Find length of column
result.append(temp)
df_result = pd.DataFrame(result) # Create data frame
df_plot = df_result.nlargest(50,['Count']) # Finding top 50
plt.figure(figsize=(20, 15))
plt.title('earliest_cr_line')
g = sns.barplot(x='earliest_cr_line', y='Count', data=df_plot) # Plot using barplot
plt.xticks(rotation=90) # Name column vertically
plt.show()
png

Insight: The month the borrower’s earliest reported credit line was opened is October-99.

list_categorical.remove('earliest_cr_line') # Removing from list

last_credit_pull_d column :

As we can see below that no of the unique values are very large it will be difficult to show all values count on the bar plot so taking the top 50 values and displaying them on the bar plot.

df.last_credit_pull_d.describe()count      38575
unique 106
top May-16
freq 9333
Name: last_credit_pull_d, dtype: object
result = []
df_temp = df.groupby('last_credit_pull_d') # Group by column name
for key,item in df_temp:
temp = {}
temp['last_credit_pull_d'] = key
temp['Count'] = item.emp_length.count() # Find length of column
result.append(temp)
df_result = pd.DataFrame(result) # Create data frame
df_plot = df_result.nlargest(50,['Count']) # Finding top 50
plt.figure(figsize=(20, 15))
plt.title('last_credit_pull_d')
g = sns.barplot(x='last_credit_pull_d', y='Count', data=df_plot) # Plot using barplot
plt.xticks(rotation=90) # Name column vertically
plt.show()
png

Insight : The most recent month LC pulled credit for this loan is May-16.

list_categorical.remove('last_credit_pull_d')

Plotting graph for rest of values

def plot_univariate_analysis_categorical(column_name):
"""Function to plot frequency of various categorical variables using bar plot"""
result = []
df_temp = df.groupby(column_name) # Group by column name
for key,item in df_temp:
temp = {}
temp[column_name] = key
temp['Count'] = item.emp_length.count() # Find length of column
result.append(temp)
df_plot = pd.DataFrame(result) # Create data frame
plt.figure(figsize=(20, 15))
plt.title(column_name)
g = sns.barplot(x=column_name, y='Count', data=df_plot) # Plot using barplot
for index, row in df_plot.iterrows():
g.text(row.name,row.Count, round(row.Count,2), color='black', ha="center")
plt.xticks(rotation=90) # Name column vertically
plt.show()
for i in list_categorical:
plot_univariate_analysis_categorical(i) # Drawing bar plot for categorical variables using bar plot
png
png
png
png
png
png
png
png
png
png

Insight: Analysis for above plots
1)term for 36 months has more loan compared to 60 months
2)Loan for Grade B is maximum with count 11675
3)For grade B , B3 subgrade has max loan count
4)Loan count for Rent is maximum
5)There are 16694 loans that are not verified
6)Loan issue date is maximum in Nov-11
7)Loan status: Fully paid loan count is 32950,charged-off loans count is 5627
8)Most of the loans are taken for debt_consolidation purpose
9)Address of the state where most of the loans are taken is CA

3.3 Binning various continuous variables and deriving various segments to draw more insights on various segments of distribution

Drawing count plot for loan_year derived from issue_d

df['loan_year'] = df.issue_d.apply(lambda x : '20'+x.split('-')[1]) # Extracting loan issued year
plt.figure(figsize = (15,5))
plt.title("loan_year")
df.loan_year.value_counts().plot(kind = 'bar') # Plotting barplot
plt.xticks(rotation=90)
plt.show()
list_categorical.append('loan_year')
png

Insight: Loan year with a maximum loan is 2011.

Drawing count plot for loan_month derived from issue_d

df['loan_month'] = df.issue_d.apply(lambda x : x.split('-')[0])
plt.figure(figsize = (15,5))
plt.title("loan_month")
df.loan_month.value_counts().plot(kind = 'bar') # Plotting barplot
plt.xticks(rotation=90)
plt.show()
list_categorical.append('loan_month')
png

Insight: Max loan is taken in December month.

Plotting bins for a loan amount

bins = np.linspace(0, 40000, 41)
range_label = []
for i in range(0,40000,1000): # Bins at 1000 interval
range_label.append(str(str(i)+'-'+str(i+1000)))
df['loan_amount_bins'] = pd.cut(df['loan_amnt'], bins ,labels=range_label)
plt.figure(figsize = (15,5))
plt.title("loan_amount_bins")
df.loan_amount_bins.value_counts().plot(kind = 'bar') # Plotting barplot
plt.xticks(rotation=90)
plt.show()
list_categorical.append('loan_amount_bins')
png

Insight: The loan amount range for maximum loan count is 9000–10000.

Plotting beans for funded amount

bins = np.linspace(0, 40000, 41)
range_label = []
for i in range(0,40000,1000): # Bins at 1000 interval
range_label.append(str(str(i)+'-'+str(i+1000)))
df['funded_amnt_bins'] = pd.cut(df['funded_amnt'], bins ,labels=range_label)
plt.figure(figsize = (15,5))
plt.title("funded_amnt_bins")
df.funded_amnt_bins.value_counts().plot(kind = 'bar') # Plotting barplot
plt.xticks(rotation=90)
plt.show()
list_categorical.append('funded_amnt_bins')
png

Insight : Maximum count of funded amount range by LC is 9000–10000.

Plotting bins for funded_amnt_inv

bins = np.linspace(0, 40000, 41)
range_label = []
for i in range(0,40000,1000): # Bins at 1000 interval
range_label.append(str(str(i)+'-'+str(i+1000)))
df['funded_amnt_inv_bins'] = pd.cut(df['funded_amnt_inv'], bins ,labels=range_label)
plt.figure(figsize = (15,5))
plt.title("funded_amnt_inv_bins")
df.funded_amnt_inv_bins.value_counts().plot(kind = 'bar') # Plotting barplot
plt.xticks(rotation=90)
plt.show()
list_categorical.append('funded_amnt_inv_bins')
png

Insight: Maximum count of the amount approved by the investor is 4000–5000

Drawing plot of annual income bins

bins = np.append(np.linspace(0, 100000, 5),1000000)
range_label = ['0-25000', '25000-50000', '50000-75000', '75000-100000', '100000+']
df['annual_inc_bins'] = pd.cut(df['annual_inc'], bins, labels=range_label)
plt.figure(figsize = (15,5))
plt.title("annual_inc_bins")
df.annual_inc_bins.value_counts().plot(kind = 'bar') # Plotting barplot
plt.xticks(rotation=90)
plt.show()
list_categorical.append('annual_inc_bins')
png

Insight: Borrowers with annual income between 25000–50000 applies for the maximum loan.

Plotting bins for interest rate

bins = np.linspace(0,25,11)
range_label = ['0-2.5','2.5-5','5-7.5','7.5-10','10-12.5','12.5-15','15-17.5','17.5-20','20-22.5','22.5-25']
df['int_rate_bins'] = pd.cut(df['int_rate'], bins, labels=range_label,right=False)
plt.figure(figsize = (15,5))
plt.title("Interest Rate Bins ")
df.int_rate_bins.value_counts().plot(kind = 'bar')
plt.xticks(rotation=90)
plt.show()
list_categorical.append('int_rate_bins')
png

Insight: Max count of interest rate range applied on Loans is 10–12.5.

3.4 Bivariate Analysis drawing correlation between various attributes

Finding a correlation between various continuous variable by drawing a heat map

cor = round(df.corr(),3)
# figure size
plt.figure(figsize=(25,15))
# heatmap
ax = sns.heatmap(cor, cmap = "rainbow", annot=True)
top, bottom = ax.get_ylim()
ax.set_ylim(top+0.5, bottom-0.5)
plt.show()
png

Insight : Above heat map shows that loan_amt,funded_amt,funded_amt_inv are highly correlated. It implies that investors lend approx 94 percent of the loan amount claimed by the borrower.

3.5 Bivariate analysis for categorical variables with the rate of charge off

The rate of charge off can be obtained by dividing the Number of chargeoff by the Total no of loans

def bivariate_analysis_categorical(column_name):
tmp_df = df.groupby([column_name,'loan_status'])
result = []
for item in set(df[column_name]):
tmp = {}
tmp[column_name] = item
tmp['charged_off_count'] = 0
tmp['fully_paid_count'] = 0
result.append(tmp) # Adding initial value in result list
for key,item in tmp_df:
for value in result:
if(value[column_name]==key[0] and key[1]==1):
value['charged_off_count'] = len(item) # Finding 1 i.e charged off count
elif(value[column_name]==key[0] and key[1]==0):
value['fully_paid_count'] = len(item) # Finding 0 i.e fully paid count
result_df = pd.DataFrame(result)
result_df['total_loan_count'] = (result_df.charged_off_count+result_df.fully_paid_count)
result_df['charged_off_rate'] = result_df.charged_off_count/result_df.total_loan_count
result_df.sort_values(by=['charged_off_rate'] , inplace=True , ascending=False)
# set figure size for larger figure
plt.figure(num=None, figsize=(12, 8), dpi=80, facecolor='w', edgecolor='k')
# specify hue="categorical_variable"
sns.barplot(x=column_name, y='charged_off_rate',data=result_df)
plt.xticks(rotation=90)
plt.show()
list_categorical.remove('loan_status') # Not needed as we are comparing with this column
list_categorical.remove('issue_d') # Not needed as we already have column with year and month
for i in list_categorical:
bivariate_analysis_categorical(i) # Drawing bivariate analysis for categorical variables recursively
# bivariate_analysis_categorical('grade')
png
png
png
png
png
png
png
png
png
png
png
png
png
png
png

Insight: Analysis for above plots
1)default rate for 60 months is more than 36 months
2)Loan with grade G has a maximum default rate
3)Sub_garde F5 has a maximum default rate
4)Loan for ‘others has the highest default rate
5)Verified loans have more chances of default
6)Loan taken for small_business has more default rate
7)State NE has more default rate compared to other states
8)Year 2007 has more default rate
9)December month has more default rate compared to other months
10)Loan amount and funded amt_inv between 30000–31000 has a maximum default rate
11)Borrowers with annual income between 0–25000 has a maximum charged off rate
12) High-interest rate between 22.5 to 25 is more likely to default

def plot_bivariate_two_variables(column_x,column_hue):
'''Function to plot graph of two attributes with charged off rate'''
tmp_df = df.groupby([column_hue,column_x,'loan_status'])
result = []
for item in set(df[column_hue]):
for i in set(df[column_x]):
tmp = {}
tmp[column_hue] = item
tmp[column_x] = i
tmp['charged_off_count'] = 0
tmp['fully_paid_count'] = 0
result.append(tmp)
for key,item in tmp_df:
for value in result:
if(value[column_hue]==key[0] and value[column_x]==key[1] and key[2]==1):
value['charged_off_count'] = len(item)
if(value[column_hue]==key[0] and value[column_x]==key[1] and key[2]==0):
value['fully_paid_count'] = len(item)
result_df = pd.DataFrame(result)
result_df['total_loan_count'] = (result_df.charged_off_count+result_df.fully_paid_count)
result_df['charged_off_rate'] = result_df.charged_off_count/result_df.total_loan_count
result_df.sort_values(by=['charged_off_rate'] , inplace=True , ascending=False)
# set figure size for larger figure
plt.figure(num=None, figsize=(12, 8), dpi=80, facecolor='w', edgecolor='k')
# specify hue="categorical_variable"
sns.barplot(x=column_x, y='charged_off_rate',hue=column_hue,data=result_df)
plt.xticks(rotation=90)
plt.show()

3.6 Bivariate Analysis for grade ,term with charged off probablity

plot_bivariate_two_variables('grade','term') # Calling plot function
png

Insight: Grade G with loan term 36 months has more charged off rate.

3.7 Bivariate Analysis for purpose, term with charged off probability

plot_bivariate_two_variables('purpose','term') # Calling plot function
png

Insight: Loan for educational purposes with 60 months term has a maximum default rate.

3.8 Bivariate Analysis for int_rate_bins, purpose with charged off probablity

plot_bivariate_two_variables('int_rate_bins','purpose') # Calling plot function
png

Insight: Loan for car, house, and renewable energy with interest rate range 22.5–25 has maximum default rate.

3.9 Bivariate Analysis for loan_year, purpose with charged off probablity

plot_bivariate_two_variables('loan_year','purpose') # Calling plot function
png

Insight: Loan in the year 2007 for moving purpose has a maximum default rate

3.10 Bivariate Analysis for int_rate_bins,loan_year with charged off probablity

plot_bivariate_two_variables('int_rate_bins','loan_year') # Calling plot function
png

Insight: Interest rate between 17.5–20 in the loan year 2008 has maximum default rate.

3.11 Bivariate Analysis for int_rate_bins,home_ownership with charged off probablity

plot_bivariate_two_variables('int_rate_bins','home_ownership') # Calling plot function
png

Insight: Interest rate between 17.5–20 for homeownership ‘other’ has the highest default rate

3.12 Bivariate Analysis for int_rate_bins, term with charged off probablity

plot_bivariate_two_variables('int_rate_bins','term') # Calling plot function
png

Insight: Interest rate between 22.5–25 for 60 months term has maximum default rate.

--

--

Manish Poddar
Manish Poddar

Written by Manish Poddar

Machine Learning Engineer at AWS | Generative AI | MS in AI & ML, Liverpool John Moores University | Solving Data Problem

No responses yet