Getting started with Exploratory Data Analysis

17 min readApr 20, 2021

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing and handling missing values, and making transformations of variables as needed. EDA encompasses IDA. (from Wikipedia)

Here in this blog post we have take lending club case study data from Kaggle and perform EDA on it

In this notebook, we have considered that we work for a consumer finance company that specializes in lending various types of loans to urban customers. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company

If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company

The data given below contains information about past loan applicants and whether they ‘defaulted’ or not. The aim is to identify patterns that indicate if a person is likely to default, which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.

In this case study, we will use EDA to understand how consumer attributes and loan attributes influence the tendency of default.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.set_option('display.max_columns', 150)
pd.set_option('display.max_rows', 150)
%matplotlib inline
sns.set_style("whitegrid")

1 Data Understanding

1.1 Read the data to python data frame and describing it

# Loading Pandas data frame
df = pd.read_csv("loan.csv")
df.head() # looking at head field/home/quantiphi/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3058: DtypeWarning: Columns (47) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

df.describe(percentiles=[0.05,0.15,0.25,0.50,0.75,0.90,0.95,0.99]) # describing the dataframes and finding the stats

1.2 Categorizing different attributes of the data frame

df.columns #extracting columnsIndex(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       ...
       'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq',
       'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tax_liens',
       'tot_hi_cred_lim', 'total_bal_ex_mort', 'total_bc_limit',
       'total_il_high_credit_limit'],
      dtype='object', length=111)

From above data, frame variables can be divided into three categories:

1)those related to applicants
eg:emp_title,emp_length,open_acc,addr_state etc

2)Loan characteristics
eg:desc,dti,funded_amnt,grade etc

3)customer behaviour variables
eg:delinq_2yrs,last_payment_amnt,revol_bal,acc_now_delinq,bc_open_to_buy,chargeoff_within_12_mths,collection_recovery_fee,collections_12_mths_ex_med,delinq_amnt,installment,last_pymnt_d,max_bal_bc,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mths_since_last_delinq,mths_since_last_major_derog,next_pymnt_d,pct_tl_nvr_dlq,recoveries,revol_util,tot_coll_amt,total_rec_late_fee,total_rec_int,total_rec_prncp,total_rev_hi_lim,out_prncp,out_prncp_inv,pct_tl_nvr_dlq,percent_bc_gt_75.   

since customer behaviour variables are generated after the loan is approved so this variables need not to be taken into consideration for this case study.

1.3 check the datatype of various columns

df.dtypes # findinfg the datatypes for all columnsid                                  int64
member_id                           int64
loan_amnt                           int64
funded_amnt                         int64
funded_amnt_inv                   float64
term                               object
int_rate                           object
installment                       float64
grade                              object
sub_grade                          object
emp_title                          object
emp_length                         object
home_ownership                     object
annual_inc                        float64
verification_status                object
issue_d                            object
loan_status                        object
pymnt_plan                         object
url                                object
desc                               object
purpose                            object
title                              object
zip_code                           object
addr_state                         object
dti                               float64
delinq_2yrs                         int64
earliest_cr_line                   object
inq_last_6mths                      int64
mths_since_last_delinq            float64
mths_since_last_record            float64
open_acc                            int64
pub_rec                             int64
revol_bal                           int64
revol_util                         object
total_acc                           int64
initial_list_status                object
out_prncp                         float64
out_prncp_inv                     float64
total_pymnt                       float64
total_pymnt_inv                   float64
total_rec_prncp                   float64
total_rec_int                     float64
total_rec_late_fee                float64
recoveries                        float64
collection_recovery_fee           float64
last_pymnt_d                       object
last_pymnt_amnt                   float64
next_pymnt_d                       object
last_credit_pull_d                 object
collections_12_mths_ex_med        float64
mths_since_last_major_derog       float64
policy_code                         int64
application_type                   object
annual_inc_joint                  float64
dti_joint                         float64
verification_status_joint         float64
acc_now_delinq                      int64
tot_coll_amt                      float64
tot_cur_bal                       float64
open_acc_6m                       float64
open_il_6m                        float64
open_il_12m                       float64
open_il_24m                       float64
mths_since_rcnt_il                float64
total_bal_il                      float64
il_util                           float64
open_rv_12m                       float64
open_rv_24m                       float64
max_bal_bc                        float64
all_util                          float64
total_rev_hi_lim                  float64
inq_fi                            float64
total_cu_tl                       float64
inq_last_12m                      float64
acc_open_past_24mths              float64
avg_cur_bal                       float64
bc_open_to_buy                    float64
bc_util                           float64
chargeoff_within_12_mths          float64
delinq_amnt                         int64
mo_sin_old_il_acct                float64
mo_sin_old_rev_tl_op              float64
mo_sin_rcnt_rev_tl_op             float64
mo_sin_rcnt_tl                    float64
mort_acc                          float64
mths_since_recent_bc              float64
mths_since_recent_bc_dlq          float64
mths_since_recent_inq             float64
mths_since_recent_revol_delinq    float64
num_accts_ever_120_pd             float64
num_actv_bc_tl                    float64
num_actv_rev_tl                   float64
num_bc_sats                       float64
num_bc_tl                         float64
num_il_tl                         float64
num_op_rev_tl                     float64
num_rev_accts                     float64
num_rev_tl_bal_gt_0               float64
num_sats                          float64
num_tl_120dpd_2m                  float64
num_tl_30dpd                      float64
num_tl_90g_dpd_24m                float64
num_tl_op_past_12m                float64
pct_tl_nvr_dlq                    float64
percent_bc_gt_75                  float64
pub_rec_bankruptcies              float64
tax_liens                         float64
tot_hi_cred_lim                   float64
total_bal_ex_mort                 float64
total_bc_limit                    float64
total_il_high_credit_limit        float64
dtype: object

Conclusion: Intrest rate(int_rate) and employment length(emp_length) are in object format but it has to be in float format so fixing data type issues

1.4 correct the datatype of the column

#employment length(emp_length)
print("Unique Values before imputing: ",df.emp_length.unique())
df.emp_length.fillna(0,inplace=True)
df.emp_length.replace(['10+ years'],10,inplace=True) # map the values
df.emp_length.replace(['< 1 year'],0,inplace=True)
df.emp_length.replace(['1 year'],1,inplace=True)
df.emp_length.replace(['2 years'],2,inplace=True)
df.emp_length.replace(['3 years'],3,inplace=True)
df.emp_length.replace(['4 years'],4,inplace=True)
df.emp_length.replace(['5 years'],5,inplace=True)
df.emp_length.replace(['6 years'],6,inplace=True)
df.emp_length.replace(['7 years'],7,inplace=True)
df.emp_length.replace(['8 years'],8,inplace=True)
df.emp_length.replace(['9 years'],9,inplace=True)
print("Unique Values after imputing: ",df.emp_length.unique())Unique Values before imputing:  ['10+ years' '< 1 year' '1 year' '3 years' '8 years' '9 years' '4 years'
 '5 years' '6 years' '2 years' '7 years' nan]
Unique Values after imputing:  [10  0  1  3  8  9  4  5  6  2  7]# fixing intrest rate(int_rate)
df.loc[:,'int_rate'] = df.loc[:,'int_rate'].apply(lambda x : float(x.strip('%')))#converting percentage to numericprint("After Fixing data typpes : \n",df.int_rate.head())After Fixing data typpes : 
 0    10.65
1    15.27
2    15.96
3    13.49
4    12.69
Name: int_rate, dtype: float64

1.5 The target variable, which we want to compare across the other variables, is loan status. The strategy is to compare the average default rates across various other variables and identify the ones that affect default rate the most.

2. DATA CLEANING

2.1 Fix Columns

Removing columns from data frame where null values are greater than 90 percent

print("Dataframe Shape before removing NULL : ",df.shape)
count_of_null = df.isnull().sum(axis=0) #Finding count of null in each column
list_of_count_null_above_90_per = list(count_of_null.iloc[count_of_null.values >(0.9*len(df))].index) #Finding list of columns with null above 90 percent of count
df.drop(columns=list_of_count_null_above_90_per,axis=1,inplace=True) # dropping columns with null values
print("No of column dropped : ",len(list_of_count_null_above_90_per))
print("Dataframe Shape after removing NULL : ",df.shape)Dataframe Shape before removing NULL :  (39717, 111)
No of column dropped :  56
Dataframe Shape after removing NULL :  (39717, 55)

2.2 Fix rows

Removing rows from data frame where null values are greater than 90 percent

print("Dataframe Shape before removing NULL : ",df.shape)
count_of_null = df.isnull().sum(axis=1)
list_of_count_null_above_90_per = list(count_of_null.iloc[count_of_null.values >(0.9*len(df.columns))].index)
df.drop(df.index[list_of_count_null_above_90_per],axis=0,inplace=True)# dropping rows with null values
print("No of rows dropped : ",len(list_of_count_null_above_90_per))
print("Dataframe Shape after removing NULL : ",df.shape)Dataframe Shape before removing NULL :  (39717, 55)
No of rows dropped :  0
Dataframe Shape after removing NULL :  (39717, 55)

There are no rows with null values greater than 90 percent 2.3 Fix columns for one unique value

Removing columns from the data frame where only one unique value is there. Removing it will not impact our result sets.

print("Dataframe Shape before removing NULL : ",df.shape)
unique_df = df.nunique()
one_unquiue_value_list = list(unique_df.iloc[unique_df.values==1].index)
df.drop(columns=one_unquiue_value_list,axis=1,inplace=True) # dropping columns with one unique values
print("No of columns dropped : ",len(one_unquiue_value_list))
print("Dataframe Shape after removing NULL : ",df.shape)Dataframe Shape before removing NULL :  (39717, 55)
No of columns dropped :  9
Dataframe Shape after removing NULL :  (39717, 46)

2.4 Removing irrelevant columns

Now let's look at columns from a business point of view and remove columns that are not needed.

irrelevant_cloumns = ['id','member_id','url','zip_code','delinq_2yrs','revol_bal','collection_recovery_fee','installment','last_pymnt_d','mths_since_last_delinq','recoveries','revol_util','total_rec_late_fee','out_prncp_inv','out_prncp','desc']print("Dataframe Shape before removing irrelevant columns : ",df.shape)
df.drop(columns=irrelevant_cloumns,axis=1,inplace=True)
print("No of column dropped : ",len(irrelevant_cloumns))
print("Dataframe Shape after removing irrelevant columns : ",df.shape)Dataframe Shape before removing irrelevant columns :  (39717, 46)
No of column dropped :  16
Dataframe Shape after removing irrelevant columns :  (39717, 30)

2.5 Finding List of columns with continuous and categorical values

df.columns.to_series().groupby(df.dtypes).groups{dtype('int64'): Index(['loan_amnt', 'funded_amnt', 'emp_length', 'inq_last_6mths', 'open_acc',
        'pub_rec', 'total_acc'],
       dtype='object'),
 dtype('float64'): Index(['funded_amnt_inv', 'int_rate', 'annual_inc', 'dti', 'total_pymnt',
        'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
        'last_pymnt_amnt', 'pub_rec_bankruptcies'],
       dtype='object'),
 dtype('O'): Index(['term', 'grade', 'sub_grade', 'emp_title', 'home_ownership',
        'verification_status', 'issue_d', 'loan_status', 'purpose', 'title',
        'addr_state', 'earliest_cr_line', 'last_credit_pull_d'],
       dtype='object')}

Conclusion
List of continuous variables : [‘loan_amnt’, ‘funded_amnt’, ‘emp_length’, ‘inq_last_6mths’, ‘open_acc’, ‘pub_rec’, ‘total_acc’,’funded_amnt_inv’, ‘int_rate’, ‘annual_inc’, ‘dti’, ‘total_pymnt’, ‘total_pymnt_inv’, ‘total_rec_prncp’, ‘total_rec_int’, ‘last_pymnt_amnt’, ‘pub_rec_bankruptcies’ ]

List of non-continuous variables : [ ‘term’, ‘grade’, ‘sub_grade’, ‘emp_title’, ‘home_ownership’, ‘verification_status’, ‘issue_d’, ‘loan_status’, ‘desc’, ‘purpose’, ‘title’, ‘addr_state’, ‘earliest_cr_line’, ‘last_credit_pull_d’]

2.6 Standardise continuous values to numeric values

Finding a list of columns that have to be transformed into numeric values.
List : [‘loan_amnt’, ‘funded_amnt’, ‘emp_length’, ‘inq_last_6mths’, ‘open_acc’, ‘pub_rec’, ‘total_acc’,’funded_amnt_inv’, ‘int_rate’, ‘annual_inc’, ‘dti’, ‘total_pymnt’, ‘total_pymnt_inv’, ‘total_rec_prncp’, ‘total_rec_int’, ‘last_pymnt_amnt’, ‘pub_rec_bankruptcies’]

list_continuous = ['loan_amnt', 'funded_amnt', 'inq_last_6mths', 'open_acc', 'pub_rec', 'total_acc','funded_amnt_inv', 'int_rate', 'annual_inc', 'dti', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'last_pymnt_amnt', 'pub_rec_bankruptcies' ]
df.loc[:,list_continuous] = df.loc[:,list_continuous].apply(pd.to_numeric)  
list_categorical = [ 'term', 'grade', 'sub_grade', 'emp_length','emp_title', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'purpose', 'title', 'addr_state', 'earliest_cr_line', 'last_credit_pull_d']

2.7 Target column category check.Removing rows for category not needed.

# filtering only fully paid or charged-off
df = df[df['loan_status'] != 'Current']
df['loan_status'] = df['loan_status'].apply(lambda x: 0 if x=='Fully Paid' else 1)

# converting loan_status to integer type
df['loan_status'] = df['loan_status'].apply(lambda x: pd.to_numeric(x))

# summarising the values
df['loan_status'].value_counts()0    32950
1     5627
Name: loan_status, dtype: int64df.shape(38577, 30)

3. DATA ANALYSIS

3.1 Univariate Analaysis for various continuous variables

Using distribution graph, violin plot, and box plot to show the distribution of various continuous variables

def plot_univariate_analysis_numeric(column_name):
    """Function to plot distribution for continuous column using Distribution plot, Violin plot and box plot"""
    figure, axis=plt.subplots(nrows =1,ncols=3,figsize=(25,10))
    axis[0].set_title("Distribution Plot for "+column_name)
    sns.distplot(df[column_name],ax=axis[0]) # Distribution plot
    axis[1].set_title("Violin Plot for " + column_name)
    sns.violinplot(data =df[column_name],ax=axis[1], inner="quartile") # Violin Plot
    axis[2].set_title("Box Plot for " + column_name) # Box plot
    sns.boxplot(df[column_name],ax=axis[2],orient='v')
    plt.show()for i in list_continuous:
    plot_univariate_analysis_numeric(i) # Plotting for various continuous variables.

/home/quantiphi/anaconda3/lib/python3.7/site-packages/numpy/lib/histograms.py:829: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
/home/quantiphi/anaconda3/lib/python3.7/site-packages/numpy/lib/histograms.py:830: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)
/home/quantiphi/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:447: RuntimeWarning: invalid value encountered in greater
  X = X[np.logical_and(X > clip[0], X < clip[1])] # won't work for two columns.
/home/quantiphi/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:447: RuntimeWarning: invalid value encountered in less
  X = X[np.logical_and(X > clip[0], X < clip[1])] # won't work for two columns.

Insight: Analysis for above plots:
1)loan amount lies between 6000–15000
2)Funded amount range is 5500–15000
3)Funded amount investment range is 5000–13000
4)employment length of the loan seekers is between 2–9 years
5)Number of derogatory public record is very less, mainly concentrated at the center
6)Interest rate of the loan lies between 8.8–14 percent
7)dti ratio lies between 8–18

3.2 Univariate Analysis for various categorical variables

Using the bar to show the count of various categorical variables

emp_title column :

As we can see below that no of the unique values are very large it will be difficult to show all values count on the bar plot so taking the top 50 values and displaying them on the bar plot.

df.emp_title.describe()count       36191
unique      28027
top       US Army
freq          131
Name: emp_title, dtype: objectresult = []
df_temp = df.groupby('emp_title') # Group by column name 
for key,item in df_temp:
    temp = {}
    temp['emp_title'] = key 
    temp['Count'] = item.emp_length.count() # Find length of column
    result.append(temp)
df_result = pd.DataFrame(result) # Create data frame
df_plot = df_result.nlargest(50,['Count']) # Finding top 50
plt.figure(figsize=(20, 15))
plt.title('emp_title')
g = sns.barplot(x='emp_title', y='Count', data=df_plot) # Plot using barplot
plt.xticks(rotation=90) # Name column vertically
plt.show()

Insight: The job title for which the maximum loan is taken by the borrower is US Army.

list_categorical.remove('emp_title')

title column :

As we can see below that no of the unique values are very large it will be difficult to show all values count on the bar plot so taking the top 50 values and displaying them on the bar plot.

df.title.describe()count                  38566
unique                 19297
top       Debt Consolidation
freq                    2090
Name: title, dtype: objectresult = []
df_temp = df.groupby('title') # Group by column name 
for key,item in df_temp:
    temp = {}
    temp['title'] = key 
    temp['Count'] = item.emp_length.count() # Find length of column
    result.append(temp)
df_result = pd.DataFrame(result) # Create data frame
df_plot = df_result.nlargest(50,['Count']) # Finding top 50
plt.figure(figsize=(20, 15))
plt.title('title')
g = sns.barplot(x='title', y='Count', data=df_plot) # Plot using barplot
plt.xticks(rotation=90) # Name column vertically
plt.show()

Insight: Loan title with a maximum count of loan is Debt Consolidation.

list_categorical.remove('title')

earliest_cr_line column :

As we can see below that no of the unique values are very large it will be difficult to show all values count on the bar plot so taking the top 50 values and displaying them on the bar plot.

df.earliest_cr_line.describe()count      38577
unique       524
top       Oct-99
freq         360
Name: earliest_cr_line, dtype: objectresult = []
df_temp = df.groupby('earliest_cr_line') # Group by column name 
for key,item in df_temp:
    temp = {}
    temp['earliest_cr_line'] = key 
    temp['Count'] = item.emp_length.count() # Find length of column
    result.append(temp)
df_result = pd.DataFrame(result) # Create data frame
df_plot = df_result.nlargest(50,['Count']) # Finding top 50
plt.figure(figsize=(20, 15))
plt.title('earliest_cr_line')
g = sns.barplot(x='earliest_cr_line', y='Count', data=df_plot) # Plot using barplot
plt.xticks(rotation=90) # Name column vertically
plt.show()

Insight: The month the borrower’s earliest reported credit line was opened is October-99.

list_categorical.remove('earliest_cr_line') # Removing from list

last_credit_pull_d column :

As we can see below that no of the unique values are very large it will be difficult to show all values count on the bar plot so taking the top 50 values and displaying them on the bar plot.

df.last_credit_pull_d.describe()count      38575
unique       106
top       May-16
freq        9333
Name: last_credit_pull_d, dtype: objectresult = []
df_temp = df.groupby('last_credit_pull_d') # Group by column name 
for key,item in df_temp:
    temp = {}
    temp['last_credit_pull_d'] = key 
    temp['Count'] = item.emp_length.count() # Find length of column
    result.append(temp)
df_result = pd.DataFrame(result) # Create data frame
df_plot = df_result.nlargest(50,['Count']) # Finding top 50
plt.figure(figsize=(20, 15))
plt.title('last_credit_pull_d')
g = sns.barplot(x='last_credit_pull_d', y='Count', data=df_plot) # Plot using barplot
plt.xticks(rotation=90) # Name column vertically
plt.show()

Insight : The most recent month LC pulled credit for this loan is May-16.

list_categorical.remove('last_credit_pull_d')

Plotting graph for rest of values

def plot_univariate_analysis_categorical(column_name):
    """Function to plot frequency of various categorical variables using bar plot"""
    result = []
    df_temp = df.groupby(column_name) # Group by column name 
    for key,item in df_temp:
        temp = {}
        temp[column_name] = key 
        temp['Count'] = item.emp_length.count() # Find length of column
        result.append(temp)
    df_plot = pd.DataFrame(result) # Create data frame
    plt.figure(figsize=(20, 15))
    plt.title(column_name)
    g = sns.barplot(x=column_name, y='Count', data=df_plot) # Plot using barplot
    for index, row in df_plot.iterrows():
        g.text(row.name,row.Count, round(row.Count,2), color='black', ha="center")
    plt.xticks(rotation=90) # Name column vertically
    plt.show()for i in list_categorical:
    plot_univariate_analysis_categorical(i) # Drawing bar plot for categorical variables using bar plot

Insight: Analysis for above plots
1)term for 36 months has more loan compared to 60 months
2)Loan for Grade B is maximum with count 11675
3)For grade B , B3 subgrade has max loan count
4)Loan count for Rent is maximum
5)There are 16694 loans that are not verified
6)Loan issue date is maximum in Nov-11
7)Loan status: Fully paid loan count is 32950,charged-off loans count is 5627
8)Most of the loans are taken for debt_consolidation purpose
9)Address of the state where most of the loans are taken is CA

3.3 Binning various continuous variables and deriving various segments to draw more insights on various segments of distribution

Drawing count plot for loan_year derived from issue_d

df['loan_year'] = df.issue_d.apply(lambda x : '20'+x.split('-')[1]) # Extracting loan issued year
plt.figure(figsize = (15,5))
plt.title("loan_year")
df.loan_year.value_counts().plot(kind = 'bar') # Plotting barplot
plt.xticks(rotation=90)
plt.show()
list_categorical.append('loan_year')

Insight: Loan year with a maximum loan is 2011.

Drawing count plot for loan_month derived from issue_d

df['loan_month'] = df.issue_d.apply(lambda x : x.split('-')[0])
plt.figure(figsize = (15,5))
plt.title("loan_month")
df.loan_month.value_counts().plot(kind = 'bar') # Plotting barplot
plt.xticks(rotation=90)
plt.show()
list_categorical.append('loan_month')

Insight: Max loan is taken in December month.

Plotting bins for a loan amount

bins = np.linspace(0, 40000, 41)
range_label = []
for i in range(0,40000,1000): # Bins at 1000 interval
    range_label.append(str(str(i)+'-'+str(i+1000)))
df['loan_amount_bins'] = pd.cut(df['loan_amnt'], bins ,labels=range_label)
plt.figure(figsize = (15,5))
plt.title("loan_amount_bins")
df.loan_amount_bins.value_counts().plot(kind = 'bar') # Plotting barplot
plt.xticks(rotation=90)
plt.show()
list_categorical.append('loan_amount_bins')

Insight: The loan amount range for maximum loan count is 9000–10000.

Plotting beans for funded amount

bins = np.linspace(0, 40000, 41)
range_label = []
for i in range(0,40000,1000): # Bins at 1000 interval
    range_label.append(str(str(i)+'-'+str(i+1000))) 
df['funded_amnt_bins'] = pd.cut(df['funded_amnt'], bins ,labels=range_label)
plt.figure(figsize = (15,5))
plt.title("funded_amnt_bins")
df.funded_amnt_bins.value_counts().plot(kind = 'bar') # Plotting barplot
plt.xticks(rotation=90)
plt.show()
list_categorical.append('funded_amnt_bins')

Insight : Maximum count of funded amount range by LC is 9000–10000.

Plotting bins for funded_amnt_inv

bins = np.linspace(0, 40000, 41)
range_label = []
for i in range(0,40000,1000): # Bins at 1000 interval
    range_label.append(str(str(i)+'-'+str(i+1000))) 
df['funded_amnt_inv_bins'] = pd.cut(df['funded_amnt_inv'], bins ,labels=range_label)
plt.figure(figsize = (15,5))
plt.title("funded_amnt_inv_bins")
df.funded_amnt_inv_bins.value_counts().plot(kind = 'bar') # Plotting barplot
plt.xticks(rotation=90)
plt.show()
list_categorical.append('funded_amnt_inv_bins')

Insight: Maximum count of the amount approved by the investor is 4000–5000

Drawing plot of annual income bins

bins = np.append(np.linspace(0, 100000, 5),1000000)
range_label = ['0-25000', '25000-50000', '50000-75000', '75000-100000', '100000+']
df['annual_inc_bins'] = pd.cut(df['annual_inc'], bins, labels=range_label)
plt.figure(figsize = (15,5))
plt.title("annual_inc_bins")
df.annual_inc_bins.value_counts().plot(kind = 'bar') # Plotting barplot
plt.xticks(rotation=90)
plt.show()
list_categorical.append('annual_inc_bins')

Insight: Borrowers with annual income between 25000–50000 applies for the maximum loan.

Plotting bins for interest rate

bins = np.linspace(0,25,11)
range_label = ['0-2.5','2.5-5','5-7.5','7.5-10','10-12.5','12.5-15','15-17.5','17.5-20','20-22.5','22.5-25']
df['int_rate_bins'] = pd.cut(df['int_rate'], bins, labels=range_label,right=False)
plt.figure(figsize = (15,5))
plt.title("Interest Rate Bins ")
df.int_rate_bins.value_counts().plot(kind = 'bar')
plt.xticks(rotation=90)
plt.show()
list_categorical.append('int_rate_bins')

Insight: Max count of interest rate range applied on Loans is 10–12.5.

3.4 Bivariate Analysis drawing correlation between various attributes

Finding a correlation between various continuous variable by drawing a heat map

cor = round(df.corr(),3)
# figure size
plt.figure(figsize=(25,15))
# heatmap
ax = sns.heatmap(cor, cmap = "rainbow", annot=True)
top, bottom = ax.get_ylim()
ax.set_ylim(top+0.5, bottom-0.5)
plt.show()

Insight : Above heat map shows that loan_amt,funded_amt,funded_amt_inv are highly correlated. It implies that investors lend approx 94 percent of the loan amount claimed by the borrower.

3.5 Bivariate analysis for categorical variables with the rate of charge off

The rate of charge off can be obtained by dividing the Number of chargeoff by the Total no of loans

def bivariate_analysis_categorical(column_name):
    tmp_df = df.groupby([column_name,'loan_status'])
    result = []
    for item in set(df[column_name]):
        tmp = {}
        tmp[column_name] = item
        tmp['charged_off_count'] = 0
        tmp['fully_paid_count'] = 0
        result.append(tmp) # Adding initial value in result list 
    for key,item in tmp_df:
        for value in result:
            if(value[column_name]==key[0] and key[1]==1):
                value['charged_off_count'] = len(item) # Finding 1 i.e charged off count
            elif(value[column_name]==key[0] and key[1]==0):
                value['fully_paid_count'] = len(item) # Finding 0 i.e fully paid count
    result_df = pd.DataFrame(result)
    result_df['total_loan_count'] = (result_df.charged_off_count+result_df.fully_paid_count)
    result_df['charged_off_rate'] = result_df.charged_off_count/result_df.total_loan_count
    result_df.sort_values(by=['charged_off_rate'] , inplace=True , ascending=False)
    # set figure size for larger figure
    plt.figure(num=None, figsize=(12, 8), dpi=80, facecolor='w', edgecolor='k')
    # specify hue="categorical_variable"
    sns.barplot(x=column_name, y='charged_off_rate',data=result_df)
    plt.xticks(rotation=90)
    plt.show()list_categorical.remove('loan_status') # Not needed as we are comparing with this column
list_categorical.remove('issue_d') # Not needed as we already have column with year and monthfor i in list_categorical:
    bivariate_analysis_categorical(i) # Drawing bivariate analysis for categorical variables recursively
# bivariate_analysis_categorical('grade')

Insight: Analysis for above plots
1)default rate for 60 months is more than 36 months
2)Loan with grade G has a maximum default rate
3)Sub_garde F5 has a maximum default rate
4)Loan for ‘others has the highest default rate
5)Verified loans have more chances of default
6)Loan taken for small_business has more default rate
7)State NE has more default rate compared to other states
8)Year 2007 has more default rate
9)December month has more default rate compared to other months
10)Loan amount and funded amt_inv between 30000–31000 has a maximum default rate
11)Borrowers with annual income between 0–25000 has a maximum charged off rate
12) High-interest rate between 22.5 to 25 is more likely to default

def plot_bivariate_two_variables(column_x,column_hue):
    '''Function to plot graph of two attributes with charged off rate'''
    tmp_df = df.groupby([column_hue,column_x,'loan_status'])
    result = []
    for item in set(df[column_hue]):
        for i in set(df[column_x]):
            tmp = {}
            tmp[column_hue] = item
            tmp[column_x] = i
            tmp['charged_off_count'] = 0
            tmp['fully_paid_count'] = 0
            result.append(tmp)
    for key,item in tmp_df:
        for value in result:
            if(value[column_hue]==key[0] and value[column_x]==key[1] and key[2]==1):
                value['charged_off_count'] = len(item)
            if(value[column_hue]==key[0] and value[column_x]==key[1] and key[2]==0):
                value['fully_paid_count'] = len(item)
    result_df = pd.DataFrame(result)
    result_df['total_loan_count'] = (result_df.charged_off_count+result_df.fully_paid_count)
    result_df['charged_off_rate'] = result_df.charged_off_count/result_df.total_loan_count
    result_df.sort_values(by=['charged_off_rate'] , inplace=True , ascending=False)
    # set figure size for larger figure
    plt.figure(num=None, figsize=(12, 8), dpi=80, facecolor='w', edgecolor='k')
    # specify hue="categorical_variable"
    sns.barplot(x=column_x, y='charged_off_rate',hue=column_hue,data=result_df)
    plt.xticks(rotation=90)
    plt.show()

3.6 Bivariate Analysis for grade ,term with charged off probablity

plot_bivariate_two_variables('grade','term') # Calling plot function

Insight: Grade G with loan term 36 months has more charged off rate.

3.7 Bivariate Analysis for purpose, term with charged off probability

plot_bivariate_two_variables('purpose','term') # Calling plot function

Insight: Loan for educational purposes with 60 months term has a maximum default rate.

3.8 Bivariate Analysis for int_rate_bins, purpose with charged off probablity

plot_bivariate_two_variables('int_rate_bins','purpose') # Calling plot function

Insight: Loan for car, house, and renewable energy with interest rate range 22.5–25 has maximum default rate.

3.9 Bivariate Analysis for loan_year, purpose with charged off probablity

plot_bivariate_two_variables('loan_year','purpose') # Calling plot function

Insight: Loan in the year 2007 for moving purpose has a maximum default rate

3.10 Bivariate Analysis for int_rate_bins,loan_year with charged off probablity

plot_bivariate_two_variables('int_rate_bins','loan_year') # Calling plot function

Insight: Interest rate between 17.5–20 in the loan year 2008 has maximum default rate.

3.11 Bivariate Analysis for int_rate_bins,home_ownership with charged off probablity

plot_bivariate_two_variables('int_rate_bins','home_ownership') # Calling plot function

Insight: Interest rate between 17.5–20 for homeownership ‘other’ has the highest default rate

3.12 Bivariate Analysis for int_rate_bins, term with charged off probablity

plot_bivariate_two_variables('int_rate_bins','term') # Calling plot function

Insight: Interest rate between 22.5–25 for 60 months term has maximum default rate.

Getting started with Exploratory Data Analysis

1 Data Understanding

1.2 Categorizing different attributes of the data frame

2. DATA CLEANING

3. DATA ANALYSIS

3.3 Binning various continuous variables and deriving various segments to draw more insights on various segments of distribution

Written by Manish Poddar

No responses yet