Introduction: Manual Feature Engineering

In this notebook, we will explore making features by hand for the Home Credit Default Risk competition. In an earlier notebook, we used only the application data in order to build a model. The best model we made from this data achieved a score on the leaderboard around 0.74. In order to better this score, we will have to include more information from the other dataframes. Here, we will look at using information from the bureau and bureau_balance data. The definitions of these data files are:

  • bureau: information about client's previous loans with other financial institutions reported to Home Credit. Each previous loan has its own row.
  • bureau_balance: monthly information about the previous loans. Each month has its own row.

Manual feature engineering can be a tedious process (which is why we use automated feature engineering with featuretools!) and often relies on domain expertise. Since I have limited domain knowledge of loans and what makes a person likely to default, I will instead concentrate of getting as much info as possible into the final training dataframe. The idea is that the model will then pick up on which features are important rather than us having to decide that. Basically, our approach is to make as many features as possible and then give them all to the model to use! Later, we can perform feature reduction using the feature importances from the model or other techniques such as PCA.

The process of manual feature engineering will involve plenty of Pandas code, a little patience, and a lot of great practice manipulation data. Even though automated feature engineering tools are starting to be made available, feature engineering will still have to be done using plenty of data wrangling for a little while longer.

In [1]:
# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress warnings from pandas
import warnings
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')

Example: Counts of a client's previous loans

To illustrate the general process of manual feature engineering, we will first simply get the count of a client's previous loans at other financial institutions. This requires a number of Pandas operations we will make heavy use of throughout the notebook:

  • groupby: group a dataframe by a column. In this case we will group by the unique client, the SK_ID_CURR column
  • agg: perform a calculation on the grouped data such as taking the mean of columns. We can either call the function directly (grouped_df.mean()) or use the agg function together with a list of transforms (grouped_df.agg([mean, max, min, sum]))
  • merge: match the aggregated statistics to the appropriate client. We need to merge the original training data with the calculated stats on the SK_ID_CURR column which will insert NaN in any cell for which the client does not have the corresponding statistic

We also use the (rename) function quite a bit specifying the columns to be renamed as a dictionary. This is useful in order to keep track of the new variables we create.

This might seem like a lot, which is why we'll eventually write a function to do this process for us. Let's take a look at implementing this by hand first.

In [2]:
# Read in bureau
bureau = pd.read_csv('./input/bureau.csv')
bureau.head()
Out[2]:
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN
In [3]:
# Groupby the client id (SK_ID_CURR), count the number of previous loans, and rename the column
previous_loan_counts = bureau.groupby('SK_ID_CURR', as_index=False)['SK_ID_BUREAU'].count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_counts'})
previous_loan_counts.head()
Out[3]:
SK_ID_CURR previous_loan_counts
0 100001 7
1 100002 8
2 100003 4
3 100004 2
4 100005 3
In [4]:
# Join to the training dataframe
train = pd.read_csv('./input/application_train.csv')
train = train.merge(previous_loan_counts, on = 'SK_ID_CURR', how = 'left')

# Fill the missing values with 0 
train['previous_loan_counts'] = train['previous_loan_counts'].fillna(0)
train.head()
Out[4]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR previous_loan_counts
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.022 0.0198 0.0 0.0 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 Stone, brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 8.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0.0 0.0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 4.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 2.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 0.0
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

Scroll all the way to the right to see the new column.

Assessing Usefulness of New Variable with r value

To determine if the new variable is useful, we can calculate the Pearson Correlation Coefficient (r-value) between this variable and the target. This measures the strength of a linear relationship between two variables and ranges from -1 (perfectly negatively linear) to +1 (perfectly positively linear). The r-value is not best measure of the "usefulness" of a new variable, but it can give a first approximation of whether a variable will be helpful to a machine learning model. The larger the r-value of a variable with respect to the target, the more a change in this variable is likely to affect the value of the target. Therefore, we look for the variables with the greatest absolute value r-value relative to the target.

We can also visually inspect a relationship with the target using the Kernel Density Estimate (KDE) plot.

Kernel Density Estimate Plots

The kernel density estimate plot shows the distribution of a single variable (think of it as a smoothed histogram). To see the different in distributions dependent on the value of a categorical variable, we can color the distributions differently according to the category. For example, we can show the kernel density estimate of the previous_loan_count colored by whether the TARGET = 1 or 0. The resulting KDE will show any significant differences in the distribution of the variable between people who did not repay their loan (TARGET == 1) and the people who did (TARGET == 0). This can serve as an indicator of whether a variable will be 'relevant' to a machine learning model.

We will put this plotting functionality in a function to re-use for any variable.

In [5]:
# Plots the disribution of a variable colored by value of the target
def kde_target(var_name, df):
    
    # Calculate the correlation coefficient between the new variable and the target
    corr = df['TARGET'].corr(df[var_name])
    
    # Calculate medians for repaid vs not repaid
    avg_repaid = df.ix[df['TARGET'] == 0, var_name].median()
    avg_not_repaid = df.ix[df['TARGET'] == 1, var_name].median()
    
    plt.figure(figsize = (12, 6))
    
    # Plot the distribution for target == 0 and target == 1
    sns.kdeplot(df.ix[df['TARGET'] == 0, var_name], label = 'TARGET == 0')
    sns.kdeplot(df.ix[df['TARGET'] == 1, var_name], label = 'TARGET == 1')
    
    # label the plot
    plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)
    plt.legend();
    
    # print out the correlation
    print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr))
    # Print out average values
    print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid)
    print('Median value for loan that was repaid =     %0.4f' % avg_repaid)
    

We can test this function using the EXT_SOURCE_3 variable which we found to be one of the most important variables according to a Random Forest and Gradient Boosting Machine.

In [6]:
kde_target('EXT_SOURCE_3', train)
The correlation between EXT_SOURCE_3 and the TARGET is -0.1789
Median value for loan that was not repaid = 0.3791
Median value for loan that was repaid =     0.5460

Now for the new variable we just made, the number of previous loans at other institutions.

In [7]:
kde_target('previous_loan_counts', train)
The correlation between previous_loan_counts and the TARGET is -0.0100
Median value for loan that was not repaid = 3.0000
Median value for loan that was repaid =     4.0000

From this it's difficult to tell if this variable will be important. The correlation coefficient is extremely weak and there is almost no noticeable difference in the distributions.

Let's move on to make a few more variables from the bureau dataframe. We will take the mean, min, and max of every numeric column in the bureau dataframe.

Aggregating Numeric Columns

To account for the numeric information in the bureau dataframe, we can compute statistics for all the numeric columns. To do so, we groupby the client id, agg the grouped dataframe, and merge the result back into the training data. The agg function will only calculate the values for the numeric columns where the operation is considered valid. We will stick to using 'mean', 'max', 'min', 'sum' but any function can be passed in here. We can even write our own function and use it in an agg call.

In [8]:
# Group by the client id, calculate aggregation statistics
bureau_agg = bureau.drop(columns = ['SK_ID_BUREAU']).groupby('SK_ID_CURR', as_index = False).agg(['count', 'mean', 'max', 'min', 'sum']).reset_index()
bureau_agg.head()
Out[8]:
SK_ID_CURR DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE DAYS_CREDIT_UPDATE AMT_ANNUITY
count mean max min sum count mean max min sum count mean max min sum count mean max min sum count mean max min sum count mean max min sum count mean max min sum count mean max min sum count mean max min sum count mean max min sum count mean max min sum count mean max min sum
0 100001 7 -735.000000 -49 -1572 -5145 7 0.0 0 0 0 7 82.428571 1778.0 -1329.0 577.0 4 -825.500000 -544.0 -1328.0 -3302.0 0 NaN NaN NaN 0.000 7 0.0 0 0 0 7 207623.571429 378000.0 85500.0 1453365.000 7 85240.928571 373239.0 0.0 596686.5 6 0.00000 0.000 0.0 0.000 7 0.0 0.0 0.0 0.0 7 -93.142857 -6 -155 -652 7 3545.357143 10822.5 0.0 24817.5
1 100002 8 -874.000000 -103 -1437 -6992 8 0.0 0 0 0 6 -349.000000 780.0 -1072.0 -2094.0 6 -697.500000 -36.0 -1185.0 -4185.0 5 1681.029 5043.645 0.0 8405.145 8 0.0 0 0 0 8 108131.945625 450000.0 0.0 865055.565 5 49156.200000 245781.0 0.0 245781.0 4 7997.14125 31988.565 0.0 31988.565 8 0.0 0.0 0.0 0.0 8 -499.875000 -7 -1185 -3999 7 0.000000 0.0 0.0 0.0
2 100003 4 -1400.750000 -606 -2586 -5603 4 0.0 0 0 0 4 -544.500000 1216.0 -2434.0 -2178.0 3 -1097.333333 -540.0 -2131.0 -3292.0 4 0.000 0.000 0.0 0.000 4 0.0 0 0 0 4 254350.125000 810000.0 22248.0 1017400.500 4 0.000000 0.0 0.0 0.0 4 202500.00000 810000.000 0.0 810000.000 4 0.0 0.0 0.0 0.0 4 -816.000000 -43 -2131 -3264 0 NaN NaN NaN 0.0
3 100004 2 -867.000000 -408 -1326 -1734 2 0.0 0 0 0 2 -488.500000 -382.0 -595.0 -977.0 2 -532.500000 -382.0 -683.0 -1065.0 1 0.000 0.000 0.0 0.000 2 0.0 0 0 0 2 94518.900000 94537.8 94500.0 189037.800 2 0.000000 0.0 0.0 0.0 2 0.00000 0.000 0.0 0.000 2 0.0 0.0 0.0 0.0 2 -532.000000 -382 -682 -1064 0 NaN NaN NaN 0.0
4 100005 3 -190.666667 -62 -373 -572 3 0.0 0 0 0 3 439.333333 1324.0 -128.0 1318.0 1 -123.000000 -123.0 -123.0 -123.0 1 0.000 0.000 0.0 0.000 3 0.0 0 0 0 3 219042.000000 568800.0 29826.0 657126.000 3 189469.500000 543087.0 0.0 568408.5 3 0.00000 0.000 0.0 0.000 3 0.0 0.0 0.0 0.0 3 -54.333333 -11 -121 -163 3 1420.500000 4261.5 0.0 4261.5

We need to create new names for each of these columns. The following code makes new names by appending the stat to the name. Here we have to deal with the fact that the dataframe has a multi-level index. I find these confusing and hard to work with, so I try to reduce to a single level index as quickly as possible.

In [9]:
# List of column names
columns = ['SK_ID_CURR']

# Iterate through the variables names
for var in bureau_agg.columns.levels[0]:
    # Skip the id name
    if var != 'SK_ID_CURR':
        
        # Iterate through the stat names
        for stat in bureau_agg.columns.levels[1][:-1]:
            # Make a new column name for the variable and stat
            columns.append('bureau_%s_%s' % (var, stat))
In [10]:
# Assign the list of columns names as the dataframe column names
bureau_agg.columns = columns
bureau_agg.head()
Out[10]:
SK_ID_CURR bureau_DAYS_CREDIT_count bureau_DAYS_CREDIT_mean bureau_DAYS_CREDIT_max bureau_DAYS_CREDIT_min bureau_DAYS_CREDIT_sum bureau_CREDIT_DAY_OVERDUE_count bureau_CREDIT_DAY_OVERDUE_mean bureau_CREDIT_DAY_OVERDUE_max bureau_CREDIT_DAY_OVERDUE_min bureau_CREDIT_DAY_OVERDUE_sum bureau_DAYS_CREDIT_ENDDATE_count bureau_DAYS_CREDIT_ENDDATE_mean bureau_DAYS_CREDIT_ENDDATE_max bureau_DAYS_CREDIT_ENDDATE_min bureau_DAYS_CREDIT_ENDDATE_sum bureau_DAYS_ENDDATE_FACT_count bureau_DAYS_ENDDATE_FACT_mean bureau_DAYS_ENDDATE_FACT_max bureau_DAYS_ENDDATE_FACT_min bureau_DAYS_ENDDATE_FACT_sum bureau_AMT_CREDIT_MAX_OVERDUE_count bureau_AMT_CREDIT_MAX_OVERDUE_mean bureau_AMT_CREDIT_MAX_OVERDUE_max bureau_AMT_CREDIT_MAX_OVERDUE_min bureau_AMT_CREDIT_MAX_OVERDUE_sum bureau_CNT_CREDIT_PROLONG_count bureau_CNT_CREDIT_PROLONG_mean bureau_CNT_CREDIT_PROLONG_max bureau_CNT_CREDIT_PROLONG_min bureau_CNT_CREDIT_PROLONG_sum bureau_AMT_CREDIT_SUM_count bureau_AMT_CREDIT_SUM_mean bureau_AMT_CREDIT_SUM_max bureau_AMT_CREDIT_SUM_min bureau_AMT_CREDIT_SUM_sum bureau_AMT_CREDIT_SUM_DEBT_count bureau_AMT_CREDIT_SUM_DEBT_mean bureau_AMT_CREDIT_SUM_DEBT_max bureau_AMT_CREDIT_SUM_DEBT_min bureau_AMT_CREDIT_SUM_DEBT_sum bureau_AMT_CREDIT_SUM_LIMIT_count bureau_AMT_CREDIT_SUM_LIMIT_mean bureau_AMT_CREDIT_SUM_LIMIT_max bureau_AMT_CREDIT_SUM_LIMIT_min bureau_AMT_CREDIT_SUM_LIMIT_sum bureau_AMT_CREDIT_SUM_OVERDUE_count bureau_AMT_CREDIT_SUM_OVERDUE_mean bureau_AMT_CREDIT_SUM_OVERDUE_max bureau_AMT_CREDIT_SUM_OVERDUE_min bureau_AMT_CREDIT_SUM_OVERDUE_sum bureau_DAYS_CREDIT_UPDATE_count bureau_DAYS_CREDIT_UPDATE_mean bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_min bureau_DAYS_CREDIT_UPDATE_sum bureau_AMT_ANNUITY_count bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_min bureau_AMT_ANNUITY_sum
0 100001 7 -735.000000 -49 -1572 -5145 7 0.0 0 0 0 7 82.428571 1778.0 -1329.0 577.0 4 -825.500000 -544.0 -1328.0 -3302.0 0 NaN NaN NaN 0.000 7 0.0 0 0 0 7 207623.571429 378000.0 85500.0 1453365.000 7 85240.928571 373239.0 0.0 596686.5 6 0.00000 0.000 0.0 0.000 7 0.0 0.0 0.0 0.0 7 -93.142857 -6 -155 -652 7 3545.357143 10822.5 0.0 24817.5
1 100002 8 -874.000000 -103 -1437 -6992 8 0.0 0 0 0 6 -349.000000 780.0 -1072.0 -2094.0 6 -697.500000 -36.0 -1185.0 -4185.0 5 1681.029 5043.645 0.0 8405.145 8 0.0 0 0 0 8 108131.945625 450000.0 0.0 865055.565 5 49156.200000 245781.0 0.0 245781.0 4 7997.14125 31988.565 0.0 31988.565 8 0.0 0.0 0.0 0.0 8 -499.875000 -7 -1185 -3999 7 0.000000 0.0 0.0 0.0
2 100003 4 -1400.750000 -606 -2586 -5603 4 0.0 0 0 0 4 -544.500000 1216.0 -2434.0 -2178.0 3 -1097.333333 -540.0 -2131.0 -3292.0 4 0.000 0.000 0.0 0.000 4 0.0 0 0 0 4 254350.125000 810000.0 22248.0 1017400.500 4 0.000000 0.0 0.0 0.0 4 202500.00000 810000.000 0.0 810000.000 4 0.0 0.0 0.0 0.0 4 -816.000000 -43 -2131 -3264 0 NaN NaN NaN 0.0
3 100004 2 -867.000000 -408 -1326 -1734 2 0.0 0 0 0 2 -488.500000 -382.0 -595.0 -977.0 2 -532.500000 -382.0 -683.0 -1065.0 1 0.000 0.000 0.0 0.000 2 0.0 0 0 0 2 94518.900000 94537.8 94500.0 189037.800 2 0.000000 0.0 0.0 0.0 2 0.00000 0.000 0.0 0.000 2 0.0 0.0 0.0 0.0 2 -532.000000 -382 -682 -1064 0 NaN NaN NaN 0.0
4 100005 3 -190.666667 -62 -373 -572 3 0.0 0 0 0 3 439.333333 1324.0 -128.0 1318.0 1 -123.000000 -123.0 -123.0 -123.0 1 0.000 0.000 0.0 0.000 3 0.0 0 0 0 3 219042.000000 568800.0 29826.0 657126.000 3 189469.500000 543087.0 0.0 568408.5 3 0.00000 0.000 0.0 0.000 3 0.0 0.0 0.0 0.0 3 -54.333333 -11 -121 -163 3 1420.500000 4261.5 0.0 4261.5

Now we simply merge with the training data as we did before.

In [11]:
# Merge with the training data
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')
train.head()
Out[11]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE ... DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR previous_loan_counts bureau_DAYS_CREDIT_count bureau_DAYS_CREDIT_mean bureau_DAYS_CREDIT_max bureau_DAYS_CREDIT_min bureau_DAYS_CREDIT_sum bureau_CREDIT_DAY_OVERDUE_count bureau_CREDIT_DAY_OVERDUE_mean bureau_CREDIT_DAY_OVERDUE_max bureau_CREDIT_DAY_OVERDUE_min bureau_CREDIT_DAY_OVERDUE_sum bureau_DAYS_CREDIT_ENDDATE_count bureau_DAYS_CREDIT_ENDDATE_mean bureau_DAYS_CREDIT_ENDDATE_max bureau_DAYS_CREDIT_ENDDATE_min bureau_DAYS_CREDIT_ENDDATE_sum bureau_DAYS_ENDDATE_FACT_count bureau_DAYS_ENDDATE_FACT_mean bureau_DAYS_ENDDATE_FACT_max bureau_DAYS_ENDDATE_FACT_min bureau_DAYS_ENDDATE_FACT_sum bureau_AMT_CREDIT_MAX_OVERDUE_count bureau_AMT_CREDIT_MAX_OVERDUE_mean bureau_AMT_CREDIT_MAX_OVERDUE_max bureau_AMT_CREDIT_MAX_OVERDUE_min bureau_AMT_CREDIT_MAX_OVERDUE_sum bureau_CNT_CREDIT_PROLONG_count bureau_CNT_CREDIT_PROLONG_mean bureau_CNT_CREDIT_PROLONG_max bureau_CNT_CREDIT_PROLONG_min bureau_CNT_CREDIT_PROLONG_sum bureau_AMT_CREDIT_SUM_count bureau_AMT_CREDIT_SUM_mean bureau_AMT_CREDIT_SUM_max bureau_AMT_CREDIT_SUM_min bureau_AMT_CREDIT_SUM_sum bureau_AMT_CREDIT_SUM_DEBT_count bureau_AMT_CREDIT_SUM_DEBT_mean bureau_AMT_CREDIT_SUM_DEBT_max bureau_AMT_CREDIT_SUM_DEBT_min bureau_AMT_CREDIT_SUM_DEBT_sum bureau_AMT_CREDIT_SUM_LIMIT_count bureau_AMT_CREDIT_SUM_LIMIT_mean bureau_AMT_CREDIT_SUM_LIMIT_max bureau_AMT_CREDIT_SUM_LIMIT_min bureau_AMT_CREDIT_SUM_LIMIT_sum bureau_AMT_CREDIT_SUM_OVERDUE_count bureau_AMT_CREDIT_SUM_OVERDUE_mean bureau_AMT_CREDIT_SUM_OVERDUE_max bureau_AMT_CREDIT_SUM_OVERDUE_min bureau_AMT_CREDIT_SUM_OVERDUE_sum bureau_DAYS_CREDIT_UPDATE_count bureau_DAYS_CREDIT_UPDATE_mean bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_min bureau_DAYS_CREDIT_UPDATE_sum bureau_AMT_ANNUITY_count bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_min bureau_AMT_ANNUITY_sum
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.022 0.0198 0.0 0.0 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 ... 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 8.0 8.0 -874.00 -103.0 -1437.0 -6992.0 8.0 0.0 0.0 0.0 0.0 6.0 -349.0 780.0 -1072.0 -2094.0 6.0 -697.500000 -36.0 -1185.0 -4185.0 5.0 1681.029 5043.645 0.0 8405.145 8.0 0.0 0.0 0.0 0.0 8.0 108131.945625 450000.0 0.0 865055.565 5.0 49156.2 245781.0 0.0 245781.0 4.0 7997.14125 31988.565 0.0 31988.565 8.0 0.0 0.0 0.0 0.0 8.0 -499.875 -7.0 -1185.0 -3999.0 7.0 0.0 0.0 0.0 0.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0.0 0.0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 ... 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 4.0 -1400.75 -606.0 -2586.0 -5603.0 4.0 0.0 0.0 0.0 0.0 4.0 -544.5 1216.0 -2434.0 -2178.0 3.0 -1097.333333 -540.0 -2131.0 -3292.0 4.0 0.000 0.000 0.0 0.000 4.0 0.0 0.0 0.0 0.0 4.0 254350.125000 810000.0 22248.0 1017400.500 4.0 0.0 0.0 0.0 0.0 4.0 202500.00000 810000.000 0.0 810000.000 4.0 0.0 0.0 0.0 0.0 4.0 -816.000 -43.0 -2131.0 -3264.0 0.0 NaN NaN NaN 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 2.0 -867.00 -408.0 -1326.0 -1734.0 2.0 0.0 0.0 0.0 0.0 2.0 -488.5 -382.0 -595.0 -977.0 2.0 -532.500000 -382.0 -683.0 -1065.0 1.0 0.000 0.000 0.0 0.000 2.0 0.0 0.0 0.0 0.0 2.0 94518.900000 94537.8 94500.0 189037.800 2.0 0.0 0.0 0.0 0.0 2.0 0.00000 0.000 0.0 0.000 2.0 0.0 0.0 0.0 0.0 2.0 -532.000 -382.0 -682.0 -1064.0 0.0 NaN NaN NaN 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 -1149.00 -1149.0 -1149.0 -1149.0 1.0 0.0 0.0 0.0 0.0 1.0 -783.0 -783.0 -783.0 -783.0 1.0 -783.000000 -783.0 -783.0 -783.0 1.0 0.000 0.000 0.0 0.000 1.0 0.0 0.0 0.0 0.0 1.0 146250.000000 146250.0 146250.0 146250.000 1.0 0.0 0.0 0.0 0.0 1.0 0.00000 0.000 0.0 0.000 1.0 0.0 0.0 0.0 0.0 1.0 -783.000 -783.0 -783.0 -783.0 0.0 NaN NaN NaN 0.0

Correlations of Aggregated Values with Target

We can calculate the correlation of all new values with the target. Again, we can use these as an approximation of the variables which may be important for modeling.

In [12]:
# List of new correlations
new_corrs = []

# Iterate through the columns 
for col in columns:
    # Calculate correlation with the target
    corr = train['TARGET'].corr(train[col])
    
    # Append the list as a tuple
    new_corrs.append((col, corr))

In the code below, we sort the correlations by the magnitude (absolute value) using the sorted Python function. We also make use of an anonymous lambda function, another important Python operation that is good to know.

In [13]:
# Sort the correlations by the absolute value
# Make sure to reverse to put the largest values at the front of list
new_corrs = sorted(new_corrs, key = lambda x: abs(x[1]), reverse = True)
new_corrs[:15]
Out[13]:
[('bureau_DAYS_CREDIT_mean', 0.08972896721998114),
 ('bureau_DAYS_CREDIT_min', 0.0752482510301036),
 ('bureau_DAYS_CREDIT_UPDATE_mean', 0.06892735266968673),
 ('bureau_DAYS_ENDDATE_FACT_min', 0.05588737984392077),
 ('bureau_DAYS_CREDIT_ENDDATE_sum', 0.0537348956010205),
 ('bureau_DAYS_ENDDATE_FACT_mean', 0.05319962585758616),
 ('bureau_DAYS_CREDIT_max', 0.04978205463997299),
 ('bureau_DAYS_ENDDATE_FACT_sum', 0.048853502611115894),
 ('bureau_DAYS_CREDIT_ENDDATE_mean', 0.046982754334835494),
 ('bureau_DAYS_CREDIT_UPDATE_min', 0.042863922470730155),
 ('bureau_DAYS_CREDIT_sum', 0.041999824814846716),
 ('bureau_DAYS_CREDIT_UPDATE_sum', 0.04140363535306002),
 ('bureau_DAYS_CREDIT_ENDDATE_max', 0.036589634696329094),
 ('bureau_DAYS_CREDIT_ENDDATE_min', 0.034281109921616024),
 ('bureau_DAYS_ENDDATE_FACT_count', -0.030492306653325495)]

None of the new variables have a significant correlation with the TARGET. We can look at the KDE plot of the highest correlated variable, bureau_DAYS_CREDIT_mean, with the target in in terms of absolute magnitude correlation.

In [14]:
kde_target('bureau_DAYS_CREDIT_mean', train)
The correlation between bureau_DAYS_CREDIT_mean and the TARGET is 0.0897
Median value for loan that was not repaid = -835.3333
Median value for loan that was repaid =     -1067.0000

The definition of this column is: "How many days before current application did client apply for Credit Bureau credit". My interpretation is: this is the number of days that the previous loan was applied for before the application for a loan at Home Credit. Therefore, a larger negative number indicates the loan was further before the current loan application. We see an extremely weak positive relationship between the average of this variable and the target meaning that clients who applied for loans further in the past potentially are more likely to repay loans at Home Credit. With a correlation this weak though, it is just as likely to be noise as a signal.

The Multiple Comparisons Problem

When we have lots of variables, we expect some of them to be correlated just by pure chance, a problem known as multiple comparisons. We can make hundreds of features, and some will turn out to be corelated with the target simply because of random noise in the data. Then, when our model trains, it may overfit to these variables because it thinks they have a relationship with the target in the training set, but this does not necessarily generalize to the test set. There are many considerations that we have to take into account when making features!

Function for Numeric Aggregations

Let's encapsulate all of the previous work into a function. This will allow us to compute aggregate stats for numeric columns across any dataframe. We will re-use this function when we want to apply the same operations for other dataframes.

In [15]:
def agg_numeric(df, group_var, df_name):
    """Aggregates the numeric values in a dataframe. This can
    be used to create features for each instance of the grouping variable.
    
    Parameters
    --------
        df (dataframe): 
            the dataframe to calculate the statistics on
        group_var (string): 
            the variable by which to group df
        df_name (string): 
            the variable used to rename the columns
        
    Return
    --------
        agg (dataframe): 
            a dataframe with the statistics aggregated for 
            all numeric columns. Each instance of the grouping variable will have 
            the statistics (mean, min, max, sum; currently supported) calculated. 
            The columns are also renamed to keep track of features created.
    
    """
    # Remove id variables other than grouping variable
    for col in df:
        if col != group_var and 'SK_ID' in col:
            df = df.drop(columns = col)
            
    group_ids = df[group_var]
    numeric_df = df.select_dtypes('number')
    numeric_df[group_var] = group_ids

    # Group by the specified variable and calculate the statistics
    agg = numeric_df.groupby(group_var).agg(['count', 'mean', 'max', 'min', 'sum']).reset_index()

    # Need to create new column names
    columns = [group_var]

    # Iterate through the variables names
    for var in agg.columns.levels[0]:
        # Skip the grouping variable
        if var != group_var:
            # Iterate through the stat names
            for stat in agg.columns.levels[1][:-1]:
                # Make a new column name for the variable and stat
                columns.append('%s_%s_%s' % (df_name, var, stat))

    agg.columns = columns
    return agg
In [16]:
bureau_agg_new = agg_numeric(bureau.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_agg_new.head()
Out[16]:
SK_ID_CURR bureau_DAYS_CREDIT_count bureau_DAYS_CREDIT_mean bureau_DAYS_CREDIT_max bureau_DAYS_CREDIT_min bureau_DAYS_CREDIT_sum bureau_CREDIT_DAY_OVERDUE_count bureau_CREDIT_DAY_OVERDUE_mean bureau_CREDIT_DAY_OVERDUE_max bureau_CREDIT_DAY_OVERDUE_min bureau_CREDIT_DAY_OVERDUE_sum bureau_DAYS_CREDIT_ENDDATE_count bureau_DAYS_CREDIT_ENDDATE_mean bureau_DAYS_CREDIT_ENDDATE_max bureau_DAYS_CREDIT_ENDDATE_min bureau_DAYS_CREDIT_ENDDATE_sum bureau_DAYS_ENDDATE_FACT_count bureau_DAYS_ENDDATE_FACT_mean bureau_DAYS_ENDDATE_FACT_max bureau_DAYS_ENDDATE_FACT_min bureau_DAYS_ENDDATE_FACT_sum bureau_AMT_CREDIT_MAX_OVERDUE_count bureau_AMT_CREDIT_MAX_OVERDUE_mean bureau_AMT_CREDIT_MAX_OVERDUE_max bureau_AMT_CREDIT_MAX_OVERDUE_min bureau_AMT_CREDIT_MAX_OVERDUE_sum bureau_CNT_CREDIT_PROLONG_count bureau_CNT_CREDIT_PROLONG_mean bureau_CNT_CREDIT_PROLONG_max bureau_CNT_CREDIT_PROLONG_min bureau_CNT_CREDIT_PROLONG_sum bureau_AMT_CREDIT_SUM_count bureau_AMT_CREDIT_SUM_mean bureau_AMT_CREDIT_SUM_max bureau_AMT_CREDIT_SUM_min bureau_AMT_CREDIT_SUM_sum bureau_AMT_CREDIT_SUM_DEBT_count bureau_AMT_CREDIT_SUM_DEBT_mean bureau_AMT_CREDIT_SUM_DEBT_max bureau_AMT_CREDIT_SUM_DEBT_min bureau_AMT_CREDIT_SUM_DEBT_sum bureau_AMT_CREDIT_SUM_LIMIT_count bureau_AMT_CREDIT_SUM_LIMIT_mean bureau_AMT_CREDIT_SUM_LIMIT_max bureau_AMT_CREDIT_SUM_LIMIT_min bureau_AMT_CREDIT_SUM_LIMIT_sum bureau_AMT_CREDIT_SUM_OVERDUE_count bureau_AMT_CREDIT_SUM_OVERDUE_mean bureau_AMT_CREDIT_SUM_OVERDUE_max bureau_AMT_CREDIT_SUM_OVERDUE_min bureau_AMT_CREDIT_SUM_OVERDUE_sum bureau_DAYS_CREDIT_UPDATE_count bureau_DAYS_CREDIT_UPDATE_mean bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_min bureau_DAYS_CREDIT_UPDATE_sum bureau_AMT_ANNUITY_count bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_min bureau_AMT_ANNUITY_sum
0 100001 7 -735.000000 -49 -1572 -5145 7 0.0 0 0 0 7 82.428571 1778.0 -1329.0 577.0 4 -825.500000 -544.0 -1328.0 -3302.0 0 NaN NaN NaN 0.000 7 0.0 0 0 0 7 207623.571429 378000.0 85500.0 1453365.000 7 85240.928571 373239.0 0.0 596686.5 6 0.00000 0.000 0.0 0.000 7 0.0 0.0 0.0 0.0 7 -93.142857 -6 -155 -652 7 3545.357143 10822.5 0.0 24817.5
1 100002 8 -874.000000 -103 -1437 -6992 8 0.0 0 0 0 6 -349.000000 780.0 -1072.0 -2094.0 6 -697.500000 -36.0 -1185.0 -4185.0 5 1681.029 5043.645 0.0 8405.145 8 0.0 0 0 0 8 108131.945625 450000.0 0.0 865055.565 5 49156.200000 245781.0 0.0 245781.0 4 7997.14125 31988.565 0.0 31988.565 8 0.0 0.0 0.0 0.0 8 -499.875000 -7 -1185 -3999 7 0.000000 0.0 0.0 0.0
2 100003 4 -1400.750000 -606 -2586 -5603 4 0.0 0 0 0 4 -544.500000 1216.0 -2434.0 -2178.0 3 -1097.333333 -540.0 -2131.0 -3292.0 4 0.000 0.000 0.0 0.000 4 0.0 0 0 0 4 254350.125000 810000.0 22248.0 1017400.500 4 0.000000 0.0 0.0 0.0 4 202500.00000 810000.000 0.0 810000.000 4 0.0 0.0 0.0 0.0 4 -816.000000 -43 -2131 -3264 0 NaN NaN NaN 0.0
3 100004 2 -867.000000 -408 -1326 -1734 2 0.0 0 0 0 2 -488.500000 -382.0 -595.0 -977.0 2 -532.500000 -382.0 -683.0 -1065.0 1 0.000 0.000 0.0 0.000 2 0.0 0 0 0 2 94518.900000 94537.8 94500.0 189037.800 2 0.000000 0.0 0.0 0.0 2 0.00000 0.000 0.0 0.000 2 0.0 0.0 0.0 0.0 2 -532.000000 -382 -682 -1064 0 NaN NaN NaN 0.0
4 100005 3 -190.666667 -62 -373 -572 3 0.0 0 0 0 3 439.333333 1324.0 -128.0 1318.0 1 -123.000000 -123.0 -123.0 -123.0 1 0.000 0.000 0.0 0.000 3 0.0 0 0 0 3 219042.000000 568800.0 29826.0 657126.000 3 189469.500000 543087.0 0.0 568408.5 3 0.00000 0.000 0.0 0.000 3 0.0 0.0 0.0 0.0 3 -54.333333 -11 -121 -163 3 1420.500000 4261.5 0.0 4261.5

To make sure the function worked as intended, we should compare with the aggregated dataframe we constructed by hand.

In [17]:
bureau_agg.head()
Out[17]:
SK_ID_CURR bureau_DAYS_CREDIT_count bureau_DAYS_CREDIT_mean bureau_DAYS_CREDIT_max bureau_DAYS_CREDIT_min bureau_DAYS_CREDIT_sum bureau_CREDIT_DAY_OVERDUE_count bureau_CREDIT_DAY_OVERDUE_mean bureau_CREDIT_DAY_OVERDUE_max bureau_CREDIT_DAY_OVERDUE_min bureau_CREDIT_DAY_OVERDUE_sum bureau_DAYS_CREDIT_ENDDATE_count bureau_DAYS_CREDIT_ENDDATE_mean bureau_DAYS_CREDIT_ENDDATE_max bureau_DAYS_CREDIT_ENDDATE_min bureau_DAYS_CREDIT_ENDDATE_sum bureau_DAYS_ENDDATE_FACT_count bureau_DAYS_ENDDATE_FACT_mean bureau_DAYS_ENDDATE_FACT_max bureau_DAYS_ENDDATE_FACT_min bureau_DAYS_ENDDATE_FACT_sum bureau_AMT_CREDIT_MAX_OVERDUE_count bureau_AMT_CREDIT_MAX_OVERDUE_mean bureau_AMT_CREDIT_MAX_OVERDUE_max bureau_AMT_CREDIT_MAX_OVERDUE_min bureau_AMT_CREDIT_MAX_OVERDUE_sum bureau_CNT_CREDIT_PROLONG_count bureau_CNT_CREDIT_PROLONG_mean bureau_CNT_CREDIT_PROLONG_max bureau_CNT_CREDIT_PROLONG_min bureau_CNT_CREDIT_PROLONG_sum bureau_AMT_CREDIT_SUM_count bureau_AMT_CREDIT_SUM_mean bureau_AMT_CREDIT_SUM_max bureau_AMT_CREDIT_SUM_min bureau_AMT_CREDIT_SUM_sum bureau_AMT_CREDIT_SUM_DEBT_count bureau_AMT_CREDIT_SUM_DEBT_mean bureau_AMT_CREDIT_SUM_DEBT_max bureau_AMT_CREDIT_SUM_DEBT_min bureau_AMT_CREDIT_SUM_DEBT_sum bureau_AMT_CREDIT_SUM_LIMIT_count bureau_AMT_CREDIT_SUM_LIMIT_mean bureau_AMT_CREDIT_SUM_LIMIT_max bureau_AMT_CREDIT_SUM_LIMIT_min bureau_AMT_CREDIT_SUM_LIMIT_sum bureau_AMT_CREDIT_SUM_OVERDUE_count bureau_AMT_CREDIT_SUM_OVERDUE_mean bureau_AMT_CREDIT_SUM_OVERDUE_max bureau_AMT_CREDIT_SUM_OVERDUE_min bureau_AMT_CREDIT_SUM_OVERDUE_sum bureau_DAYS_CREDIT_UPDATE_count bureau_DAYS_CREDIT_UPDATE_mean bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_min bureau_DAYS_CREDIT_UPDATE_sum bureau_AMT_ANNUITY_count bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_min bureau_AMT_ANNUITY_sum
0 100001 7 -735.000000 -49 -1572 -5145 7 0.0 0 0 0 7 82.428571 1778.0 -1329.0 577.0 4 -825.500000 -544.0 -1328.0 -3302.0 0 NaN NaN NaN 0.000 7 0.0 0 0 0 7 207623.571429 378000.0 85500.0 1453365.000 7 85240.928571 373239.0 0.0 596686.5 6 0.00000 0.000 0.0 0.000 7 0.0 0.0 0.0 0.0 7 -93.142857 -6 -155 -652 7 3545.357143 10822.5 0.0 24817.5
1 100002 8 -874.000000 -103 -1437 -6992 8 0.0 0 0 0 6 -349.000000 780.0 -1072.0 -2094.0 6 -697.500000 -36.0 -1185.0 -4185.0 5 1681.029 5043.645 0.0 8405.145 8 0.0 0 0 0 8 108131.945625 450000.0 0.0 865055.565 5 49156.200000 245781.0 0.0 245781.0 4 7997.14125 31988.565 0.0 31988.565 8 0.0 0.0 0.0 0.0 8 -499.875000 -7 -1185 -3999 7 0.000000 0.0 0.0 0.0
2 100003 4 -1400.750000 -606 -2586 -5603 4 0.0 0 0 0 4 -544.500000 1216.0 -2434.0 -2178.0 3 -1097.333333 -540.0 -2131.0 -3292.0 4 0.000 0.000 0.0 0.000 4 0.0 0 0 0 4 254350.125000 810000.0 22248.0 1017400.500 4 0.000000 0.0 0.0 0.0 4 202500.00000 810000.000 0.0 810000.000 4 0.0 0.0 0.0 0.0 4 -816.000000 -43 -2131 -3264 0 NaN NaN NaN 0.0
3 100004 2 -867.000000 -408 -1326 -1734 2 0.0 0 0 0 2 -488.500000 -382.0 -595.0 -977.0 2 -532.500000 -382.0 -683.0 -1065.0 1 0.000 0.000 0.0 0.000 2 0.0 0 0 0 2 94518.900000 94537.8 94500.0 189037.800 2 0.000000 0.0 0.0 0.0 2 0.00000 0.000 0.0 0.000 2 0.0 0.0 0.0 0.0 2 -532.000000 -382 -682 -1064 0 NaN NaN NaN 0.0
4 100005 3 -190.666667 -62 -373 -572 3 0.0 0 0 0 3 439.333333 1324.0 -128.0 1318.0 1 -123.000000 -123.0 -123.0 -123.0 1 0.000 0.000 0.0 0.000 3 0.0 0 0 0 3 219042.000000 568800.0 29826.0 657126.000 3 189469.500000 543087.0 0.0 568408.5 3 0.00000 0.000 0.0 0.000 3 0.0 0.0 0.0 0.0 3 -54.333333 -11 -121 -163 3 1420.500000 4261.5 0.0 4261.5

If we go through and inspect the values, we do find that they are equivalent. We will be able to reuse this function for calculating numeric stats for other dataframes. Using functions allows for consistent results and decreases the amount of work we have to do in the future!

Correlation Function

Before we move on, we can also make the code to calculate correlations with the target into a function.

In [18]:
# Function to calculate correlations with the target for a dataframe
def target_corrs(df):

    # List of correlations
    corrs = []

    # Iterate through the columns 
    for col in df.columns:
        print(col)
        # Skip the target column
        if col != 'TARGET':
            # Calculate correlation with the target
            corr = df['TARGET'].corr(df[col])

            # Append the list as a tuple
            corrs.append((col, corr))
            
    # Sort by absolute magnitude of correlations
    corrs = sorted(corrs, key = lambda x: abs(x[1]), reverse = True)
    
    return corrs

Categorical Variables

Now we move from the numeric columns to the categorical columns. These are discrete string variables, so we cannot just calculate statistics such as mean and max which only work with numeric variables. Instead, we will rely on calculating value counts of each category within each categorical variable. As an example, if we have the following dataframe:

SK_ID_CURR Loan type
1 home
1 home
1 home
1 credit
2 credit
3 credit
3 cash
3 cash
4 credit
4 home
4 home

we will use this information counting the number of loans in each category for each client.

SK_ID_CURR credit count cash count home count total count
1 1 0 3 4
2 1 0 0 1
3 1 2 0 3
4 1 0 2 3

Then we can normalize these value counts by the total number of occurences of that categorical variable for that observation (meaning that the normalized counts must sum to 1.0 for each observation).

SK_ID_CURR credit count cash count home count total count credit count norm cash count norm home count norm
1 1 0 3 4 0.25 0 0.75
2 1 0 0 1 1.00 0 0
3 1 2 0 3 0.33 0.66 0
4 1 0 2 3 0.33 0 0.66

Hopefully, encoding the categorical variables this way will allow us to capture the information they contain. We will now go through this process step-by-step. At the end, we will wrap up all the code into one function to be re-used for many dataframes.

First we one-hot encode a dataframe with only the categorical columns (dtype == 'object').

In [19]:
categorical = pd.get_dummies(bureau.select_dtypes('object'))
categorical['SK_ID_CURR'] = bureau['SK_ID_CURR']
categorical.head()
Out[19]:
CREDIT_ACTIVE_Active CREDIT_ACTIVE_Bad debt CREDIT_ACTIVE_Closed CREDIT_ACTIVE_Sold CREDIT_CURRENCY_currency 1 CREDIT_CURRENCY_currency 2 CREDIT_CURRENCY_currency 3 CREDIT_CURRENCY_currency 4 CREDIT_TYPE_Another type of loan CREDIT_TYPE_Car loan CREDIT_TYPE_Cash loan (non-earmarked) CREDIT_TYPE_Consumer credit CREDIT_TYPE_Credit card CREDIT_TYPE_Interbank credit CREDIT_TYPE_Loan for business development CREDIT_TYPE_Loan for purchase of shares (margin lending) CREDIT_TYPE_Loan for the purchase of equipment CREDIT_TYPE_Loan for working capital replenishment CREDIT_TYPE_Microloan CREDIT_TYPE_Mobile operator loan CREDIT_TYPE_Mortgage CREDIT_TYPE_Real estate loan CREDIT_TYPE_Unknown type of loan SK_ID_CURR
0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 215354
1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 215354
2 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 215354
3 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 215354
4 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 215354
In [20]:
categorical_grouped = categorical.groupby('SK_ID_CURR').agg(['sum', 'mean'])
categorical_grouped.head()
Out[20]:
CREDIT_ACTIVE_Active CREDIT_ACTIVE_Bad debt CREDIT_ACTIVE_Closed CREDIT_ACTIVE_Sold CREDIT_CURRENCY_currency 1 CREDIT_CURRENCY_currency 2 CREDIT_CURRENCY_currency 3 CREDIT_CURRENCY_currency 4 CREDIT_TYPE_Another type of loan CREDIT_TYPE_Car loan CREDIT_TYPE_Cash loan (non-earmarked) CREDIT_TYPE_Consumer credit CREDIT_TYPE_Credit card CREDIT_TYPE_Interbank credit CREDIT_TYPE_Loan for business development CREDIT_TYPE_Loan for purchase of shares (margin lending) CREDIT_TYPE_Loan for the purchase of equipment CREDIT_TYPE_Loan for working capital replenishment CREDIT_TYPE_Microloan CREDIT_TYPE_Mobile operator loan CREDIT_TYPE_Mortgage CREDIT_TYPE_Real estate loan CREDIT_TYPE_Unknown type of loan
sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean sum mean
SK_ID_CURR
100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 7 1.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 4 0.500000 4 0.500000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 2 0.500000 2 0.500000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 2 1.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 2 0.666667 1 0.333333 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0

The sum columns represent the count of that category for the associated client and the mean represents the normalized count. One-hot encoding makes the process of calculating these figures very easy!

We can use a similar function as before to rename the columns. Again, we have to deal with the multi-level index for the columns. We iterate through the first level (level 0) which is the name of the categorical variable appended with the value of the category (from one-hot encoding). Then we iterate stats we calculated for each client. We will rename the column with the level 0 name appended with the stat. As an example, the column with CREDIT_ACTIVE_Active as level 0 and sum as level 1 will become CREDIT_ACTIVE_Active_count.

In [21]:
categorical_grouped.columns.levels[0][:10]
Out[21]:
Index(['CREDIT_ACTIVE_Active', 'CREDIT_ACTIVE_Bad debt',
       'CREDIT_ACTIVE_Closed', 'CREDIT_ACTIVE_Sold',
       'CREDIT_CURRENCY_currency 1', 'CREDIT_CURRENCY_currency 2',
       'CREDIT_CURRENCY_currency 3', 'CREDIT_CURRENCY_currency 4',
       'CREDIT_TYPE_Another type of loan', 'CREDIT_TYPE_Car loan'],
      dtype='object')
In [22]:
categorical_grouped.columns.levels[1]
Out[22]:
Index(['sum', 'mean'], dtype='object')
In [23]:
group_var = 'SK_ID_CURR'

# Need to create new column names
columns = []

# Iterate through the variables names
for var in categorical_grouped.columns.levels[0]:
    # Skip the grouping variable
    if var != group_var:
        # Iterate through the stat names
        for stat in ['count', 'count_norm']:
            # Make a new column name for the variable and stat
            columns.append('%s_%s' % (var, stat))

#  Rename the columns
categorical_grouped.columns = columns

categorical_grouped.head()
Out[23]:
CREDIT_ACTIVE_Active_count CREDIT_ACTIVE_Active_count_norm CREDIT_ACTIVE_Bad debt_count CREDIT_ACTIVE_Bad debt_count_norm CREDIT_ACTIVE_Closed_count CREDIT_ACTIVE_Closed_count_norm CREDIT_ACTIVE_Sold_count CREDIT_ACTIVE_Sold_count_norm CREDIT_CURRENCY_currency 1_count CREDIT_CURRENCY_currency 1_count_norm CREDIT_CURRENCY_currency 2_count CREDIT_CURRENCY_currency 2_count_norm CREDIT_CURRENCY_currency 3_count CREDIT_CURRENCY_currency 3_count_norm CREDIT_CURRENCY_currency 4_count CREDIT_CURRENCY_currency 4_count_norm CREDIT_TYPE_Another type of loan_count CREDIT_TYPE_Another type of loan_count_norm CREDIT_TYPE_Car loan_count CREDIT_TYPE_Car loan_count_norm CREDIT_TYPE_Cash loan (non-earmarked)_count CREDIT_TYPE_Cash loan (non-earmarked)_count_norm CREDIT_TYPE_Consumer credit_count CREDIT_TYPE_Consumer credit_count_norm CREDIT_TYPE_Credit card_count CREDIT_TYPE_Credit card_count_norm CREDIT_TYPE_Interbank credit_count CREDIT_TYPE_Interbank credit_count_norm CREDIT_TYPE_Loan for business development_count CREDIT_TYPE_Loan for business development_count_norm CREDIT_TYPE_Loan for purchase of shares (margin lending)_count CREDIT_TYPE_Loan for purchase of shares (margin lending)_count_norm CREDIT_TYPE_Loan for the purchase of equipment_count CREDIT_TYPE_Loan for the purchase of equipment_count_norm CREDIT_TYPE_Loan for working capital replenishment_count CREDIT_TYPE_Loan for working capital replenishment_count_norm CREDIT_TYPE_Microloan_count CREDIT_TYPE_Microloan_count_norm CREDIT_TYPE_Mobile operator loan_count CREDIT_TYPE_Mobile operator loan_count_norm CREDIT_TYPE_Mortgage_count CREDIT_TYPE_Mortgage_count_norm CREDIT_TYPE_Real estate loan_count CREDIT_TYPE_Real estate loan_count_norm CREDIT_TYPE_Unknown type of loan_count CREDIT_TYPE_Unknown type of loan_count_norm
SK_ID_CURR
100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 7 1.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 4 0.500000 4 0.500000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 2 0.500000 2 0.500000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 2 1.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 2 0.666667 1 0.333333 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0

The sum column records the counts and the mean column records the normalized count.

We can merge this dataframe into the training data.

In [24]:
train = train.merge(categorical_grouped, left_on = 'SK_ID_CURR', right_index = True, how = 'left')
train.head()
Out[24]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE ... bureau_DAYS_ENDDATE_FACT_max bureau_DAYS_ENDDATE_FACT_min bureau_DAYS_ENDDATE_FACT_sum bureau_AMT_CREDIT_MAX_OVERDUE_count bureau_AMT_CREDIT_MAX_OVERDUE_mean bureau_AMT_CREDIT_MAX_OVERDUE_max bureau_AMT_CREDIT_MAX_OVERDUE_min bureau_AMT_CREDIT_MAX_OVERDUE_sum bureau_CNT_CREDIT_PROLONG_count bureau_CNT_CREDIT_PROLONG_mean bureau_CNT_CREDIT_PROLONG_max bureau_CNT_CREDIT_PROLONG_min bureau_CNT_CREDIT_PROLONG_sum bureau_AMT_CREDIT_SUM_count bureau_AMT_CREDIT_SUM_mean bureau_AMT_CREDIT_SUM_max bureau_AMT_CREDIT_SUM_min bureau_AMT_CREDIT_SUM_sum bureau_AMT_CREDIT_SUM_DEBT_count bureau_AMT_CREDIT_SUM_DEBT_mean bureau_AMT_CREDIT_SUM_DEBT_max bureau_AMT_CREDIT_SUM_DEBT_min bureau_AMT_CREDIT_SUM_DEBT_sum bureau_AMT_CREDIT_SUM_LIMIT_count bureau_AMT_CREDIT_SUM_LIMIT_mean bureau_AMT_CREDIT_SUM_LIMIT_max bureau_AMT_CREDIT_SUM_LIMIT_min bureau_AMT_CREDIT_SUM_LIMIT_sum bureau_AMT_CREDIT_SUM_OVERDUE_count bureau_AMT_CREDIT_SUM_OVERDUE_mean bureau_AMT_CREDIT_SUM_OVERDUE_max bureau_AMT_CREDIT_SUM_OVERDUE_min bureau_AMT_CREDIT_SUM_OVERDUE_sum bureau_DAYS_CREDIT_UPDATE_count bureau_DAYS_CREDIT_UPDATE_mean bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_min bureau_DAYS_CREDIT_UPDATE_sum bureau_AMT_ANNUITY_count bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_min bureau_AMT_ANNUITY_sum CREDIT_ACTIVE_Active_count CREDIT_ACTIVE_Active_count_norm CREDIT_ACTIVE_Bad debt_count CREDIT_ACTIVE_Bad debt_count_norm CREDIT_ACTIVE_Closed_count CREDIT_ACTIVE_Closed_count_norm CREDIT_ACTIVE_Sold_count CREDIT_ACTIVE_Sold_count_norm CREDIT_CURRENCY_currency 1_count CREDIT_CURRENCY_currency 1_count_norm CREDIT_CURRENCY_currency 2_count CREDIT_CURRENCY_currency 2_count_norm CREDIT_CURRENCY_currency 3_count CREDIT_CURRENCY_currency 3_count_norm CREDIT_CURRENCY_currency 4_count CREDIT_CURRENCY_currency 4_count_norm CREDIT_TYPE_Another type of loan_count CREDIT_TYPE_Another type of loan_count_norm CREDIT_TYPE_Car loan_count CREDIT_TYPE_Car loan_count_norm CREDIT_TYPE_Cash loan (non-earmarked)_count CREDIT_TYPE_Cash loan (non-earmarked)_count_norm CREDIT_TYPE_Consumer credit_count CREDIT_TYPE_Consumer credit_count_norm CREDIT_TYPE_Credit card_count CREDIT_TYPE_Credit card_count_norm CREDIT_TYPE_Interbank credit_count CREDIT_TYPE_Interbank credit_count_norm CREDIT_TYPE_Loan for business development_count CREDIT_TYPE_Loan for business development_count_norm CREDIT_TYPE_Loan for purchase of shares (margin lending)_count CREDIT_TYPE_Loan for purchase of shares (margin lending)_count_norm CREDIT_TYPE_Loan for the purchase of equipment_count CREDIT_TYPE_Loan for the purchase of equipment_count_norm CREDIT_TYPE_Loan for working capital replenishment_count CREDIT_TYPE_Loan for working capital replenishment_count_norm CREDIT_TYPE_Microloan_count CREDIT_TYPE_Microloan_count_norm CREDIT_TYPE_Mobile operator loan_count CREDIT_TYPE_Mobile operator loan_count_norm CREDIT_TYPE_Mortgage_count CREDIT_TYPE_Mortgage_count_norm CREDIT_TYPE_Real estate loan_count CREDIT_TYPE_Real estate loan_count_norm CREDIT_TYPE_Unknown type of loan_count CREDIT_TYPE_Unknown type of loan_count_norm
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.022 0.0198 0.0 0.0 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 ... -36.0 -1185.0 -4185.0 5.0 1681.029 5043.645 0.0 8405.145 8.0 0.0 0.0 0.0 0.0 8.0 108131.945625 450000.0 0.0 865055.565 5.0 49156.2 245781.0 0.0 245781.0 4.0 7997.14125 31988.565 0.0 31988.565 8.0 0.0 0.0 0.0 0.0 8.0 -499.875 -7.0 -1185.0 -3999.0 7.0 0.0 0.0 0.0 0.0 2.0 0.25 0.0 0.0 6.0 0.75 0.0 0.0 8.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 0.5 4.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0.0 0.0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 ... -540.0 -2131.0 -3292.0 4.0 0.000 0.000 0.0 0.000 4.0 0.0 0.0 0.0 0.0 4.0 254350.125000 810000.0 22248.0 1017400.500 4.0 0.0 0.0 0.0 0.0 4.0 202500.00000 810000.000 0.0 810000.000 4.0 0.0 0.0 0.0 0.0 4.0 -816.000 -43.0 -2131.0 -3264.0 0.0 NaN NaN NaN 0.0 1.0 0.25 0.0 0.0 3.0 0.75 0.0 0.0 4.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.5 2.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -382.0 -683.0 -1065.0 1.0 0.000 0.000 0.0 0.000 2.0 0.0 0.0 0.0 0.0 2.0 94518.900000 94537.8 94500.0 189037.800 2.0 0.0 0.0 0.0 0.0 2.0 0.00000 0.000 0.0 0.000 2.0 0.0 0.0 0.0 0.0 2.0 -532.000 -382.0 -682.0 -1064.0 0.0 NaN NaN NaN 0.0 0.0 0.00 0.0 0.0 2.0 1.00 0.0 0.0 2.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -783.0 -783.0 -783.0 1.0 0.000 0.000 0.0 0.000 1.0 0.0 0.0 0.0 0.0 1.0 146250.000000 146250.0 146250.0 146250.000 1.0 0.0 0.0 0.0 0.0 1.0 0.00000 0.000 0.0 0.000 1.0 0.0 0.0 0.0 0.0 1.0 -783.000 -783.0 -783.0 -783.0 0.0 NaN NaN NaN 0.0 0.0 0.00 0.0 0.0 1.0 1.00 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In [25]:
train.shape
Out[25]:
(307511, 229)
In [26]:
train.iloc[:10, 123:]
Out[26]:
bureau_DAYS_CREDIT_count bureau_DAYS_CREDIT_mean bureau_DAYS_CREDIT_max bureau_DAYS_CREDIT_min bureau_DAYS_CREDIT_sum bureau_CREDIT_DAY_OVERDUE_count bureau_CREDIT_DAY_OVERDUE_mean bureau_CREDIT_DAY_OVERDUE_max bureau_CREDIT_DAY_OVERDUE_min bureau_CREDIT_DAY_OVERDUE_sum bureau_DAYS_CREDIT_ENDDATE_count bureau_DAYS_CREDIT_ENDDATE_mean bureau_DAYS_CREDIT_ENDDATE_max bureau_DAYS_CREDIT_ENDDATE_min bureau_DAYS_CREDIT_ENDDATE_sum bureau_DAYS_ENDDATE_FACT_count bureau_DAYS_ENDDATE_FACT_mean bureau_DAYS_ENDDATE_FACT_max bureau_DAYS_ENDDATE_FACT_min bureau_DAYS_ENDDATE_FACT_sum bureau_AMT_CREDIT_MAX_OVERDUE_count bureau_AMT_CREDIT_MAX_OVERDUE_mean bureau_AMT_CREDIT_MAX_OVERDUE_max bureau_AMT_CREDIT_MAX_OVERDUE_min bureau_AMT_CREDIT_MAX_OVERDUE_sum bureau_CNT_CREDIT_PROLONG_count bureau_CNT_CREDIT_PROLONG_mean bureau_CNT_CREDIT_PROLONG_max bureau_CNT_CREDIT_PROLONG_min bureau_CNT_CREDIT_PROLONG_sum bureau_AMT_CREDIT_SUM_count bureau_AMT_CREDIT_SUM_mean bureau_AMT_CREDIT_SUM_max bureau_AMT_CREDIT_SUM_min bureau_AMT_CREDIT_SUM_sum bureau_AMT_CREDIT_SUM_DEBT_count bureau_AMT_CREDIT_SUM_DEBT_mean bureau_AMT_CREDIT_SUM_DEBT_max bureau_AMT_CREDIT_SUM_DEBT_min bureau_AMT_CREDIT_SUM_DEBT_sum bureau_AMT_CREDIT_SUM_LIMIT_count bureau_AMT_CREDIT_SUM_LIMIT_mean bureau_AMT_CREDIT_SUM_LIMIT_max bureau_AMT_CREDIT_SUM_LIMIT_min bureau_AMT_CREDIT_SUM_LIMIT_sum bureau_AMT_CREDIT_SUM_OVERDUE_count bureau_AMT_CREDIT_SUM_OVERDUE_mean bureau_AMT_CREDIT_SUM_OVERDUE_max bureau_AMT_CREDIT_SUM_OVERDUE_min bureau_AMT_CREDIT_SUM_OVERDUE_sum bureau_DAYS_CREDIT_UPDATE_count bureau_DAYS_CREDIT_UPDATE_mean bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_min bureau_DAYS_CREDIT_UPDATE_sum bureau_AMT_ANNUITY_count bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_min bureau_AMT_ANNUITY_sum CREDIT_ACTIVE_Active_count CREDIT_ACTIVE_Active_count_norm CREDIT_ACTIVE_Bad debt_count CREDIT_ACTIVE_Bad debt_count_norm CREDIT_ACTIVE_Closed_count CREDIT_ACTIVE_Closed_count_norm CREDIT_ACTIVE_Sold_count CREDIT_ACTIVE_Sold_count_norm CREDIT_CURRENCY_currency 1_count CREDIT_CURRENCY_currency 1_count_norm CREDIT_CURRENCY_currency 2_count CREDIT_CURRENCY_currency 2_count_norm CREDIT_CURRENCY_currency 3_count CREDIT_CURRENCY_currency 3_count_norm CREDIT_CURRENCY_currency 4_count CREDIT_CURRENCY_currency 4_count_norm CREDIT_TYPE_Another type of loan_count CREDIT_TYPE_Another type of loan_count_norm CREDIT_TYPE_Car loan_count CREDIT_TYPE_Car loan_count_norm CREDIT_TYPE_Cash loan (non-earmarked)_count CREDIT_TYPE_Cash loan (non-earmarked)_count_norm CREDIT_TYPE_Consumer credit_count CREDIT_TYPE_Consumer credit_count_norm CREDIT_TYPE_Credit card_count CREDIT_TYPE_Credit card_count_norm CREDIT_TYPE_Interbank credit_count CREDIT_TYPE_Interbank credit_count_norm CREDIT_TYPE_Loan for business development_count CREDIT_TYPE_Loan for business development_count_norm CREDIT_TYPE_Loan for purchase of shares (margin lending)_count CREDIT_TYPE_Loan for purchase of shares (margin lending)_count_norm CREDIT_TYPE_Loan for the purchase of equipment_count CREDIT_TYPE_Loan for the purchase of equipment_count_norm CREDIT_TYPE_Loan for working capital replenishment_count CREDIT_TYPE_Loan for working capital replenishment_count_norm CREDIT_TYPE_Microloan_count CREDIT_TYPE_Microloan_count_norm CREDIT_TYPE_Mobile operator loan_count CREDIT_TYPE_Mobile operator loan_count_norm CREDIT_TYPE_Mortgage_count CREDIT_TYPE_Mortgage_count_norm CREDIT_TYPE_Real estate loan_count CREDIT_TYPE_Real estate loan_count_norm CREDIT_TYPE_Unknown type of loan_count CREDIT_TYPE_Unknown type of loan_count_norm
0 8.0 -874.000000 -103.0 -1437.0 -6992.0 8.0 0.0 0.0 0.0 0.0 6.0 -349.000000 780.0 -1072.0 -2094.0 6.0 -697.500000 -36.0 -1185.0 -4185.0 5.0 1681.029 5043.645 0.0 8405.145 8.0 0.0 0.0 0.0 0.0 8.0 108131.945625 450000.0 0.0 865055.565 5.0 49156.200000 245781.0 0.0 245781.0 4.0 7997.14125 31988.565 0.0 31988.565 8.0 0.0 0.0 0.0 0.0 8.0 -499.875000 -7.0 -1185.0 -3999.0 7.0 0.0 0.0 0.0 0.0 2.0 0.250000 0.0 0.0 6.0 0.750000 0.0 0.0 8.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 0.500000 4.0 0.500000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 4.0 -1400.750000 -606.0 -2586.0 -5603.0 4.0 0.0 0.0 0.0 0.0 4.0 -544.500000 1216.0 -2434.0 -2178.0 3.0 -1097.333333 -540.0 -2131.0 -3292.0 4.0 0.000 0.000 0.0 0.000 4.0 0.0 0.0 0.0 0.0 4.0 254350.125000 810000.0 22248.0 1017400.500 4.0 0.000000 0.0 0.0 0.0 4.0 202500.00000 810000.000 0.0 810000.000 4.0 0.0 0.0 0.0 0.0 4.0 -816.000000 -43.0 -2131.0 -3264.0 0.0 NaN NaN NaN 0.0 1.0 0.250000 0.0 0.0 3.0 0.750000 0.0 0.0 4.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.500000 2.0 0.500000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2.0 -867.000000 -408.0 -1326.0 -1734.0 2.0 0.0 0.0 0.0 0.0 2.0 -488.500000 -382.0 -595.0 -977.0 2.0 -532.500000 -382.0 -683.0 -1065.0 1.0 0.000 0.000 0.0 0.000 2.0 0.0 0.0 0.0 0.0 2.0 94518.900000 94537.8 94500.0 189037.800 2.0 0.000000 0.0 0.0 0.0 2.0 0.00000 0.000 0.0 0.000 2.0 0.0 0.0 0.0 0.0 2.0 -532.000000 -382.0 -682.0 -1064.0 0.0 NaN NaN NaN 0.0 0.0 0.000000 0.0 0.0 2.0 1.000000 0.0 0.0 2.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 1.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1.0 -1149.000000 -1149.0 -1149.0 -1149.0 1.0 0.0 0.0 0.0 0.0 1.0 -783.000000 -783.0 -783.0 -783.0 1.0 -783.000000 -783.0 -783.0 -783.0 1.0 0.000 0.000 0.0 0.000 1.0 0.0 0.0 0.0 0.0 1.0 146250.000000 146250.0 146250.0 146250.000 1.0 0.000000 0.0 0.0 0.0 1.0 0.00000 0.000 0.0 0.000 1.0 0.0 0.0 0.0 0.0 1.0 -783.000000 -783.0 -783.0 -783.0 0.0 NaN NaN NaN 0.0 0.0 0.000000 0.0 0.0 1.0 1.000000 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 3.0 -757.333333 -78.0 -1097.0 -2272.0 3.0 0.0 0.0 0.0 0.0 3.0 -391.333333 471.0 -853.0 -1174.0 2.0 -909.000000 -790.0 -1028.0 -1818.0 1.0 0.000 0.000 0.0 0.000 3.0 0.0 0.0 0.0 0.0 3.0 156148.500000 267606.0 95134.5 468445.500 3.0 80019.000000 240057.0 0.0 240057.0 3.0 0.00000 0.000 0.0 0.000 3.0 0.0 0.0 0.0 0.0 3.0 -611.000000 -16.0 -1027.0 -1833.0 0.0 NaN NaN NaN 0.0 1.0 0.333333 0.0 0.0 2.0 0.666667 0.0 0.0 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 1.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 18.0 -1271.500000 -239.0 -2882.0 -22887.0 18.0 0.0 0.0 0.0 0.0 16.0 -794.937500 1402.0 -2152.0 -12719.0 14.0 -1108.500000 -313.0 -2152.0 -15519.0 4.0 0.000 0.000 0.0 0.000 18.0 0.0 0.0 0.0 0.0 18.0 266711.750000 1777500.0 35770.5 4800811.500 14.0 76953.535714 557959.5 0.0 1077349.5 11.0 0.00000 0.000 0.0 0.000 18.0 0.0 0.0 0.0 0.0 18.0 -851.611111 -23.0 -2152.0 -15329.0 0.0 NaN NaN NaN 0.0 4.0 0.222222 0.0 0.0 14.0 0.777778 0.0 0.0 18.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.0 0.888889 2.0 0.111111 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 2.0 -1939.500000 -1138.0 -2741.0 -3879.0 2.0 0.0 0.0 0.0 0.0 2.0 -119.500000 689.0 -928.0 -239.0 1.0 -1138.000000 -1138.0 -1138.0 -1138.0 0.0 NaN NaN NaN 0.000 2.0 0.0 0.0 0.0 0.0 2.0 495000.000000 675000.0 315000.0 990000.000 2.0 174003.750000 348007.5 0.0 348007.5 1.0 0.00000 0.000 0.0 0.000 2.0 0.0 0.0 0.0 0.0 2.0 -578.000000 -18.0 -1138.0 -1156.0 0.0 NaN NaN NaN 0.0 1.0 0.500000 0.0 0.0 1.0 0.500000 0.0 0.0 2.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.500000 0.0 0.000000 0.0 0.0 1.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 4.0 -1773.000000 -1309.0 -2508.0 -7092.0 4.0 0.0 0.0 0.0 0.0 4.0 -1293.250000 -860.0 -2173.0 -5173.0 4.0 -1463.250000 -968.0 -2197.0 -5853.0 2.0 5073.615 10147.230 0.0 10147.230 4.0 0.0 0.0 0.0 0.0 4.0 108807.075000 145242.0 54000.0 435228.300 3.0 0.000000 0.0 0.0 0.0 3.0 0.00000 0.000 0.0 0.000 4.0 0.0 0.0 0.0 0.0 4.0 -1454.750000 -965.0 -2172.0 -5819.0 0.0 NaN NaN NaN 0.0 0.0 0.000000 0.0 0.0 4.0 1.000000 0.0 0.0 4.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.750000 1.0 0.250000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Function to Handle Categorical Variables

To make the code more efficient, we can now write a function to handle the categorical variables for us. This will take the same form as the agg_numeric function in that it accepts a dataframe and a grouping variable. Then it will calculate the counts and normalized counts of each category for all categorical variables in the dataframe.

In [27]:
def count_categorical(df, group_var, df_name):
    """Computes counts and normalized counts for each observation
    of `group_var` of each unique category in every categorical variable
    
    Parameters
    --------
    df : dataframe 
        The dataframe to calculate the value counts for.
        
    group_var : string
        The variable by which to group the dataframe. For each unique
        value of this variable, the final dataframe will have one row
        
    df_name : string
        Variable added to the front of column names to keep track of columns

    
    Return
    --------
    categorical : dataframe
        A dataframe with counts and normalized counts of each unique category in every categorical variable
        with one row for every unique value of the `group_var`.
        
    """
    
    # Select the categorical columns
    categorical = pd.get_dummies(df.select_dtypes('object'))

    # Make sure to put the identifying id on the column
    categorical[group_var] = df[group_var]

    # Groupby the group var and calculate the sum and mean
    categorical = categorical.groupby(group_var).agg(['sum', 'mean'])
    
    column_names = []
    
    # Iterate through the columns in level 0
    for var in categorical.columns.levels[0]:
        # Iterate through the stats in level 1
        for stat in ['count', 'count_norm']:
            # Make a new column name
            column_names.append('%s_%s_%s' % (df_name, var, stat))
    
    categorical.columns = column_names
    
    return categorical
In [28]:
bureau_counts = count_categorical(bureau, group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_counts.head()
Out[28]:
bureau_CREDIT_ACTIVE_Active_count bureau_CREDIT_ACTIVE_Active_count_norm bureau_CREDIT_ACTIVE_Bad debt_count bureau_CREDIT_ACTIVE_Bad debt_count_norm bureau_CREDIT_ACTIVE_Closed_count bureau_CREDIT_ACTIVE_Closed_count_norm bureau_CREDIT_ACTIVE_Sold_count bureau_CREDIT_ACTIVE_Sold_count_norm bureau_CREDIT_CURRENCY_currency 1_count bureau_CREDIT_CURRENCY_currency 1_count_norm bureau_CREDIT_CURRENCY_currency 2_count bureau_CREDIT_CURRENCY_currency 2_count_norm bureau_CREDIT_CURRENCY_currency 3_count bureau_CREDIT_CURRENCY_currency 3_count_norm bureau_CREDIT_CURRENCY_currency 4_count bureau_CREDIT_CURRENCY_currency 4_count_norm bureau_CREDIT_TYPE_Another type of loan_count bureau_CREDIT_TYPE_Another type of loan_count_norm bureau_CREDIT_TYPE_Car loan_count bureau_CREDIT_TYPE_Car loan_count_norm bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count_norm bureau_CREDIT_TYPE_Consumer credit_count bureau_CREDIT_TYPE_Consumer credit_count_norm bureau_CREDIT_TYPE_Credit card_count bureau_CREDIT_TYPE_Credit card_count_norm bureau_CREDIT_TYPE_Interbank credit_count bureau_CREDIT_TYPE_Interbank credit_count_norm bureau_CREDIT_TYPE_Loan for business development_count bureau_CREDIT_TYPE_Loan for business development_count_norm bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count_norm bureau_CREDIT_TYPE_Loan for the purchase of equipment_count bureau_CREDIT_TYPE_Loan for the purchase of equipment_count_norm bureau_CREDIT_TYPE_Loan for working capital replenishment_count bureau_CREDIT_TYPE_Loan for working capital replenishment_count_norm bureau_CREDIT_TYPE_Microloan_count bureau_CREDIT_TYPE_Microloan_count_norm bureau_CREDIT_TYPE_Mobile operator loan_count bureau_CREDIT_TYPE_Mobile operator loan_count_norm bureau_CREDIT_TYPE_Mortgage_count bureau_CREDIT_TYPE_Mortgage_count_norm bureau_CREDIT_TYPE_Real estate loan_count bureau_CREDIT_TYPE_Real estate loan_count_norm bureau_CREDIT_TYPE_Unknown type of loan_count bureau_CREDIT_TYPE_Unknown type of loan_count_norm
SK_ID_CURR
100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 7 1.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 4 0.500000 4 0.500000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 2 0.500000 2 0.500000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 2 1.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 2 0.666667 1 0.333333 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0

Applying Operations to another dataframe

We will now turn to the bureau balance dataframe. This dataframe has monthly information about each client's previous loan(s) with other financial institutions. Instead of grouping this dataframe by the SK_ID_CURR which is the client id, we will first group the dataframe by the SK_ID_BUREAU which is the id of the previous loan. This will give us one row of the dataframe for each loan. Then, we can group by the SK_ID_CURR and calculate the aggregations across the loans of each client. The final result will be a dataframe with one row for each client, with stats calculated for their loans.

In [29]:
# Read in bureau balance
bureau_balance = pd.read_csv('./input/bureau_balance.csv')
bureau_balance.head()
Out[29]:
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C

First, we can calculate the value counts of each status for each loan. Fortunately, we already have a function that does this for us!

In [30]:
# Counts of each type of status for each previous loan
bureau_balance_counts = count_categorical(bureau_balance, group_var = 'SK_ID_BUREAU', df_name = 'bureau_balance')
bureau_balance_counts.head()
Out[30]:
bureau_balance_STATUS_0_count bureau_balance_STATUS_0_count_norm bureau_balance_STATUS_1_count bureau_balance_STATUS_1_count_norm bureau_balance_STATUS_2_count bureau_balance_STATUS_2_count_norm bureau_balance_STATUS_3_count bureau_balance_STATUS_3_count_norm bureau_balance_STATUS_4_count bureau_balance_STATUS_4_count_norm bureau_balance_STATUS_5_count bureau_balance_STATUS_5_count_norm bureau_balance_STATUS_C_count bureau_balance_STATUS_C_count_norm bureau_balance_STATUS_X_count bureau_balance_STATUS_X_count_norm
SK_ID_BUREAU
5001709 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 86 0.886598 11 0.113402
5001710 5 0.060241 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 48 0.578313 30 0.361446
5001711 3 0.750000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 1 0.250000
5001712 10 0.526316 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 9 0.473684 0 0.000000
5001713 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 22 1.000000

Now we can handle the one numeric column. The MONTHS_BALANCE column has the "months of balance relative to application date." This might not necessarily be that important as a numeric variable, and in future work we might want to consider this as a time variable. For now, we can just calculate the same aggregation statistics as previously.

In [31]:
# Calculate value count statistics for each `SK_ID_CURR` 
bureau_balance_agg = agg_numeric(bureau_balance, group_var = 'SK_ID_BUREAU', df_name = 'bureau_balance')
bureau_balance_agg.head()
Out[31]:
SK_ID_BUREAU bureau_balance_MONTHS_BALANCE_count bureau_balance_MONTHS_BALANCE_mean bureau_balance_MONTHS_BALANCE_max bureau_balance_MONTHS_BALANCE_min bureau_balance_MONTHS_BALANCE_sum
0 5001709 97 -48.0 0 -96 -4656
1 5001710 83 -41.0 0 -82 -3403
2 5001711 4 -1.5 0 -3 -6
3 5001712 19 -9.0 0 -18 -171
4 5001713 22 -10.5 0 -21 -231

The above dataframes have the calculations done on each loan. Now we need to aggregate these for each client. We can do this by merging the dataframes together first and then since all the variables are numeric, we just need to aggregate the statistics again, this time grouping by the SK_ID_CURR.

In [32]:
# Dataframe grouped by the loan
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, left_on = 'SK_ID_BUREAU', how = 'outer')

# Merge to include the SK_ID_CURR
bureau_by_loan = bureau_by_loan.merge(bureau[['SK_ID_BUREAU', 'SK_ID_CURR']], on = 'SK_ID_BUREAU', how = 'left')

bureau_by_loan.head()
Out[32]:
SK_ID_BUREAU bureau_balance_MONTHS_BALANCE_count bureau_balance_MONTHS_BALANCE_mean bureau_balance_MONTHS_BALANCE_max bureau_balance_MONTHS_BALANCE_min bureau_balance_MONTHS_BALANCE_sum bureau_balance_STATUS_0_count bureau_balance_STATUS_0_count_norm bureau_balance_STATUS_1_count bureau_balance_STATUS_1_count_norm bureau_balance_STATUS_2_count bureau_balance_STATUS_2_count_norm bureau_balance_STATUS_3_count bureau_balance_STATUS_3_count_norm bureau_balance_STATUS_4_count bureau_balance_STATUS_4_count_norm bureau_balance_STATUS_5_count bureau_balance_STATUS_5_count_norm bureau_balance_STATUS_C_count bureau_balance_STATUS_C_count_norm bureau_balance_STATUS_X_count bureau_balance_STATUS_X_count_norm SK_ID_CURR
0 5001709 97 -48.0 0 -96 -4656 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 86 0.886598 11 0.113402 NaN
1 5001710 83 -41.0 0 -82 -3403 5 0.060241 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 48 0.578313 30 0.361446 162368.0
2 5001711 4 -1.5 0 -3 -6 3 0.750000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 1 0.250000 162368.0
3 5001712 19 -9.0 0 -18 -171 10 0.526316 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 9 0.473684 0 0.000000 162368.0
4 5001713 22 -10.5 0 -21 -231 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 22 1.000000 150635.0
In [33]:
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', df_name = 'client')
bureau_balance_by_client.head()
Out[33]:
SK_ID_CURR client_bureau_balance_MONTHS_BALANCE_count_count client_bureau_balance_MONTHS_BALANCE_count_mean client_bureau_balance_MONTHS_BALANCE_count_max client_bureau_balance_MONTHS_BALANCE_count_min client_bureau_balance_MONTHS_BALANCE_count_sum client_bureau_balance_MONTHS_BALANCE_mean_count client_bureau_balance_MONTHS_BALANCE_mean_mean client_bureau_balance_MONTHS_BALANCE_mean_max client_bureau_balance_MONTHS_BALANCE_mean_min client_bureau_balance_MONTHS_BALANCE_mean_sum client_bureau_balance_MONTHS_BALANCE_max_count client_bureau_balance_MONTHS_BALANCE_max_mean client_bureau_balance_MONTHS_BALANCE_max_max client_bureau_balance_MONTHS_BALANCE_max_min client_bureau_balance_MONTHS_BALANCE_max_sum client_bureau_balance_MONTHS_BALANCE_min_count client_bureau_balance_MONTHS_BALANCE_min_mean client_bureau_balance_MONTHS_BALANCE_min_max client_bureau_balance_MONTHS_BALANCE_min_min client_bureau_balance_MONTHS_BALANCE_min_sum client_bureau_balance_MONTHS_BALANCE_sum_count client_bureau_balance_MONTHS_BALANCE_sum_mean client_bureau_balance_MONTHS_BALANCE_sum_max client_bureau_balance_MONTHS_BALANCE_sum_min client_bureau_balance_MONTHS_BALANCE_sum_sum client_bureau_balance_STATUS_0_count_count client_bureau_balance_STATUS_0_count_mean client_bureau_balance_STATUS_0_count_max client_bureau_balance_STATUS_0_count_min client_bureau_balance_STATUS_0_count_sum client_bureau_balance_STATUS_0_count_norm_count client_bureau_balance_STATUS_0_count_norm_mean client_bureau_balance_STATUS_0_count_norm_max client_bureau_balance_STATUS_0_count_norm_min client_bureau_balance_STATUS_0_count_norm_sum client_bureau_balance_STATUS_1_count_count client_bureau_balance_STATUS_1_count_mean client_bureau_balance_STATUS_1_count_max client_bureau_balance_STATUS_1_count_min client_bureau_balance_STATUS_1_count_sum client_bureau_balance_STATUS_1_count_norm_count client_bureau_balance_STATUS_1_count_norm_mean client_bureau_balance_STATUS_1_count_norm_max client_bureau_balance_STATUS_1_count_norm_min client_bureau_balance_STATUS_1_count_norm_sum client_bureau_balance_STATUS_2_count_count client_bureau_balance_STATUS_2_count_mean client_bureau_balance_STATUS_2_count_max client_bureau_balance_STATUS_2_count_min client_bureau_balance_STATUS_2_count_sum client_bureau_balance_STATUS_2_count_norm_count client_bureau_balance_STATUS_2_count_norm_mean client_bureau_balance_STATUS_2_count_norm_max client_bureau_balance_STATUS_2_count_norm_min client_bureau_balance_STATUS_2_count_norm_sum client_bureau_balance_STATUS_3_count_count client_bureau_balance_STATUS_3_count_mean client_bureau_balance_STATUS_3_count_max client_bureau_balance_STATUS_3_count_min client_bureau_balance_STATUS_3_count_sum client_bureau_balance_STATUS_3_count_norm_count client_bureau_balance_STATUS_3_count_norm_mean client_bureau_balance_STATUS_3_count_norm_max client_bureau_balance_STATUS_3_count_norm_min client_bureau_balance_STATUS_3_count_norm_sum client_bureau_balance_STATUS_4_count_count client_bureau_balance_STATUS_4_count_mean client_bureau_balance_STATUS_4_count_max client_bureau_balance_STATUS_4_count_min client_bureau_balance_STATUS_4_count_sum client_bureau_balance_STATUS_4_count_norm_count client_bureau_balance_STATUS_4_count_norm_mean client_bureau_balance_STATUS_4_count_norm_max client_bureau_balance_STATUS_4_count_norm_min client_bureau_balance_STATUS_4_count_norm_sum client_bureau_balance_STATUS_5_count_count client_bureau_balance_STATUS_5_count_mean client_bureau_balance_STATUS_5_count_max client_bureau_balance_STATUS_5_count_min client_bureau_balance_STATUS_5_count_sum client_bureau_balance_STATUS_5_count_norm_count client_bureau_balance_STATUS_5_count_norm_mean client_bureau_balance_STATUS_5_count_norm_max client_bureau_balance_STATUS_5_count_norm_min client_bureau_balance_STATUS_5_count_norm_sum client_bureau_balance_STATUS_C_count_count client_bureau_balance_STATUS_C_count_mean client_bureau_balance_STATUS_C_count_max client_bureau_balance_STATUS_C_count_min client_bureau_balance_STATUS_C_count_sum client_bureau_balance_STATUS_C_count_norm_count client_bureau_balance_STATUS_C_count_norm_mean client_bureau_balance_STATUS_C_count_norm_max client_bureau_balance_STATUS_C_count_norm_min client_bureau_balance_STATUS_C_count_norm_sum client_bureau_balance_STATUS_X_count_count client_bureau_balance_STATUS_X_count_mean client_bureau_balance_STATUS_X_count_max client_bureau_balance_STATUS_X_count_min client_bureau_balance_STATUS_X_count_sum client_bureau_balance_STATUS_X_count_norm_count client_bureau_balance_STATUS_X_count_norm_mean client_bureau_balance_STATUS_X_count_norm_max client_bureau_balance_STATUS_X_count_norm_min client_bureau_balance_STATUS_X_count_norm_sum
0 100001.0 7 24.571429 52 2 172 7 -11.785714 -0.5 -25.5 -82.5 7 0.0 0 0 0 7 -23.571429 -1 -51 -165 7 -400.000000 -1 -1326 -2800 7 4.428571 12 1 31.0 7 0.336651 1.000000 0.019231 2.356557 7 0.142857 1 0 1 7 0.007519 0.052632 0.0 0.052632 7 0.0 0 0 0 7 0.0 0.0 0.0 0.0 7 0.0 0 0 0 7 0.0 0.0 0.0 0.0 7 0.0 0 0 0 7 0.0 0.0 0.0 0.0 7 0.0 0 0 0.0 7 0.0 0.0 0.0 0.0 7 15.714286 44 0 110.0 7 0.441240 0.966667 0.000000 3.088683 7 4.285714 9 0 30.0 7 0.214590 0.500000 0.0 1.502129
1 100002.0 8 13.750000 22 4 110 8 -21.875000 -1.5 -39.5 -175.0 8 -15.5 0 -32 -124 8 -28.250000 -3 -47 -226 8 -337.625000 -6 -632 -2701 8 5.625000 18 2 45.0 8 0.406960 0.818182 0.187500 3.255682 8 3.375000 6 0 27 8 0.255682 0.500000 0.0 2.045455 8 0.0 0 0 0 8 0.0 0.0 0.0 0.0 8 0.0 0 0 0 8 0.0 0.0 0.0 0.0 8 0.0 0 0 0 8 0.0 0.0 0.0 0.0 8 0.0 0 0 0.0 8 0.0 0.0 0.0 0.0 8 2.875000 13 0 23.0 8 0.175426 0.812500 0.000000 1.403409 8 1.875000 3 0 15.0 8 0.161932 0.500000 0.0 1.295455
2 100005.0 3 7.000000 13 3 21 3 -3.000000 -1.0 -6.0 -9.0 3 0.0 0 0 0 3 -6.000000 -2 -12 -18 3 -30.333333 -3 -78 -91 3 4.666667 7 2 14.0 3 0.735043 1.000000 0.538462 2.205128 3 0.000000 0 0 0 3 0.000000 0.000000 0.0 0.000000 3 0.0 0 0 0 3 0.0 0.0 0.0 0.0 3 0.0 0 0 0 3 0.0 0.0 0.0 0.0 3 0.0 0 0 0 3 0.0 0.0 0.0 0.0 3 0.0 0 0 0.0 3 0.0 0.0 0.0 0.0 3 1.666667 5 0 5.0 3 0.128205 0.384615 0.000000 0.384615 3 0.666667 1 0 2.0 3 0.136752 0.333333 0.0 0.410256
3 100010.0 2 36.000000 36 36 72 2 -46.000000 -19.5 -72.5 -92.0 2 -28.5 -2 -55 -57 2 -63.500000 -37 -90 -127 2 -1656.000000 -702 -2610 -3312 2 10.000000 10 10 20.0 2 0.277778 0.277778 0.277778 0.555556 2 0.000000 0 0 0 2 0.000000 0.000000 0.0 0.000000 2 0.0 0 0 0 2 0.0 0.0 0.0 0.0 2 0.0 0 0 0 2 0.0 0.0 0.0 0.0 2 0.0 0 0 0 2 0.0 0.0 0.0 0.0 2 0.0 0 0 0.0 2 0.0 0.0 0.0 0.0 2 26.000000 26 26 52.0 2 0.722222 0.722222 0.722222 1.444444 2 0.000000 0 0 0.0 2 0.000000 0.000000 0.0 0.000000
4 100013.0 4 57.500000 69 40 230 4 -28.250000 -19.5 -34.0 -113.0 4 0.0 0 0 0 4 -56.500000 -39 -68 -226 4 -1689.000000 -780 -2346 -6756 4 19.750000 34 0 79.0 4 0.320718 0.618182 0.000000 1.282872 4 1.750000 3 0 7 4 0.027701 0.045455 0.0 0.110804 4 0.0 0 0 0 4 0.0 0.0 0.0 0.0 4 0.0 0 0 0 4 0.0 0.0 0.0 0.0 4 0.0 0 0 0 4 0.0 0.0 0.0 0.0 4 0.0 0 0 0.0 4 0.0 0.0 0.0 0.0 4 25.750000 44 0 103.0 4 0.397036 0.666667 0.000000 1.588142 4 10.250000 40 0 41.0 4 0.254545 1.000000 0.0 1.018182

To recap, for the bureau_balance dataframe we:

  1. Calculated numeric stats grouping by each loan
  2. Made value counts of each categorical variable grouping by loan
  3. Merged the stats and the value counts on the loans
  4. Calculated numeric stats for the resulting dataframe grouping by the client id

The final resulting dataframe has one row for each client, with statistics calculated for all of their loans with monthly balance information.

Some of these variables are a little confusing, so let's try to explain a few:

  • client_bureau_balance_MONTHS_BALANCE_mean_mean: For each loan calculate the mean value of MONTHS_BALANCE. Then for each client, calculate the mean of this value for all of their loans.
  • client_bureau_balance_STATUS_X_count_norm_sum: For each loan, calculate the number of occurences of STATUS == X divided by the number of total STATUS values for the loan. Then, for each client, add up the values for each loan.

We will hold off on calculating the correlations until we have all the variables together in one dataframe.

Putting the Functions Together

We now have all the pieces in place to take the information from the previous loans at other institutions and the monthly payments information about these loans and put them into the main training dataframe. Let's do a reset of all the variables and then use the functions we built to do this from the ground up. This demonstrate the benefit of using functions for repeatable workflows!

In [34]:
# Free up memory by deleting old objects
import gc
gc.enable()
del train, bureau, bureau_balance, bureau_agg, bureau_agg_new, bureau_balance_agg, bureau_balance_counts, bureau_by_loan, bureau_balance_by_client, bureau_counts
gc.collect()
Out[34]:
364
In [35]:
# Read in new copies of all the dataframes
train = pd.read_csv('./input/application_train.csv')
bureau = pd.read_csv('./input/bureau.csv')
bureau_balance = pd.read_csv('./input/bureau_balance.csv')

Counts of Bureau Dataframe

In [36]:
bureau_counts = count_categorical(bureau, group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_counts.head()
Out[36]:
bureau_CREDIT_ACTIVE_Active_count bureau_CREDIT_ACTIVE_Active_count_norm bureau_CREDIT_ACTIVE_Bad debt_count bureau_CREDIT_ACTIVE_Bad debt_count_norm bureau_CREDIT_ACTIVE_Closed_count bureau_CREDIT_ACTIVE_Closed_count_norm bureau_CREDIT_ACTIVE_Sold_count bureau_CREDIT_ACTIVE_Sold_count_norm bureau_CREDIT_CURRENCY_currency 1_count bureau_CREDIT_CURRENCY_currency 1_count_norm bureau_CREDIT_CURRENCY_currency 2_count bureau_CREDIT_CURRENCY_currency 2_count_norm bureau_CREDIT_CURRENCY_currency 3_count bureau_CREDIT_CURRENCY_currency 3_count_norm bureau_CREDIT_CURRENCY_currency 4_count bureau_CREDIT_CURRENCY_currency 4_count_norm bureau_CREDIT_TYPE_Another type of loan_count bureau_CREDIT_TYPE_Another type of loan_count_norm bureau_CREDIT_TYPE_Car loan_count bureau_CREDIT_TYPE_Car loan_count_norm bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count_norm bureau_CREDIT_TYPE_Consumer credit_count bureau_CREDIT_TYPE_Consumer credit_count_norm bureau_CREDIT_TYPE_Credit card_count bureau_CREDIT_TYPE_Credit card_count_norm bureau_CREDIT_TYPE_Interbank credit_count bureau_CREDIT_TYPE_Interbank credit_count_norm bureau_CREDIT_TYPE_Loan for business development_count bureau_CREDIT_TYPE_Loan for business development_count_norm bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count_norm bureau_CREDIT_TYPE_Loan for the purchase of equipment_count bureau_CREDIT_TYPE_Loan for the purchase of equipment_count_norm bureau_CREDIT_TYPE_Loan for working capital replenishment_count bureau_CREDIT_TYPE_Loan for working capital replenishment_count_norm bureau_CREDIT_TYPE_Microloan_count bureau_CREDIT_TYPE_Microloan_count_norm bureau_CREDIT_TYPE_Mobile operator loan_count bureau_CREDIT_TYPE_Mobile operator loan_count_norm bureau_CREDIT_TYPE_Mortgage_count bureau_CREDIT_TYPE_Mortgage_count_norm bureau_CREDIT_TYPE_Real estate loan_count bureau_CREDIT_TYPE_Real estate loan_count_norm bureau_CREDIT_TYPE_Unknown type of loan_count bureau_CREDIT_TYPE_Unknown type of loan_count_norm
SK_ID_CURR
100001 3 0.428571 0 0.0 4 0.571429 0 0.0 7 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 7 1.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100002 2 0.250000 0 0.0 6 0.750000 0 0.0 8 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 4 0.500000 4 0.500000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100003 1 0.250000 0 0.0 3 0.750000 0 0.0 4 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 2 0.500000 2 0.500000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100004 0 0.000000 0 0.0 2 1.000000 0 0.0 2 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 2 1.000000 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0
100005 2 0.666667 0 0.0 1 0.333333 0 0.0 3 1.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 2 0.666667 1 0.333333 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0

Aggregated Stats of Bureau Dataframe

In [37]:
bureau_agg = agg_numeric(bureau.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_agg.head()
Out[37]:
SK_ID_CURR bureau_DAYS_CREDIT_count bureau_DAYS_CREDIT_mean bureau_DAYS_CREDIT_max bureau_DAYS_CREDIT_min bureau_DAYS_CREDIT_sum bureau_CREDIT_DAY_OVERDUE_count bureau_CREDIT_DAY_OVERDUE_mean bureau_CREDIT_DAY_OVERDUE_max bureau_CREDIT_DAY_OVERDUE_min bureau_CREDIT_DAY_OVERDUE_sum bureau_DAYS_CREDIT_ENDDATE_count bureau_DAYS_CREDIT_ENDDATE_mean bureau_DAYS_CREDIT_ENDDATE_max bureau_DAYS_CREDIT_ENDDATE_min bureau_DAYS_CREDIT_ENDDATE_sum bureau_DAYS_ENDDATE_FACT_count bureau_DAYS_ENDDATE_FACT_mean bureau_DAYS_ENDDATE_FACT_max bureau_DAYS_ENDDATE_FACT_min bureau_DAYS_ENDDATE_FACT_sum bureau_AMT_CREDIT_MAX_OVERDUE_count bureau_AMT_CREDIT_MAX_OVERDUE_mean bureau_AMT_CREDIT_MAX_OVERDUE_max bureau_AMT_CREDIT_MAX_OVERDUE_min bureau_AMT_CREDIT_MAX_OVERDUE_sum bureau_CNT_CREDIT_PROLONG_count bureau_CNT_CREDIT_PROLONG_mean bureau_CNT_CREDIT_PROLONG_max bureau_CNT_CREDIT_PROLONG_min bureau_CNT_CREDIT_PROLONG_sum bureau_AMT_CREDIT_SUM_count bureau_AMT_CREDIT_SUM_mean bureau_AMT_CREDIT_SUM_max bureau_AMT_CREDIT_SUM_min bureau_AMT_CREDIT_SUM_sum bureau_AMT_CREDIT_SUM_DEBT_count bureau_AMT_CREDIT_SUM_DEBT_mean bureau_AMT_CREDIT_SUM_DEBT_max bureau_AMT_CREDIT_SUM_DEBT_min bureau_AMT_CREDIT_SUM_DEBT_sum bureau_AMT_CREDIT_SUM_LIMIT_count bureau_AMT_CREDIT_SUM_LIMIT_mean bureau_AMT_CREDIT_SUM_LIMIT_max bureau_AMT_CREDIT_SUM_LIMIT_min bureau_AMT_CREDIT_SUM_LIMIT_sum bureau_AMT_CREDIT_SUM_OVERDUE_count bureau_AMT_CREDIT_SUM_OVERDUE_mean bureau_AMT_CREDIT_SUM_OVERDUE_max bureau_AMT_CREDIT_SUM_OVERDUE_min bureau_AMT_CREDIT_SUM_OVERDUE_sum bureau_DAYS_CREDIT_UPDATE_count bureau_DAYS_CREDIT_UPDATE_mean bureau_DAYS_CREDIT_UPDATE_max bureau_DAYS_CREDIT_UPDATE_min bureau_DAYS_CREDIT_UPDATE_sum bureau_AMT_ANNUITY_count bureau_AMT_ANNUITY_mean bureau_AMT_ANNUITY_max bureau_AMT_ANNUITY_min bureau_AMT_ANNUITY_sum
0 100001 7 -735.000000 -49 -1572 -5145 7 0.0 0 0 0 7 82.428571 1778.0 -1329.0 577.0 4 -825.500000 -544.0 -1328.0 -3302.0 0 NaN NaN NaN 0.000 7 0.0 0 0 0 7 207623.571429 378000.0 85500.0 1453365.000 7 85240.928571 373239.0 0.0 596686.5 6 0.00000 0.000 0.0 0.000 7 0.0 0.0 0.0 0.0 7 -93.142857 -6 -155 -652 7 3545.357143 10822.5 0.0 24817.5
1 100002 8 -874.000000 -103 -1437 -6992 8 0.0 0 0 0 6 -349.000000 780.0 -1072.0 -2094.0 6 -697.500000 -36.0 -1185.0 -4185.0 5 1681.029 5043.645 0.0 8405.145 8 0.0 0 0 0 8 108131.945625 450000.0 0.0 865055.565 5 49156.200000 245781.0 0.0 245781.0 4 7997.14125 31988.565 0.0 31988.565 8 0.0 0.0 0.0 0.0 8 -499.875000 -7 -1185 -3999 7 0.000000 0.0 0.0 0.0
2 100003 4 -1400.750000 -606 -2586 -5603 4 0.0 0 0 0 4 -544.500000 1216.0 -2434.0 -2178.0 3 -1097.333333 -540.0 -2131.0 -3292.0 4 0.000 0.000 0.0 0.000 4 0.0 0 0 0 4 254350.125000 810000.0 22248.0 1017400.500 4 0.000000 0.0 0.0 0.0 4 202500.00000 810000.000 0.0 810000.000 4 0.0 0.0 0.0 0.0 4 -816.000000 -43 -2131 -3264 0 NaN NaN NaN 0.0
3 100004 2 -867.000000 -408 -1326 -1734 2 0.0 0 0 0 2 -488.500000 -382.0 -595.0 -977.0 2 -532.500000 -382.0 -683.0 -1065.0 1 0.000 0.000 0.0 0.000 2 0.0 0 0 0 2 94518.900000 94537.8 94500.0 189037.800 2 0.000000 0.0 0.0 0.0 2 0.00000 0.000 0.0 0.000 2 0.0 0.0 0.0 0.0 2 -532.000000 -382 -682 -1064 0 NaN NaN NaN 0.0
4 100005 3 -190.666667 -62 -373 -572 3 0.0 0 0 0 3 439.333333 1324.0 -128.0 1318.0 1 -123.000000 -123.0 -123.0 -123.0 1 0.000 0.000 0.0 0.000 3 0.0 0 0 0 3 219042.000000 568800.0 29826.0 657126.000 3 189469.500000 543087.0 0.0 568408.5 3 0.00000 0.000 0.0 0.000 3 0.0 0.0 0.0 0.0 3 -54.333333 -11 -121 -163 3 1420.500000 4261.5 0.0 4261.5

Value counts of Bureau Balance dataframe by loan

In [38]:
bureau_balance_counts = count_categorical(bureau_balance, group_var = 'SK_ID_BUREAU', df_name = 'bureau_balance')
bureau_balance_counts.head()
Out[38]:
bureau_balance_STATUS_0_count bureau_balance_STATUS_0_count_norm bureau_balance_STATUS_1_count bureau_balance_STATUS_1_count_norm bureau_balance_STATUS_2_count bureau_balance_STATUS_2_count_norm bureau_balance_STATUS_3_count bureau_balance_STATUS_3_count_norm bureau_balance_STATUS_4_count bureau_balance_STATUS_4_count_norm bureau_balance_STATUS_5_count bureau_balance_STATUS_5_count_norm bureau_balance_STATUS_C_count bureau_balance_STATUS_C_count_norm bureau_balance_STATUS_X_count bureau_balance_STATUS_X_count_norm
SK_ID_BUREAU
5001709 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 86 0.886598 11 0.113402
5001710 5 0.060241 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 48 0.578313 30 0.361446
5001711 3 0.750000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 1 0.250000
5001712 10 0.526316 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 9 0.473684 0 0.000000
5001713 0 0.000000 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.000000 22 1.000000

Aggregated stats of Bureau Balance dataframe by loan

In [39]:
bureau_balance_agg = agg_numeric(bureau_balance, group_var = 'SK_ID_BUREAU', df_name = 'bureau_balance')
bureau_balance_agg.head()
Out[39]:
SK_ID_BUREAU bureau_balance_MONTHS_BALANCE_count bureau_balance_MONTHS_BALANCE_mean bureau_balance_MONTHS_BALANCE_max bureau_balance_MONTHS_BALANCE_min bureau_balance_MONTHS_BALANCE_sum
0 5001709 97 -48.0 0 -96 -4656
1 5001710 83 -41.0 0 -82 -3403
2 5001711 4 -1.5 0 -3 -6
3 5001712 19 -9.0 0 -18 -171
4 5001713 22 -10.5 0 -21 -231

Aggregated Stats of Bureau Balance by Client

In [40]:
# Dataframe grouped by the loan
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, left_on = 'SK_ID_BUREAU', how = 'outer')

# Merge to include the SK_ID_CURR
bureau_by_loan = bureau[['SK_ID_BUREAU', 'SK_ID_CURR']].merge(bureau_by_loan, on = 'SK_ID_BUREAU', how = 'left')

# Aggregate the stats for each client
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', df_name = 'client')

Insert Computed Features into Training Data

In [41]:
original_features = list(train.columns)
print('Original Number of Features: ', len(original_features))
Original Number of Features:  122
In [42]:
# Merge with the value counts of bureau
train = train.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')

# Merge with the stats of bureau
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')

# Merge with the monthly information grouped by client
train = train.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')
In [43]:
new_features = list(train.columns)
print('Number of features using previous loans from other institutions data: ', len(new_features))
Number of features using previous loans from other institutions data:  333

Feature Engineering Outcomes

After all that work, now we want to take a look at the variables we have created. We can look at the percentage of missing values, the correlations of variables with the target, and also the correlation of variables with the other variables. The correlations between variables can show if we have collinear varibles, that is, variables that are highly correlated with one another. Often, we want to remove one in a pair of collinear variables because having both variables would be redundant. We can also use the percentage of missing values to remove features with a substantial majority of values that are not present. Feature selection will be an important focus going forward, because reducing the number of features can help the model learn during training and also generalize better to the testing data. The "curse of dimensionality" is the name given to the issues caused by having too many features (too high of a dimension). As the number of variables increases, the number of datapoints needed to learn the relationship between these variables and the target value increases exponentially.

Feature selection is the process of removing variables to help our model to learn and generalize better to the testing set. The objective is to remove useless/redundant variables while preserving those that are useful. There are a number of tools we can use for this process, but in this notebook we will stick to removing columns with a high percentage of missing values and variables that have a high correlation with one another. Later we can look at using the feature importances returned from models such as the Gradient Boosting Machine or Random Forest to perform feature selection.

Missing Values

An important consideration is the missing values in the dataframe. Columns with too many missing values might have to be dropped.

In [44]:
# Function to calculate missing values by column# Funct 
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns
In [45]:
missing_train = missing_values_table(train)
missing_train.head(10)
Your selected dataframe has 333 columns.
There are 278 columns that have missing values.
Out[45]:
Missing Values % of Total Values
bureau_AMT_ANNUITY_min 227502 74.0
bureau_AMT_ANNUITY_max 227502 74.0
bureau_AMT_ANNUITY_mean 227502 74.0
client_bureau_balance_STATUS_4_count_min 215280 70.0
client_bureau_balance_STATUS_3_count_norm_mean 215280 70.0
client_bureau_balance_MONTHS_BALANCE_count_min 215280 70.0
client_bureau_balance_STATUS_4_count_max 215280 70.0
client_bureau_balance_STATUS_4_count_mean 215280 70.0
client_bureau_balance_STATUS_3_count_norm_min 215280 70.0
client_bureau_balance_STATUS_3_count_norm_max 215280 70.0

We see there are a number of columns with a high percentage of missing values. There is no well-established threshold for removing missing values, and the best course of action depends on the problem. Here, to reduce the number of features, we will remove any columns in either the training or the testing data that have greater than 90% missing values.

In [46]:
missing_train_vars = list(missing_train.index[missing_train['% of Total Values'] > 90])
len(missing_train_vars)
Out[46]:
0

Before we remove the missing values, we will find the missing value percentages in the testing data. We'll then remove any columns with greater than 90% missing values in either the training or testing data. Let's now read in the testing data, perform the same operations, and look at the missing values in the testing data. We already have calculated all the counts and aggregation statistics, so we only need to merge the testing data with the appropriate data.

Calculate Information for Testing Data

In [47]:
# Read in the test dataframe
test = pd.read_csv('./input/application_test.csv')

# Merge with the value counts of bureau
test = test.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')

# Merge with the stats of bureau
test = test.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')

# Merge with the value counts of bureau balance
test = test.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')
In [48]:
print('Shape of Testing Data: ', test.shape)
Shape of Testing Data:  (48744, 332)

We need to align the testing and training dataframes, which means matching up the columns so they have the exact same columns. This shouldn't be an issue here, but when we one-hot encode variables, we need to align the dataframes to make sure they have the same columns.

In [49]:
train_labels = train['TARGET']

# Align the dataframes, this will remove the 'TARGET' column
train, test = train.align(test, join = 'inner', axis = 1)

train['TARGET'] = train_labels
In [50]:
print('Training Data Shape: ', train.shape)
print('Testing Data Shape: ', test.shape)
Training Data Shape:  (307511, 333)
Testing Data Shape:  (48744, 332)

The dataframes now have the same columns (with the exception of the TARGET column in the training data). This means we can use them in a machine learning model which needs to see the same columns in both the training and testing dataframes.

Let's now look at the percentage of missing values in the testing data so we can figure out the columns that should be dropped.

In [51]:
missing_test = missing_values_table(test)
missing_test.head(10)
Your selected dataframe has 332 columns.
There are 275 columns that have missing values.
Out[51]:
Missing Values % of Total Values
COMMONAREA_MEDI 33495 68.7
COMMONAREA_MODE 33495 68.7
COMMONAREA_AVG 33495 68.7
NONLIVINGAPARTMENTS_MEDI 33347 68.4
NONLIVINGAPARTMENTS_AVG 33347 68.4
NONLIVINGAPARTMENTS_MODE 33347 68.4
FONDKAPREMONT_MODE 32797 67.3
LIVINGAPARTMENTS_MEDI 32780 67.2
LIVINGAPARTMENTS_MODE 32780 67.2
LIVINGAPARTMENTS_AVG 32780 67.2
In [52]:
missing_test_vars = list(missing_test.index[missing_test['% of Total Values'] > 90])
len(missing_test_vars)
Out[52]:
0
In [53]:
missing_columns = list(set(missing_test_vars + missing_train_vars))
print('There are %d columns with more than 90%% missing in either the training or testing data.' % len(missing_columns))
There are 0 columns with more than 90% missing in either the training or testing data.
In [54]:
# Drop the missing columns
train = train.drop(columns = missing_columns)
test = test.drop(columns = missing_columns)

We ended up removing no columns in this round because there are no columns with more than 90% missing values. We might have to apply another feature selection method to reduce the dimensionality.

At this point we will save both the training and testing data. I encourage anyone to try different percentages for dropping the missing columns and compare the outcomes.

In [55]:
train.to_csv('train_bureau_raw.csv', index = False)
test.to_csv('test_bureau_raw.csv', index = False)

Correlations

First let's look at the correlations of the variables with the target. We can see in any of the variables we created have a greater correlation than those already present in the training data (from application).

In [56]:
# Calculate all correlations in dataframe
corrs = train.corr()
In [57]:
corrs = corrs.sort_values('TARGET', ascending = False)

# Ten most positive correlations
pd.DataFrame(corrs['TARGET'].head(10))
Out[57]:
TARGET
TARGET 1.000000
bureau_DAYS_CREDIT_mean 0.089729
client_bureau_balance_MONTHS_BALANCE_min_mean 0.089038
DAYS_BIRTH 0.078239
bureau_CREDIT_ACTIVE_Active_count_norm 0.077356
client_bureau_balance_MONTHS_BALANCE_mean_mean 0.076424
bureau_DAYS_CREDIT_min 0.075248
client_bureau_balance_MONTHS_BALANCE_min_min 0.073225
client_bureau_balance_MONTHS_BALANCE_sum_mean 0.072606
bureau_DAYS_CREDIT_UPDATE_mean 0.068927
In [58]:
# Ten most negative correlations
pd.DataFrame(corrs['TARGET'].dropna().tail(100))
Out[58]:
TARGET
AMT_INCOME_TOTAL -0.003982
bureau_DAYS_CREDIT_ENDDATE_count -0.004002
FLAG_DOCUMENT_11 -0.004229
FLAG_DOCUMENT_9 -0.004352
bureau_AMT_CREDIT_SUM_LIMIT_min -0.004779
bureau_CREDIT_CURRENCY_currency 2_count -0.006003
FLAG_DOCUMENT_15 -0.006536
bureau_CREDIT_CURRENCY_currency 2_count_norm -0.006821
FLAG_DOCUMENT_18 -0.007952
FLAG_DOCUMENT_8 -0.008040
client_bureau_balance_STATUS_X_count_sum -0.008445
client_bureau_balance_STATUS_C_count_norm_sum -0.008680
YEARS_BEGINEXPLUATATION_MODE -0.009036
bureau_AMT_CREDIT_SUM_LIMIT_sum -0.009419
FLAG_DOCUMENT_14 -0.009464
YEARS_BEGINEXPLUATATION_AVG -0.009728
YEARS_BEGINEXPLUATATION_MEDI -0.009993
LANDAREA_MODE -0.010174
bureau_AMT_CREDIT_SUM_LIMIT_max -0.010633
bureau_CREDIT_TYPE_Consumer credit_count -0.010707
bureau_AMT_CREDIT_SUM_min -0.010764
LANDAREA_AVG -0.010885
LANDAREA_MEDI -0.011256
bureau_AMT_CREDIT_SUM_LIMIT_mean -0.011446
FLAG_DOCUMENT_13 -0.011583
FLAG_DOCUMENT_16 -0.011615
AMT_REQ_CREDIT_BUREAU_MON -0.012462
NONLIVINGAREA_MODE -0.012711
AMT_ANNUITY -0.012817
NONLIVINGAREA_MEDI -0.013337
... ...
client_bureau_balance_STATUS_X_count_max -0.030919
client_bureau_balance_STATUS_0_count_max -0.031158
ELEVATORS_MODE -0.032131
TOTALAREA_MODE -0.032596
FLOORSMIN_MODE -0.032698
LIVINGAREA_MEDI -0.032739
LIVINGAREA_AVG -0.032997
client_bureau_balance_STATUS_X_count_mean -0.033292
FLOORSMIN_MEDI -0.033394
FLOORSMIN_AVG -0.033614
ELEVATORS_MEDI -0.033863
ELEVATORS_AVG -0.034199
client_bureau_balance_STATUS_0_count_mean -0.035868
REGION_POPULATION_RELATIVE -0.037227
AMT_GOODS_PRICE -0.039645
FLOORSMAX_MODE -0.043226
FLOORSMAX_MEDI -0.043768
FLOORSMAX_AVG -0.044003
client_bureau_balance_STATUS_C_count_norm_max -0.044119
DAYS_EMPLOYED -0.044932
client_bureau_balance_MONTHS_BALANCE_count_min -0.048224
client_bureau_balance_STATUS_C_count_norm_mean -0.055936
client_bureau_balance_STATUS_C_count_max -0.061083
client_bureau_balance_STATUS_C_count_mean -0.062954
client_bureau_balance_MONTHS_BALANCE_count_max -0.068792
bureau_CREDIT_ACTIVE_Closed_count_norm -0.079369
client_bureau_balance_MONTHS_BALANCE_count_mean -0.080193
EXT_SOURCE_1 -0.155317
EXT_SOURCE_2 -0.160472
EXT_SOURCE_3 -0.178919

100 rows × 1 columns

The highest correlated variable with the target (other than the TARGET which of course has a correlation of 1), is a variable we created. However, just because the variable is correlated does not mean that it will be useful, and we have to remember that if we generate hundreds of new variables, some are going to be correlated with the target simply because of random noise.

Viewing the correlations skeptically, it does appear that several of the newly created variables may be useful. To assess the "usefulness" of variables, we will look at the feature importances returned by the model. For curiousity's sake (and because we already wrote the function) we can make a kde plot of two of the newly created variables.

In [59]:
kde_target(var_name='client_bureau_balance_MONTHS_BALANCE_count_mean', df=train)
The correlation between client_bureau_balance_MONTHS_BALANCE_count_mean and the TARGET is -0.0802
Median value for loan that was not repaid = 19.3333
Median value for loan that was repaid =     25.1429

This variable represents the average number of monthly records per loan for each client. For example, if a client had three previous loans with 3, 4, and 5 records in the monthly data, the value of this variable for them would be 4. Based on the distribution, clients with a greater number of average monthly records per loan were more likely to repay their loans with Home Credit. Let's not read too much into this value, but it could indicate that clients who have had more previous credit history are generally more likely to repay a loan.

In [60]:
kde_target(var_name='bureau_CREDIT_ACTIVE_Active_count_norm', df=train)
The correlation between bureau_CREDIT_ACTIVE_Active_count_norm and the TARGET is 0.0774
Median value for loan that was not repaid = 0.5000
Median value for loan that was repaid =     0.3636

Well this distribution is all over the place. This variable represents the number of previous loans with a CREDIT_ACTIVE value of Active divided by the total number of previous loans for a client. The correlation here is so weak that I do not think we should draw any conclusions!

Collinear Variables

We can calculate not only the correlations of the variables with the target, but also the correlation of each variable with every other variable. This will allow us to see if there are highly collinear variables that should perhaps be removed from the data.

Let's look for any variables that have a greather than 0.8 correlation with other variables.

In [61]:
# Set the threshold
threshold = 0.8

# Empty dictionary to hold correlated variables
above_threshold_vars = {}

# For each column, record the variables that are above the threshold
for col in corrs:
    above_threshold_vars[col] = list(corrs.index[corrs[col] > threshold])

For each of these pairs of highly correlated variables, we only want to remove one of the variables. The following code creates a set of variables to remove by only adding one of each pair.

In [62]:
# Track columns to remove and columns already examined
cols_to_remove = []
cols_seen = []
cols_to_remove_pair = []

# Iterate through columns and correlated columns
for key, value in above_threshold_vars.items():
    # Keep track of columns already examined
    cols_seen.append(key)
    for x in value:
        if x == key:
            next
        else:
            # Only want to remove one in a pair
            if x not in cols_seen:
                cols_to_remove.append(x)
                cols_to_remove_pair.append(key)
            
cols_to_remove = list(set(cols_to_remove))
print('Number of columns to remove: ', len(cols_to_remove))
Number of columns to remove:  134

We can remove these columns from both the training and the testing datasets. We will have to compare performance after removing these variables with performance keeping these variables (the raw csv files we saved earlier).

In [63]:
train_corrs_removed = train.drop(columns = cols_to_remove)
test_corrs_removed = test.drop(columns = cols_to_remove)

print('Training Corrs Removed Shape: ', train_corrs_removed.shape)
print('Testing Corrs Removed Shape: ', test_corrs_removed.shape)
Training Corrs Removed Shape:  (307511, 199)
Testing Corrs Removed Shape:  (48744, 198)
In [64]:
train_corrs_removed.to_csv('train_bureau_corrs_removed.csv', index = False)
test_corrs_removed.to_csv('test_bureau_corrs_removed.csv', index = False)

Modeling

To actually test the performance of these new datasets, we will try using them for machine learning! Here we will use a function I developed in another notebook to compare the features (the raw version with the highly correlated variables removed). We can run this kind of like an experiment, and the control will be the performance of just the application data in this function when submitted to the competition. I've already recorded that performance, so we can list out our control and our two test conditions:

For all datasets, use the model shown below (with the exact hyperparameters).

  • control: only the data in the application files.
  • test one: the data in the application files with all of the data recorded from the bureau and bureau_balance files
  • test two: the data in the application files with all of the data recorded from the bureau and bureau_balance files with highly correlated variables removed.
In [65]:
import lightgbm as lgb

from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

import gc

import matplotlib.pyplot as plt
In [66]:
def model(features, test_features, encoding = 'ohe', n_folds = 5):
    
    """Train and test a light gradient boosting model using
    cross validation. 
    
    Parameters
    --------
        features (pd.DataFrame): 
            dataframe of training features to use 
            for training a model. Must include the TARGET column.
        test_features (pd.DataFrame): 
            dataframe of testing features to use
            for making predictions with the model. 
        encoding (str, default = 'ohe'): 
            method for encoding categorical variables. Either 'ohe' for one-hot encoding or 'le' for integer label encoding
            n_folds (int, default = 5): number of folds to use for cross validation
        
    Return
    --------
        submission (pd.DataFrame): 
            dataframe with `SK_ID_CURR` and `TARGET` probabilities
            predicted by the model.
        feature_importances (pd.DataFrame): 
            dataframe with the feature importances from the model.
        valid_metrics (pd.DataFrame): 
            dataframe with training and validation metrics (ROC AUC) for each fold and overall.
        
    """
    
    # Extract the ids
    train_ids = features['SK_ID_CURR']
    test_ids = test_features['SK_ID_CURR']
    
    # Extract the labels for training
    labels = features['TARGET']
    
    # Remove the ids and target
    features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
    test_features = test_features.drop(columns = ['SK_ID_CURR'])
    
    
    # One Hot Encoding
    if encoding == 'ohe':
        features = pd.get_dummies(features)
        test_features = pd.get_dummies(test_features)
        
        # Align the dataframes by the columns
        features, test_features = features.align(test_features, join = 'inner', axis = 1)
        
        # No categorical indices to record
        cat_indices = 'auto'
    
    # Integer label encoding
    elif encoding == 'le':
        
        # Create a label encoder
        label_encoder = LabelEncoder()
        
        # List for storing categorical indices
        cat_indices = []
        
        # Iterate through each column
        for i, col in enumerate(features):
            if features[col].dtype == 'object':
                # Map the categorical features to integers
                features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
                test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))

                # Record the categorical indices
                cat_indices.append(i)
    
    # Catch error if label encoding scheme is not valid
    else:
        raise ValueError("Encoding must be either 'ohe' or 'le'")
        
    print('Training Data Shape: ', features.shape)
    print('Testing Data Shape: ', test_features.shape)
    
    # Extract feature names
    feature_names = list(features.columns)
    
    # Convert to np arrays
    features = np.array(features)
    test_features = np.array(test_features)
    
    # Create the kfold object
    k_fold = KFold(n_splits = n_folds, shuffle = False, random_state = 50)
    
    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    
    # Empty array for test predictions
    test_predictions = np.zeros(test_features.shape[0])
    
    # Empty array for out of fold validation predictions
    out_of_fold = np.zeros(features.shape[0])
    
    # Lists for recording validation and training scores
    valid_scores = []
    train_scores = []
    
    # Iterate through each fold
    for train_indices, valid_indices in k_fold.split(features):
        
        # Training data for the fold
        train_features, train_labels = features[train_indices], labels[train_indices]
        # Validation data for the fold
        valid_features, valid_labels = features[valid_indices], labels[valid_indices]
        
        # Create the model
        model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                                   class_weight = 'balanced', learning_rate = 0.05, 
                                   reg_alpha = 0.1, reg_lambda = 0.1, 
                                   subsample = 0.8, n_jobs = -1, random_state = 50)
        
        # Train the model
        model.fit(train_features, train_labels, eval_metric = 'auc',
                  eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
                  eval_names = ['valid', 'train'], categorical_feature = cat_indices,
                  early_stopping_rounds = 100, verbose = 200)
        
        # Record the best iteration
        best_iteration = model.best_iteration_
        
        # Record the feature importances
        feature_importance_values += model.feature_importances_ / k_fold.n_splits
        
        # Make predictions
        test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
        
        # Record the out of fold predictions
        out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
        
        # Record the best score
        valid_score = model.best_score_['valid']['auc']
        train_score = model.best_score_['train']['auc']
        
        valid_scores.append(valid_score)
        train_scores.append(train_score)
        
        # Clean up memory
        gc.enable()
        del model, train_features, valid_features
        gc.collect()
        
    # Make the submission dataframe
    submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
    
    # Make the feature importance dataframe
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
    
    # Overall validation score
    valid_auc = roc_auc_score(labels, out_of_fold)
    
    # Add the overall scores to the metrics
    valid_scores.append(valid_auc)
    train_scores.append(np.mean(train_scores))
    
    # Needed for creating dataframe of validation scores
    fold_names = list(range(n_folds))
    fold_names.append('overall')
    
    # Dataframe of validation scores
    metrics = pd.DataFrame({'fold': fold_names,
                            'train': train_scores,
                            'valid': valid_scores}) 
    
    return submission, feature_importances, metrics
In [67]:
def plot_feature_importances(df):
    """
    Plot importances returned by a model. This can work with any measure of
    feature importance provided that higher importance is better. 
    
    Args:
        df (dataframe): feature importances. Must have the features in a column
        called `features` and the importances in a column called `importance
        
    Returns:
        shows a plot of the 15 most importance features
        
        df (dataframe): feature importances sorted by importance (highest to lowest) 
        with a column for normalized importance
        """
    
    # Sort features according to importance
    df = df.sort_values('importance', ascending = False).reset_index()
    
    # Normalize the feature importances to add up to one
    df['importance_normalized'] = df['importance'] / df['importance'].sum()

    # Make a horizontal bar chart of feature importances
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()
    
    # Need to reverse the index to plot most important on top
    ax.barh(list(reversed(list(df.index[:15]))), 
            df['importance_normalized'].head(15), 
            align = 'center', edgecolor = 'k')
    
    # Set the yticks and labels
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))
    
    # Plot labeling
    plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
    plt.show()
    
    return df

Control

The first step in any experiment is establishing a control. For this we will use the function defined above (that implements a Gradient Boosting Machine model) and the single main data source (application).

In [68]:
train_control = pd.read_csv('./input/application_train.csv')
test_control = pd.read_csv('./input/application_test.csv')

Fortunately, once we have taken the time to write a function, using it is simple. The function above returns a submission dataframe we can upload to the competition, a fi dataframe of feature importances, and a metrics dataframe with validation and test performance.

In [69]:
submission, fi, metrics = model(train_control, test_control)
Training Data Shape:  (307511, 241)
Testing Data Shape:  (48744, 241)
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.760007	train's auc: 0.798103
Early stopping, best iteration is:
[269]	valid's auc: 0.760273	train's auc: 0.809199
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.76114	train's auc: 0.798328
Early stopping, best iteration is:
[289]	valid's auc: 0.761398	train's auc: 0.812654
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.750232	train's auc: 0.79964
Early stopping, best iteration is:
[265]	valid's auc: 0.750451	train's auc: 0.809734
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.759831	train's auc: 0.797797
Early stopping, best iteration is:
[282]	valid's auc: 0.760245	train's auc: 0.811121
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.76071	train's auc: 0.798106
Early stopping, best iteration is:
[226]	valid's auc: 0.760972	train's auc: 0.802236
In [70]:
metrics
Out[70]:
fold train valid
0 0 0.809199 0.760273
1 1 0.812654 0.761398
2 2 0.809734 0.750451
3 3 0.811121 0.760245
4 4 0.802236 0.760972
5 overall 0.808989 0.758635

The control slightly overfits because the training score is higher than the validation score. We can address this in later notebooks when we look at regularization (we already perform some regularization in this model by using reg_lambda and reg_alpha as well as early stopping).

We can visualize the feature importance with another function, plot_feature_importances. The feature importances may be useful when it's time for feature selection.

In [71]:
fi_sorted = plot_feature_importances(fi)
In [72]:
submission.to_csv('control.csv', index = False)

The control scores 0.745 when submitted to the competition.

Test One

Let's conduct the first test. We will just need to pass in the data to the function, which does most of the work for us.

In [73]:
submission_raw, fi_raw, metrics_raw = model(train, test)
Training Data Shape:  (307511, 452)
Testing Data Shape:  (48744, 452)
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.766471	train's auc: 0.810385
Early stopping, best iteration is:
[293]	valid's auc: 0.767203	train's auc: 0.827287
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.767523	train's auc: 0.810288
Early stopping, best iteration is:
[265]	valid's auc: 0.768109	train's auc: 0.822264
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.761029	train's auc: 0.811653
Early stopping, best iteration is:
[267]	valid's auc: 0.761625	train's auc: 0.824147
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.767837	train's auc: 0.809671
Early stopping, best iteration is:
[236]	valid's auc: 0.76815	train's auc: 0.816322
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.767516	train's auc: 0.809788
[400]	valid's auc: 0.767583	train's auc: 0.842953
Early stopping, best iteration is:
[306]	valid's auc: 0.767902	train's auc: 0.828518
In [74]:
metrics_raw
Out[74]:
fold train valid
0 0 0.827287 0.767203
1 1 0.822264 0.768109
2 2 0.824147 0.761625
3 3 0.816322 0.768150
4 4 0.828518 0.767902
5 overall 0.823708 0.766597

Based on these numbers, the engineered features perform better than the control case. However, we will have to submit the predictions to the leaderboard before we can say if this better validation performance transfers to the testing data.

In [75]:
fi_raw_sorted = plot_feature_importances(fi_raw)

Examining the feature improtances, it looks as if a few of the feature we constructed are among the most important. Let's find the percentage of the top 100 most important features that we made in this notebook. However, rather than just compare to the original features, we need to compare to the one-hot encoded original features. These are already recorded for us in fi (from the original data).

In [76]:
top_100 = list(fi_raw_sorted['feature'])[:100]
new_features = [x for x in top_100 if x not in list(fi['feature'])]

print('%% of Top 100 Features created from the bureau data = %d.00' % len(new_features))
% of Top 100 Features created from the bureau data = 53.00

Over half of the top 100 features were made by us! That should give us confidence that all the hard work we did was worthwhile.

In [77]:
submission_raw.to_csv('test_one.csv', index = False)

Test one scores 0.759 when submitted to the competition.

Test Two

That was easy, so let's do another run! Same as before but with the highly collinear variables removed.

In [78]:
submission_corrs, fi_corrs, metrics_corr = model(train_corrs_removed, test_corrs_removed)
Training Data Shape:  (307511, 318)
Testing Data Shape:  (48744, 318)
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.764122	train's auc: 0.8063
Early stopping, best iteration is:
[264]	valid's auc: 0.764565	train's auc: 0.817897
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.7652	train's auc: 0.806611
[400]	valid's auc: 0.765625	train's auc: 0.839251
Early stopping, best iteration is:
[379]	valid's auc: 0.765829	train's auc: 0.836334
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.757092	train's auc: 0.808063
Early stopping, best iteration is:
[198]	valid's auc: 0.757149	train's auc: 0.807627
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.764131	train's auc: 0.806097
Early stopping, best iteration is:
[287]	valid's auc: 0.76481	train's auc: 0.821551
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.763964	train's auc: 0.806256
[400]	valid's auc: 0.763928	train's auc: 0.838649
Early stopping, best iteration is:
[312]	valid's auc: 0.764449	train's auc: 0.825298
In [79]:
metrics_corr
Out[79]:
fold train valid
0 0 0.817897 0.764565
1 1 0.836334 0.765829
2 2 0.807627 0.757149
3 3 0.821551 0.764810
4 4 0.825298 0.764449
5 overall 0.821741 0.763352

These results are better than the control, but slightly lower than the raw features.

In [80]:
fi_corrs_sorted = plot_feature_importances(fi_corrs)
In [81]:
submission_corrs.to_csv('test_two.csv', index = False)

Test Two scores 0.753 when submitted to the competition.

Results

After all that work, we can say that including the extra information did improve performance! The model is definitely not optimized to our data, but we still had a noticeable improvement over the original dataset when using the calculated features. Let's officially summarize the performances:

Experiment Train AUC Validation AUC Test AUC
Control 0.815 0.760 0.745
Test One 0.837 0.767 0.759
Test Two 0.826 0.765 0.753

(Note that these scores may change from run to run of the notebook. I have not observed that the general ordering changes however.)

All of our hard work translates to a small improvement of 0.014 ROC AUC over the original testing data. Removing the highly collinear variables slightly decreases performance so we will want to consider a different method for feature selection. Moreover, we can say that some of the features we built are among the most important as judged by the model.

In a competition such as this, even an improvement of this size is enough to move us up 100s of spots on the leaderboard. By making numerous small improvements such as in this notebook, we can gradually achieve better and better performance. I encourage others to use the results here to make their own improvements, and I will continue to document the steps I take to help others.

Next Steps

Going forward, we can now use the functions we developed in this notebook on the other datasets. There are still 4 other data files to use in our model!

(Internal reference in KB: k003)