Introduction: Manual Feature Engineering¶

In this notebook, we will explore making features by hand for the Home Credit Default Risk competition. In an earlier notebook, we used only the application data in order to build a model. The best model we made from this data achieved a score on the leaderboard around 0.74. In order to better this score, we will have to include more information from the other dataframes. Here, we will look at using information from the bureau and bureau_balance data. The definitions of these data files are:

bureau: information about client's previous loans with other financial institutions reported to Home Credit. Each previous loan has its own row.
bureau_balance: monthly information about the previous loans. Each month has its own row.

Manual feature engineering can be a tedious process (which is why we use automated feature engineering with featuretools!) and often relies on domain expertise. Since I have limited domain knowledge of loans and what makes a person likely to default, I will instead concentrate of getting as much info as possible into the final training dataframe. The idea is that the model will then pick up on which features are important rather than us having to decide that. Basically, our approach is to make as many features as possible and then give them all to the model to use! Later, we can perform feature reduction using the feature importances from the model or other techniques such as PCA.

The process of manual feature engineering will involve plenty of Pandas code, a little patience, and a lot of great practice manipulation data. Even though automated feature engineering tools are starting to be made available, feature engineering will still have to be done using plenty of data wrangling for a little while longer.

# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress warnings from pandas
import warnings
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')

Example: Counts of a client's previous loans¶

To illustrate the general process of manual feature engineering, we will first simply get the count of a client's previous loans at other financial institutions. This requires a number of Pandas operations we will make heavy use of throughout the notebook:

groupby: group a dataframe by a column. In this case we will group by the unique client, the SK_ID_CURR column
agg: perform a calculation on the grouped data such as taking the mean of columns. We can either call the function directly (grouped_df.mean()) or use the agg function together with a list of transforms (grouped_df.agg([mean, max, min, sum]))
merge: match the aggregated statistics to the appropriate client. We need to merge the original training data with the calculated stats on the SK_ID_CURR column which will insert NaN in any cell for which the client does not have the corresponding statistic

We also use the (rename) function quite a bit specifying the columns to be renamed as a dictionary. This is useful in order to keep track of the new variables we create.

This might seem like a lot, which is why we'll eventually write a function to do this process for us. Let's take a look at implementing this by hand first.

# Read in bureau
bureau = pd.read_csv('./input/bureau.csv')
bureau.head()

# Groupby the client id (SK_ID_CURR), count the number of previous loans, and rename the column
previous_loan_counts = bureau.groupby('SK_ID_CURR', as_index=False)['SK_ID_BUREAU'].count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_counts'})
previous_loan_counts.head()

# Join to the training dataframe
train = pd.read_csv('./input/application_train.csv')
train = train.merge(previous_loan_counts, on = 'SK_ID_CURR', how = 'left')

# Fill the missing values with 0 
train['previous_loan_counts'] = train['previous_loan_counts'].fillna(0)
train.head()

Scroll all the way to the right to see the new column.

Assessing Usefulness of New Variable with r value¶

To determine if the new variable is useful, we can calculate the Pearson Correlation Coefficient (r-value) between this variable and the target. This measures the strength of a linear relationship between two variables and ranges from -1 (perfectly negatively linear) to +1 (perfectly positively linear). The r-value is not best measure of the "usefulness" of a new variable, but it can give a first approximation of whether a variable will be helpful to a machine learning model. The larger the r-value of a variable with respect to the target, the more a change in this variable is likely to affect the value of the target. Therefore, we look for the variables with the greatest absolute value r-value relative to the target.

We can also visually inspect a relationship with the target using the Kernel Density Estimate (KDE) plot.

Kernel Density Estimate Plots¶

The kernel density estimate plot shows the distribution of a single variable (think of it as a smoothed histogram). To see the different in distributions dependent on the value of a categorical variable, we can color the distributions differently according to the category. For example, we can show the kernel density estimate of the previous_loan_count colored by whether the TARGET = 1 or 0. The resulting KDE will show any significant differences in the distribution of the variable between people who did not repay their loan (TARGET == 1) and the people who did (TARGET == 0). This can serve as an indicator of whether a variable will be 'relevant' to a machine learning model.

We will put this plotting functionality in a function to re-use for any variable.

# Plots the disribution of a variable colored by value of the target
def kde_target(var_name, df):
    
    # Calculate the correlation coefficient between the new variable and the target
    corr = df['TARGET'].corr(df[var_name])
    
    # Calculate medians for repaid vs not repaid
    avg_repaid = df.ix[df['TARGET'] == 0, var_name].median()
    avg_not_repaid = df.ix[df['TARGET'] == 1, var_name].median()
    
    plt.figure(figsize = (12, 6))
    
    # Plot the distribution for target == 0 and target == 1
    sns.kdeplot(df.ix[df['TARGET'] == 0, var_name], label = 'TARGET == 0')
    sns.kdeplot(df.ix[df['TARGET'] == 1, var_name], label = 'TARGET == 1')
    
    # label the plot
    plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)
    plt.legend();
    
    # print out the correlation
    print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr))
    # Print out average values
    print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid)
    print('Median value for loan that was repaid =     %0.4f' % avg_repaid)

We can test this function using the EXT_SOURCE_3 variable which we found to be one of the most important variables according to a Random Forest and Gradient Boosting Machine.

kde_target('EXT_SOURCE_3', train)

The correlation between EXT_SOURCE_3 and the TARGET is -0.1789
Median value for loan that was not repaid = 0.3791
Median value for loan that was repaid =     0.5460

Now for the new variable we just made, the number of previous loans at other institutions.

kde_target('previous_loan_counts', train)

The correlation between previous_loan_counts and the TARGET is -0.0100
Median value for loan that was not repaid = 3.0000
Median value for loan that was repaid =     4.0000

From this it's difficult to tell if this variable will be important. The correlation coefficient is extremely weak and there is almost no noticeable difference in the distributions.

Let's move on to make a few more variables from the bureau dataframe. We will take the mean, min, and max of every numeric column in the bureau dataframe.

Aggregating Numeric Columns¶

To account for the numeric information in the bureau dataframe, we can compute statistics for all the numeric columns. To do so, we groupby the client id, agg the grouped dataframe, and merge the result back into the training data. The agg function will only calculate the values for the numeric columns where the operation is considered valid. We will stick to using 'mean', 'max', 'min', 'sum' but any function can be passed in here. We can even write our own function and use it in an agg call.

# Group by the client id, calculate aggregation statistics
bureau_agg = bureau.drop(columns = ['SK_ID_BUREAU']).groupby('SK_ID_CURR', as_index = False).agg(['count', 'mean', 'max', 'min', 'sum']).reset_index()
bureau_agg.head()

We need to create new names for each of these columns. The following code makes new names by appending the stat to the name. Here we have to deal with the fact that the dataframe has a multi-level index. I find these confusing and hard to work with, so I try to reduce to a single level index as quickly as possible.

# List of column names
columns = ['SK_ID_CURR']

# Iterate through the variables names
for var in bureau_agg.columns.levels[0]:
    # Skip the id name
    if var != 'SK_ID_CURR':
        
        # Iterate through the stat names
        for stat in bureau_agg.columns.levels[1][:-1]:
            # Make a new column name for the variable and stat
            columns.append('bureau_%s_%s' % (var, stat))

# Assign the list of columns names as the dataframe column names
bureau_agg.columns = columns
bureau_agg.head()

Now we simply merge with the training data as we did before.

# Merge with the training data
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')
train.head()

Correlations of Aggregated Values with Target¶

We can calculate the correlation of all new values with the target. Again, we can use these as an approximation of the variables which may be important for modeling.

# List of new correlations
new_corrs = []

# Iterate through the columns 
for col in columns:
    # Calculate correlation with the target
    corr = train['TARGET'].corr(train[col])
    
    # Append the list as a tuple
    new_corrs.append((col, corr))

In the code below, we sort the correlations by the magnitude (absolute value) using the sorted Python function. We also make use of an anonymous lambda function, another important Python operation that is good to know.

# Sort the correlations by the absolute value
# Make sure to reverse to put the largest values at the front of list
new_corrs = sorted(new_corrs, key = lambda x: abs(x[1]), reverse = True)
new_corrs[:15]

[('bureau_DAYS_CREDIT_mean', 0.08972896721998114),
 ('bureau_DAYS_CREDIT_min', 0.0752482510301036),
 ('bureau_DAYS_CREDIT_UPDATE_mean', 0.06892735266968673),
 ('bureau_DAYS_ENDDATE_FACT_min', 0.05588737984392077),
 ('bureau_DAYS_CREDIT_ENDDATE_sum', 0.0537348956010205),
 ('bureau_DAYS_ENDDATE_FACT_mean', 0.05319962585758616),
 ('bureau_DAYS_CREDIT_max', 0.04978205463997299),
 ('bureau_DAYS_ENDDATE_FACT_sum', 0.048853502611115894),
 ('bureau_DAYS_CREDIT_ENDDATE_mean', 0.046982754334835494),
 ('bureau_DAYS_CREDIT_UPDATE_min', 0.042863922470730155),
 ('bureau_DAYS_CREDIT_sum', 0.041999824814846716),
 ('bureau_DAYS_CREDIT_UPDATE_sum', 0.04140363535306002),
 ('bureau_DAYS_CREDIT_ENDDATE_max', 0.036589634696329094),
 ('bureau_DAYS_CREDIT_ENDDATE_min', 0.034281109921616024),
 ('bureau_DAYS_ENDDATE_FACT_count', -0.030492306653325495)]

None of the new variables have a significant correlation with the TARGET. We can look at the KDE plot of the highest correlated variable, bureau_DAYS_CREDIT_mean, with the target in in terms of absolute magnitude correlation.

kde_target('bureau_DAYS_CREDIT_mean', train)

The correlation between bureau_DAYS_CREDIT_mean and the TARGET is 0.0897
Median value for loan that was not repaid = -835.3333
Median value for loan that was repaid =     -1067.0000

The definition of this column is: "How many days before current application did client apply for Credit Bureau credit". My interpretation is: this is the number of days that the previous loan was applied for before the application for a loan at Home Credit. Therefore, a larger negative number indicates the loan was further before the current loan application. We see an extremely weak positive relationship between the average of this variable and the target meaning that clients who applied for loans further in the past potentially are more likely to repay loans at Home Credit. With a correlation this weak though, it is just as likely to be noise as a signal.

The Multiple Comparisons Problem¶

When we have lots of variables, we expect some of them to be correlated just by pure chance, a problem known as multiple comparisons. We can make hundreds of features, and some will turn out to be corelated with the target simply because of random noise in the data. Then, when our model trains, it may overfit to these variables because it thinks they have a relationship with the target in the training set, but this does not necessarily generalize to the test set. There are many considerations that we have to take into account when making features!

Function for Numeric Aggregations¶

Let's encapsulate all of the previous work into a function. This will allow us to compute aggregate stats for numeric columns across any dataframe. We will re-use this function when we want to apply the same operations for other dataframes.

def agg_numeric(df, group_var, df_name):
    """Aggregates the numeric values in a dataframe. This can
    be used to create features for each instance of the grouping variable.
    
    Parameters
    --------
        df (dataframe): 
            the dataframe to calculate the statistics on
        group_var (string): 
            the variable by which to group df
        df_name (string): 
            the variable used to rename the columns
        
    Return
    --------
        agg (dataframe): 
            a dataframe with the statistics aggregated for 
            all numeric columns. Each instance of the grouping variable will have 
            the statistics (mean, min, max, sum; currently supported) calculated. 
            The columns are also renamed to keep track of features created.
    
    """
    # Remove id variables other than grouping variable
    for col in df:
        if col != group_var and 'SK_ID' in col:
            df = df.drop(columns = col)
            
    group_ids = df[group_var]
    numeric_df = df.select_dtypes('number')
    numeric_df[group_var] = group_ids

    # Group by the specified variable and calculate the statistics
    agg = numeric_df.groupby(group_var).agg(['count', 'mean', 'max', 'min', 'sum']).reset_index()

    # Need to create new column names
    columns = [group_var]

    # Iterate through the variables names
    for var in agg.columns.levels[0]:
        # Skip the grouping variable
        if var != group_var:
            # Iterate through the stat names
            for stat in agg.columns.levels[1][:-1]:
                # Make a new column name for the variable and stat
                columns.append('%s_%s_%s' % (df_name, var, stat))

    agg.columns = columns
    return agg

bureau_agg_new = agg_numeric(bureau.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_agg_new.head()

To make sure the function worked as intended, we should compare with the aggregated dataframe we constructed by hand.

bureau_agg.head()

If we go through and inspect the values, we do find that they are equivalent. We will be able to reuse this function for calculating numeric stats for other dataframes. Using functions allows for consistent results and decreases the amount of work we have to do in the future!

Correlation Function¶

Before we move on, we can also make the code to calculate correlations with the target into a function.

# Function to calculate correlations with the target for a dataframe
def target_corrs(df):

    # List of correlations
    corrs = []

    # Iterate through the columns 
    for col in df.columns:
        print(col)
        # Skip the target column
        if col != 'TARGET':
            # Calculate correlation with the target
            corr = df['TARGET'].corr(df[col])

            # Append the list as a tuple
            corrs.append((col, corr))
            
    # Sort by absolute magnitude of correlations
    corrs = sorted(corrs, key = lambda x: abs(x[1]), reverse = True)
    
    return corrs

Categorical Variables¶

Now we move from the numeric columns to the categorical columns. These are discrete string variables, so we cannot just calculate statistics such as mean and max which only work with numeric variables. Instead, we will rely on calculating value counts of each category within each categorical variable. As an example, if we have the following dataframe:

SK_ID_CURR	Loan type
1	home
1	home
1	home
1	credit
2	credit
3	credit
3	cash
3	cash
4	credit
4	home
4	home

we will use this information counting the number of loans in each category for each client.

SK_ID_CURR	credit count	cash count	home count	total count
1	1	0	3	4
2	1	0	0	1
3	1	2	0	3
4	1	0	2	3

Then we can normalize these value counts by the total number of occurences of that categorical variable for that observation (meaning that the normalized counts must sum to 1.0 for each observation).

SK_ID_CURR	credit count	cash count	home count	total count	credit count norm	cash count norm	home count norm
1	1	0	3	4	0.25	0	0.75
2	1	0	0	1	1.00	0	0
3	1	2	0	3	0.33	0.66	0
4	1	0	2	3	0.33	0	0.66

Hopefully, encoding the categorical variables this way will allow us to capture the information they contain. We will now go through this process step-by-step. At the end, we will wrap up all the code into one function to be re-used for many dataframes.

First we one-hot encode a dataframe with only the categorical columns (dtype == 'object').

categorical = pd.get_dummies(bureau.select_dtypes('object'))
categorical['SK_ID_CURR'] = bureau['SK_ID_CURR']
categorical.head()

categorical_grouped = categorical.groupby('SK_ID_CURR').agg(['sum', 'mean'])
categorical_grouped.head()

The sum columns represent the count of that category for the associated client and the mean represents the normalized count. One-hot encoding makes the process of calculating these figures very easy!

We can use a similar function as before to rename the columns. Again, we have to deal with the multi-level index for the columns. We iterate through the first level (level 0) which is the name of the categorical variable appended with the value of the category (from one-hot encoding). Then we iterate stats we calculated for each client. We will rename the column with the level 0 name appended with the stat. As an example, the column with CREDIT_ACTIVE_Active as level 0 and sum as level 1 will become CREDIT_ACTIVE_Active_count.

categorical_grouped.columns.levels[0][:10]

Index(['CREDIT_ACTIVE_Active', 'CREDIT_ACTIVE_Bad debt',
       'CREDIT_ACTIVE_Closed', 'CREDIT_ACTIVE_Sold',
       'CREDIT_CURRENCY_currency 1', 'CREDIT_CURRENCY_currency 2',
       'CREDIT_CURRENCY_currency 3', 'CREDIT_CURRENCY_currency 4',
       'CREDIT_TYPE_Another type of loan', 'CREDIT_TYPE_Car loan'],
      dtype='object')

categorical_grouped.columns.levels[1]

Index(['sum', 'mean'], dtype='object')

group_var = 'SK_ID_CURR'

# Need to create new column names
columns = []

# Iterate through the variables names
for var in categorical_grouped.columns.levels[0]:
    # Skip the grouping variable
    if var != group_var:
        # Iterate through the stat names
        for stat in ['count', 'count_norm']:
            # Make a new column name for the variable and stat
            columns.append('%s_%s' % (var, stat))

#  Rename the columns
categorical_grouped.columns = columns

categorical_grouped.head()

The sum column records the counts and the mean column records the normalized count.

We can merge this dataframe into the training data.

train = train.merge(categorical_grouped, left_on = 'SK_ID_CURR', right_index = True, how = 'left')
train.head()

train.shape

(307511, 229)

train.iloc[:10, 123:]

Function to Handle Categorical Variables¶

To make the code more efficient, we can now write a function to handle the categorical variables for us. This will take the same form as the agg_numeric function in that it accepts a dataframe and a grouping variable. Then it will calculate the counts and normalized counts of each category for all categorical variables in the dataframe.

def count_categorical(df, group_var, df_name):
    """Computes counts and normalized counts for each observation
    of `group_var` of each unique category in every categorical variable
    
    Parameters
    --------
    df : dataframe 
        The dataframe to calculate the value counts for.
        
    group_var : string
        The variable by which to group the dataframe. For each unique
        value of this variable, the final dataframe will have one row
        
    df_name : string
        Variable added to the front of column names to keep track of columns

    
    Return
    --------
    categorical : dataframe
        A dataframe with counts and normalized counts of each unique category in every categorical variable
        with one row for every unique value of the `group_var`.
        
    """
    
    # Select the categorical columns
    categorical = pd.get_dummies(df.select_dtypes('object'))

    # Make sure to put the identifying id on the column
    categorical[group_var] = df[group_var]

    # Groupby the group var and calculate the sum and mean
    categorical = categorical.groupby(group_var).agg(['sum', 'mean'])
    
    column_names = []
    
    # Iterate through the columns in level 0
    for var in categorical.columns.levels[0]:
        # Iterate through the stats in level 1
        for stat in ['count', 'count_norm']:
            # Make a new column name
            column_names.append('%s_%s_%s' % (df_name, var, stat))
    
    categorical.columns = column_names
    
    return categorical

bureau_counts = count_categorical(bureau, group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_counts.head()

Applying Operations to another dataframe¶

We will now turn to the bureau balance dataframe. This dataframe has monthly information about each client's previous loan(s) with other financial institutions. Instead of grouping this dataframe by the SK_ID_CURR which is the client id, we will first group the dataframe by the SK_ID_BUREAU which is the id of the previous loan. This will give us one row of the dataframe for each loan. Then, we can group by the SK_ID_CURR and calculate the aggregations across the loans of each client. The final result will be a dataframe with one row for each client, with stats calculated for their loans.

# Read in bureau balance
bureau_balance = pd.read_csv('./input/bureau_balance.csv')
bureau_balance.head()

First, we can calculate the value counts of each status for each loan. Fortunately, we already have a function that does this for us!

# Counts of each type of status for each previous loan
bureau_balance_counts = count_categorical(bureau_balance, group_var = 'SK_ID_BUREAU', df_name = 'bureau_balance')
bureau_balance_counts.head()

Now we can handle the one numeric column. The MONTHS_BALANCE column has the "months of balance relative to application date." This might not necessarily be that important as a numeric variable, and in future work we might want to consider this as a time variable. For now, we can just calculate the same aggregation statistics as previously.

# Calculate value count statistics for each `SK_ID_CURR` 
bureau_balance_agg = agg_numeric(bureau_balance, group_var = 'SK_ID_BUREAU', df_name = 'bureau_balance')
bureau_balance_agg.head()

The above dataframes have the calculations done on each loan. Now we need to aggregate these for each client. We can do this by merging the dataframes together first and then since all the variables are numeric, we just need to aggregate the statistics again, this time grouping by the SK_ID_CURR.

# Dataframe grouped by the loan
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, left_on = 'SK_ID_BUREAU', how = 'outer')

# Merge to include the SK_ID_CURR
bureau_by_loan = bureau_by_loan.merge(bureau[['SK_ID_BUREAU', 'SK_ID_CURR']], on = 'SK_ID_BUREAU', how = 'left')

bureau_by_loan.head()

bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', df_name = 'client')
bureau_balance_by_client.head()

To recap, for the bureau_balance dataframe we:

Calculated numeric stats grouping by each loan
Made value counts of each categorical variable grouping by loan
Merged the stats and the value counts on the loans
Calculated numeric stats for the resulting dataframe grouping by the client id

The final resulting dataframe has one row for each client, with statistics calculated for all of their loans with monthly balance information.

Some of these variables are a little confusing, so let's try to explain a few:

client_bureau_balance_MONTHS_BALANCE_mean_mean: For each loan calculate the mean value of MONTHS_BALANCE. Then for each client, calculate the mean of this value for all of their loans.
client_bureau_balance_STATUS_X_count_norm_sum: For each loan, calculate the number of occurences of STATUS == X divided by the number of total STATUS values for the loan. Then, for each client, add up the values for each loan.

We will hold off on calculating the correlations until we have all the variables together in one dataframe.

Putting the Functions Together¶

We now have all the pieces in place to take the information from the previous loans at other institutions and the monthly payments information about these loans and put them into the main training dataframe. Let's do a reset of all the variables and then use the functions we built to do this from the ground up. This demonstrate the benefit of using functions for repeatable workflows!

# Free up memory by deleting old objects
import gc
gc.enable()
del train, bureau, bureau_balance, bureau_agg, bureau_agg_new, bureau_balance_agg, bureau_balance_counts, bureau_by_loan, bureau_balance_by_client, bureau_counts
gc.collect()

364

# Read in new copies of all the dataframes
train = pd.read_csv('./input/application_train.csv')
bureau = pd.read_csv('./input/bureau.csv')
bureau_balance = pd.read_csv('./input/bureau_balance.csv')

Counts of Bureau Dataframe¶

bureau_counts = count_categorical(bureau, group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_counts.head()

Aggregated Stats of Bureau Dataframe¶

bureau_agg = agg_numeric(bureau.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', df_name = 'bureau')
bureau_agg.head()

Value counts of Bureau Balance dataframe by loan¶

bureau_balance_counts = count_categorical(bureau_balance, group_var = 'SK_ID_BUREAU', df_name = 'bureau_balance')
bureau_balance_counts.head()

Aggregated stats of Bureau Balance dataframe by loan¶

bureau_balance_agg = agg_numeric(bureau_balance, group_var = 'SK_ID_BUREAU', df_name = 'bureau_balance')
bureau_balance_agg.head()

Aggregated Stats of Bureau Balance by Client¶

# Dataframe grouped by the loan
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, left_on = 'SK_ID_BUREAU', how = 'outer')

# Merge to include the SK_ID_CURR
bureau_by_loan = bureau[['SK_ID_BUREAU', 'SK_ID_CURR']].merge(bureau_by_loan, on = 'SK_ID_BUREAU', how = 'left')

# Aggregate the stats for each client
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', df_name = 'client')

Insert Computed Features into Training Data¶

original_features = list(train.columns)
print('Original Number of Features: ', len(original_features))

Original Number of Features:  122

# Merge with the value counts of bureau
train = train.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')

# Merge with the stats of bureau
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')

# Merge with the monthly information grouped by client
train = train.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')

new_features = list(train.columns)
print('Number of features using previous loans from other institutions data: ', len(new_features))

Number of features using previous loans from other institutions data:  333

Feature Engineering Outcomes¶

After all that work, now we want to take a look at the variables we have created. We can look at the percentage of missing values, the correlations of variables with the target, and also the correlation of variables with the other variables. The correlations between variables can show if we have collinear varibles, that is, variables that are highly correlated with one another. Often, we want to remove one in a pair of collinear variables because having both variables would be redundant. We can also use the percentage of missing values to remove features with a substantial majority of values that are not present. Feature selection will be an important focus going forward, because reducing the number of features can help the model learn during training and also generalize better to the testing data. The "curse of dimensionality" is the name given to the issues caused by having too many features (too high of a dimension). As the number of variables increases, the number of datapoints needed to learn the relationship between these variables and the target value increases exponentially.

Feature selection is the process of removing variables to help our model to learn and generalize better to the testing set. The objective is to remove useless/redundant variables while preserving those that are useful. There are a number of tools we can use for this process, but in this notebook we will stick to removing columns with a high percentage of missing values and variables that have a high correlation with one another. Later we can look at using the feature importances returned from models such as the Gradient Boosting Machine or Random Forest to perform feature selection.

Missing Values¶

An important consideration is the missing values in the dataframe. Columns with too many missing values might have to be dropped.

# Function to calculate missing values by column# Funct 
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

missing_train = missing_values_table(train)
missing_train.head(10)

Your selected dataframe has 333 columns.
There are 278 columns that have missing values.

We see there are a number of columns with a high percentage of missing values. There is no well-established threshold for removing missing values, and the best course of action depends on the problem. Here, to reduce the number of features, we will remove any columns in either the training or the testing data that have greater than 90% missing values.

missing_train_vars = list(missing_train.index[missing_train['% of Total Values'] > 90])
len(missing_train_vars)

0

Before we remove the missing values, we will find the missing value percentages in the testing data. We'll then remove any columns with greater than 90% missing values in either the training or testing data. Let's now read in the testing data, perform the same operations, and look at the missing values in the testing data. We already have calculated all the counts and aggregation statistics, so we only need to merge the testing data with the appropriate data.

Calculate Information for Testing Data¶

# Read in the test dataframe
test = pd.read_csv('./input/application_test.csv')

# Merge with the value counts of bureau
test = test.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')

# Merge with the stats of bureau
test = test.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')

# Merge with the value counts of bureau balance
test = test.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')

print('Shape of Testing Data: ', test.shape)

Shape of Testing Data:  (48744, 332)

We need to align the testing and training dataframes, which means matching up the columns so they have the exact same columns. This shouldn't be an issue here, but when we one-hot encode variables, we need to align the dataframes to make sure they have the same columns.

train_labels = train['TARGET']

# Align the dataframes, this will remove the 'TARGET' column
train, test = train.align(test, join = 'inner', axis = 1)

train['TARGET'] = train_labels

print('Training Data Shape: ', train.shape)
print('Testing Data Shape: ', test.shape)

Training Data Shape:  (307511, 333)
Testing Data Shape:  (48744, 332)

The dataframes now have the same columns (with the exception of the TARGET column in the training data). This means we can use them in a machine learning model which needs to see the same columns in both the training and testing dataframes.

Let's now look at the percentage of missing values in the testing data so we can figure out the columns that should be dropped.

missing_test = missing_values_table(test)
missing_test.head(10)

Your selected dataframe has 332 columns.
There are 275 columns that have missing values.

missing_test_vars = list(missing_test.index[missing_test['% of Total Values'] > 90])
len(missing_test_vars)

0

missing_columns = list(set(missing_test_vars + missing_train_vars))
print('There are %d columns with more than 90%% missing in either the training or testing data.' % len(missing_columns))

There are 0 columns with more than 90% missing in either the training or testing data.

# Drop the missing columns
train = train.drop(columns = missing_columns)
test = test.drop(columns = missing_columns)

We ended up removing no columns in this round because there are no columns with more than 90% missing values. We might have to apply another feature selection method to reduce the dimensionality.

At this point we will save both the training and testing data. I encourage anyone to try different percentages for dropping the missing columns and compare the outcomes.

train.to_csv('train_bureau_raw.csv', index = False)
test.to_csv('test_bureau_raw.csv', index = False)

Correlations¶

First let's look at the correlations of the variables with the target. We can see in any of the variables we created have a greater correlation than those already present in the training data (from application).

# Calculate all correlations in dataframe
corrs = train.corr()

corrs = corrs.sort_values('TARGET', ascending = False)

# Ten most positive correlations
pd.DataFrame(corrs['TARGET'].head(10))

# Ten most negative correlations
pd.DataFrame(corrs['TARGET'].dropna().tail(100))

The highest correlated variable with the target (other than the TARGET which of course has a correlation of 1), is a variable we created. However, just because the variable is correlated does not mean that it will be useful, and we have to remember that if we generate hundreds of new variables, some are going to be correlated with the target simply because of random noise.

Viewing the correlations skeptically, it does appear that several of the newly created variables may be useful. To assess the "usefulness" of variables, we will look at the feature importances returned by the model. For curiousity's sake (and because we already wrote the function) we can make a kde plot of two of the newly created variables.

kde_target(var_name='client_bureau_balance_MONTHS_BALANCE_count_mean', df=train)

The correlation between client_bureau_balance_MONTHS_BALANCE_count_mean and the TARGET is -0.0802
Median value for loan that was not repaid = 19.3333
Median value for loan that was repaid =     25.1429

This variable represents the average number of monthly records per loan for each client. For example, if a client had three previous loans with 3, 4, and 5 records in the monthly data, the value of this variable for them would be 4. Based on the distribution, clients with a greater number of average monthly records per loan were more likely to repay their loans with Home Credit. Let's not read too much into this value, but it could indicate that clients who have had more previous credit history are generally more likely to repay a loan.

kde_target(var_name='bureau_CREDIT_ACTIVE_Active_count_norm', df=train)

The correlation between bureau_CREDIT_ACTIVE_Active_count_norm and the TARGET is 0.0774
Median value for loan that was not repaid = 0.5000
Median value for loan that was repaid =     0.3636

Well this distribution is all over the place. This variable represents the number of previous loans with a CREDIT_ACTIVE value of Active divided by the total number of previous loans for a client. The correlation here is so weak that I do not think we should draw any conclusions!

Collinear Variables¶

We can calculate not only the correlations of the variables with the target, but also the correlation of each variable with every other variable. This will allow us to see if there are highly collinear variables that should perhaps be removed from the data.

Let's look for any variables that have a greather than 0.8 correlation with other variables.

# Set the threshold
threshold = 0.8

# Empty dictionary to hold correlated variables
above_threshold_vars = {}

# For each column, record the variables that are above the threshold
for col in corrs:
    above_threshold_vars[col] = list(corrs.index[corrs[col] > threshold])

For each of these pairs of highly correlated variables, we only want to remove one of the variables. The following code creates a set of variables to remove by only adding one of each pair.

# Track columns to remove and columns already examined
cols_to_remove = []
cols_seen = []
cols_to_remove_pair = []

# Iterate through columns and correlated columns
for key, value in above_threshold_vars.items():
    # Keep track of columns already examined
    cols_seen.append(key)
    for x in value:
        if x == key:
            next
        else:
            # Only want to remove one in a pair
            if x not in cols_seen:
                cols_to_remove.append(x)
                cols_to_remove_pair.append(key)
            
cols_to_remove = list(set(cols_to_remove))
print('Number of columns to remove: ', len(cols_to_remove))

Number of columns to remove:  134

We can remove these columns from both the training and the testing datasets. We will have to compare performance after removing these variables with performance keeping these variables (the raw csv files we saved earlier).

train_corrs_removed = train.drop(columns = cols_to_remove)
test_corrs_removed = test.drop(columns = cols_to_remove)

print('Training Corrs Removed Shape: ', train_corrs_removed.shape)
print('Testing Corrs Removed Shape: ', test_corrs_removed.shape)

Training Corrs Removed Shape:  (307511, 199)
Testing Corrs Removed Shape:  (48744, 198)

train_corrs_removed.to_csv('train_bureau_corrs_removed.csv', index = False)
test_corrs_removed.to_csv('test_bureau_corrs_removed.csv', index = False)

Modeling¶

To actually test the performance of these new datasets, we will try using them for machine learning! Here we will use a function I developed in another notebook to compare the features (the raw version with the highly correlated variables removed). We can run this kind of like an experiment, and the control will be the performance of just the application data in this function when submitted to the competition. I've already recorded that performance, so we can list out our control and our two test conditions:

For all datasets, use the model shown below (with the exact hyperparameters).

control: only the data in the application files.
test one: the data in the application files with all of the data recorded from the bureau and bureau_balance files
test two: the data in the application files with all of the data recorded from the bureau and bureau_balance files with highly correlated variables removed.

import lightgbm as lgb

from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

import gc

import matplotlib.pyplot as plt

def model(features, test_features, encoding = 'ohe', n_folds = 5):
    
    """Train and test a light gradient boosting model using
    cross validation. 
    
    Parameters
    --------
        features (pd.DataFrame): 
            dataframe of training features to use 
            for training a model. Must include the TARGET column.
        test_features (pd.DataFrame): 
            dataframe of testing features to use
            for making predictions with the model. 
        encoding (str, default = 'ohe'): 
            method for encoding categorical variables. Either 'ohe' for one-hot encoding or 'le' for integer label encoding
            n_folds (int, default = 5): number of folds to use for cross validation
        
    Return
    --------
        submission (pd.DataFrame): 
            dataframe with `SK_ID_CURR` and `TARGET` probabilities
            predicted by the model.
        feature_importances (pd.DataFrame): 
            dataframe with the feature importances from the model.
        valid_metrics (pd.DataFrame): 
            dataframe with training and validation metrics (ROC AUC) for each fold and overall.
        
    """
    
    # Extract the ids
    train_ids = features['SK_ID_CURR']
    test_ids = test_features['SK_ID_CURR']
    
    # Extract the labels for training
    labels = features['TARGET']
    
    # Remove the ids and target
    features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
    test_features = test_features.drop(columns = ['SK_ID_CURR'])
    
    
    # One Hot Encoding
    if encoding == 'ohe':
        features = pd.get_dummies(features)
        test_features = pd.get_dummies(test_features)
        
        # Align the dataframes by the columns
        features, test_features = features.align(test_features, join = 'inner', axis = 1)
        
        # No categorical indices to record
        cat_indices = 'auto'
    
    # Integer label encoding
    elif encoding == 'le':
        
        # Create a label encoder
        label_encoder = LabelEncoder()
        
        # List for storing categorical indices
        cat_indices = []
        
        # Iterate through each column
        for i, col in enumerate(features):
            if features[col].dtype == 'object':
                # Map the categorical features to integers
                features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
                test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))

                # Record the categorical indices
                cat_indices.append(i)
    
    # Catch error if label encoding scheme is not valid
    else:
        raise ValueError("Encoding must be either 'ohe' or 'le'")
        
    print('Training Data Shape: ', features.shape)
    print('Testing Data Shape: ', test_features.shape)
    
    # Extract feature names
    feature_names = list(features.columns)
    
    # Convert to np arrays
    features = np.array(features)
    test_features = np.array(test_features)
    
    # Create the kfold object
    k_fold = KFold(n_splits = n_folds, shuffle = False, random_state = 50)
    
    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    
    # Empty array for test predictions
    test_predictions = np.zeros(test_features.shape[0])
    
    # Empty array for out of fold validation predictions
    out_of_fold = np.zeros(features.shape[0])
    
    # Lists for recording validation and training scores
    valid_scores = []
    train_scores = []
    
    # Iterate through each fold
    for train_indices, valid_indices in k_fold.split(features):
        
        # Training data for the fold
        train_features, train_labels = features[train_indices], labels[train_indices]
        # Validation data for the fold
        valid_features, valid_labels = features[valid_indices], labels[valid_indices]
        
        # Create the model
        model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                                   class_weight = 'balanced', learning_rate = 0.05, 
                                   reg_alpha = 0.1, reg_lambda = 0.1, 
                                   subsample = 0.8, n_jobs = -1, random_state = 50)
        
        # Train the model
        model.fit(train_features, train_labels, eval_metric = 'auc',
                  eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
                  eval_names = ['valid', 'train'], categorical_feature = cat_indices,
                  early_stopping_rounds = 100, verbose = 200)
        
        # Record the best iteration
        best_iteration = model.best_iteration_
        
        # Record the feature importances
        feature_importance_values += model.feature_importances_ / k_fold.n_splits
        
        # Make predictions
        test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
        
        # Record the out of fold predictions
        out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
        
        # Record the best score
        valid_score = model.best_score_['valid']['auc']
        train_score = model.best_score_['train']['auc']
        
        valid_scores.append(valid_score)
        train_scores.append(train_score)
        
        # Clean up memory
        gc.enable()
        del model, train_features, valid_features
        gc.collect()
        
    # Make the submission dataframe
    submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
    
    # Make the feature importance dataframe
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
    
    # Overall validation score
    valid_auc = roc_auc_score(labels, out_of_fold)
    
    # Add the overall scores to the metrics
    valid_scores.append(valid_auc)
    train_scores.append(np.mean(train_scores))
    
    # Needed for creating dataframe of validation scores
    fold_names = list(range(n_folds))
    fold_names.append('overall')
    
    # Dataframe of validation scores
    metrics = pd.DataFrame({'fold': fold_names,
                            'train': train_scores,
                            'valid': valid_scores}) 
    
    return submission, feature_importances, metrics

def plot_feature_importances(df):
    """
    Plot importances returned by a model. This can work with any measure of
    feature importance provided that higher importance is better. 
    
    Args:
        df (dataframe): feature importances. Must have the features in a column
        called `features` and the importances in a column called `importance
        
    Returns:
        shows a plot of the 15 most importance features
        
        df (dataframe): feature importances sorted by importance (highest to lowest) 
        with a column for normalized importance
        """
    
    # Sort features according to importance
    df = df.sort_values('importance', ascending = False).reset_index()
    
    # Normalize the feature importances to add up to one
    df['importance_normalized'] = df['importance'] / df['importance'].sum()

    # Make a horizontal bar chart of feature importances
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()
    
    # Need to reverse the index to plot most important on top
    ax.barh(list(reversed(list(df.index[:15]))), 
            df['importance_normalized'].head(15), 
            align = 'center', edgecolor = 'k')
    
    # Set the yticks and labels
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))
    
    # Plot labeling
    plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
    plt.show()
    
    return df

Control¶

The first step in any experiment is establishing a control. For this we will use the function defined above (that implements a Gradient Boosting Machine model) and the single main data source (application).

train_control = pd.read_csv('./input/application_train.csv')
test_control = pd.read_csv('./input/application_test.csv')

Fortunately, once we have taken the time to write a function, using it is simple. The function above returns a submission dataframe we can upload to the competition, a fi dataframe of feature importances, and a metrics dataframe with validation and test performance.

submission, fi, metrics = model(train_control, test_control)

Training Data Shape:  (307511, 241)
Testing Data Shape:  (48744, 241)
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.760007	train's auc: 0.798103
Early stopping, best iteration is:
[269]	valid's auc: 0.760273	train's auc: 0.809199
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.76114	train's auc: 0.798328
Early stopping, best iteration is:
[289]	valid's auc: 0.761398	train's auc: 0.812654
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.750232	train's auc: 0.79964
Early stopping, best iteration is:
[265]	valid's auc: 0.750451	train's auc: 0.809734
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.759831	train's auc: 0.797797
Early stopping, best iteration is:
[282]	valid's auc: 0.760245	train's auc: 0.811121
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.76071	train's auc: 0.798106
Early stopping, best iteration is:
[226]	valid's auc: 0.760972	train's auc: 0.802236

metrics

The control slightly overfits because the training score is higher than the validation score. We can address this in later notebooks when we look at regularization (we already perform some regularization in this model by using reg_lambda and reg_alpha as well as early stopping).

We can visualize the feature importance with another function, plot_feature_importances. The feature importances may be useful when it's time for feature selection.

fi_sorted = plot_feature_importances(fi)

submission.to_csv('control.csv', index = False)

The control scores 0.745 when submitted to the competition.

Test One¶

Let's conduct the first test. We will just need to pass in the data to the function, which does most of the work for us.

submission_raw, fi_raw, metrics_raw = model(train, test)

Training Data Shape:  (307511, 452)
Testing Data Shape:  (48744, 452)
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.766471	train's auc: 0.810385
Early stopping, best iteration is:
[293]	valid's auc: 0.767203	train's auc: 0.827287
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.767523	train's auc: 0.810288
Early stopping, best iteration is:
[265]	valid's auc: 0.768109	train's auc: 0.822264
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.761029	train's auc: 0.811653
Early stopping, best iteration is:
[267]	valid's auc: 0.761625	train's auc: 0.824147
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.767837	train's auc: 0.809671
Early stopping, best iteration is:
[236]	valid's auc: 0.76815	train's auc: 0.816322
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.767516	train's auc: 0.809788
[400]	valid's auc: 0.767583	train's auc: 0.842953
Early stopping, best iteration is:
[306]	valid's auc: 0.767902	train's auc: 0.828518

metrics_raw

Based on these numbers, the engineered features perform better than the control case. However, we will have to submit the predictions to the leaderboard before we can say if this better validation performance transfers to the testing data.

fi_raw_sorted = plot_feature_importances(fi_raw)

Examining the feature improtances, it looks as if a few of the feature we constructed are among the most important. Let's find the percentage of the top 100 most important features that we made in this notebook. However, rather than just compare to the original features, we need to compare to the one-hot encoded original features. These are already recorded for us in fi (from the original data).

top_100 = list(fi_raw_sorted['feature'])[:100]
new_features = [x for x in top_100 if x not in list(fi['feature'])]

print('%% of Top 100 Features created from the bureau data = %d.00' % len(new_features))

% of Top 100 Features created from the bureau data = 53.00

Over half of the top 100 features were made by us! That should give us confidence that all the hard work we did was worthwhile.

submission_raw.to_csv('test_one.csv', index = False)

Test one scores 0.759 when submitted to the competition.

Test Two¶

That was easy, so let's do another run! Same as before but with the highly collinear variables removed.

submission_corrs, fi_corrs, metrics_corr = model(train_corrs_removed, test_corrs_removed)

Training Data Shape:  (307511, 318)
Testing Data Shape:  (48744, 318)
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.764122	train's auc: 0.8063
Early stopping, best iteration is:
[264]	valid's auc: 0.764565	train's auc: 0.817897
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.7652	train's auc: 0.806611
[400]	valid's auc: 0.765625	train's auc: 0.839251
Early stopping, best iteration is:
[379]	valid's auc: 0.765829	train's auc: 0.836334
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.757092	train's auc: 0.808063
Early stopping, best iteration is:
[198]	valid's auc: 0.757149	train's auc: 0.807627
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.764131	train's auc: 0.806097
Early stopping, best iteration is:
[287]	valid's auc: 0.76481	train's auc: 0.821551
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.763964	train's auc: 0.806256
[400]	valid's auc: 0.763928	train's auc: 0.838649
Early stopping, best iteration is:
[312]	valid's auc: 0.764449	train's auc: 0.825298

metrics_corr

These results are better than the control, but slightly lower than the raw features.

fi_corrs_sorted = plot_feature_importances(fi_corrs)

submission_corrs.to_csv('test_two.csv', index = False)

Test Two scores 0.753 when submitted to the competition.

Results¶

After all that work, we can say that including the extra information did improve performance! The model is definitely not optimized to our data, but we still had a noticeable improvement over the original dataset when using the calculated features. Let's officially summarize the performances:

Experiment	Train AUC	Validation AUC	Test AUC
Control	0.815	0.760	0.745
Test One	0.837	0.767	0.759
Test Two	0.826	0.765	0.753

(Note that these scores may change from run to run of the notebook. I have not observed that the general ordering changes however.)

All of our hard work translates to a small improvement of 0.014 ROC AUC over the original testing data. Removing the highly collinear variables slightly decreases performance so we will want to consider a different method for feature selection. Moreover, we can say that some of the features we built are among the most important as judged by the model.

In a competition such as this, even an improvement of this size is enough to move us up 100s of spots on the leaderboard. By making numerous small improvements such as in this notebook, we can gradually achieve better and better performance. I encourage others to use the results here to make their own improvements, and I will continue to document the steps I take to help others.

Next Steps¶

Going forward, we can now use the functions we developed in this notebook on the other datasets. There are still 4 other data files to use in our model!

(Internal reference in KB: k003)

	SK_ID_CURR	previous_loan_counts
0	100001	7
1	100002	8
2	100003	4
3	100004	2
4	100005	3

	SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	NAME_TYPE_SUITE	NAME_INCOME_TYPE	NAME_EDUCATION_TYPE	NAME_FAMILY_STATUS	NAME_HOUSING_TYPE	REGION_POPULATION_RELATIVE	DAYS_BIRTH	DAYS_EMPLOYED	DAYS_REGISTRATION	DAYS_ID_PUBLISH	OWN_CAR_AGE	FLAG_MOBIL	FLAG_EMP_PHONE	FLAG_WORK_PHONE	FLAG_CONT_MOBILE	FLAG_PHONE	OCCUPATION_TYPE	CNT_FAM_MEMBERS	REGION_RATING_CLIENT	REGION_RATING_CLIENT_W_CITY	WEEKDAY_APPR_PROCESS_START	HOUR_APPR_PROCESS_START	REG_CITY_NOT_WORK_CITY	LIVE_CITY_NOT_WORK_CITY	ORGANIZATION_TYPE	EXT_SOURCE_1	EXT_SOURCE_2	EXT_SOURCE_3	APARTMENTS_AVG	BASEMENTAREA_AVG	YEARS_BEGINEXPLUATATION_AVG	YEARS_BUILD_AVG	COMMONAREA_AVG	ELEVATORS_AVG	ENTRANCES_AVG	FLOORSMAX_AVG	FLOORSMIN_AVG	LANDAREA_AVG	LIVINGAPARTMENTS_AVG	LIVINGAREA_AVG	NONLIVINGAPARTMENTS_AVG	NONLIVINGAREA_AVG	APARTMENTS_MODE	BASEMENTAREA_MODE	YEARS_BEGINEXPLUATATION_MODE	YEARS_BUILD_MODE	COMMONAREA_MODE	ELEVATORS_MODE	ENTRANCES_MODE	FLOORSMAX_MODE	FLOORSMIN_MODE	LANDAREA_MODE	LIVINGAPARTMENTS_MODE	LIVINGAREA_MODE	NONLIVINGAPARTMENTS_MODE	NONLIVINGAREA_MODE	APARTMENTS_MEDI	BASEMENTAREA_MEDI	YEARS_BEGINEXPLUATATION_MEDI	YEARS_BUILD_MEDI	COMMONAREA_MEDI	ELEVATORS_MEDI	ENTRANCES_MEDI	FLOORSMAX_MEDI	FLOORSMIN_MEDI	LANDAREA_MEDI	LIVINGAPARTMENTS_MEDI	LIVINGAREA_MEDI	NONLIVINGAPARTMENTS_MEDI	NONLIVINGAREA_MEDI	FONDKAPREMONT_MODE	HOUSETYPE_MODE	TOTALAREA_MODE	WALLSMATERIAL_MODE	EMERGENCYSTATE_MODE	OBS_30_CNT_SOCIAL_CIRCLE	DEF_30_CNT_SOCIAL_CIRCLE	OBS_60_CNT_SOCIAL_CIRCLE	DEF_60_CNT_SOCIAL_CIRCLE	DAYS_LAST_PHONE_CHANGE	FLAG_DOCUMENT_3	FLAG_DOCUMENT_8	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR	previous_loan_counts
0	100002	1	Cash loans	M	N	Y	202500.0	406597.5	24700.5	351000.0	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	0.018801	-9461	-637	-3648.0	-2120	NaN	1	1	0	1	1	Laborers	1.0	2	2	WEDNESDAY	10	0	0	Business Entity Type 3	0.083037	0.262949	0.139376	0.0247	0.0369	0.9722	0.6192	0.0143	0.00	0.0690	0.0833	0.1250	0.0369	0.0202	0.0190	0.0000	0.0000	0.0252	0.0383	0.9722	0.6341	0.0144	0.0000	0.0690	0.0833	0.1250	0.0377	0.022	0.0198	0.0	0.0	0.0250	0.0369	0.9722	0.6243	0.0144	0.00	0.0690	0.0833	0.1250	0.0375	0.0205	0.0193	0.0000	0.00	reg oper account	block of flats	0.0149	Stone, brick	No	2.0	2.0	2.0	2.0	-1134.0	1	0	0.0	0.0	0.0	0.0	0.0	1.0	8.0
1	100003	0	Cash loans	F	N	N	270000.0	1293502.5	35698.5	1129500.0	Family	State servant	Higher education	Married	House / apartment	0.003541	-16765	-1188	-1186.0	-291	NaN	1	1	0	1	1	Core staff	2.0	1	1	MONDAY	11	0	0	School	0.311267	0.622246	NaN	0.0959	0.0529	0.9851	0.7960	0.0605	0.08	0.0345	0.2917	0.3333	0.0130	0.0773	0.0549	0.0039	0.0098	0.0924	0.0538	0.9851	0.8040	0.0497	0.0806	0.0345	0.2917	0.3333	0.0128	0.079	0.0554	0.0	0.0	0.0968	0.0529	0.9851	0.7987	0.0608	0.08	0.0345	0.2917	0.3333	0.0132	0.0787	0.0558	0.0039	0.01	reg oper account	block of flats	0.0714	Block	No	1.0	0.0	1.0	0.0	-828.0	1	0	0.0	0.0	0.0	0.0	0.0	0.0	4.0
2	100004	0	Revolving loans	M	Y	Y	67500.0	135000.0	6750.0	135000.0	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	0.010032	-19046	-225	-4260.0	-2531	26.0	1	1	1	1	1	Laborers	1.0	2	2	MONDAY	9	0	0	Government	NaN	0.555912	0.729567	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	0.0	0.0	-815.0	0	0	0.0	0.0	0.0	0.0	0.0	0.0	2.0
3	100006	0	Cash loans	F	N	Y	135000.0	312682.5	29686.5	297000.0	Unaccompanied	Working	Secondary / secondary special	Civil marriage	House / apartment	0.008019	-19005	-3039	-9833.0	-2437	NaN	1	1	0	1	0	Laborers	2.0	2	2	WEDNESDAY	17	0	0	Business Entity Type 3	NaN	0.650442	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.0	0.0	2.0	0.0	-617.0	1	0	NaN	NaN	NaN	NaN	NaN	NaN	0.0
4	100007	0	Cash loans	M	N	Y	121500.0	513000.0	21865.5	513000.0	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	0.028663	-19932	-3038	-4311.0	-3458	NaN	1	1	0	1	0	Core staff	1.0	2	2	THURSDAY	11	1	1	Religion	NaN	0.322738	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	0.0	0.0	-1106.0	0	1	0.0	0.0	0.0	0.0	0.0	0.0	1.0

	SK_ID_CURR	DAYS_CREDIT					CREDIT_DAY_OVERDUE					DAYS_CREDIT_ENDDATE					DAYS_ENDDATE_FACT					AMT_CREDIT_MAX_OVERDUE					CNT_CREDIT_PROLONG					AMT_CREDIT_SUM					AMT_CREDIT_SUM_DEBT					AMT_CREDIT_SUM_LIMIT					AMT_CREDIT_SUM_OVERDUE					DAYS_CREDIT_UPDATE					AMT_ANNUITY
		count	mean	max	min	sum	count	mean	max	min	sum	count	mean	max	min	sum	count	mean	max	min	sum	count	mean	max	min	sum	count	mean	max	min	sum	count	mean	max	min	sum	count	mean	max	min	sum	count	mean	max	min	sum	count	mean	max	min	sum	count	mean	max	min	sum	count	mean	max	min	sum
0	100001	7	-735.000000	-49	-1572	-5145	7	0.0	0	0	0	7	82.428571	1778.0	-1329.0	577.0	4	-825.500000	-544.0	-1328.0	-3302.0	0	NaN	NaN	NaN	0.000	7	0.0	0	0	0	7	207623.571429	378000.0	85500.0	1453365.000	7	85240.928571	373239.0	0.0	596686.5	6	0.00000	0.000	0.0	0.000	7	0.0	0.0	0.0	0.0	7	-93.142857	-6	-155	-652	7	3545.357143	10822.5	0.0	24817.5
1	100002	8	-874.000000	-103	-1437	-6992	8	0.0	0	0	0	6	-349.000000	780.0	-1072.0	-2094.0	6	-697.500000	-36.0	-1185.0	-4185.0	5	1681.029	5043.645	0.0	8405.145	8	0.0	0	0	0	8	108131.945625	450000.0	0.0	865055.565	5	49156.200000	245781.0	0.0	245781.0	4	7997.14125	31988.565	0.0	31988.565	8	0.0	0.0	0.0	0.0	8	-499.875000	-7	-1185	-3999	7	0.000000	0.0	0.0	0.0
2	100003	4	-1400.750000	-606	-2586	-5603	4	0.0	0	0	0	4	-544.500000	1216.0	-2434.0	-2178.0	3	-1097.333333	-540.0	-2131.0	-3292.0	4	0.000	0.000	0.0	0.000	4	0.0	0	0	0	4	254350.125000	810000.0	22248.0	1017400.500	4	0.000000	0.0	0.0	0.0	4	202500.00000	810000.000	0.0	810000.000	4	0.0	0.0	0.0	0.0	4	-816.000000	-43	-2131	-3264	0	NaN	NaN	NaN	0.0
3	100004	2	-867.000000	-408	-1326	-1734	2	0.0	0	0	0	2	-488.500000	-382.0	-595.0	-977.0	2	-532.500000	-382.0	-683.0	-1065.0	1	0.000	0.000	0.0	0.000	2	0.0	0	0	0	2	94518.900000	94537.8	94500.0	189037.800	2	0.000000	0.0	0.0	0.0	2	0.00000	0.000	0.0	0.000	2	0.0	0.0	0.0	0.0	2	-532.000000	-382	-682	-1064	0	NaN	NaN	NaN	0.0
4	100005	3	-190.666667	-62	-373	-572	3	0.0	0	0	0	3	439.333333	1324.0	-128.0	1318.0	1	-123.000000	-123.0	-123.0	-123.0	1	0.000	0.000	0.0	0.000	3	0.0	0	0	0	3	219042.000000	568800.0	29826.0	657126.000	3	189469.500000	543087.0	0.0	568408.5	3	0.00000	0.000	0.0	0.000	3	0.0	0.0	0.0	0.0	3	-54.333333	-11	-121	-163	3	1420.500000	4261.5	0.0	4261.5

	SK_ID_CURR	bureau_DAYS_CREDIT_count	bureau_DAYS_CREDIT_mean	bureau_DAYS_CREDIT_max	bureau_DAYS_CREDIT_min	bureau_DAYS_CREDIT_sum	bureau_CREDIT_DAY_OVERDUE_count	bureau_DAYS_CREDIT_ENDDATE_count	bureau_DAYS_CREDIT_ENDDATE_mean	bureau_DAYS_CREDIT_ENDDATE_max	bureau_DAYS_CREDIT_ENDDATE_min	bureau_DAYS_CREDIT_ENDDATE_sum	bureau_DAYS_ENDDATE_FACT_count	bureau_DAYS_ENDDATE_FACT_mean	bureau_DAYS_ENDDATE_FACT_max	bureau_DAYS_ENDDATE_FACT_min	bureau_DAYS_ENDDATE_FACT_sum	bureau_AMT_CREDIT_MAX_OVERDUE_count	bureau_AMT_CREDIT_MAX_OVERDUE_mean	bureau_AMT_CREDIT_MAX_OVERDUE_max	bureau_AMT_CREDIT_MAX_OVERDUE_min	bureau_AMT_CREDIT_MAX_OVERDUE_sum	bureau_CNT_CREDIT_PROLONG_count	bureau_AMT_CREDIT_SUM_count	bureau_AMT_CREDIT_SUM_mean	bureau_AMT_CREDIT_SUM_max	bureau_AMT_CREDIT_SUM_min	bureau_AMT_CREDIT_SUM_sum	bureau_AMT_CREDIT_SUM_DEBT_count	bureau_AMT_CREDIT_SUM_DEBT_mean	bureau_AMT_CREDIT_SUM_DEBT_max	bureau_AMT_CREDIT_SUM_DEBT_sum	bureau_AMT_CREDIT_SUM_LIMIT_count	bureau_AMT_CREDIT_SUM_LIMIT_mean	bureau_AMT_CREDIT_SUM_LIMIT_max	bureau_AMT_CREDIT_SUM_LIMIT_sum	bureau_AMT_CREDIT_SUM_OVERDUE_count	bureau_DAYS_CREDIT_UPDATE_count	bureau_DAYS_CREDIT_UPDATE_mean	bureau_DAYS_CREDIT_UPDATE_max	bureau_DAYS_CREDIT_UPDATE_min	bureau_DAYS_CREDIT_UPDATE_sum	bureau_AMT_ANNUITY_count	bureau_AMT_ANNUITY_mean	bureau_AMT_ANNUITY_max	bureau_AMT_ANNUITY_min	bureau_AMT_ANNUITY_sum
0	100001	7	-735.000000	-49	-1572	-5145	7	7	82.428571	1778.0	-1329.0	577.0	4	-825.500000	-544.0	-1328.0	-3302.0	0	NaN	NaN	NaN	0.000	7	7	207623.571429	378000.0	85500.0	1453365.000	7	85240.928571	373239.0	596686.5	6	0.00000	0.000	0.000	7	7	-93.142857	-6	-155	-652	7	3545.357143	10822.5	0.0	24817.5
1	100002	8	-874.000000	-103	-1437	-6992	8	6	-349.000000	780.0	-1072.0	-2094.0	6	-697.500000	-36.0	-1185.0	-4185.0	5	1681.029	5043.645	0.0	8405.145	8	8	108131.945625	450000.0	0.0	865055.565	5	49156.200000	245781.0	245781.0	4	7997.14125	31988.565	31988.565	8	8	-499.875000	-7	-1185	-3999	7	0.000000	0.0	0.0	0.0
2	100003	4	-1400.750000	-606	-2586	-5603	4	4	-544.500000	1216.0	-2434.0	-2178.0	3	-1097.333333	-540.0	-2131.0	-3292.0	4	0.000	0.000	0.0	0.000	4	4	254350.125000	810000.0	22248.0	1017400.500	4	0.000000	0.0	0.0	4	202500.00000	810000.000	810000.000	4	4	-816.000000	-43	-2131	-3264	0	NaN	NaN	NaN	0.0
3	100004	2	-867.000000	-408	-1326	-1734	2	2	-488.500000	-382.0	-595.0	-977.0	2	-532.500000	-382.0	-683.0	-1065.0	1	0.000	0.000	0.0	0.000	2	2	94518.900000	94537.8	94500.0	189037.800	2	0.000000	0.0	0.0	2	0.00000	0.000	0.000	2	2	-532.000000	-382	-682	-1064	0	NaN	NaN	NaN	0.0
4	100005	3	-190.666667	-62	-373	-572	3	3	439.333333	1324.0	-128.0	1318.0	1	-123.000000	-123.0	-123.0	-123.0	1	0.000	0.000	0.0	0.000	3	3	219042.000000	568800.0	29826.0	657126.000	3	189469.500000	543087.0	568408.5	3	0.00000	0.000	0.000	3	3	-54.333333	-11	-121	-163	3	1420.500000	4261.5	0.0	4261.5

	SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	NAME_TYPE_SUITE	NAME_INCOME_TYPE	NAME_EDUCATION_TYPE	NAME_FAMILY_STATUS	NAME_HOUSING_TYPE	REGION_POPULATION_RELATIVE	DAYS_BIRTH	DAYS_EMPLOYED	DAYS_REGISTRATION	DAYS_ID_PUBLISH	OWN_CAR_AGE	FLAG_MOBIL	FLAG_EMP_PHONE	FLAG_WORK_PHONE	FLAG_CONT_MOBILE	FLAG_PHONE	OCCUPATION_TYPE	CNT_FAM_MEMBERS	REGION_RATING_CLIENT	REGION_RATING_CLIENT_W_CITY	WEEKDAY_APPR_PROCESS_START	HOUR_APPR_PROCESS_START	REG_CITY_NOT_WORK_CITY	LIVE_CITY_NOT_WORK_CITY	ORGANIZATION_TYPE	EXT_SOURCE_1	EXT_SOURCE_2	EXT_SOURCE_3	APARTMENTS_AVG	BASEMENTAREA_AVG	YEARS_BEGINEXPLUATATION_AVG	YEARS_BUILD_AVG	COMMONAREA_AVG	ELEVATORS_AVG	ENTRANCES_AVG	FLOORSMAX_AVG	FLOORSMIN_AVG	LANDAREA_AVG	LIVINGAPARTMENTS_AVG	LIVINGAREA_AVG	NONLIVINGAPARTMENTS_AVG	NONLIVINGAREA_AVG	APARTMENTS_MODE	BASEMENTAREA_MODE	YEARS_BEGINEXPLUATATION_MODE	YEARS_BUILD_MODE	COMMONAREA_MODE	ELEVATORS_MODE	ENTRANCES_MODE	FLOORSMAX_MODE	FLOORSMIN_MODE	LANDAREA_MODE	LIVINGAPARTMENTS_MODE	LIVINGAREA_MODE	NONLIVINGAPARTMENTS_MODE	NONLIVINGAREA_MODE	APARTMENTS_MEDI	BASEMENTAREA_MEDI	YEARS_BEGINEXPLUATATION_MEDI	YEARS_BUILD_MEDI	COMMONAREA_MEDI	ELEVATORS_MEDI	ENTRANCES_MEDI	FLOORSMAX_MEDI	FLOORSMIN_MEDI	LANDAREA_MEDI	LIVINGAPARTMENTS_MEDI	LIVINGAREA_MEDI	NONLIVINGAPARTMENTS_MEDI	NONLIVINGAREA_MEDI	FONDKAPREMONT_MODE	HOUSETYPE_MODE	TOTALAREA_MODE	...	DEF_60_CNT_SOCIAL_CIRCLE	DAYS_LAST_PHONE_CHANGE	FLAG_DOCUMENT_3	FLAG_DOCUMENT_8	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR	previous_loan_counts	bureau_DAYS_CREDIT_count	bureau_DAYS_CREDIT_mean	bureau_DAYS_CREDIT_max	bureau_DAYS_CREDIT_min	bureau_DAYS_CREDIT_sum	bureau_CREDIT_DAY_OVERDUE_count	bureau_CREDIT_DAY_OVERDUE_mean	bureau_CREDIT_DAY_OVERDUE_max	bureau_CREDIT_DAY_OVERDUE_min	bureau_CREDIT_DAY_OVERDUE_sum	bureau_DAYS_CREDIT_ENDDATE_count	bureau_DAYS_CREDIT_ENDDATE_mean	bureau_DAYS_CREDIT_ENDDATE_max	bureau_DAYS_CREDIT_ENDDATE_min	bureau_DAYS_CREDIT_ENDDATE_sum	bureau_DAYS_ENDDATE_FACT_count	bureau_DAYS_ENDDATE_FACT_mean	bureau_DAYS_ENDDATE_FACT_max	bureau_DAYS_ENDDATE_FACT_min	bureau_DAYS_ENDDATE_FACT_sum	bureau_AMT_CREDIT_MAX_OVERDUE_count	bureau_AMT_CREDIT_MAX_OVERDUE_mean	bureau_AMT_CREDIT_MAX_OVERDUE_max	bureau_AMT_CREDIT_MAX_OVERDUE_min	bureau_AMT_CREDIT_MAX_OVERDUE_sum	bureau_CNT_CREDIT_PROLONG_count	bureau_CNT_CREDIT_PROLONG_mean	bureau_CNT_CREDIT_PROLONG_max	bureau_CNT_CREDIT_PROLONG_min	bureau_CNT_CREDIT_PROLONG_sum	bureau_AMT_CREDIT_SUM_count	bureau_AMT_CREDIT_SUM_mean	bureau_AMT_CREDIT_SUM_max	bureau_AMT_CREDIT_SUM_min	bureau_AMT_CREDIT_SUM_sum	bureau_AMT_CREDIT_SUM_DEBT_count	bureau_AMT_CREDIT_SUM_DEBT_mean	bureau_AMT_CREDIT_SUM_DEBT_max	bureau_AMT_CREDIT_SUM_DEBT_min	bureau_AMT_CREDIT_SUM_DEBT_sum	bureau_AMT_CREDIT_SUM_LIMIT_count	bureau_AMT_CREDIT_SUM_LIMIT_mean	bureau_AMT_CREDIT_SUM_LIMIT_max	bureau_AMT_CREDIT_SUM_LIMIT_min	bureau_AMT_CREDIT_SUM_LIMIT_sum	bureau_AMT_CREDIT_SUM_OVERDUE_count	bureau_AMT_CREDIT_SUM_OVERDUE_mean	bureau_AMT_CREDIT_SUM_OVERDUE_max	bureau_AMT_CREDIT_SUM_OVERDUE_min	bureau_AMT_CREDIT_SUM_OVERDUE_sum	bureau_DAYS_CREDIT_UPDATE_count	bureau_DAYS_CREDIT_UPDATE_mean	bureau_DAYS_CREDIT_UPDATE_max	bureau_DAYS_CREDIT_UPDATE_min	bureau_DAYS_CREDIT_UPDATE_sum	bureau_AMT_ANNUITY_count	bureau_AMT_ANNUITY_mean	bureau_AMT_ANNUITY_max	bureau_AMT_ANNUITY_min	bureau_AMT_ANNUITY_sum
0	100002	1	Cash loans	M	N	Y	202500.0	406597.5	24700.5	351000.0	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	0.018801	-9461	-637	-3648.0	-2120	NaN	1	1	0	1	1	Laborers	1.0	2	2	WEDNESDAY	10	0	0	Business Entity Type 3	0.083037	0.262949	0.139376	0.0247	0.0369	0.9722	0.6192	0.0143	0.00	0.0690	0.0833	0.1250	0.0369	0.0202	0.0190	0.0000	0.0000	0.0252	0.0383	0.9722	0.6341	0.0144	0.0000	0.0690	0.0833	0.1250	0.0377	0.022	0.0198	0.0	0.0	0.0250	0.0369	0.9722	0.6243	0.0144	0.00	0.0690	0.0833	0.1250	0.0375	0.0205	0.0193	0.0000	0.00	reg oper account	block of flats	0.0149	...	2.0	-1134.0	1	0	0.0	0.0	0.0	0.0	0.0	1.0	8.0	8.0	-874.00	-103.0	-1437.0	-6992.0	8.0	0.0	0.0	0.0	0.0	6.0	-349.0	780.0	-1072.0	-2094.0	6.0	-697.500000	-36.0	-1185.0	-4185.0	5.0	1681.029	5043.645	0.0	8405.145	8.0	0.0	0.0	0.0	0.0	8.0	108131.945625	450000.0	0.0	865055.565	5.0	49156.2	245781.0	0.0	245781.0	4.0	7997.14125	31988.565	0.0	31988.565	8.0	0.0	0.0	0.0	0.0	8.0	-499.875	-7.0	-1185.0	-3999.0	7.0	0.0	0.0	0.0	0.0
1	100003	0	Cash loans	F	N	N	270000.0	1293502.5	35698.5	1129500.0	Family	State servant	Higher education	Married	House / apartment	0.003541	-16765	-1188	-1186.0	-291	NaN	1	1	0	1	1	Core staff	2.0	1	1	MONDAY	11	0	0	School	0.311267	0.622246	NaN	0.0959	0.0529	0.9851	0.7960	0.0605	0.08	0.0345	0.2917	0.3333	0.0130	0.0773	0.0549	0.0039	0.0098	0.0924	0.0538	0.9851	0.8040	0.0497	0.0806	0.0345	0.2917	0.3333	0.0128	0.079	0.0554	0.0	0.0	0.0968	0.0529	0.9851	0.7987	0.0608	0.08	0.0345	0.2917	0.3333	0.0132	0.0787	0.0558	0.0039	0.01	reg oper account	block of flats	0.0714	...	0.0	-828.0	1	0	0.0	0.0	0.0	0.0	0.0	0.0	4.0	4.0	-1400.75	-606.0	-2586.0	-5603.0	4.0	0.0	0.0	0.0	0.0	4.0	-544.5	1216.0	-2434.0	-2178.0	3.0	-1097.333333	-540.0	-2131.0	-3292.0	4.0	0.000	0.000	0.0	0.000	4.0	0.0	0.0	0.0	0.0	4.0	254350.125000	810000.0	22248.0	1017400.500	4.0	0.0	0.0	0.0	0.0	4.0	202500.00000	810000.000	0.0	810000.000	4.0	0.0	0.0	0.0	0.0	4.0	-816.000	-43.0	-2131.0	-3264.0	0.0	NaN	NaN	NaN	0.0
2	100004	0	Revolving loans	M	Y	Y	67500.0	135000.0	6750.0	135000.0	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	0.010032	-19046	-225	-4260.0	-2531	26.0	1	1	1	1	1	Laborers	1.0	2	2	MONDAY	9	0	0	Government	NaN	0.555912	0.729567	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0.0	-815.0	0	0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	2.0	-867.00	-408.0	-1326.0	-1734.0	2.0	0.0	0.0	0.0	0.0	2.0	-488.5	-382.0	-595.0	-977.0	2.0	-532.500000	-382.0	-683.0	-1065.0	1.0	0.000	0.000	0.0	0.000	2.0	0.0	0.0	0.0	0.0	2.0	94518.900000	94537.8	94500.0	189037.800	2.0	0.0	0.0	0.0	0.0	2.0	0.00000	0.000	0.0	0.000	2.0	0.0	0.0	0.0	0.0	2.0	-532.000	-382.0	-682.0	-1064.0	0.0	NaN	NaN	NaN	0.0
3	100006	0	Cash loans	F	N	Y	135000.0	312682.5	29686.5	297000.0	Unaccompanied	Working	Secondary / secondary special	Civil marriage	House / apartment	0.008019	-19005	-3039	-9833.0	-2437	NaN	1	1	0	1	0	Laborers	2.0	2	2	WEDNESDAY	17	0	0	Business Entity Type 3	NaN	0.650442	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0.0	-617.0	1	0	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	100007	0	Cash loans	M	N	Y	121500.0	513000.0	21865.5	513000.0	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	0.028663	-19932	-3038	-4311.0	-3458	NaN	1	1	0	1	0	Core staff	1.0	2	2	THURSDAY	11	1	1	Religion	NaN	0.322738	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0.0	-1106.0	0	1	0.0	0.0	0.0	0.0	0.0	0.0	1.0	1.0	-1149.00	-1149.0	-1149.0	-1149.0	1.0	0.0	0.0	0.0	0.0	1.0	-783.0	-783.0	-783.0	-783.0	1.0	-783.000000	-783.0	-783.0	-783.0	1.0	0.000	0.000	0.0	0.000	1.0	0.0	0.0	0.0	0.0	1.0	146250.000000	146250.0	146250.0	146250.000	1.0	0.0	0.0	0.0	0.0	1.0	0.00000	0.000	0.0	0.000	1.0	0.0	0.0	0.0	0.0	1.0	-783.000	-783.0	-783.0	-783.0	0.0	NaN	NaN	NaN	0.0

	SK_ID_CURR	SK_ID_BUREAU	CREDIT_ACTIVE	CREDIT_CURRENCY	DAYS_CREDIT	DAYS_CREDIT_ENDDATE	DAYS_ENDDATE_FACT	AMT_CREDIT_MAX_OVERDUE	AMT_CREDIT_SUM	AMT_CREDIT_SUM_DEBT	AMT_CREDIT_SUM_LIMIT	CREDIT_TYPE	DAYS_CREDIT_UPDATE	AMT_ANNUITY
0	215354	5714462	Closed	currency 1	-497	-153.0	-153.0	NaN	91323.0	0.0	NaN	Consumer credit	-131	NaN
1	215354	5714463	Active	currency 1	-208	1075.0	NaN	NaN	225000.0	171342.0	NaN	Credit card	-20	NaN
2	215354	5714464	Active	currency 1	-203	528.0	NaN	NaN	464323.5	NaN	NaN	Consumer credit	-16	NaN
3	215354	5714465	Active	currency 1	-203	NaN	NaN	NaN	90000.0	NaN	NaN	Credit card	-16	NaN
4	215354	5714466	Active	currency 1	-629	1197.0	NaN	77674.5	2700000.0	NaN	NaN	Consumer credit	-21	NaN

	CREDIT_ACTIVE_Active_count	CREDIT_ACTIVE_Active_count_norm	CREDIT_ACTIVE_Bad debt_count	CREDIT_ACTIVE_Bad debt_count_norm	CREDIT_ACTIVE_Closed_count	CREDIT_ACTIVE_Closed_count_norm	CREDIT_ACTIVE_Sold_count	CREDIT_ACTIVE_Sold_count_norm	CREDIT_CURRENCY_currency 1_count	CREDIT_CURRENCY_currency 1_count_norm	CREDIT_CURRENCY_currency 2_count	CREDIT_CURRENCY_currency 2_count_norm	CREDIT_CURRENCY_currency 3_count	CREDIT_CURRENCY_currency 3_count_norm	CREDIT_CURRENCY_currency 4_count	CREDIT_CURRENCY_currency 4_count_norm	CREDIT_TYPE_Another type of loan_count	CREDIT_TYPE_Another type of loan_count_norm	CREDIT_TYPE_Car loan_count	CREDIT_TYPE_Car loan_count_norm	CREDIT_TYPE_Cash loan (non-earmarked)_count	CREDIT_TYPE_Cash loan (non-earmarked)_count_norm	CREDIT_TYPE_Consumer credit_count	CREDIT_TYPE_Consumer credit_count_norm	CREDIT_TYPE_Credit card_count	CREDIT_TYPE_Credit card_count_norm	CREDIT_TYPE_Interbank credit_count	CREDIT_TYPE_Interbank credit_count_norm	CREDIT_TYPE_Loan for business development_count	CREDIT_TYPE_Loan for business development_count_norm	CREDIT_TYPE_Loan for purchase of shares (margin lending)_count	CREDIT_TYPE_Loan for purchase of shares (margin lending)_count_norm	CREDIT_TYPE_Loan for the purchase of equipment_count	CREDIT_TYPE_Loan for the purchase of equipment_count_norm	CREDIT_TYPE_Loan for working capital replenishment_count	CREDIT_TYPE_Loan for working capital replenishment_count_norm	CREDIT_TYPE_Microloan_count	CREDIT_TYPE_Microloan_count_norm	CREDIT_TYPE_Mobile operator loan_count	CREDIT_TYPE_Mobile operator loan_count_norm	CREDIT_TYPE_Mortgage_count	CREDIT_TYPE_Mortgage_count_norm	CREDIT_TYPE_Real estate loan_count	CREDIT_TYPE_Real estate loan_count_norm	CREDIT_TYPE_Unknown type of loan_count	CREDIT_TYPE_Unknown type of loan_count_norm
SK_ID_CURR
100001	3	0.428571	0	0.0	4	0.571429	0	0.0	7	1.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	7	1.000000	0	0.000000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100002	2	0.250000	0	0.0	6	0.750000	0	0.0	8	1.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	4	0.500000	4	0.500000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100003	1	0.250000	0	0.0	3	0.750000	0	0.0	4	1.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	2	0.500000	2	0.500000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100004	0	0.000000	0	0.0	2	1.000000	0	0.0	2	1.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	2	1.000000	0	0.000000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
100005	2	0.666667	0	0.0	1	0.333333	0	0.0	3	1.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	2	0.666667	1	0.333333	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0

	bureau_balance_STATUS_0_count	bureau_balance_STATUS_0_count_norm	bureau_balance_STATUS_1_count	bureau_balance_STATUS_1_count_norm	bureau_balance_STATUS_2_count	bureau_balance_STATUS_2_count_norm	bureau_balance_STATUS_3_count	bureau_balance_STATUS_3_count_norm	bureau_balance_STATUS_4_count	bureau_balance_STATUS_4_count_norm	bureau_balance_STATUS_5_count	bureau_balance_STATUS_5_count_norm	bureau_balance_STATUS_C_count	bureau_balance_STATUS_C_count_norm	bureau_balance_STATUS_X_count	bureau_balance_STATUS_X_count_norm
SK_ID_BUREAU
5001709	0	0.000000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	86	0.886598	11	0.113402
5001710	5	0.060241	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	48	0.578313	30	0.361446
5001711	3	0.750000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.000000	1	0.250000
5001712	10	0.526316	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	9	0.473684	0	0.000000
5001713	0	0.000000	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.000000	22	1.000000

	SK_ID_BUREAU	bureau_balance_MONTHS_BALANCE_count	bureau_balance_MONTHS_BALANCE_mean	bureau_balance_MONTHS_BALANCE_min	bureau_balance_MONTHS_BALANCE_sum
0	5001709	97	-48.0	-96	-4656
1	5001710	83	-41.0	-82	-3403
2	5001711	4	-1.5	-3	-6
3	5001712	19	-9.0	-18	-171
4	5001713	22	-10.5	-21	-231

	SK_ID_CURR	client_bureau_balance_MONTHS_BALANCE_count_count	client_bureau_balance_MONTHS_BALANCE_count_mean	client_bureau_balance_MONTHS_BALANCE_count_max	client_bureau_balance_MONTHS_BALANCE_count_min	client_bureau_balance_MONTHS_BALANCE_count_sum	client_bureau_balance_MONTHS_BALANCE_mean_count	client_bureau_balance_MONTHS_BALANCE_mean_mean	client_bureau_balance_MONTHS_BALANCE_mean_max	client_bureau_balance_MONTHS_BALANCE_mean_min	client_bureau_balance_MONTHS_BALANCE_mean_sum	client_bureau_balance_MONTHS_BALANCE_max_count	client_bureau_balance_MONTHS_BALANCE_max_mean	client_bureau_balance_MONTHS_BALANCE_max_max	client_bureau_balance_MONTHS_BALANCE_max_min	client_bureau_balance_MONTHS_BALANCE_max_sum	client_bureau_balance_MONTHS_BALANCE_min_count	client_bureau_balance_MONTHS_BALANCE_min_mean	client_bureau_balance_MONTHS_BALANCE_min_max	client_bureau_balance_MONTHS_BALANCE_min_min	client_bureau_balance_MONTHS_BALANCE_min_sum	client_bureau_balance_MONTHS_BALANCE_sum_count	client_bureau_balance_MONTHS_BALANCE_sum_mean	client_bureau_balance_MONTHS_BALANCE_sum_max	client_bureau_balance_MONTHS_BALANCE_sum_min	client_bureau_balance_MONTHS_BALANCE_sum_sum	client_bureau_balance_STATUS_0_count_count	client_bureau_balance_STATUS_0_count_mean	client_bureau_balance_STATUS_0_count_max	client_bureau_balance_STATUS_0_count_min	client_bureau_balance_STATUS_0_count_sum	client_bureau_balance_STATUS_0_count_norm_count	client_bureau_balance_STATUS_0_count_norm_mean	client_bureau_balance_STATUS_0_count_norm_max	client_bureau_balance_STATUS_0_count_norm_min	client_bureau_balance_STATUS_0_count_norm_sum	client_bureau_balance_STATUS_1_count_count	client_bureau_balance_STATUS_1_count_mean	client_bureau_balance_STATUS_1_count_max	client_bureau_balance_STATUS_1_count_sum	client_bureau_balance_STATUS_1_count_norm_count	client_bureau_balance_STATUS_1_count_norm_mean	client_bureau_balance_STATUS_1_count_norm_max	client_bureau_balance_STATUS_1_count_norm_sum	client_bureau_balance_STATUS_2_count_count	client_bureau_balance_STATUS_2_count_norm_count	client_bureau_balance_STATUS_3_count_count	client_bureau_balance_STATUS_3_count_norm_count	client_bureau_balance_STATUS_4_count_count	client_bureau_balance_STATUS_4_count_norm_count	client_bureau_balance_STATUS_5_count_count	client_bureau_balance_STATUS_5_count_norm_count	client_bureau_balance_STATUS_C_count_count	client_bureau_balance_STATUS_C_count_mean	client_bureau_balance_STATUS_C_count_max	client_bureau_balance_STATUS_C_count_min	client_bureau_balance_STATUS_C_count_sum	client_bureau_balance_STATUS_C_count_norm_count	client_bureau_balance_STATUS_C_count_norm_mean	client_bureau_balance_STATUS_C_count_norm_max	client_bureau_balance_STATUS_C_count_norm_min	client_bureau_balance_STATUS_C_count_norm_sum	client_bureau_balance_STATUS_X_count_count	client_bureau_balance_STATUS_X_count_mean	client_bureau_balance_STATUS_X_count_max	client_bureau_balance_STATUS_X_count_sum	client_bureau_balance_STATUS_X_count_norm_count	client_bureau_balance_STATUS_X_count_norm_mean	client_bureau_balance_STATUS_X_count_norm_max	client_bureau_balance_STATUS_X_count_norm_sum
0	100001.0	7	24.571429	52	2	172	7	-11.785714	-0.5	-25.5	-82.5	7	0.0	0	0	0	7	-23.571429	-1	-51	-165	7	-400.000000	-1	-1326	-2800	7	4.428571	12	1	31.0	7	0.336651	1.000000	0.019231	2.356557	7	0.142857	1	1	7	0.007519	0.052632	0.052632	7	7	7	7	7	7	7	7	7	15.714286	44	0	110.0	7	0.441240	0.966667	0.000000	3.088683	7	4.285714	9	30.0	7	0.214590	0.500000	1.502129
1	100002.0	8	13.750000	22	4	110	8	-21.875000	-1.5	-39.5	-175.0	8	-15.5	0	-32	-124	8	-28.250000	-3	-47	-226	8	-337.625000	-6	-632	-2701	8	5.625000	18	2	45.0	8	0.406960	0.818182	0.187500	3.255682	8	3.375000	6	27	8	0.255682	0.500000	2.045455	8	8	8	8	8	8	8	8	8	2.875000	13	0	23.0	8	0.175426	0.812500	0.000000	1.403409	8	1.875000	3	15.0	8	0.161932	0.500000	1.295455
2	100005.0	3	7.000000	13	3	21	3	-3.000000	-1.0	-6.0	-9.0	3	0.0	0	0	0	3	-6.000000	-2	-12	-18	3	-30.333333	-3	-78	-91	3	4.666667	7	2	14.0	3	0.735043	1.000000	0.538462	2.205128	3	0.000000	0	0	3	0.000000	0.000000	0.000000	3	3	3	3	3	3	3	3	3	1.666667	5	0	5.0	3	0.128205	0.384615	0.000000	0.384615	3	0.666667	1	2.0	3	0.136752	0.333333	0.410256
3	100010.0	2	36.000000	36	36	72	2	-46.000000	-19.5	-72.5	-92.0	2	-28.5	-2	-55	-57	2	-63.500000	-37	-90	-127	2	-1656.000000	-702	-2610	-3312	2	10.000000	10	10	20.0	2	0.277778	0.277778	0.277778	0.555556	2	0.000000	0	0	2	0.000000	0.000000	0.000000	2	2	2	2	2	2	2	2	2	26.000000	26	26	52.0	2	0.722222	0.722222	0.722222	1.444444	2	0.000000	0	0.0	2	0.000000	0.000000	0.000000
4	100013.0	4	57.500000	69	40	230	4	-28.250000	-19.5	-34.0	-113.0	4	0.0	0	0	0	4	-56.500000	-39	-68	-226	4	-1689.000000	-780	-2346	-6756	4	19.750000	34	0	79.0	4	0.320718	0.618182	0.000000	1.282872	4	1.750000	3	7	4	0.027701	0.045455	0.110804	4	4	4	4	4	4	4	4	4	25.750000	44	0	103.0	4	0.397036	0.666667	0.000000	1.588142	4	10.250000	40	41.0	4	0.254545	1.000000	1.018182

	Missing Values	% of Total Values
bureau_AMT_ANNUITY_min	227502	74.0
bureau_AMT_ANNUITY_max	227502	74.0
bureau_AMT_ANNUITY_mean	227502	74.0
client_bureau_balance_STATUS_4_count_min	215280	70.0
client_bureau_balance_STATUS_3_count_norm_mean	215280	70.0
client_bureau_balance_MONTHS_BALANCE_count_min	215280	70.0
client_bureau_balance_STATUS_4_count_max	215280	70.0
client_bureau_balance_STATUS_4_count_mean	215280	70.0
client_bureau_balance_STATUS_3_count_norm_min	215280	70.0
client_bureau_balance_STATUS_3_count_norm_max	215280	70.0

	Missing Values	% of Total Values
COMMONAREA_MEDI	33495	68.7
COMMONAREA_MODE	33495	68.7
COMMONAREA_AVG	33495	68.7
NONLIVINGAPARTMENTS_MEDI	33347	68.4
NONLIVINGAPARTMENTS_AVG	33347	68.4
NONLIVINGAPARTMENTS_MODE	33347	68.4
FONDKAPREMONT_MODE	32797	67.3
LIVINGAPARTMENTS_MEDI	32780	67.2
LIVINGAPARTMENTS_MODE	32780	67.2
LIVINGAPARTMENTS_AVG	32780	67.2

	TARGET
AMT_INCOME_TOTAL	-0.003982
bureau_DAYS_CREDIT_ENDDATE_count	-0.004002
FLAG_DOCUMENT_11	-0.004229
FLAG_DOCUMENT_9	-0.004352
bureau_AMT_CREDIT_SUM_LIMIT_min	-0.004779
bureau_CREDIT_CURRENCY_currency 2_count	-0.006003
FLAG_DOCUMENT_15	-0.006536
bureau_CREDIT_CURRENCY_currency 2_count_norm	-0.006821
FLAG_DOCUMENT_18	-0.007952
FLAG_DOCUMENT_8	-0.008040
client_bureau_balance_STATUS_X_count_sum	-0.008445
client_bureau_balance_STATUS_C_count_norm_sum	-0.008680
YEARS_BEGINEXPLUATATION_MODE	-0.009036
bureau_AMT_CREDIT_SUM_LIMIT_sum	-0.009419
FLAG_DOCUMENT_14	-0.009464
YEARS_BEGINEXPLUATATION_AVG	-0.009728
YEARS_BEGINEXPLUATATION_MEDI	-0.009993
LANDAREA_MODE	-0.010174
bureau_AMT_CREDIT_SUM_LIMIT_max	-0.010633
bureau_CREDIT_TYPE_Consumer credit_count	-0.010707
bureau_AMT_CREDIT_SUM_min	-0.010764
LANDAREA_AVG	-0.010885
LANDAREA_MEDI	-0.011256
bureau_AMT_CREDIT_SUM_LIMIT_mean	-0.011446
FLAG_DOCUMENT_13	-0.011583
FLAG_DOCUMENT_16	-0.011615
AMT_REQ_CREDIT_BUREAU_MON	-0.012462
NONLIVINGAREA_MODE	-0.012711
AMT_ANNUITY	-0.012817
NONLIVINGAREA_MEDI	-0.013337
...	...
client_bureau_balance_STATUS_X_count_max	-0.030919
client_bureau_balance_STATUS_0_count_max	-0.031158
ELEVATORS_MODE	-0.032131
TOTALAREA_MODE	-0.032596
FLOORSMIN_MODE	-0.032698
LIVINGAREA_MEDI	-0.032739
LIVINGAREA_AVG	-0.032997
client_bureau_balance_STATUS_X_count_mean	-0.033292
FLOORSMIN_MEDI	-0.033394
FLOORSMIN_AVG	-0.033614
ELEVATORS_MEDI	-0.033863
ELEVATORS_AVG	-0.034199
client_bureau_balance_STATUS_0_count_mean	-0.035868
REGION_POPULATION_RELATIVE	-0.037227
AMT_GOODS_PRICE	-0.039645
FLOORSMAX_MODE	-0.043226
FLOORSMAX_MEDI	-0.043768
FLOORSMAX_AVG	-0.044003
client_bureau_balance_STATUS_C_count_norm_max	-0.044119
DAYS_EMPLOYED	-0.044932
client_bureau_balance_MONTHS_BALANCE_count_min	-0.048224
client_bureau_balance_STATUS_C_count_norm_mean	-0.055936
client_bureau_balance_STATUS_C_count_max	-0.061083
client_bureau_balance_STATUS_C_count_mean	-0.062954
client_bureau_balance_MONTHS_BALANCE_count_max	-0.068792
bureau_CREDIT_ACTIVE_Closed_count_norm	-0.079369
client_bureau_balance_MONTHS_BALANCE_count_mean	-0.080193
EXT_SOURCE_1	-0.155317
EXT_SOURCE_2	-0.160472
EXT_SOURCE_3	-0.178919

	fold	train	valid
0	0	0.809199	0.760273
1	1	0.812654	0.761398
2	2	0.809734	0.750451
3	3	0.811121	0.760245
4	4	0.802236	0.760972
5	overall	0.808989	0.758635

	fold	train	valid
0	0	0.827287	0.767203
1	1	0.822264	0.768109
2	2	0.824147	0.761625
3	3	0.816322	0.768150
4	4	0.828518	0.767902
5	overall	0.823708	0.766597

	SK_ID_BUREAU	MONTHS_BALANCE	STATUS
0	5715448	0	C
1	5715448	-1	C
2	5715448	-2	C
3	5715448	-3	C
4	5715448	-4	C

	fold	train	valid
0	0	0.817897	0.764565
1	1	0.836334	0.765829
2	2	0.807627	0.757149
3	3	0.821551	0.764810
4	4	0.825298	0.764449
5	overall	0.821741	0.763352