Hello and Welcome to Kaggle, the online Data Science Community to learn, share, and compete. Most beginners get lost in the field, because they fall into the black box approach, using libraries and algorithms they don't understand. This tutorial will give you a 1-2-year head start over your peers, by providing a framework that teaches you how-to think like a data scientist vs what to think/code. Not only will you be able to submit your first competition, but you’ll be able to solve any problem thrown your way. I provide clear explanations, clean code, and plenty of links to resources. Please Note: This Kernel is still being improved. So check the Change Logs below for updates. Also, please be sure to upvote, fork, and comment and I'll continue to develop. Thanks, and may you have "statistically significant" luck!

Table of Contents

  1. Chapter 1 - How a Data Scientist Beat the Odds
  2. Chapter 2 - A Data Science Framework
  3. Chapter 3 - Step 1: Define the Problem and Step 2: Gather the Data
  4. Chapter 4 - Step 3: Prepare Data for Consumption
  5. Chapter 5 - The 4 C's of Data Cleaning: Correcting, Completing, Creating, and Converting
  6. Chapter 6 - Step 4: Perform Exploratory Analysis with Statistics
  7. Chapter 7 - Step 5: Model Data
  8. Chapter 8 - Evaluate Model Performance
  9. Chapter 9 - Tune Model with Hyper-Parameters
  10. Chapter 10 - Tune Model with Feature Selection
  11. Chapter 11 - Step 6: Validate and Implement
  12. Chapter 12 - Conclusion and Step 7: Optimize and Strategize
  13. Change Log
  14. Credits

How-to Use this Tutorial: Read the explanations provided in this Kernel and the links to developer documentation. The goal is to not just learn the whats, but the whys. If you don't understand something in the code the print() function is your best friend. In coding, it's okay to try, fail, and try again. If you do run into problems, Google is your second best friend, because 99.99% of the time, someone else had the same question/problem and already asked the coding community. If you've exhausted all your resources, the Kaggle Community via forums and comments can help too.

How a Data Scientist Beat the Odds

It's the classical problem, predict the outcome of a binary event. In laymen terms this means, it either occurred or did not occur. For example, you won or did not win, you passed the test or did not pass the test, you were accepted or not accepted, and you get the point. A common business application is churn or customer retention. Another popular use case is, healthcare's mortality rate or survival analysis. Binary events create an interesting dynamic, because we know statistically, a random guess should achieve a 50% accuracy rate, without creating one single algorithm or writing one single line of code. However, just like autocorrect spellcheck technology, sometimes we humans can be too smart for our own good and actually underperform a coin flip. In this kernel, I use Kaggle's Getting Started Competition, Titanic: Machine Learning from Disaster, to walk the reader through, how-to use the data science framework to beat the odds.

What happens when technology is too smart for its own good? Funny Autocorrect

A Data Science Framework

  1. Define the Problem: If data science, big data, machine learning, predictive analytics, business intelligence, or any other buzzword is the solution, then what is the problem? As the saying goes, don't put the cart before the horse. Problems before requirements, requirements before solutions, solutions before design, and design before technology. Too often we are quick to jump on the new shiny technology, tool, or algorithm before determining the actual problem we are trying to solve.
  2. Gather the Data: John Naisbitt wrote in his 1984 (yes, 1984) book Megatrends, we are “drowning in data, yet staving for knowledge." So, chances are, the dataset(s) already exist somewhere, in some format. It may be external or internal, structured or unstructured, static or streamed, objective or subjective, etc. As the saying goes, you don't have to reinvent the wheel, you just have to know where to find it. In the next step, we will worry about transforming "dirty data" to "clean data."
  3. Prepare Data for Consumption: This step is often referred to as data wrangling, a required process to turn “wild” data into “manageable” data. Data wrangling includes implementing data architectures for storage and processing, developing data governance standards for quality and control, data extraction (i.e. ETL and web scraping), and data cleaning to identify aberrant, missing, or outlier data points.
  4. Perform Exploratory Analysis: Anybody who has ever worked with data knows, garbage-in, garbage-out (GIGO). Therefore, it is important to deploy descriptive and graphical statistics to look for potential problems, patterns, classifications, correlations and comparisons in the dataset. In addition, data categorization (i.e. qualitative vs quantitative) is also important to understand and select the correct hypothesis test or data model.
  5. Model Data: Like descriptive and inferential statistics, data modeling can either summarize the data or predict future outcomes. Your dataset and expected results, will determine the algorithms available for use. It's important to remember, algorithms are tools and not magical wands or silver bullets. You must still be the master craft (wo)man that knows how-to select the right tool for the job. An analogy would be asking someone to hand you a Philip screwdriver, and they hand you a flathead screwdriver or worst a hammer. At best, it shows a complete lack of understanding. At worst, it makes completing the project impossible. The same is true in data modelling. The wrong model can lead to poor performance at best and the wrong conclusion (that’s used as actionable intelligence) at worst.
  6. Validate and Implement Data Model: After you've trained your model based on a subset of your data, it's time to test your model. This helps ensure you haven't overfit your model or made it so specific to the selected subset, that it does not accurately fit another subset from the same dataset. In this step we determine if our model overfit, generalize, or underfit our dataset.
  7. Optimize and Strategize: This is the "bionic man" step, where you iterate back through the process to make it better...stronger...faster than it was before. As a data scientist, your strategy should be to outsource developer operations and application plumbing, so you have more time to focus on recommendations and design. Once you're able to package your ideas, this becomes your “currency exchange" rate.

Step 1: Define the Problem

For this project, the problem statement is given to us on a golden plater, develop an algorithm to predict the survival outcome of passengers on the Titanic.

......

Project Summary: The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Practice Skills

  • Binary classification
  • Python and R basics

Step 2: Gather the Data

The dataset is also given to us on a golden plater with test and train data at Kaggle's Titanic: Machine Learning from Disaster

Step 3: Prepare Data for Consumption

Since step 2 was provided to us on a golden plater, so is step 3. Therefore, normal processes in data wrangling, such as data architecture, governance, and extraction are out of scope. Thus, only data cleaning is in scope.

3.1 Import Libraries

The following code is written in Python 3.x. Libraries provide pre-written functionality to perform necessary tasks. The idea is why write ten lines of code, when you can write one line.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python

#load packages
import sys #access to system parameters https://docs.python.org/3/library/sys.html
print("Python version: {}". format(sys.version))

import pandas as pd #collection of functions for data processing and analysis modeled after R dataframes with SQL like features
print("pandas version: {}". format(pd.__version__))

import matplotlib #collection of functions for scientific and publication-ready visualization
print("matplotlib version: {}". format(matplotlib.__version__))

import numpy as np #foundational package for scientific computing
print("NumPy version: {}". format(np.__version__))

import scipy as sp #collection of functions for scientific computing and advance mathematics
print("SciPy version: {}". format(sp.__version__)) 

import IPython
from IPython import display #pretty printing of dataframes in Jupyter notebook
print("IPython version: {}". format(IPython.__version__)) 

import sklearn #collection of machine learning algorithms
print("scikit-learn version: {}". format(sklearn.__version__))

#misc libraries
import random
import time


#ignore warnings
import warnings
warnings.filterwarnings('ignore')
print('-'*25)



# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.
Python version: 3.6.3 |Anaconda custom (64-bit)| (default, Nov 20 2017, 20:41:42) 
[GCC 7.2.0]
pandas version: 0.20.3
matplotlib version: 2.1.1
NumPy version: 1.13.0
SciPy version: 1.0.0
IPython version: 5.3.0
scikit-learn version: 0.19.1
-------------------------
gender_submission.csv
test.csv
train.csv

3.11 Load Data Modelling Libraries

We will use the popular scikit-learn library to develop our machine learning algorithms. In sklearn, algorithms are called Estimators and implemented in their own classes. For data visualization, we will use the matplotlib and seaborn library. Below are common classes to load.

In [2]:
#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.tools.plotting import scatter_matrix

#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8

3.2 Meet and Greet Data

This is the meet and greet step. Get to know your data by first name and learn a little bit about it. What does it look like (datatype and values), what makes it tick (independent/feature variables(s)), what's its goals in life (dependent/target variable(s)). Think of it like a first date, before you jump in and start poking it in the bedroom.

To begin this step, we first import our data. Next we use the info() and sample() function, to get a quick and dirty overview of variable datatypes (i.e. qualitative vs quantitative). Click here for the Source Data Dictionary.

  1. The Survived variable is our outcome or dependent variable. It is a binary nominal datatype of 1 for survived and 0 for did not survive. All other variables are potential predictor or independent variables. It's important to note, more predictor variables do not make a better model, but the right variables.
  2. The PassengerID and Ticket variables are assumed to be random unique identifiers, that have no impact on the outcome variable. Thus, they will be excluded from analysis.
  3. The Pclass variable is an ordinal datatype for the ticket class, a proxy for socio-economic status (SES), representing 1 = upper class, 2 = middle class, and 3 = lower class.
  4. The Name variable is a nominal datatype. It could be used in feature engineering to derive the gender from title, family size from surname, and SES from titles like doctor or master. Since these variables already exist, we'll make use of it to see if title, like master, makes a difference.
  5. The Sex and Embarked variables are a nominal datatype. They will be converted to dummy variables for mathematical calculations.
  6. The Age and Fare variable are continuous quantitative datatypes.
  7. The SibSp represents number of related siblings/spouse aboard and Parch represents number of related parents/children aboard. Both are discrete quantitative datatypes. This can be used for feature engineering to create a family size and is alone variable.
  8. The Cabin variable is a nominal datatype that can be used in feature engineering for approximate position on ship when the incident occurred and SES from deck levels. However, since there are many null values, it does not add value and thus is excluded from analysis.
In [3]:
#import data from file: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
data_raw = pd.read_csv('../input/train.csv')


#a dataset should be broken into 3 splits: train, test, and (final) validation
#the test file provided is the validation file for competition submission
#we will split the train set into train and test data in future sections
data_val  = pd.read_csv('../input/test.csv')


#to play with our data we'll create a copy
#remember python assignment or equal passes by reference vs values, so we use the copy function: https://stackoverflow.com/questions/46327494/python-pandas-dataframe-copydeep-false-vs-copydeep-true-vs
data1 = data_raw.copy(deep = True)

#however passing by reference is convenient, because we can clean both datasets at once
data_cleaner = [data1, data_val]


#preview data
print (data_raw.info()) #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html
#data_raw.head() #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html
#data_raw.tail() #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html
data_raw.sample(10) #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
855 856 1 3 Aks, Mrs. Sam (Leah Rosen) female 18.0 0 1 392091 9.3500 NaN S
701 702 1 1 Silverthorne, Mr. Spencer Victor male 35.0 0 0 PC 17475 26.2875 E24 S
180 181 0 3 Sage, Miss. Constance Gladys female NaN 8 2 CA. 2343 69.5500 NaN S
409 410 0 3 Lefebre, Miss. Ida female NaN 3 1 4133 25.4667 NaN S
776 777 0 3 Tobin, Mr. Roger male NaN 0 0 383121 7.7500 F38 Q
404 405 0 3 Oreskovic, Miss. Marija female 20.0 0 0 315096 8.6625 NaN S
49 50 0 3 Arnold-Franchi, Mrs. Josef (Josefine Franchi) female 18.0 1 0 349237 17.8000 NaN S
836 837 0 3 Pasic, Mr. Jakob male 21.0 0 0 315097 8.6625 NaN S
619 620 0 2 Gavey, Mr. Lawrence male 26.0 0 0 31028 10.5000 NaN S
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

3.21 The 4 C's of Data Cleaning: Correcting, Completing, Creating, and Converting

In this stage, we will clean our data by 1) correcting aberrant values and outliers, 2) completing missing information, 3) creating new features for analysis, and 4) converting fields to the correct format for calculations and presentation.

  1. Correcting: Reviewing the data, there does not appear to be any aberrant or non-acceptable data inputs. In addition, we see we may have potential outliers in age and fare. However, since they are reasonable values, we will wait until after we complete our exploratory analysis to determine if we should include or exclude from the dataset. It should be noted, that if they were unreasonable values, for example age = 800 instead of 80, then it's probably a safe decision to fix now. However, we want to use caution when we modify data from its original value, because it may be necessary to create an accurate model.
  2. Completing: There are null values or missing data in the age, cabin, and embarked field. Missing values can be bad, because some algorithms don't know how-to handle null values and will fail. While others, like decision trees, can handle null values. Thus, it's important to fix before we start modeling, because we will compare and contrast several models. There are two common methods, either delete the record or populate the missing value using a reasonable input. It is not recommended to delete the record, especially a large percentage of records, unless it truly represents an incomplete record. Instead, it's best to impute missing values. A basic methodology for qualitative data is impute using mode. A basic methodology for quantitative data is impute using mean, median, or mean + randomized standard deviation. An intermediate methodology is to use the basic methodology based on specific criteria; like the average age by class or embark port by fare and SES. There are more complex methodologies, however before deploying, it should be compared to the base model to determine if complexity truly adds value. For this dataset, age will be imputed with the median, the cabin attribute will be dropped, and embark will be imputed with mode. Subsequent model iterations may modify this decision to determine if it improves the model’s accuracy.
  3. Creating: Feature engineering is when we use existing features to create new features to determine if they provide new signals to predict our outcome. For this dataset, we will create a title feature to determine if it played a role in survival.
  4. Converting: Last, but certainly not least, we'll deal with formatting. There are no date or currency formats, but datatype formats. Our categorical data imported as objects, which makes it difficult for mathematical calculations. For this dataset, we will convert object datatypes to categorical dummy variables.
In [4]:
print('Train columns with null values:\n', data1.isnull().sum())
print("-"*10)

print('Test/Validation columns with null values:\n', data_val.isnull().sum())
print("-"*10)

data_raw.describe(include = 'all')
Train columns with null values:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
----------
Test/Validation columns with null values:
 PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64
----------
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
count 891.000000 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889
unique NaN NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3
top NaN NaN NaN Lindahl, Miss. Agda Thorilda Viktoria male NaN NaN NaN 1601 NaN C23 C25 C27 S
freq NaN NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644
mean 446.000000 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaN
std 257.353842 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaN
min 1.000000 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN
25% 223.500000 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN
50% 446.000000 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN
75% 668.500000 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaN
max 891.000000 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN NaN
In [5]:
###COMPLETING: complete or delete missing values in train and test/validation dataset
for dataset in data_cleaner:    
    #complete missing age with median
    dataset['Age'].fillna(dataset['Age'].median(), inplace = True)

    #complete embarked with mode
    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)

    #complete missing fare with median
    dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)
    
#delete the cabin feature/column and others previously stated to exclude in train dataset
drop_column = ['PassengerId','Cabin', 'Ticket']
data1.drop(drop_column, axis=1, inplace = True)

print(data1.isnull().sum())
print("-"*10)
print(data_val.isnull().sum())
Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64
----------
PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          327
Embarked         0
dtype: int64
In [6]:
###CREATE: Feature Engineering for train and test/validation dataset
for dataset in data_cleaner:    
    #Discrete variables
    dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1

    dataset['IsAlone'] = 1 #initialize to yes/1 is alone
    dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1

    #quick and dirty code split title from name: http://www.pythonforbeginners.com/dictionary/python-split
    dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]


    #Continuous variable bins; qcut vs cut: https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut
    #Fare Bins/Buckets using qcut or frequency bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
    dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)

    #Age Bins/Buckets using cut or value bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html
    dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)


    
#cleanup rare title names
#print(data1['Title'].value_counts())
stat_min = 10 #while small is arbitrary, we'll use the common minimum in statistics: http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/
title_names = (data1['Title'].value_counts() < stat_min) #this will create a true false series with title name as index

#apply and lambda functions are quick and dirty code to find and replace with fewer lines of code: https://community.modeanalytics.com/python/tutorial/pandas-groupby-and-python-lambda-functions/
data1['Title'] = data1['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)
print(data1['Title'].value_counts())
print("-"*10)


#preview data again
data1.info()
data_val.info()
data1.sample(10)
Mr        517
Miss      182
Mrs       125
Master     40
Misc       27
Name: Title, dtype: int64
----------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
Survived      891 non-null int64
Pclass        891 non-null int64
Name          891 non-null object
Sex           891 non-null object
Age           891 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Embarked      891 non-null object
FamilySize    891 non-null int64
IsAlone       891 non-null int64
Title         891 non-null object
FareBin       891 non-null category
AgeBin        891 non-null category
dtypes: category(2), float64(2), int64(6), object(4)
memory usage: 85.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 16 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           418 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
FamilySize     418 non-null int64
IsAlone        418 non-null int64
Title          418 non-null object
FareBin        418 non-null category
AgeBin         418 non-null category
dtypes: category(2), float64(2), int64(6), object(6)
memory usage: 46.8+ KB
Out[6]:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone Title FareBin AgeBin
651 1 2 Doling, Miss. Elsie female 18.0 0 1 23.0000 S 2 0 Miss (14.454, 31.0] (16.0, 32.0]
561 0 3 Sivic, Mr. Husein male 40.0 0 0 7.8958 S 1 1 Mr (-0.001, 7.91] (32.0, 48.0]
737 1 1 Lesurer, Mr. Gustave J male 35.0 0 0 512.3292 C 1 1 Mr (31.0, 512.329] (32.0, 48.0]
368 1 3 Jermyn, Miss. Annie female 28.0 0 0 7.7500 Q 1 1 Miss (-0.001, 7.91] (16.0, 32.0]
837 0 3 Sirota, Mr. Maurice male 28.0 0 0 8.0500 S 1 1 Mr (7.91, 14.454] (16.0, 32.0]
299 1 1 Baxter, Mrs. James (Helene DeLaudeniere Chaput) female 50.0 0 1 247.5208 C 2 0 Mrs (31.0, 512.329] (48.0, 64.0]
273 0 1 Natsch, Mr. Charles H male 37.0 0 1 29.7000 C 2 0 Mr (14.454, 31.0] (32.0, 48.0]
117 0 2 Turpin, Mr. William John Robert male 29.0 1 0 21.0000 S 2 0 Mr (14.454, 31.0] (16.0, 32.0]
810 0 3 Alexander, Mr. William male 26.0 0 0 7.8875 S 1 1 Mr (-0.001, 7.91] (16.0, 32.0]
234 0 2 Leyson, Mr. Robert William Norman male 24.0 0 0 10.5000 S 1 1 Mr (7.91, 14.454] (16.0, 32.0]

3.23 Convert Formats

We will convert categorical data to dummy variables for mathematical analysis. There are multiple ways to encode categorical variables; we will use the sklearn and pandas functions.

In this step, we will also define our x (independent/features/explanatory/predictor/etc.) and y (dependent/target/outcome/response/etc.) variables for data modeling.

Developer Documentation:

In [7]:
#CONVERT: convert objects to category using Label Encoder for train and test/validation dataset

#code categorical data
label = LabelEncoder()
for dataset in data_cleaner:    
    dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])
    dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
    dataset['Title_Code'] = label.fit_transform(dataset['Title'])
    dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
    dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])


#define y variable aka target/outcome
Target = ['Survived']

#define x variables for original features aka feature selection
data1_x = ['Sex','Pclass', 'Embarked', 'Title','SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] #pretty name/values for charts
data1_x_calc = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code','SibSp', 'Parch', 'Age', 'Fare'] #coded for algorithm calculation
data1_xy =  Target + data1_x
print('Original X Y: ', data1_xy, '\n')


#define x variables for original w/bin features to remove continuous variables
data1_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
data1_xy_bin = Target + data1_x_bin
print('Bin X Y: ', data1_xy_bin, '\n')


#define x and y variables for dummy features original
data1_dummy = pd.get_dummies(data1[data1_x])
data1_x_dummy = data1_dummy.columns.tolist()
data1_xy_dummy = Target + data1_x_dummy
print('Dummy X Y: ', data1_xy_dummy, '\n')



data1_dummy.head()
Original X Y:  ['Survived', 'Sex', 'Pclass', 'Embarked', 'Title', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] 

Bin X Y:  ['Survived', 'Sex_Code', 'Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code'] 

Dummy X Y:  ['Survived', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Title_Master', 'Title_Misc', 'Title_Miss', 'Title_Mr', 'Title_Mrs'] 

Out[7]:
Pclass SibSp Parch Age Fare FamilySize IsAlone Sex_female Sex_male Embarked_C Embarked_Q Embarked_S Title_Master Title_Misc Title_Miss Title_Mr Title_Mrs
0 3 1 0 22.0 7.2500 2 0 0 1 0 0 1 0 0 0 1 0
1 1 1 0 38.0 71.2833 2 0 1 0 1 0 0 0 0 0 0 1
2 3 0 0 26.0 7.9250 1 1 1 0 0 0 1 0 0 1 0 0
3 1 1 0 35.0 53.1000 2 0 1 0 0 0 1 0 0 0 0 1
4 3 0 0 35.0 8.0500 1 1 0 1 0 0 1 0 0 0 1 0

3.24 Da-Double Check Cleaned Data

Now that we've cleaned our data, let's do a discount da-double check!

In [8]:
print('Train columns with null values: \n', data1.isnull().sum())
print("-"*10)
print (data1.info())
print("-"*10)

print('Test/Validation columns with null values: \n', data_val.isnull().sum())
print("-"*10)
print (data_val.info())
print("-"*10)

data_raw.describe(include = 'all')
Train columns with null values: 
 Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Fare             0
Embarked         0
FamilySize       0
IsAlone          0
Title            0
FareBin          0
AgeBin           0
Sex_Code         0
Embarked_Code    0
Title_Code       0
AgeBin_Code      0
FareBin_Code     0
dtype: int64
----------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 19 columns):
Survived         891 non-null int64
Pclass           891 non-null int64
Name             891 non-null object
Sex              891 non-null object
Age              891 non-null float64
SibSp            891 non-null int64
Parch            891 non-null int64
Fare             891 non-null float64
Embarked         891 non-null object
FamilySize       891 non-null int64
IsAlone          891 non-null int64
Title            891 non-null object
FareBin          891 non-null category
AgeBin           891 non-null category
Sex_Code         891 non-null int64
Embarked_Code    891 non-null int64
Title_Code       891 non-null int64
AgeBin_Code      891 non-null int64
FareBin_Code     891 non-null int64
dtypes: category(2), float64(2), int64(11), object(4)
memory usage: 120.3+ KB
None
----------
Test/Validation columns with null values: 
 PassengerId        0
Pclass             0
Name               0
Sex                0
Age                0
SibSp              0
Parch              0
Ticket             0
Fare               0
Cabin            327
Embarked           0
FamilySize         0
IsAlone            0
Title              0
FareBin            0
AgeBin             0
Sex_Code           0
Embarked_Code      0
Title_Code         0
AgeBin_Code        0
FareBin_Code       0
dtype: int64
----------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 21 columns):
PassengerId      418 non-null int64
Pclass           418 non-null int64
Name             418 non-null object
Sex              418 non-null object
Age              418 non-null float64
SibSp            418 non-null int64
Parch            418 non-null int64
Ticket           418 non-null object
Fare             418 non-null float64
Cabin            91 non-null object
Embarked         418 non-null object
FamilySize       418 non-null int64
IsAlone          418 non-null int64
Title            418 non-null object
FareBin          418 non-null category
AgeBin           418 non-null category
Sex_Code         418 non-null int64
Embarked_Code    418 non-null int64
Title_Code       418 non-null int64
AgeBin_Code      418 non-null int64
FareBin_Code     418 non-null int64
dtypes: category(2), float64(2), int64(11), object(6)
memory usage: 63.1+ KB
None
----------
Out[8]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
count 891.000000 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889
unique NaN NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3
top NaN NaN NaN Lindahl, Miss. Agda Thorilda Viktoria male NaN NaN NaN 1601 NaN C23 C25 C27 S
freq NaN NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644
mean 446.000000 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaN
std 257.353842 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaN
min 1.000000 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN
25% 223.500000 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN
50% 446.000000 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN
75% 668.500000 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaN
max 891.000000 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN NaN

3.25 Split Training and Testing Data

As mentioned previously, the test file provided is really validation data for competition submission. So, we will use sklearn function to split the training data in two datasets; 75/25 split. This is important, so we don't overfit our model. Meaning, the algorithm is so specific to a given subset, it cannot accurately generalize another subset, from the same dataset. It's important our algorithm has not seen the subset we will use to test, so it doesn't "cheat" by memorizing the answers. We will use sklearn's train_test_split function. In later sections we will also use sklearn's cross validation functions, that splits our dataset into train and test for data modeling comparison.

In [9]:
#split train and test data with function defaults
#random_state -> seed or control random number generator: https://www.quora.com/What-is-seed-in-random-number-generation
train1_x, test1_x, train1_y, test1_y = model_selection.train_test_split(data1[data1_x_calc], data1[Target], random_state = 0)
train1_x_bin, test1_x_bin, train1_y_bin, test1_y_bin = model_selection.train_test_split(data1[data1_x_bin], data1[Target] , random_state = 0)
train1_x_dummy, test1_x_dummy, train1_y_dummy, test1_y_dummy = model_selection.train_test_split(data1_dummy[data1_x_dummy], data1[Target], random_state = 0)


print("Data1 Shape: {}".format(data1.shape))
print("Train1 Shape: {}".format(train1_x.shape))
print("Test1 Shape: {}".format(test1_x.shape))

train1_x_bin.head()
Data1 Shape: (891, 19)
Train1 Shape: (668, 8)
Test1 Shape: (223, 8)
Out[9]:
Sex_Code Pclass Embarked_Code Title_Code FamilySize AgeBin_Code FareBin_Code
105 1 3 2 3 1 1 0
68 0 3 2 2 7 1 1
253 1 3 2 3 2 1 2
320 1 3 2 3 1 1 0
706 0 2 2 4 1 2 1

Step 4: Perform Exploratory Analysis with Statistics

Now that our data is cleaned, we will explore our data with descriptive and graphical statistics to describe and summarize our variables. In this stage, you will find yourself classifying features and determining their correlation with the target variable and each other.

In [10]:
#Discrete Variable Correlation by Survival using
#group by aka pivot table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
for x in data1_x:
    if data1[x].dtype != 'float64' :
        print('Survival Correlation by:', x)
        print(data1[[x, Target[0]]].groupby(x, as_index=False).mean())
        print('-'*10, '\n')
        

#using crosstabs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html
print(pd.crosstab(data1['Title'],data1[Target[0]]))
Survival Correlation by: Sex
      Sex  Survived
0  female  0.742038
1    male  0.188908
---------- 

Survival Correlation by: Pclass
   Pclass  Survived
0       1  0.629630
1       2  0.472826
2       3  0.242363
---------- 

Survival Correlation by: Embarked
  Embarked  Survived
0        C  0.553571
1        Q  0.389610
2        S  0.339009
---------- 

Survival Correlation by: Title
    Title  Survived
0  Master  0.575000
1    Misc  0.444444
2    Miss  0.697802
3      Mr  0.156673
4     Mrs  0.792000
---------- 

Survival Correlation by: SibSp
   SibSp  Survived
0      0  0.345395
1      1  0.535885
2      2  0.464286
3      3  0.250000
4      4  0.166667
5      5  0.000000
6      8  0.000000
---------- 

Survival Correlation by: Parch
   Parch  Survived
0      0  0.343658
1      1  0.550847
2      2  0.500000
3      3  0.600000
4      4  0.000000
5      5  0.200000
6      6  0.000000
---------- 

Survival Correlation by: FamilySize
   FamilySize  Survived
0           1  0.303538
1           2  0.552795
2           3  0.578431
3           4  0.724138
4           5  0.200000
5           6  0.136364
6           7  0.333333
7           8  0.000000
8          11  0.000000
---------- 

Survival Correlation by: IsAlone
   IsAlone  Survived
0        0  0.505650
1        1  0.303538
---------- 

Survived    0    1
Title             
Master     17   23
Misc       15   12
Miss       55  127
Mr        436   81
Mrs        26   99
In [11]:
#IMPORTANT: Intentionally plotted different ways for learning purposes only. 

#optional plotting w/pandas: https://pandas.pydata.org/pandas-docs/stable/visualization.html

#we will use matplotlib.pyplot: https://matplotlib.org/api/pyplot_api.html

#to organize our graphics will use figure: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure
#subplot: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html#matplotlib.pyplot.subplot
#and subplotS: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html?highlight=matplotlib%20pyplot%20subplots#matplotlib.pyplot.subplots

#graph distribution of quantitative data
plt.figure(figsize=[16,12])

plt.subplot(231)
plt.boxplot(x=data1['Fare'], showmeans = True, meanline = True)
plt.title('Fare Boxplot')
plt.ylabel('Fare ($)')

plt.subplot(232)
plt.boxplot(data1['Age'], showmeans = True, meanline = True)
plt.title('Age Boxplot')
plt.ylabel('Age (Years)')

plt.subplot(233)
plt.boxplot(data1['FamilySize'], showmeans = True, meanline = True)
plt.title('Family Size Boxplot')
plt.ylabel('Family Size (#)')

plt.subplot(234)
plt.hist(x = [data1[data1['Survived']==1]['Fare'], data1[data1['Survived']==0]['Fare']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Fare Histogram by Survival')
plt.xlabel('Fare ($)')
plt.ylabel('# of Passengers')
plt.legend()

plt.subplot(235)
plt.hist(x = [data1[data1['Survived']==1]['Age'], data1[data1['Survived']==0]['Age']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Age Histogram by Survival')
plt.xlabel('Age (Years)')
plt.ylabel('# of Passengers')
plt.legend()

plt.subplot(236)
plt.hist(x = [data1[data1['Survived']==1]['FamilySize'], data1[data1['Survived']==0]['FamilySize']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Family Size Histogram by Survival')
plt.xlabel('Family Size (#)')
plt.ylabel('# of Passengers')
plt.legend()
Out[11]:
<matplotlib.legend.Legend at 0x7f57f3087860>
In [12]:
#we will use seaborn graphics for multi-variable comparison: https://seaborn.pydata.org/api.html

#graph individual features by survival
fig, saxis = plt.subplots(2, 3,figsize=(16,12))

sns.barplot(x = 'Embarked', y = 'Survived', data=data1, ax = saxis[0,0])
sns.barplot(x = 'Pclass', y = 'Survived', order=[1,2,3], data=data1, ax = saxis[0,1])
sns.barplot(x = 'IsAlone', y = 'Survived', order=[1,0], data=data1, ax = saxis[0,2])

sns.pointplot(x = 'FareBin', y = 'Survived',  data=data1, ax = saxis[1,0])
sns.pointplot(x = 'AgeBin', y = 'Survived',  data=data1, ax = saxis[1,1])
sns.pointplot(x = 'FamilySize', y = 'Survived', data=data1, ax = saxis[1,2])
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f57f2f11080>
In [13]:
#graph distribution of qualitative data: Pclass
#we know class mattered in survival, now let's compare class and a 2nd feature
fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(14,12))

sns.boxplot(x = 'Pclass', y = 'Fare', hue = 'Survived', data = data1, ax = axis1)
axis1.set_title('Pclass vs Fare Survival Comparison')

sns.violinplot(x = 'Pclass', y = 'Age', hue = 'Survived', data = data1, split = True, ax = axis2)
axis2.set_title('Pclass vs Age Survival Comparison')

sns.boxplot(x = 'Pclass', y ='FamilySize', hue = 'Survived', data = data1, ax = axis3)
axis3.set_title('Pclass vs Family Size Survival Comparison')
Out[13]:
Text(0.5,1,'Pclass vs Family Size Survival Comparison')
In [14]:
#graph distribution of qualitative data: Sex
#we know sex mattered in survival, now let's compare sex and a 2nd feature
fig, qaxis = plt.subplots(1,3,figsize=(14,12))

sns.barplot(x = 'Sex', y = 'Survived', hue = 'Embarked', data=data1, ax = qaxis[0])
axis1.set_title('Sex vs Embarked Survival Comparison')

sns.barplot(x = 'Sex', y = 'Survived', hue = 'Pclass', data=data1, ax  = qaxis[1])
axis1.set_title('Sex vs Pclass Survival Comparison')

sns.barplot(x = 'Sex', y = 'Survived', hue = 'IsAlone', data=data1, ax  = qaxis[2])
axis1.set_title('Sex vs IsAlone Survival Comparison')
Out[14]:
Text(0.5,1,'Sex vs IsAlone Survival Comparison')
In [15]:
#more side-by-side comparisons
fig, (maxis1, maxis2) = plt.subplots(1, 2,figsize=(14,12))

#how does family size factor with sex & survival compare
sns.pointplot(x="FamilySize", y="Survived", hue="Sex", data=data1,
              palette={"male": "blue", "female": "pink"},
              markers=["*", "o"], linestyles=["-", "--"], ax = maxis1)

#how does class factor with sex & survival compare
sns.pointplot(x="Pclass", y="Survived", hue="Sex", data=data1,
              palette={"male": "blue", "female": "pink"},
              markers=["*", "o"], linestyles=["-", "--"], ax = maxis2)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f57f2df7978>
In [16]:
#how does embark port factor with class, sex, and survival compare
#facetgrid: https://seaborn.pydata.org/generated/seaborn.FacetGrid.html
e = sns.FacetGrid(data1, col = 'Embarked')
e.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', ci=95.0, palette = 'deep')
e.add_legend()
Out[16]:
<seaborn.axisgrid.FacetGrid at 0x7f57f660f2b0>
In [17]:
#plot distributions of age of passengers who survived or did not survive
a = sns.FacetGrid( data1, hue = 'Survived', aspect=4 )
a.map(sns.kdeplot, 'Age', shade= True )
a.set(xlim=(0 , data1['Age'].max()))
a.add_legend()
Out[17]:
<seaborn.axisgrid.FacetGrid at 0x7f57f254b048>
In [18]:
#histogram comparison of sex, class, and age by survival
h = sns.FacetGrid(data1, row = 'Sex', col = 'Pclass', hue = 'Survived')
h.map(plt.hist, 'Age', alpha = .75)
h.add_legend()
Out[18]:
<seaborn.axisgrid.FacetGrid at 0x7f57f246bdd8>
In [19]:
#pair plots of entire dataset
pp = sns.pairplot(data1, hue = 'Survived', palette = 'deep', size=1.2, diag_kind = 'kde', diag_kws=dict(shade=True), plot_kws=dict(s=10) )
pp.set(xticklabels=[])
Out[19]:
<seaborn.axisgrid.PairGrid at 0x7f57f22bd2b0>
In [20]:
#correlation heatmap of dataset
def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    colormap = sns.diverging_palette(220, 10, as_cmap = True)
    
    _ = sns.heatmap(
        df.corr(), 
        cmap = colormap,
        square=True, 
        cbar_kws={'shrink':.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor='white',
        annot_kws={'fontsize':12 }
    )
    
    plt.title('Pearson Correlation of Features', y=1.05, size=15)

correlation_heatmap(data1)

Step 5: Model Data

Data Science is a multi-disciplinary field between mathematics (i.e. statistics, linear algebra, etc.), computer science (i.e. programming languages, computer systems, etc.) and business management (i.e. communication, subject-matter knowledge, etc.). Most data scientist come from one of the three fields, so they tend to lean towards that discipline. However, data science is like a three-legged stool, with no one leg being more important than the other. So, this step will require advanced knowledge in mathematics. But don’t worry, we only need a high-level overview, which we’ll cover in this Kernel. Also, thanks to computer science, a lot of the heavy lifting is done for you. So, problems that once required graduate degrees in mathematics or statistics, now only take a few lines of code. Last, we’ll need some business acumen to think through the problem. After all, like training a sight-seeing dog, it’s learning from us and not the other way around.

Machine Learning (ML), as the name suggest, is teaching the machine how-to think and not what to think. While this topic and big data has been around for decades, it is becoming more popular than ever because the barrier to entry is lower, for businesses and professionals alike. This is both good and bad. It’s good because these algorithms are now accessible to more people that can solve more problems in the real-world. It’s bad because a lower barrier to entry means, more people will not know the tools they are using and can come to incorrect conclusions. That’s why I focus on teaching you, not just what to do, but why you’re doing it. Previously, I used the analogy of asking someone to hand you a Philip screwdriver, and they hand you a flathead screwdriver or worst a hammer. At best, it shows a complete lack of understanding. At worst, it makes completing the project impossible; or even worst, implements incorrect actionable intelligence. So now that I’ve hammered (no pun intended) my point, I’ll show you what to do and most importantly, WHY you do it.

First, you must understand, that the purpose of machine learning is to solve human problems. Machine learning can be categorized as: supervised learning, unsupervised learning, and reinforced learning. Supervised learning is where you train the model by presenting it a training dataset that includes the correct answer. Unsupervised learning is where you train the model using a training dataset that does not include the correct answer. And reinforced learning is a hybrid of the previous two, where the model is not given the correct answer immediately, but later after a sequence of events to reinforce learning. We are doing supervised machine learning, because we are training our algorithm by presenting it with a set of features and their corresponding target. We then hope to present it a new subset from the same dataset and have similar results in prediction accuracy.

There are many machine learning algorithms, however they can be reduced to four categories: classification, regression, clustering, or dimensionality reduction, depending on your target variable and data modeling goals. We'll save clustering and dimension reduction for another day, and focus on classification and regression. We can generalize that a continuous target variable requires a regression algorithm and a discrete target variable requires a classification algorithm. One side note, logistic regression, while it has regression in the name, is really a classification algorithm. Since our problem is predicting if a passenger survived or did not survive, this is a discrete target variable. We will use a classification algorithm from the sklearn library to begin our analysis. We will use cross validation and scoring metrics, discussed in later sections, to rank and compare our algorithms’ performance.

Machine Learning Selection:

Now that we identified our solution as a supervised learning classification algorithm. We can narrow our list of choices.

Machine Learning Classification Algorithms:

Data Science 101: How to Choose a Machine Learning Algorithm (MLA)

IMPORTANT: When it comes to data modeling, the beginner’s question is always, "what is the best machine learning algorithm?" To this the beginner must learn, the No Free Lunch Theorem (NFLT) of Machine Learning. In short, NFLT states, there is no super algorithm, that works best in all situations, for all datasets. So the best approach is to try multiple MLAs, tune them, and compare them for your specific scenario. With that being said, some good research has been done to compare algorithms, such as Caruana & Niculescu-Mizil 2006 watch video lecture here of MLA comparisons, Ogutu et al. 2011 done by the NIH for genomic selection, Fernandez-Delgado et al. 2014 comparing 179 classifiers from 17 families, Thoma 2016 sklearn comparison, and there is also a school of thought that says, more data beats a better algorithm.

So with all this information, where is a beginner to start? I recommend starting with Trees, Bagging, Random Forests, and Boosting. They are basically different implementations of a decision tree, which is the easiest concept to learn and understand. They are also easier to tune, discussed in the next section, than something like SVC. Below, I'll give an overview of how-to run and compare several MLAs, but the rest of this Kernel will focus on learning data modeling via decision trees and its derivatives.

In [21]:
#Machine Learning Algorithm (MLA) Selection and Initialization
MLA = [
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    #Gaussian Processes
    gaussian_process.GaussianProcessClassifier(),
    
    #GLM
    linear_model.LogisticRegressionCV(),
    linear_model.PassiveAggressiveClassifier(),
    linear_model.RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),
    
    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    
    #SVM
    svm.SVC(probability=True),
    svm.NuSVC(probability=True),
    svm.LinearSVC(),
    
    #Trees    
    tree.DecisionTreeClassifier(),
    tree.ExtraTreeClassifier(),
    
    #Discriminant Analysis
    discriminant_analysis.LinearDiscriminantAnalysis(),
    discriminant_analysis.QuadraticDiscriminantAnalysis(),

    
    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
    XGBClassifier()    
    ]



#split dataset in cross-validation with this splitter class: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit
#note: this is an alternative to train_test_split
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0 ) # run model 10x with 60/30 split intentionally leaving out 10%

#create table to compare MLA metrics
MLA_columns = ['MLA Name', 'MLA Parameters','MLA Train Accuracy Mean', 'MLA Test Accuracy Mean', 'MLA Test Accuracy 3*STD' ,'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)

#create table to compare MLA predictions
MLA_predict = data1[Target]

#index through MLA and save performance to table
row_index = 0
for alg in MLA:

    #set name and parameters
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
    MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
    
    #score model with cross validation: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate
    cv_results = model_selection.cross_validate(alg, data1[data1_x_bin], data1[Target], cv  = cv_split)

    MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
    MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean()
    MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean()   
    #if this is a non-bias random sample, then +/-3 standard deviations (std) from the mean, should statistically capture 99.7% of the subsets
    MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cv_results['test_score'].std()*3   #let's know the worst that can happen!
    

    #save MLA predictions - see section 6 for usage
    alg.fit(data1[data1_x_bin], data1[Target])
    MLA_predict[MLA_name] = alg.predict(data1[data1_x_bin])
    
    row_index+=1

    
#print and sort table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
MLA_compare.sort_values(by = ['MLA Test Accuracy Mean'], ascending = False, inplace = True)
MLA_compare
#MLA_predict
Out[21]:
MLA Name MLA Parameters MLA Train Accuracy Mean MLA Test Accuracy Mean MLA Test Accuracy 3*STD MLA Time
21 XGBClassifier {'base_score': 0.5, 'booster': 'gbtree', 'cols... 0.856367 0.829478 0.0527546 0.0338062
4 RandomForestClassifier {'bootstrap': True, 'class_weight': None, 'cri... 0.892322 0.826493 0.0679525 0.0147755
14 SVC {'C': 1.0, 'cache_size': 200, 'class_weight': ... 0.837266 0.826119 0.0453876 0.0445107
3 GradientBoostingClassifier {'criterion': 'friedman_mse', 'init': None, 'l... 0.866667 0.822761 0.0498731 0.0715864
15 NuSVC {'cache_size': 200, 'class_weight': None, 'coe... 0.835768 0.822761 0.0493681 0.0524707
2 ExtraTreesClassifier {'bootstrap': False, 'class_weight': None, 'cr... 0.895131 0.821269 0.0690863 0.0144257
17 DecisionTreeClassifier {'class_weight': None, 'criterion': 'gini', 'm... 0.895131 0.81903 0.0575704 0.00189724
1 BaggingClassifier {'base_estimator': None, 'bootstrap': True, 'b... 0.890449 0.813806 0.0614041 0.0157245
13 KNeighborsClassifier {'algorithm': 'auto', 'leaf_size': 30, 'metric... 0.850375 0.813806 0.0690863 0.00233798
18 ExtraTreeClassifier {'class_weight': None, 'criterion': 'gini', 'm... 0.895131 0.812687 0.0634811 0.00160697
0 AdaBoostClassifier {'algorithm': 'SAMME.R', 'base_estimator': Non... 0.820412 0.81194 0.0498606 0.072931
5 GaussianProcessClassifier {'copy_X_train': True, 'kernel': None, 'max_it... 0.871723 0.810448 0.0492537 0.350273
20 QuadraticDiscriminantAnalysis {'priors': None, 'reg_param': 0.0, 'store_cova... 0.821536 0.80709 0.0810389 0.0176577
8 RidgeClassifierCV {'alphas': (0.1, 1.0, 10.0), 'class_weight': N... 0.796629 0.79403 0.0360302 0.0105472
19 LinearDiscriminantAnalysis {'n_components': None, 'priors': None, 'shrink... 0.796816 0.79403 0.0360302 0.00550387
16 LinearSVC {'C': 1.0, 'class_weight': None, 'dual': True,... 0.79794 0.793657 0.0400646 0.0274618
6 LogisticRegressionCV {'Cs': 10, 'class_weight': None, 'cv': None, '... 0.797004 0.790672 0.0653582 0.129134
12 GaussianNB {'priors': None} 0.794757 0.781343 0.0874568 0.00183613
11 BernoulliNB {'alpha': 1.0, 'binarize': 0.0, 'class_prior':... 0.785768 0.775373 0.0570347 0.00200269
7 PassiveAggressiveClassifier {'C': 1.0, 'average': False, 'class_weight': N... 0.734457 0.730597 0.148826 0.00238907
10 Perceptron {'alpha': 0.0001, 'class_weight': None, 'eta0'... 0.740075 0.728731 0.162221 0.00185683
9 SGDClassifier {'alpha': 0.0001, 'average': False, 'class_wei... 0.737079 0.726119 0.17372 0.00182471
In [22]:
#barplot using https://seaborn.pydata.org/generated/seaborn.barplot.html
sns.barplot(x='MLA Test Accuracy Mean', y = 'MLA Name', data = MLA_compare, color = 'm')

#prettify using pyplot: https://matplotlib.org/api/pyplot_api.html
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Accuracy Score (%)')
plt.ylabel('Algorithm')
Out[22]:
Text(0,0.5,'Algorithm')

5.1 Evaluate Model Performance

Let's recap, with some basic data cleaning, analysis, and machine learning algorithms (MLA), we are able to predict passenger survival with ~82% accuracy. Not bad for a few lines of code. But the question we always ask is, can we do better and more importantly get an ROI (return on investment) for our time invested? For example, if we're only going to increase our accuracy by 1/10th of a percent, is it really worth 3-months of development. If you work in research maybe the answer is yes, but if you work in business mostly the answer is no. So, keep that in mind when improving your model.

Data Science 101: Determine a Baseline Accuracy

Before we decide how-to make our model better, let's determine if our model is even worth keeping. To do that, we have to go back to the basics of data science 101. We know this is a binary problem, because there are only two possible outcomes; passengers survived or died. So, think of it like a coin flip problem. If you have a fair coin and you guessed heads or tail, then you have a 50-50 chance of guessing correct. So, let's set 50% as the worst model performance; because anything lower than that, then why do I need you when I can just flip a coin?

Okay, so with no information about the dataset, we can always get 50% with a binary problem. But we have information about the dataset, so we should be able to do better. We know that 1,502/2,224 or 67.5% of people died. Therefore, if we just predict the most frequent occurrence, that 100% of people died, then we would be right 67.5% of the time. So, let's set 68% as bad model performance, because again, anything lower than that, then why do I need you, when I can just predict using the most frequent occurrence.

Data Science 101: How-to Create Your Own Model

Our accuracy is increasing, but can we do better? Are there any signals in our data? To illustrate this, we're going to build our own decision tree model, because it is the easiest to conceptualize and requires simple addition and multiplication calculations. When creating a decision tree, you want to ask questions that segment your target response, placing the survived/1 and dead/0 into homogeneous subgroups. This is part science and part art, so let's just play the 21-question game to show you how it works. If you want to follow along on your own, download the train dataset and import into Excel. Create a pivot table with survival in the columns, count and % of row count in the values, and the features described below in the rows.

Remember, the name of the game is to create subgroups using a decision tree model to get survived/1 in one bucket and dead/0 in another bucket. Our rule of thumb will be the majority rules. Meaning, if the majority or 50% or more survived, then everybody in our subgroup survived/1, but if 50% or less survived then if everybody in our subgroup died/0. Also, we will stop if the subgroup is less than 10 and/or our model accuracy plateaus or decreases. Got it? Let's go!

Question 1: Were you on the Titanic? If Yes, then majority (62%) died. Note our sample survival is different than our population of 68%. Nonetheless, if we assumed everybody died, our sample accuracy is 62%.

Question 2: Are you male or female? Male, majority (81%) died. Female, majority (74%) survived. Giving us an accuracy of 79%.

Question 3A (going down the female branch with count = 314): Are you in class 1, 2, or 3? Class 1, majority (97%) survived and Class 2, majority (92%) survived. Since the dead subgroup is less than 10, we will stop going down this branch. Class 3, is even at a 50-50 split. No new information to improve our model is gained.

Question 4A (going down the female class 3 branch with count = 144): Did you embark from port C, Q, or S? We gain a little information. C and Q, the majority still survived, so no change. Also, the dead subgroup is less than 10, so we will stop. S, the majority (63%) died. So, we will change females, class 3, embarked S from assuming they survived, to assuming they died. Our model accuracy increases to 81%.

Question 5A (going down the female class 3 embarked S branch with count = 88): So far, it looks like we made good decisions. Adding another level does not seem to gain much more information. This subgroup 55 died and 33 survived, since majority died we need to find a signal to identify the 33 or a subgroup to change them from dead to survived and improve our model accuracy. We can play with our features. One I found was fare 0-8, majority survived. It's a small sample size 11-9, but one often used in statistics. We slightly improve our accuracy, but not much to move us past 82%. So, we'll stop here.

Question 3B (going down the male branch with count = 577): Going back to question 2, we know the majority of males died. So, we are looking for a feature that identifies a subgroup that majority survived. Surprisingly, class or even embarked didn't matter like it did for females, but title does and gets us to 82%. Guess and checking other features, none seem to push us past 82%. So, we'll stop here for now.

You did it, with very little information, we get to 82% accuracy. On a worst, bad, good, better, and best scale, we'll set 82% to good, since it's a simple model that yields us decent results. But the question still remains, can we do better than our handmade model?

Before we do, let's code what we just wrote above. Please note, this is a manual process created by "hand." You won't have to do this, but it's important to understand it before you start working with MLA. Think of MLA like a TI-89 calculator on a Calculus Exam. It's very powerful and helps you with a lot of the grunt work. But if you don't know what you're doing on the exam, a calculator, even a TI-89, is not going to help you pass. So, study the next section wisely.

Reference: Cross-Validation and Decision Tree Tutorial

In [23]:
#IMPORTANT: This is a handmade model for learning purposes only.
#However, it is possible to create your own predictive model without a fancy algorithm :)

#coin flip model with random 1/survived 0/died

#iterate over dataFrame rows as (index, Series) pairs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html
for index, row in data1.iterrows(): 
    #random number generator: https://docs.python.org/2/library/random.html
    if random.random() > .5:     # Random float x, 0.0 <= x < 1.0    
        data1.set_value(index, 'Random_Predict', 1) #predict survived/1
    else: 
        data1.set_value(index, 'Random_Predict', 0) #predict died/0
    

#score random guess of survival. Use shortcut 1 = Right Guess and 0 = Wrong Guess
#the mean of the column will then equal the accuracy
data1['Random_Score'] = 0 #assume prediction wrong
data1.loc[(data1['Survived'] == data1['Random_Predict']), 'Random_Score'] = 1 #set to 1 for correct prediction
print('Coin Flip Model Accuracy: {:.2f}%'.format(data1['Random_Score'].mean()*100))

#we can also use scikit's accuracy_score function to save us a few lines of code
#http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score
print('Coin Flip Model Accuracy w/SciKit: {:.2f}%'.format(metrics.accuracy_score(data1['Survived'], data1['Random_Predict'])*100))
Coin Flip Model Accuracy: 47.81%
Coin Flip Model Accuracy w/SciKit: 47.81%
In [24]:
#group by or pivot table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
pivot_female = data1[data1.Sex=='female'].groupby(['Sex','Pclass', 'Embarked','FareBin'])['Survived'].mean()
print('Survival Decision Tree w/Female Node: \n',pivot_female)

pivot_male = data1[data1.Sex=='male'].groupby(['Sex','Title'])['Survived'].mean()
print('\n\nSurvival Decision Tree w/Male Node: \n',pivot_male)
Survival Decision Tree w/Female Node: 
 Sex     Pclass  Embarked  FareBin        
female  1       C         (14.454, 31.0]     0.666667
                          (31.0, 512.329]    1.000000
                Q         (31.0, 512.329]    1.000000
                S         (14.454, 31.0]     1.000000
                          (31.0, 512.329]    0.955556
        2       C         (7.91, 14.454]     1.000000
                          (14.454, 31.0]     1.000000
                          (31.0, 512.329]    1.000000
                Q         (7.91, 14.454]     1.000000
                S         (7.91, 14.454]     0.875000
                          (14.454, 31.0]     0.916667
                          (31.0, 512.329]    1.000000
        3       C         (-0.001, 7.91]     1.000000
                          (7.91, 14.454]     0.428571
                          (14.454, 31.0]     0.666667
                Q         (-0.001, 7.91]     0.750000
                          (7.91, 14.454]     0.500000
                          (14.454, 31.0]     0.714286
                S         (-0.001, 7.91]     0.533333
                          (7.91, 14.454]     0.448276
                          (14.454, 31.0]     0.357143
                          (31.0, 512.329]    0.125000
Name: Survived, dtype: float64


Survival Decision Tree w/Male Node: 
 Sex   Title 
male  Master    0.575000
      Misc      0.250000
      Mr        0.156673
Name: Survived, dtype: float64
In [25]:
#handmade data model using brain power (and Microsoft Excel Pivot Tables for quick calculations)
def mytree(df):
    
    #initialize table to store predictions
    Model = pd.DataFrame(data = {'Predict':[]})
    male_title = ['Master'] #survived titles

    for index, row in df.iterrows():

        #Question 1: Were you on the Titanic; majority died
        Model.loc[index, 'Predict'] = 0

        #Question 2: Are you female; majority survived
        if (df.loc[index, 'Sex'] == 'female'):
                  Model.loc[index, 'Predict'] = 1

        #Question 3A Female - Class and Question 4 Embarked gain minimum information

        #Question 5B Female - FareBin; set anything less than .5 in female node decision tree back to 0       
        if ((df.loc[index, 'Sex'] == 'female') & 
            (df.loc[index, 'Pclass'] == 3) & 
            (df.loc[index, 'Embarked'] == 'S')  &
            (df.loc[index, 'Fare'] > 8)

           ):
                  Model.loc[index, 'Predict'] = 0

        #Question 3B Male: Title; set anything greater than .5 to 1 for majority survived
        if ((df.loc[index, 'Sex'] == 'male') &
            (df.loc[index, 'Title'] in male_title)
            ):
            Model.loc[index, 'Predict'] = 1
        
        
    return Model


#model data
Tree_Predict = mytree(data1)
print('Decision Tree Model Accuracy/Precision Score: {:.2f}%\n'.format(metrics.accuracy_score(data1['Survived'], Tree_Predict)*100))


#Accuracy Summary Report with http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report
#Where recall score = (true positives)/(true positive + false negative) w/1 being best:http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score
#And F1 score = weighted average of precision and recall w/1 being best: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
print(metrics.classification_report(data1['Survived'], Tree_Predict))
Decision Tree Model Accuracy/Precision Score: 82.04%

             precision    recall  f1-score   support

          0       0.82      0.91      0.86       549
          1       0.82      0.68      0.75       342

avg / total       0.82      0.82      0.82       891

In [26]:
#Plot Accuracy Summary
#Credit: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Compute confusion matrix
cnf_matrix = metrics.confusion_matrix(data1['Survived'], Tree_Predict)
np.set_printoptions(precision=2)

class_names = ['Dead', 'Survived']
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True, 
                      title='Normalized confusion matrix')
Confusion matrix, without normalization
[[497  52]
 [108 234]]
Normalized confusion matrix
[[ 0.91  0.09]
 [ 0.32  0.68]]

5.11 Model Performance with Cross-Validation (CV)

In step 5.0, we used sklearn cross_validate function to train, test, and score our model performance.

Remember, it's important we use a different subset for train data to build our model and test data to evaluate our model. Otherwise, our model will be overfitted. Meaning it's great at "predicting" data it's already seen, but terrible at predicting data it has not seen; which is not prediction at all. It's like cheating on a school quiz to get 100%, but then when you go to take the exam, you fail because you never truly learned anything. The same is true with machine learning.

CV is basically a shortcut to split and score our model multiple times, so we can get an idea of how well it will perform on unseen data. It’s a little more expensive in computer processing, but it's important so we don't gain false confidence. This is helpful in a Kaggle Competition or any use case where consistency matters and surprises should be avoided.

In addition to CV, we used a customized sklearn train test splitter, to allow a little more randomness in our test scoring. Below is an image of the default CV split.

CV

5.12 Tune Model with Hyper-Parameters

When we used sklearn Decision Tree (DT) Classifier, we accepted all the function defaults. This leaves opportunity to see how various hyper-parameter settings will change the model accuracy. (Click here to learn more about parameters vs hyper-parameters.)

However, in order to tune a model, we need to actually understand it. That's why I took the time in the previous sections to show you how predictions work. Now let's learn a little bit more about our DT algorithm.

Credit: sklearn

Some advantages of decision trees are:

  • Simple to understand and to interpret. Trees can be visualized.
  • Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
  • The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
  • Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See algorithms for more information.
  • Able to handle multi-output problems.
  • Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by Boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.
  • Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
  • Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.

The disadvantages of decision trees include:

  • Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
  • Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
  • The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
  • There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
  • Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

Below are available hyper-parameters and defintions:

class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)

We will tune our model using ParameterGrid, GridSearchCV, and customized sklearn scoring; click here to learn more about ROC_AUC scores. We will then visualize our tree with graphviz. Click here to learn more about ROC_AUC scores.

In [27]:
#base model
dtree = tree.DecisionTreeClassifier(random_state = 0)
base_results = model_selection.cross_validate(dtree, data1[data1_x_bin], data1[Target], cv  = cv_split)
dtree.fit(data1[data1_x_bin], data1[Target])

print('BEFORE DT Parameters: ', dtree.get_params())
print("BEFORE DT Training w/bin score mean: {:.2f}". format(base_results['train_score'].mean()*100)) 
print("BEFORE DT Test w/bin score mean: {:.2f}". format(base_results['test_score'].mean()*100))
print("BEFORE DT Test w/bin score 3*std: +/- {:.2f}". format(base_results['test_score'].std()*100*3))
#print("BEFORE DT Test w/bin set score min: {:.2f}". format(base_results['test_score'].min()*100))
print('-'*10)


#tune hyper-parameters: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
param_grid = {'criterion': ['gini', 'entropy'],  #scoring methodology; two supported formulas for calculating information gain - default is gini
              #'splitter': ['best', 'random'], #splitting methodology; two supported strategies - default is best
              'max_depth': [2,4,6,8,10,None], #max depth tree can grow; default is none
              #'min_samples_split': [2,5,10,.03,.05], #minimum subset size BEFORE new split (fraction is % of total); default is 2
              #'min_samples_leaf': [1,5,10,.03,.05], #minimum subset size AFTER new split split (fraction is % of total); default is 1
              #'max_features': [None, 'auto'], #max features to consider when performing split; default none or all
              'random_state': [0] #seed or control random number generator: https://www.quora.com/What-is-seed-in-random-number-generation
             }

#print(list(model_selection.ParameterGrid(param_grid)))

#choose best model with grid_search: #http://scikit-learn.org/stable/modules/grid_search.html#grid-search
#http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html
tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
tune_model.fit(data1[data1_x_bin], data1[Target])

#print(tune_model.cv_results_.keys())
#print(tune_model.cv_results_['params'])
print('AFTER DT Parameters: ', tune_model.best_params_)
#print(tune_model.cv_results_['mean_train_score'])
print("AFTER DT Training w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_train_score'][tune_model.best_index_]*100)) 
#print(tune_model.cv_results_['mean_test_score'])
print("AFTER DT Test w/bin score mean: {:.2f}". format(tune_model.cv_results_['mean_test_score'][tune_model.best_index_]*100))
print("AFTER DT Test w/bin score 3*std: +/- {:.2f}". format(tune_model.cv_results_['std_test_score'][tune_model.best_index_]*100*3))
print('-'*10)


#duplicates gridsearchcv
#tune_results = model_selection.cross_validate(tune_model, data1[data1_x_bin], data1[Target], cv  = cv_split)

#print('AFTER DT Parameters: ', tune_model.best_params_)
#print("AFTER DT Training w/bin set score mean: {:.2f}". format(tune_results['train_score'].mean()*100)) 
#print("AFTER DT Test w/bin set score mean: {:.2f}". format(tune_results['test_score'].mean()*100))
#print("AFTER DT Test w/bin set score min: {:.2f}". format(tune_results['test_score'].min()*100))
#print('-'*10)
BEFORE DT Parameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': False, 'random_state': 0, 'splitter': 'best'}
BEFORE DT Training w/bin score mean: 89.51
BEFORE DT Test w/bin score mean: 82.09
BEFORE DT Test w/bin score 3*std: +/- 5.57
----------
AFTER DT Parameters:  {'criterion': 'gini', 'max_depth': 4, 'random_state': 0}
AFTER DT Training w/bin score mean: 89.35
AFTER DT Test w/bin score mean: 87.40
AFTER DT Test w/bin score 3*std: +/- 5.00
----------

5.13 Tune Model with Feature Selection

As stated in the beginning, more predictor variables do not make a better model, but the right predictors do. So another step in data modeling is feature selection. Sklearn has several options, we will use recursive feature elimination (RFE) with cross validation (CV).

In [28]:
#base model
print('BEFORE DT RFE Training Shape Old: ', data1[data1_x_bin].shape) 
print('BEFORE DT RFE Training Columns Old: ', data1[data1_x_bin].columns.values)

print("BEFORE DT RFE Training w/bin score mean: {:.2f}". format(base_results['train_score'].mean()*100)) 
print("BEFORE DT RFE Test w/bin score mean: {:.2f}". format(base_results['test_score'].mean()*100))
print("BEFORE DT RFE Test w/bin score 3*std: +/- {:.2f}". format(base_results['test_score'].std()*100*3))
print('-'*10)



#feature selection
dtree_rfe = feature_selection.RFECV(dtree, step = 1, scoring = 'accuracy', cv = cv_split)
dtree_rfe.fit(data1[data1_x_bin], data1[Target])

#transform x&y to reduced features and fit new model
#alternative: can use pipeline to reduce fit and transform steps: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
X_rfe = data1[data1_x_bin].columns.values[dtree_rfe.get_support()]
rfe_results = model_selection.cross_validate(dtree, data1[X_rfe], data1[Target], cv  = cv_split)

#print(dtree_rfe.grid_scores_)
print('AFTER DT RFE Training Shape New: ', data1[X_rfe].shape) 
print('AFTER DT RFE Training Columns New: ', X_rfe)

print("AFTER DT RFE Training w/bin score mean: {:.2f}". format(rfe_results['train_score'].mean()*100)) 
print("AFTER DT RFE Test w/bin score mean: {:.2f}". format(rfe_results['test_score'].mean()*100))
print("AFTER DT RFE Test w/bin score 3*std: +/- {:.2f}". format(rfe_results['test_score'].std()*100*3))
print('-'*10)


#tune rfe model
rfe_tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
rfe_tune_model.fit(data1[X_rfe], data1[Target])

#print(rfe_tune_model.cv_results_.keys())
#print(rfe_tune_model.cv_results_['params'])
print('AFTER DT RFE Tuned Parameters: ', rfe_tune_model.best_params_)
#print(rfe_tune_model.cv_results_['mean_train_score'])
print("AFTER DT RFE Tuned Training w/bin score mean: {:.2f}". format(rfe_tune_model.cv_results_['mean_train_score'][tune_model.best_index_]*100)) 
#print(rfe_tune_model.cv_results_['mean_test_score'])
print("AFTER DT RFE Tuned Test w/bin score mean: {:.2f}". format(rfe_tune_model.cv_results_['mean_test_score'][tune_model.best_index_]*100))
print("AFTER DT RFE Tuned Test w/bin score 3*std: +/- {:.2f}". format(rfe_tune_model.cv_results_['std_test_score'][tune_model.best_index_]*100*3))
print('-'*10)
BEFORE DT RFE Training Shape Old:  (891, 7)
BEFORE DT RFE Training Columns Old:  ['Sex_Code' 'Pclass' 'Embarked_Code' 'Title_Code' 'FamilySize'
 'AgeBin_Code' 'FareBin_Code']
BEFORE DT RFE Training w/bin score mean: 89.51
BEFORE DT RFE Test w/bin score mean: 82.09
BEFORE DT RFE Test w/bin score 3*std: +/- 5.57
----------
AFTER DT RFE Training Shape New:  (891, 6)
AFTER DT RFE Training Columns New:  ['Sex_Code' 'Pclass' 'Title_Code' 'FamilySize' 'AgeBin_Code' 'FareBin_Code']
AFTER DT RFE Training w/bin score mean: 88.16
AFTER DT RFE Test w/bin score mean: 83.06
AFTER DT RFE Test w/bin score 3*std: +/- 6.22
----------
AFTER DT RFE Tuned Parameters:  {'criterion': 'gini', 'max_depth': 4, 'random_state': 0}
AFTER DT RFE Tuned Training w/bin score mean: 89.39
AFTER DT RFE Tuned Test w/bin score mean: 87.34
AFTER DT RFE Tuned Test w/bin score 3*std: +/- 6.21
----------
In [29]:
#Graph MLA version of Decision Tree: http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html
import graphviz 
dot_data = tree.export_graphviz(dtree, out_file=None, 
                                feature_names = data1_x_bin, class_names = True,
                                filled = True, rounded = True)
graph = graphviz.Source(dot_data) 
graph
Out[29]:
Tree 0 Sex_Code <= 0.5 gini = 0.473 samples = 891 value = [549, 342] class = y[0] 1 Pclass <= 2.5 gini = 0.383 samples = 314 value = [81, 233] class = y[1] 0->1 True 132 Title_Code <= 0.5 gini = 0.306 samples = 577 value = [468, 109] class = y[0] 0->132 False 2 FareBin_Code <= 2.5 gini = 0.1 samples = 170 value = [9, 161] class = y[1] 1->2 47 FamilySize <= 4.5 gini = 0.5 samples = 144 value = [72, 72] class = y[0] 1->47 3 AgeBin_Code <= 2.5 gini = 0.169 samples = 75 value = [7, 68] class = y[1] 2->3 36 FamilySize <= 3.5 gini = 0.041 samples = 95 value = [2, 93] class = y[1] 2->36 4 Embarked_Code <= 1.5 gini = 0.14 samples = 66 value = [5, 61] class = y[1] 3->4 29 Embarked_Code <= 1.0 gini = 0.346 samples = 9 value = [2, 7] class = y[1] 3->29 5 gini = 0.0 samples = 9 value = [0, 9] class = y[1] 4->5 6 AgeBin_Code <= 0.5 gini = 0.16 samples = 57 value = [5, 52] class = y[1] 4->6 7 gini = 0.0 samples = 6 value = [0, 6] class = y[1] 6->7 8 FamilySize <= 3.5 gini = 0.177 samples = 51 value = [5, 46] class = y[1] 6->8 9 FamilySize <= 1.5 gini = 0.19 samples = 47 value = [5, 42] class = y[1] 8->9 28 gini = 0.0 samples = 4 value = [0, 4] class = y[1] 8->28 10 Title_Code <= 3.0 gini = 0.147 samples = 25 value = [2, 23] class = y[1] 9->10 19 Title_Code <= 3.0 gini = 0.236 samples = 22 value = [3, 19] class = y[1] 9->19 11 FareBin_Code <= 1.5 gini = 0.219 samples = 16 value = [2, 14] class = y[1] 10->11 18 gini = 0.0 samples = 9 value = [0, 9] class = y[1] 10->18 12 AgeBin_Code <= 1.5 gini = 0.26 samples = 13 value = [2, 11] class = y[1] 11->12 17 gini = 0.0 samples = 3 value = [0, 3] class = y[1] 11->17 13 Title_Code <= 1.5 gini = 0.198 samples = 9 value = [1, 8] class = y[1] 12->13 16 gini = 0.375 samples = 4 value = [1, 3] class = y[1] 12->16 14 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 13->14 15 gini = 0.219 samples = 8 value = [1, 7] class = y[1] 13->15 20 gini = 0.0 samples = 4 value = [0, 4] class = y[1] 19->20 21 AgeBin_Code <= 1.5 gini = 0.278 samples = 18 value = [3, 15] class = y[1] 19->21 22 FamilySize <= 2.5 gini = 0.298 samples = 11 value = [2, 9] class = y[1] 21->22 25 FamilySize <= 2.5 gini = 0.245 samples = 7 value = [1, 6] class = y[1] 21->25 23 gini = 0.245 samples = 7 value = [1, 6] class = y[1] 22->23 24 gini = 0.375 samples = 4 value = [1, 3] class = y[1] 22->24 26 gini = 0.32 samples = 5 value = [1, 4] class = y[1] 25->26 27 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 25->27 30 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 29->30 31 FareBin_Code <= 1.5 gini = 0.219 samples = 8 value = [1, 7] class = y[1] 29->31 32 Title_Code <= 3.0 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 31->32 35 gini = 0.0 samples = 5 value = [0, 5] class = y[1] 31->35 33 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 32->33 34 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 32->34 37 gini = 0.0 samples = 82 value = [0, 82] class = y[1] 36->37 38 Pclass <= 1.5 gini = 0.26 samples = 13 value = [2, 11] class = y[1] 36->38 39 FamilySize <= 4.5 gini = 0.375 samples = 8 value = [2, 6] class = y[1] 38->39 46 gini = 0.0 samples = 5 value = [0, 5] class = y[1] 38->46 40 AgeBin_Code <= 1.5 gini = 0.5 samples = 4 value = [2, 2] class = y[0] 39->40 45 gini = 0.0 samples = 4 value = [0, 4] class = y[1] 39->45 41 AgeBin_Code <= 0.5 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 40->41 44 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 40->44 42 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 41->42 43 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 41->43 48 Embarked_Code <= 1.5 gini = 0.484 samples = 117 value = [48, 69] class = y[1] 47->48 123 FareBin_Code <= 1.5 gini = 0.198 samples = 27 value = [24, 3] class = y[0] 47->123 49 AgeBin_Code <= 1.5 gini = 0.413 samples = 55 value = [16, 39] class = y[1] 48->49 82 AgeBin_Code <= 2.5 gini = 0.499 samples = 62 value = [32, 30] class = y[0] 48->82 50 AgeBin_Code <= 0.5 gini = 0.401 samples = 54 value = [15, 39] class = y[1] 49->50 81 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 49->81 51 FamilySize <= 1.5 gini = 0.245 samples = 14 value = [2, 12] class = y[1] 50->51 60 FareBin_Code <= 0.5 gini = 0.439 samples = 40 value = [13, 27] class = y[1] 50->60 52 gini = 0.0 samples = 5 value = [0, 5] class = y[1] 51->52 53 FamilySize <= 3.5 gini = 0.346 samples = 9 value = [2, 7] class = y[1] 51->53 54 Title_Code <= 3.0 gini = 0.444 samples = 6 value = [2, 4] class = y[1] 53->54 59 gini = 0.0 samples = 3 value = [0, 3] class = y[1] 53->59 55 FareBin_Code <= 1.5 gini = 0.48 samples = 5 value = [2, 3] class = y[1] 54->55 58 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 54->58 56 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 55->56 57 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 55->57 61 FamilySize <= 2.0 gini = 0.375 samples = 24 value = [6, 18] class = y[1] 60->61 66 FareBin_Code <= 1.5 gini = 0.492 samples = 16 value = [7, 9] class = y[1] 60->66 62 Embarked_Code <= 0.5 gini = 0.34 samples = 23 value = [5, 18] class = y[1] 61->62 65 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 61->65 63 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 62->63 64 gini = 0.363 samples = 21 value = [5, 16] class = y[1] 62->64 67 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 66->67 68 FamilySize <= 1.5 gini = 0.426 samples = 13 value = [4, 9] class = y[1] 66->68 69 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 68->69 70 Title_Code <= 3.0 gini = 0.375 samples = 12 value = [3, 9] class = y[1] 68->70 71 gini = 0.0 samples = 5 value = [0, 5] class = y[1] 70->71 72 FamilySize <= 3.5 gini = 0.49 samples = 7 value = [3, 4] class = y[1] 70->72 73 FamilySize <= 2.5 gini = 0.5 samples = 6 value = [3, 3] class = y[0] 72->73 80 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 72->80 74 Embarked_Code <= 0.5 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 73->74 77 Embarked_Code <= 0.5 gini = 0.5 samples = 4 value = [2, 2] class = y[0] 73->77 75 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 74->75 76 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 74->76 78 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 77->78 79 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 77->79 83 AgeBin_Code <= 0.5 gini = 0.499 samples = 61 value = [32, 29] class = y[0] 82->83 122 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 82->122 84 FareBin_Code <= 0.5 gini = 0.469 samples = 8 value = [3, 5] class = y[1] 83->84 93 Title_Code <= 3.0 gini = 0.496 samples = 53 value = [29, 24] class = y[0] 83->93 85 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 84->85 86 FamilySize <= 1.5 gini = 0.408 samples = 7 value = [2, 5] class = y[1] 84->86 87 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 86->87 88 FamilySize <= 2.5 gini = 0.444 samples = 6 value = [2, 4] class = y[1] 86->88 89 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 88->89 90 FareBin_Code <= 1.5 gini = 0.375 samples = 4 value = [1, 3] class = y[1] 88->90 91 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 90->91 92 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 90->92 94 FareBin_Code <= 0.5 gini = 0.475 samples = 31 value = [19, 12] class = y[0] 93->94 105 FamilySize <= 3.5 gini = 0.496 samples = 22 value = [10, 12] class = y[1] 93->105 95 AgeBin_Code <= 1.5 gini = 0.49 samples = 14 value = [6, 8] class = y[1] 94->95 100 FamilySize <= 1.5 gini = 0.36 samples = 17 value = [13, 4] class = y[0] 94->100 96 FamilySize <= 1.5 gini = 0.473 samples = 13 value = [5, 8] class = y[1] 95->96 99 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 95->99 97 gini = 0.486 samples = 12 value = [5, 7] class = y[1] 96->97 98 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 96->98 101 AgeBin_Code <= 1.5 gini = 0.444 samples = 12 value = [8, 4] class = y[0] 100->101 104 gini = 0.0 samples = 5 value = [5, 0] class = y[0] 100->104 102 gini = 0.463 samples = 11 value = [7, 4] class = y[0] 101->102 103 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 101->103 106 AgeBin_Code <= 1.5 gini = 0.499 samples = 21 value = [10, 11] class = y[1] 105->106 121 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 105->121 107 FamilySize <= 2.5 gini = 0.492 samples = 16 value = [7, 9] class = y[1] 106->107 116 FareBin_Code <= 1.5 gini = 0.48 samples = 5 value = [3, 2] class = y[0] 106->116 108 FareBin_Code <= 1.5 gini = 0.48 samples = 10 value = [4, 6] class = y[1] 107->108 113 FareBin_Code <= 1.5 gini = 0.5 samples = 6 value = [3, 3] class = y[0] 107->113 109 FamilySize <= 1.5 gini = 0.375 samples = 4 value = [1, 3] class = y[1] 108->109 112 gini = 0.5 samples = 6 value = [3, 3] class = y[0] 108->112 110 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 109->110 111 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 109->111 114 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 113->114 115 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 113->115 117 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 116->117 118 FamilySize <= 2.5 gini = 0.5 samples = 4 value = [2, 2] class = y[0] 116->118 119 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 118->119 120 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 118->120 124 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 123->124 125 FamilySize <= 6.5 gini = 0.142 samples = 26 value = [24, 2] class = y[0] 123->125 126 gini = 0.0 samples = 14 value = [14, 0] class = y[0] 125->126 127 FamilySize <= 7.5 gini = 0.278 samples = 12 value = [10, 2] class = y[0] 125->127 128 AgeBin_Code <= 1.0 gini = 0.408 samples = 7 value = [5, 2] class = y[0] 127->128 131 gini = 0.0 samples = 5 value = [5, 0] class = y[0] 127->131 129 gini = 0.32 samples = 5 value = [4, 1] class = y[0] 128->129 130 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 128->130 133 FamilySize <= 4.5 gini = 0.489 samples = 40 value = [17, 23] class = y[1] 132->133 140 Pclass <= 1.5 gini = 0.269 samples = 537 value = [451, 86] class = y[0] 132->140 134 gini = 0.0 samples = 22 value = [0, 22] class = y[1] 133->134 135 FamilySize <= 6.5 gini = 0.105 samples = 18 value = [17, 1] class = y[0] 133->135 136 gini = 0.0 samples = 11 value = [11, 0] class = y[0] 135->136 137 FamilySize <= 7.5 gini = 0.245 samples = 7 value = [6, 1] class = y[0] 135->137 138 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 137->138 139 gini = 0.0 samples = 4 value = [4, 0] class = y[0] 137->139 141 FareBin_Code <= 1.0 gini = 0.457 samples = 119 value = [77, 42] class = y[0] 140->141 204 AgeBin_Code <= 1.5 gini = 0.188 samples = 418 value = [374, 44] class = y[0] 140->204 142 gini = 0.0 samples = 6 value = [6, 0] class = y[0] 141->142 143 AgeBin_Code <= 2.5 gini = 0.467 samples = 113 value = [71, 42] class = y[0] 141->143 144 FareBin_Code <= 2.5 gini = 0.487 samples = 81 value = [47, 34] class = y[0] 143->144 185 Title_Code <= 2.0 gini = 0.375 samples = 32 value = [24, 8] class = y[0] 143->185 145 FamilySize <= 1.5 gini = 0.495 samples = 31 value = [14, 17] class = y[1] 144->145 158 FamilySize <= 1.5 gini = 0.449 samples = 50 value = [33, 17] class = y[0] 144->158 146 AgeBin_Code <= 1.5 gini = 0.491 samples = 30 value = [13, 17] class = y[1] 145->146 157 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 145->157 147 Title_Code <= 2.0 gini = 0.498 samples = 17 value = [9, 8] class = y[0] 146->147 152 Title_Code <= 2.0 gini = 0.426 samples = 13 value = [4, 9] class = y[1] 146->152 148 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 147->148 149 Embarked_Code <= 1.0 gini = 0.492 samples = 16 value = [9, 7] class = y[0] 147->149 150 gini = 0.444 samples = 6 value = [4, 2] class = y[0] 149->150 151 gini = 0.5 samples = 10 value = [5, 5] class = y[0] 149->151 153 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 152->153 154 Embarked_Code <= 1.0 gini = 0.298 samples = 11 value = [2, 9] class = y[1] 152->154 155 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 154->155 156 gini = 0.346 samples = 9 value = [2, 7] class = y[1] 154->156 159 Embarked_Code <= 1.0 gini = 0.308 samples = 21 value = [17, 4] class = y[0] 158->159 168 Embarked_Code <= 0.5 gini = 0.495 samples = 29 value = [16, 13] class = y[0] 158->168 160 Title_Code <= 2.0 gini = 0.375 samples = 8 value = [6, 2] class = y[0] 159->160 165 AgeBin_Code <= 1.5 gini = 0.26 samples = 13 value = [11, 2] class = y[0] 159->165 161 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 160->161 162 AgeBin_Code <= 1.5 gini = 0.408 samples = 7 value = [5, 2] class = y[0] 160->162 163 gini = 0.375 samples = 4 value = [3, 1] class = y[0] 162->163 164 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 162->164 166 gini = 0.346 samples = 9 value = [7, 2] class = y[0] 165->166 167 gini = 0.0 samples = 4 value = [4, 0] class = y[0] 165->167 169 AgeBin_Code <= 1.5 gini = 0.48 samples = 10 value = [4, 6] class = y[1] 168->169 174 FamilySize <= 5.0 gini = 0.465 samples = 19 value = [12, 7] class = y[0] 168->174 170 FamilySize <= 2.5 gini = 0.5 samples = 8 value = [4, 4] class = y[0] 169->170 173 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 169->173 171 gini = 0.5 samples = 6 value = [3, 3] class = y[0] 170->171 172 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 170->172 175 FamilySize <= 3.5 gini = 0.475 samples = 18 value = [11, 7] class = y[0] 174->175 184 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 174->184 176 Title_Code <= 2.0 gini = 0.457 samples = 17 value = [11, 6] class = y[0] 175->176 183 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 175->183 177 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 176->177 178 FamilySize <= 2.5 gini = 0.469 samples = 16 value = [10, 6] class = y[0] 176->178 179 AgeBin_Code <= 1.5 gini = 0.444 samples = 15 value = [10, 5] class = y[0] 178->179 182 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 178->182 180 gini = 0.444 samples = 6 value = [4, 2] class = y[0] 179->180 181 gini = 0.444 samples = 9 value = [6, 3] class = y[0] 179->181 186 AgeBin_Code <= 3.5 gini = 0.444 samples = 6 value = [2, 4] class = y[1] 185->186 191 FareBin_Code <= 2.5 gini = 0.26 samples = 26 value = [22, 4] class = y[0] 185->191 187 FareBin_Code <= 2.5 gini = 0.32 samples = 5 value = [1, 4] class = y[1] 186->187 190 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 186->190 188 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 187->188 189 gini = 0.0 samples = 3 value = [0, 3] class = y[1] 187->189 192 AgeBin_Code <= 3.5 gini = 0.32 samples = 10 value = [8, 2] class = y[0] 191->192 197 Embarked_Code <= 1.0 gini = 0.219 samples = 16 value = [14, 2] class = y[0] 191->197 193 Embarked_Code <= 1.0 gini = 0.219 samples = 8 value = [7, 1] class = y[0] 192->193 196 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 192->196 194 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 193->194 195 gini = 0.278 samples = 6 value = [5, 1] class = y[0] 193->195 198 AgeBin_Code <= 3.5 gini = 0.346 samples = 9 value = [7, 2] class = y[0] 197->198 203 gini = 0.0 samples = 7 value = [7, 0] class = y[0] 197->203 199 FamilySize <= 2.5 gini = 0.444 samples = 6 value = [4, 2] class = y[0] 198->199 202 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 198->202 200 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 199->200 201 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 199->201 205 FareBin_Code <= 2.5 gini = 0.213 samples = 314 value = [276, 38] class = y[0] 204->205 272 FareBin_Code <= 0.5 gini = 0.109 samples = 104 value = [98, 6] class = y[0] 204->272 206 Embarked_Code <= 0.5 gini = 0.201 samples = 291 value = [258, 33] class = y[0] 205->206 267 FamilySize <= 1.5 gini = 0.34 samples = 23 value = [18, 5] class = y[0] 205->267 207 AgeBin_Code <= 0.5 gini = 0.295 samples = 39 value = [32, 7] class = y[0] 206->207 224 FareBin_Code <= 1.5 gini = 0.185 samples = 252 value = [226, 26] class = y[0] 206->224 208 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 207->208 209 FamilySize <= 2.5 gini = 0.307 samples = 37 value = [30, 7] class = y[0] 207->209 210 FamilySize <= 1.5 gini = 0.291 samples = 34 value = [28, 6] class = y[0] 209->210 221 FareBin_Code <= 1.0 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 209->221 211 FareBin_Code <= 0.5 gini = 0.328 samples = 29 value = [23, 6] class = y[0] 210->211 220 gini = 0.0 samples = 5 value = [5, 0] class = y[0] 210->220 212 gini = 0.287 samples = 23 value = [19, 4] class = y[0] 211->212 213 FareBin_Code <= 1.5 gini = 0.444 samples = 6 value = [4, 2] class = y[0] 211->213 214 Pclass <= 2.5 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 213->214 217 Pclass <= 2.5 gini = 0.375 samples = 4 value = [3, 1] class = y[0] 213->217 215 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 214->215 216 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 214->216 218 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 217->218 219 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 217->219 222 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 221->222 223 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 221->223 225 FareBin_Code <= 0.5 gini = 0.195 samples = 219 value = [195, 24] class = y[0] 224->225 256 Pclass <= 2.5 gini = 0.114 samples = 33 value = [31, 2] class = y[0] 224->256 226 Pclass <= 2.5 gini = 0.156 samples = 117 value = [107, 10] class = y[0] 225->226 239 Pclass <= 2.5 gini = 0.237 samples = 102 value = [88, 14] class = y[0] 225->239 227 gini = 0.0 samples = 6 value = [6, 0] class = y[0] 226->227 228 FamilySize <= 2.5 gini = 0.164 samples = 111 value = [101, 10] class = y[0] 226->228 229 FamilySize <= 1.5 gini = 0.165 samples = 110 value = [100, 10] class = y[0] 228->229 238 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 228->238 230 AgeBin_Code <= 0.5 gini = 0.161 samples = 102 value = [93, 9] class = y[0] 229->230 235 Embarked_Code <= 1.5 gini = 0.219 samples = 8 value = [7, 1] class = y[0] 229->235 231 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 230->231 232 Embarked_Code <= 1.5 gini = 0.162 samples = 101 value = [92, 9] class = y[0] 230->232 233 gini = 0.159 samples = 23 value = [21, 2] class = y[0] 232->233 234 gini = 0.163 samples = 78 value = [71, 7] class = y[0] 232->234 236 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 235->236 237 gini = 0.245 samples = 7 value = [6, 1] class = y[0] 235->237 240 FamilySize <= 1.5 gini = 0.165 samples = 33 value = [30, 3] class = y[0] 239->240 247 FamilySize <= 1.5 gini = 0.268 samples = 69 value = [58, 11] class = y[0] 239->247 241 Title_Code <= 2.0 gini = 0.175 samples = 31 value = [28, 3] class = y[0] 240->241 246 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 240->246 242 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 241->242 243 AgeBin_Code <= 0.5 gini = 0.185 samples = 29 value = [26, 3] class = y[0] 241->243 244 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 243->244 245 gini = 0.191 samples = 28 value = [25, 3] class = y[0] 243->245 248 AgeBin_Code <= 0.5 gini = 0.257 samples = 66 value = [56, 10] class = y[0] 247->248 253 FamilySize <= 2.5 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 247->253 249 gini = 0.375 samples = 4 value = [3, 1] class = y[0] 248->249 250 Embarked_Code <= 1.5 gini = 0.248 samples = 62 value = [53, 9] class = y[0] 248->250 251 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 250->251 252 gini = 0.252 samples = 61 value = [52, 9] class = y[0] 250->252 254 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 253->254 255 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 253->255 257 FamilySize <= 1.5 gini = 0.278 samples = 6 value = [5, 1] class = y[0] 256->257 262 Embarked_Code <= 1.5 gini = 0.071 samples = 27 value = [26, 1] class = y[0] 256->262 258 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 257->258 259 FamilySize <= 2.5 gini = 0.375 samples = 4 value = [3, 1] class = y[0] 257->259 260 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 259->260 261 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 259->261 263 FamilySize <= 2.5 gini = 0.278 samples = 6 value = [5, 1] class = y[0] 262->263 266 gini = 0.0 samples = 21 value = [21, 0] class = y[0] 262->266 264 gini = 0.0 samples = 5 value = [5, 0] class = y[0] 263->264 265 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 263->265 268 Pclass <= 2.5 gini = 0.494 samples = 9 value = [4, 5] class = y[1] 267->268 271 gini = 0.0 samples = 14 value = [14, 0] class = y[0] 267->271 269 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 268->269 270 gini = 0.408 samples = 7 value = [2, 5] class = y[1] 268->270 273 gini = 0.0 samples = 32 value = [32, 0] class = y[0] 272->273 274 FareBin_Code <= 1.5 gini = 0.153 samples = 72 value = [66, 6] class = y[0] 272->274 275 FamilySize <= 1.5 gini = 0.223 samples = 47 value = [41, 6] class = y[0] 274->275 290 gini = 0.0 samples = 25 value = [25, 0] class = y[0] 274->290 276 Title_Code <= 2.0 gini = 0.24 samples = 43 value = [37, 6] class = y[0] 275->276 289 gini = 0.0 samples = 4 value = [4, 0] class = y[0] 275->289 277 gini = 0.0 samples = 4 value = [4, 0] class = y[0] 276->277 278 AgeBin_Code <= 3.5 gini = 0.26 samples = 39 value = [33, 6] class = y[0] 276->278 279 Embarked_Code <= 1.0 gini = 0.272 samples = 37 value = [31, 6] class = y[0] 278->279 288 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 278->288 280 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 279->280 281 AgeBin_Code <= 2.5 gini = 0.284 samples = 35 value = [29, 6] class = y[0] 279->281 282 Pclass <= 2.5 gini = 0.302 samples = 27 value = [22, 5] class = y[0] 281->282 285 Pclass <= 2.5 gini = 0.219 samples = 8 value = [7, 1] class = y[0] 281->285 283 gini = 0.298 samples = 11 value = [9, 2] class = y[0] 282->283 284 gini = 0.305 samples = 16 value = [13, 3] class = y[0] 282->284 286 gini = 0.32 samples = 5 value = [4, 1] class = y[0] 285->286 287 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 285->287

Step 6: Validate and Implement

The next step is to prepare for submission using the validation data.

In [30]:
#compare algorithm predictions with each other, where 1 = exactly similar and 0 = exactly opposite
#there are some 1's, but enough blues and light reds to create a "super algorithm" by combining them
correlation_heatmap(MLA_predict)
In [31]:
#why choose one model, when you can pick them all with voting classifier
#http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html
#removed models w/o attribute 'predict_proba' required for vote classifier and models with a 1.0 correlation to another model
vote_est = [
    #Ensemble Methods: http://scikit-learn.org/stable/modules/ensemble.html
    ('ada', ensemble.AdaBoostClassifier()),
    ('bc', ensemble.BaggingClassifier()),
    ('etc',ensemble.ExtraTreesClassifier()),
    ('gbc', ensemble.GradientBoostingClassifier()),
    ('rfc', ensemble.RandomForestClassifier()),

    #Gaussian Processes: http://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process-classification-gpc
    ('gpc', gaussian_process.GaussianProcessClassifier()),
    
    #GLM: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
    ('lr', linear_model.LogisticRegressionCV()),
    
    #Navies Bayes: http://scikit-learn.org/stable/modules/naive_bayes.html
    ('bnb', naive_bayes.BernoulliNB()),
    ('gnb', naive_bayes.GaussianNB()),
    
    #Nearest Neighbor: http://scikit-learn.org/stable/modules/neighbors.html
    ('knn', neighbors.KNeighborsClassifier()),
    
    #SVM: http://scikit-learn.org/stable/modules/svm.html
    ('svc', svm.SVC(probability=True)),
    
    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
   ('xgb', XGBClassifier())

]


#Hard Vote or majority rules
vote_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
vote_hard_cv = model_selection.cross_validate(vote_hard, data1[data1_x_bin], data1[Target], cv  = cv_split)
vote_hard.fit(data1[data1_x_bin], data1[Target])

print("Hard Voting Training w/bin score mean: {:.2f}". format(vote_hard_cv['train_score'].mean()*100)) 
print("Hard Voting Test w/bin score mean: {:.2f}". format(vote_hard_cv['test_score'].mean()*100))
print("Hard Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_hard_cv['test_score'].std()*100*3))
print('-'*10)


#Soft Vote or weighted probabilities
vote_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
vote_soft_cv = model_selection.cross_validate(vote_soft, data1[data1_x_bin], data1[Target], cv  = cv_split)
vote_soft.fit(data1[data1_x_bin], data1[Target])

print("Soft Voting Training w/bin score mean: {:.2f}". format(vote_soft_cv['train_score'].mean()*100)) 
print("Soft Voting Test w/bin score mean: {:.2f}". format(vote_soft_cv['test_score'].mean()*100))
print("Soft Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_soft_cv['test_score'].std()*100*3))
print('-'*10)
Hard Voting Training w/bin score mean: 86.59
Hard Voting Test w/bin score mean: 82.39
Hard Voting Test w/bin score 3*std: +/- 4.95
----------
Soft Voting Training w/bin score mean: 87.15
Soft Voting Test w/bin score mean: 82.35
Soft Voting Test w/bin score 3*std: +/- 4.85
----------
In [32]:
#IMPORTANT: THIS SECTION IS UNDER CONSTRUCTION!!!! 12.24.17
#UPDATE: This section was scrapped for the next section; as it's more computational friendly.

#WARNING: Running is very computational intensive and time expensive
#code is written for experimental/developmental purposes and not production ready


#tune each estimator before creating a super model
#http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
grid_n_estimator = [50,100,300]
grid_ratio = [.1,.25,.5,.75,1.0]
grid_learn = [.01,.03,.05,.1,.25]
grid_max_depth = [2,4,6,None]
grid_min_samples = [5,10,.03,.05,.10]
grid_criterion = ['gini', 'entropy']
grid_bool = [True, False]
grid_seed = [0]

vote_param = [{
#            #http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
            'ada__n_estimators': grid_n_estimator,
            'ada__learning_rate': grid_ratio,
            'ada__algorithm': ['SAMME', 'SAMME.R'],
            'ada__random_state': grid_seed,
    
            #http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier
            'bc__n_estimators': grid_n_estimator,
            'bc__max_samples': grid_ratio,
            'bc__oob_score': grid_bool, 
            'bc__random_state': grid_seed,
            
            #http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier
            'etc__n_estimators': grid_n_estimator,
            'etc__criterion': grid_criterion,
            'etc__max_depth': grid_max_depth,
            'etc__random_state': grid_seed,


            #http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier
            'gbc__loss': ['deviance', 'exponential'],
            'gbc__learning_rate': grid_ratio,
            'gbc__n_estimators': grid_n_estimator,
            'gbc__criterion': ['friedman_mse', 'mse', 'mae'],
            'gbc__max_depth': grid_max_depth,
            'gbc__min_samples_split': grid_min_samples,
            'gbc__min_samples_leaf': grid_min_samples,      
            'gbc__random_state': grid_seed,
    
            #http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
            'rfc__n_estimators': grid_n_estimator,
            'rfc__criterion': grid_criterion,
            'rfc__max_depth': grid_max_depth,
            'rfc__min_samples_split': grid_min_samples,
            'rfc__min_samples_leaf': grid_min_samples,   
            'rfc__bootstrap': grid_bool,
            'rfc__oob_score': grid_bool, 
            'rfc__random_state': grid_seed,
        
            #http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV
            'lr__fit_intercept': grid_bool,
            'lr__penalty': ['l1','l2'],
            'lr__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
            'lr__random_state': grid_seed,
            
            #http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB
            'bnb__alpha': grid_ratio,
            'bnb__prior': grid_bool,
            'bnb__random_state': grid_seed,
    
            #http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
            'knn__n_neighbors': [1,2,3,4,5,6,7],
            'knn__weights': ['uniform', 'distance'],
            'knn__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
            'knn__random_state': grid_seed,
            
            #http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
            #http://blog.hackerearth.com/simple-tutorial-svm-parameter-tuning-python-r
            'svc__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
            'svc__C': grid_max_depth,
            'svc__gamma': grid_ratio,
            'svc__decision_function_shape': ['ovo', 'ovr'],
            'svc__probability': [True],
            'svc__random_state': grid_seed,
    
    
            #http://xgboost.readthedocs.io/en/latest/parameter.html
            'xgb__learning_rate': grid_ratio,
            'xgb__max_depth': [2,4,6,8,10],
            'xgb__tree_method': ['exact', 'approx', 'hist'],
            'xgb__objective': ['reg:linear', 'reg:logistic', 'binary:logistic'],
            'xgb__seed': grid_seed    

        }]




#Soft Vote with tuned models
#grid_soft = model_selection.GridSearchCV(estimator = vote_soft, param_grid = vote_param, cv = 2, scoring = 'roc_auc')
#grid_soft.fit(data1[data1_x_bin], data1[Target])

#print(grid_soft.cv_results_.keys())
#print(grid_soft.cv_results_['params'])
#print('Soft Vote Tuned Parameters: ', grid_soft.best_params_)
#print(grid_soft.cv_results_['mean_train_score'])
#print("Soft Vote Tuned Training w/bin set score mean: {:.2f}". format(grid_soft.cv_results_['mean_train_score'][tune_model.best_index_]*100)) 
#print(grid_soft.cv_results_['mean_test_score'])
#print("Soft Vote Tuned Test w/bin set score mean: {:.2f}". format(grid_soft.cv_results_['mean_test_score'][tune_model.best_index_]*100))
#print("Soft Vote Tuned Test w/bin score 3*std: +/- {:.2f}". format(grid_soft.cv_results_['std_test_score'][tune_model.best_index_]*100*3))
#print('-'*10)


#credit: https://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/
#cv_keys = ('mean_test_score', 'std_test_score', 'params')
#for r, _ in enumerate(grid_soft.cv_results_['mean_test_score']):
#    print("%0.3f +/- %0.2f %r"
#          % (grid_soft.cv_results_[cv_keys[0]][r],
#             grid_soft.cv_results_[cv_keys[1]][r] / 2.0,
#             grid_soft.cv_results_[cv_keys[2]][r]))


#print('-'*10)
In [33]:
#WARNING: Running is very computational intensive and time expensive.
#Code is written for experimental/developmental purposes and not production ready!


#Hyperparameter Tune with GridSearchCV: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
grid_n_estimator = [10, 50, 100, 300]
grid_ratio = [.1, .25, .5, .75, 1.0]
grid_learn = [.01, .03, .05, .1, .25]
grid_max_depth = [2, 4, 6, 8, 10, None]
grid_min_samples = [5, 10, .03, .05, .10]
grid_criterion = ['gini', 'entropy']
grid_bool = [True, False]
grid_seed = [0]


grid_param = [
            [{
            #AdaBoostClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
            'n_estimators': grid_n_estimator, #default=50
            'learning_rate': grid_learn, #default=1
            #'algorithm': ['SAMME', 'SAMME.R'], #default=’SAMME.R
            'random_state': grid_seed
            }],
       
    
            [{
            #BaggingClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier
            'n_estimators': grid_n_estimator, #default=10
            'max_samples': grid_ratio, #default=1.0
            'random_state': grid_seed
             }],

    
            [{
            #ExtraTreesClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier
            'n_estimators': grid_n_estimator, #default=10
            'criterion': grid_criterion, #default=”gini”
            'max_depth': grid_max_depth, #default=None
            'random_state': grid_seed
             }],


            [{
            #GradientBoostingClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier
            #'loss': ['deviance', 'exponential'], #default=’deviance’
            'learning_rate': [.05], #default=0.1 -- 12/31/17 set to reduce runtime -- The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 264.45 seconds.
            'n_estimators': [300], #default=100 -- 12/31/17 set to reduce runtime -- The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 264.45 seconds.
            #'criterion': ['friedman_mse', 'mse', 'mae'], #default=”friedman_mse”
            'max_depth': grid_max_depth, #default=3   
            'random_state': grid_seed
             }],

    
            [{
            #RandomForestClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
            'n_estimators': grid_n_estimator, #default=10
            'criterion': grid_criterion, #default=”gini”
            'max_depth': grid_max_depth, #default=None
            'oob_score': [True], #default=False -- 12/31/17 set to reduce runtime -- The best parameter for RandomForestClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'oob_score': True, 'random_state': 0} with a runtime of 146.35 seconds.
            'random_state': grid_seed
             }],
    
            [{    
            #GaussianProcessClassifier
            'max_iter_predict': grid_n_estimator, #default: 100
            'random_state': grid_seed
            }],
        
    
            [{
            #LogisticRegressionCV - http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV
            'fit_intercept': grid_bool, #default: True
            #'penalty': ['l1','l2'],
            'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], #default: lbfgs
            'random_state': grid_seed
             }],
            
    
            [{
            #BernoulliNB - http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB
            'alpha': grid_ratio, #default: 1.0
             }],
    
    
            #GaussianNB - 
            [{}],
    
            [{
            #KNeighborsClassifier - http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
            'n_neighbors': [1,2,3,4,5,6,7], #default: 5
            'weights': ['uniform', 'distance'], #default = ‘uniform’
            'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
            }],
            
    
            [{
            #SVC - http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
            #http://blog.hackerearth.com/simple-tutorial-svm-parameter-tuning-python-r
            #'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
            'C': [1,2,3,4,5], #default=1.0
            'gamma': grid_ratio, #edfault: auto
            'decision_function_shape': ['ovo', 'ovr'], #default:ovr
            'probability': [True],
            'random_state': grid_seed
             }],

    
            [{
            #XGBClassifier - http://xgboost.readthedocs.io/en/latest/parameter.html
            'learning_rate': grid_learn, #default: .3
            'max_depth': [1,2,4,6,8,10], #default 2
            'n_estimators': grid_n_estimator, 
            'seed': grid_seed  
             }]   
        ]



start_total = time.perf_counter() #https://docs.python.org/3/library/time.html#time.perf_counter
for clf, param in zip (vote_est, grid_param): #https://docs.python.org/3/library/functions.html#zip

    #print(clf[1]) #vote_est is a list of tuples, index 0 is the name and index 1 is the algorithm
    #print(param)
    
    
    start = time.perf_counter()        
    best_search = model_selection.GridSearchCV(estimator = clf[1], param_grid = param, cv = cv_split, scoring = 'roc_auc')
    best_search.fit(data1[data1_x_bin], data1[Target])
    run = time.perf_counter() - start

    best_param = best_search.best_params_
    print('The best parameter for {} is {} with a runtime of {:.2f} seconds.'.format(clf[1].__class__.__name__, best_param, run))
    clf[1].set_params(**best_param) 


run_total = time.perf_counter() - start_total
print('Total optimization time was {:.2f} minutes.'.format(run_total/60))

print('-'*10)
The best parameter for AdaBoostClassifier is {'learning_rate': 0.1, 'n_estimators': 300, 'random_state': 0} with a runtime of 37.28 seconds.
The best parameter for BaggingClassifier is {'max_samples': 0.25, 'n_estimators': 300, 'random_state': 0} with a runtime of 33.04 seconds.
The best parameter for ExtraTreesClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0} with a runtime of 68.93 seconds.
The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 38.77 seconds.
The best parameter for RandomForestClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'oob_score': True, 'random_state': 0} with a runtime of 84.14 seconds.
The best parameter for GaussianProcessClassifier is {'max_iter_predict': 10, 'random_state': 0} with a runtime of 6.19 seconds.
The best parameter for LogisticRegressionCV is {'fit_intercept': True, 'random_state': 0, 'solver': 'liblinear'} with a runtime of 9.40 seconds.
The best parameter for BernoulliNB is {'alpha': 0.1} with a runtime of 0.24 seconds.
The best parameter for GaussianNB is {} with a runtime of 0.05 seconds.
The best parameter for KNeighborsClassifier is {'algorithm': 'brute', 'n_neighbors': 7, 'weights': 'uniform'} with a runtime of 5.56 seconds.
The best parameter for SVC is {'C': 2, 'decision_function_shape': 'ovo', 'gamma': 0.1, 'probability': True, 'random_state': 0} with a runtime of 30.49 seconds.
The best parameter for XGBClassifier is {'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 300, 'seed': 0} with a runtime of 43.57 seconds.
Total optimization time was 5.96 minutes.
----------
In [34]:
#Hard Vote or majority rules w/Tuned Hyperparameters
grid_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
grid_hard_cv = model_selection.cross_validate(grid_hard, data1[data1_x_bin], data1[Target], cv  = cv_split)
grid_hard.fit(data1[data1_x_bin], data1[Target])

print("Hard Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_hard_cv['train_score'].mean()*100)) 
print("Hard Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_hard_cv['test_score'].mean()*100))
print("Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_hard_cv['test_score'].std()*100*3))
print('-'*10)

#Soft Vote or weighted probabilities w/Tuned Hyperparameters
grid_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
grid_soft_cv = model_selection.cross_validate(grid_soft, data1[data1_x_bin], data1[Target], cv  = cv_split)
grid_soft.fit(data1[data1_x_bin], data1[Target])

print("Soft Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_soft_cv['train_score'].mean()*100)) 
print("Soft Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_soft_cv['test_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_soft_cv['test_score'].std()*100*3))
print('-'*10)


#12/31/17 tuned with data1_x_bin
#The best parameter for AdaBoostClassifier is {'learning_rate': 0.1, 'n_estimators': 300, 'random_state': 0} with a runtime of 33.39 seconds.
#The best parameter for BaggingClassifier is {'max_samples': 0.25, 'n_estimators': 300, 'random_state': 0} with a runtime of 30.28 seconds.
#The best parameter for ExtraTreesClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0} with a runtime of 64.76 seconds.
#The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 2, 'n_estimators': 300, 'random_state': 0} with a runtime of 34.35 seconds.
#The best parameter for RandomForestClassifier is {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'oob_score': True, 'random_state': 0} with a runtime of 76.32 seconds.
#The best parameter for GaussianProcessClassifier is {'max_iter_predict': 10, 'random_state': 0} with a runtime of 6.01 seconds.
#The best parameter for LogisticRegressionCV is {'fit_intercept': True, 'random_state': 0, 'solver': 'liblinear'} with a runtime of 8.04 seconds.
#The best parameter for BernoulliNB is {'alpha': 0.1} with a runtime of 0.19 seconds.
#The best parameter for GaussianNB is {} with a runtime of 0.04 seconds.
#The best parameter for KNeighborsClassifier is {'algorithm': 'brute', 'n_neighbors': 7, 'weights': 'uniform'} with a runtime of 4.84 seconds.
#The best parameter for SVC is {'C': 2, 'decision_function_shape': 'ovo', 'gamma': 0.1, 'probability': True, 'random_state': 0} with a runtime of 29.39 seconds.
#The best parameter for XGBClassifier is {'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 300, 'seed': 0} with a runtime of 46.23 seconds.
#Total optimization time was 5.56 minutes.
Hard Voting w/Tuned Hyperparameters Training w/bin score mean: 85.22
Hard Voting w/Tuned Hyperparameters Test w/bin score mean: 82.31
Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 5.26
----------
Soft Voting w/Tuned Hyperparameters Training w/bin score mean: 84.76
Soft Voting w/Tuned Hyperparameters Test w/bin score mean: 82.28
Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 5.42
----------
In [35]:
#prepare data for modeling
print(data_val.info())
print("-"*10)
#data_val.sample(10)



#handmade decision tree - submission score = 0.77990
data_val['Survived'] = mytree(data_val).astype(int)


#decision tree w/full dataset modeling submission score: defaults= 0.76555, tuned= 0.77990
#submit_dt = tree.DecisionTreeClassifier()
#submit_dt = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv = cv_split)
#submit_dt.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_dt.best_params_) #Best Parameters:  {'criterion': 'gini', 'max_depth': 4, 'random_state': 0}
#data_val['Survived'] = submit_dt.predict(data_val[data1_x_bin])


#bagging w/full dataset modeling submission score: defaults= 0.75119, tuned= 0.77990
#submit_bc = ensemble.BaggingClassifier()
#submit_bc = model_selection.GridSearchCV(ensemble.BaggingClassifier(), param_grid= {'n_estimators':grid_n_estimator, 'max_samples': grid_ratio, 'oob_score': grid_bool, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_bc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_bc.best_params_) #Best Parameters:  {'max_samples': 0.25, 'n_estimators': 500, 'oob_score': True, 'random_state': 0}
#data_val['Survived'] = submit_bc.predict(data_val[data1_x_bin])


#extra tree w/full dataset modeling submission score: defaults= 0.76555, tuned= 0.77990
#submit_etc = ensemble.ExtraTreesClassifier()
#submit_etc = model_selection.GridSearchCV(ensemble.ExtraTreesClassifier(), param_grid={'n_estimators': grid_n_estimator, 'criterion': grid_criterion, 'max_depth': grid_max_depth, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_etc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_etc.best_params_) #Best Parameters:  {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0}
#data_val['Survived'] = submit_etc.predict(data_val[data1_x_bin])


#random foreset w/full dataset modeling submission score: defaults= 0.71291, tuned= 0.73205
#submit_rfc = ensemble.RandomForestClassifier()
#submit_rfc = model_selection.GridSearchCV(ensemble.RandomForestClassifier(), param_grid={'n_estimators': grid_n_estimator, 'criterion': grid_criterion, 'max_depth': grid_max_depth, 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_rfc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_rfc.best_params_) #Best Parameters:  {'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 100, 'random_state': 0}
#data_val['Survived'] = submit_rfc.predict(data_val[data1_x_bin])



#ada boosting w/full dataset modeling submission score: defaults= 0.74162, tuned= 0.75119
#submit_abc = ensemble.AdaBoostClassifier()
#submit_abc = model_selection.GridSearchCV(ensemble.AdaBoostClassifier(), param_grid={'n_estimators': grid_n_estimator, 'learning_rate': grid_ratio, 'algorithm': ['SAMME', 'SAMME.R'], 'random_state': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_abc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_abc.best_params_) #Best Parameters:  {'algorithm': 'SAMME.R', 'learning_rate': 0.1, 'n_estimators': 300, 'random_state': 0}
#data_val['Survived'] = submit_abc.predict(data_val[data1_x_bin])


#gradient boosting w/full dataset modeling submission score: defaults= 0.75119, tuned= 0.77033
#submit_gbc = ensemble.GradientBoostingClassifier()
#submit_gbc = model_selection.GridSearchCV(ensemble.GradientBoostingClassifier(), param_grid={'learning_rate': grid_ratio, 'n_estimators': grid_n_estimator, 'max_depth': grid_max_depth, 'random_state':grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_gbc.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_gbc.best_params_) #Best Parameters:  {'learning_rate': 0.25, 'max_depth': 2, 'n_estimators': 50, 'random_state': 0}
#data_val['Survived'] = submit_gbc.predict(data_val[data1_x_bin])

#extreme boosting w/full dataset modeling submission score: defaults= 0.73684, tuned= 0.77990
#submit_xgb = XGBClassifier()
#submit_xgb = model_selection.GridSearchCV(XGBClassifier(), param_grid= {'learning_rate': grid_learn, 'max_depth': [0,2,4,6,8,10], 'n_estimators': grid_n_estimator, 'seed': grid_seed}, scoring = 'roc_auc', cv = cv_split)
#submit_xgb.fit(data1[data1_x_bin], data1[Target])
#print('Best Parameters: ', submit_xgb.best_params_) #Best Parameters:  {'learning_rate': 0.01, 'max_depth': 4, 'n_estimators': 300, 'seed': 0}
#data_val['Survived'] = submit_xgb.predict(data_val[data1_x_bin])


#hard voting classifier w/full dataset modeling submission score: defaults= 0.75598, tuned = 0.77990
#data_val['Survived'] = vote_hard.predict(data_val[data1_x_bin])
data_val['Survived'] = grid_hard.predict(data_val[data1_x_bin])


#soft voting classifier w/full dataset modeling submission score: defaults= 0.73684, tuned = 0.74162
#data_val['Survived'] = vote_soft.predict(data_val[data1_x_bin])
#data_val['Survived'] = grid_soft.predict(data_val[data1_x_bin])


#submit file
submit = data_val[['PassengerId','Survived']]
submit.to_csv("../working/submit.csv", index=False)

print('Validation Data Distribution: \n', data_val['Survived'].value_counts(normalize = True))
submit.sample(10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 21 columns):
PassengerId      418 non-null int64
Pclass           418 non-null int64
Name             418 non-null object
Sex              418 non-null object
Age              418 non-null float64
SibSp            418 non-null int64
Parch            418 non-null int64
Ticket           418 non-null object
Fare             418 non-null float64
Cabin            91 non-null object
Embarked         418 non-null object
FamilySize       418 non-null int64
IsAlone          418 non-null int64
Title            418 non-null object
FareBin          418 non-null category
AgeBin           418 non-null category
Sex_Code         418 non-null int64
Embarked_Code    418 non-null int64
Title_Code       418 non-null int64
AgeBin_Code      418 non-null int64
FareBin_Code     418 non-null int64
dtypes: category(2), float64(2), int64(11), object(6)
memory usage: 63.1+ KB
None
----------
Validation Data Distribution: 
 0    0.633971
1    0.366029
Name: Survived, dtype: float64
Out[35]:
PassengerId Survived
113 1005 1
69 961 1
70 962 1
191 1083 0
376 1268 1
37 929 1
46 938 1
236 1128 0
294 1186 0
366 1258 0

Step 7: Optimize and Strategize

Conclusion

Iteration one of the Data Science Framework, seems to converge on 0.77990 submission accuracy. Using the same dataset and different implementation of a decision tree (adaboost, random forest, gradient boost, xgboost, etc.) with tuning does not exceed the 0.77990 submission accuracy. Interesting for this dataset, the simple decision tree algorithm had the best default submission score and with tuning achieved the same best accuracy score.

While no general conclusions can be made from testing a handful of algorithms on a single dataset, there are several observations on the mentioned dataset.

  1. The train dataset has a different distribution than the test/validation dataset and population. This created wide margins between the cross validation (CV) accuracy score and Kaggle submission accuracy score.
  2. Given the same dataset, decision tree based algorithms, seemed to converge on the same accuracy score after proper tuning.
  3. Despite tuning, no machine learning algorithm, exceeded the homemade algorithm. The author will theorize, that for small datasets, a manmade algorithm is the bar to beat.

With that in mind, for iteration two, I would spend more time on preprocessing and feature engineering. In order to better align the CV score and Kaggle score and improve the overall accuracy.

Change Log:

11/22/17 Please note, this kernel is currently in progress, but open to feedback. Thanks!
11/23/17 Cleaned up published notebook and updated through step 3.
11/25/17 Added enhancements to published notebook and started step 4.
11/26/17 Skipped ahead to data model, since this is a published notebook. Accuracy with (very) simple data cleaning and logistic regression is ~82%. Continue to up vote and I will continue to develop this notebook. Thanks!
12/2/17 Updated section 4 with exploratory analysis and section 5 with more classifiers. Improved model to ~85% accuracy.
12/3/17 Update section 4 with improved graphical statistics.
12/7/17 Updated section 5 with Data Science 101 Lesson.
12/8/17 Reorganized section 3 & 4 with cleaner code.
12/9/17 Updated section 5 with model optimization how-tos. Initial competition submission with Decision Tree; will update with better algorithm later.
12/10/17 Updated section 3 & 4 with cleaner code and better datasets.
12/11/17 Updated section 5 with better how-tos.
12/12/17 Cleaned section 5 to prep for hyper-parameter tuning.
12/13/17 Updated section 5 to focus on learning data modeling via decision tree.
12/20/17 Updated section 4 - Thanks @Daniel M. for suggestion to split up visualization code. Started working on section 6 for "super" model.
12/23/17 Edited section 1-5 for clarity and more concise code.
12/24/17 Updated section 5 with random_state and score for more consistent results.
12/31/17 Completed data science framework iteration 1 and added section 7 with conclusion.

Credits

Programming is all about "borrowing" code, because knife sharpens knife. Nonetheless, I want to give credit, where credit is due.