Machine Learning Workflow for House Prices

Quite Practical and Far from any Theoretical Concepts

last update: 05/01/2019

You may be interested have a look at 10 Steps to Become a Data Scientist:

1. Machine Learning

you can Fork and Run this kernel on Github:

GitHub

I hope you find this kernel helpful and some UPVOTES would be very much appreciated



1- Introduction

This is a A Comprehensive ML Workflow for House Prices data set, it is clear that everyone in this community is familiar with house prices dataset but if you need to review your information about the dataset please visit this link.

I have tried to help Fans of Machine Learning in Kaggle how to face machine learning problems. and I think it is a great opportunity for who want to learn machine learning workflow with python completely.

I want to covere most of the methods that are implemented for house prices until 2018, you can start to learn and review your knowledge about ML with a simple dataset and try to learn and memorize the workflow for your journey in Data science world.

Before we get into the notebook, let me introduce some helpful resources.


1-2 Kaggle kernels

I want to thanks Kaggle team and all of the kernel's authors who develop this huge resources for Data scientists. I have learned from The work of others and I have just listed some more important kernels that inspired my work and I've used them in this kernel:

  1. Comprehensive Data Exploration with python
  2. A study on Regression applied to the Ames dataset
  3. Regularized Linear Models


2- Machine Learning Workflow

If you have already read some machine learning books. You have noticed that there are different ways to stream data into machine learning.

Most of these books share the following steps:

  1. Define Problem
  2. Specify Inputs & Outputs
  3. Exploratory data analysis
  4. Data Collection
  5. Data Preprocessing
  6. Data Cleaning
  7. Visualization
  8. Model Design, Training, and Offline Evaluation
  9. Model Deployment, Online Evaluation, and Monitoring
  10. Model Maintenance, Diagnosis, and Retraining

Of course, the same solution can not be provided for all problems, so the best way is to create a general framework and adapt it to new problem.

You can see my workflow in the below image :

Data Science has so many techniques and procedures that can confuse anyone.


2-2 Real world Application Vs Competitions

We all know that there are differences between real world problem and competition problem. The following figure that is taken from one of the courses in coursera, has partly made this comparison

As you can see, there are a lot more steps to solve in real problems.


3- Problem Definition

I think one of the important things when you start a new machine learning project is defining your problem.that means you should understand business problem.( Problem Formalization).

Problem definition has four steps that have illustrated in the picture below:

3-1 Problem Feature

We will use the house prices data set. This dataset contains information about house prices and the target value is:

  1. SalePrice

Why am I using House price dataset:

  1. This is a good project because it is so well understood.
  2. Attributes are numeric and categurical so you have to figure out how to load and handle data.
  3. It is a Regression problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.
  4. This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition. 
  5. Creative feature engineering .


3-1-1 Metric

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

3-2 Aim

It is our job to predict the sales price for each house. for each Id in the test set, you must predict the value of the SalePrice variable. 


3-3 Variables

The variables are :

  • SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
  • MSSubClass: The building class
  • MSZoning: The general zoning classification
  • LotFrontage: Linear feet of street connected to property
  • LotArea: Lot size in square feet
  • Street: Type of road access
  • Alley: Type of alley access
  • LotShape: General shape of property
  • LandContour: Flatness of the property
  • Utilities: Type of utilities available
  • LotConfig: Lot configuration
  • LandSlope: Slope of property
  • Neighborhood: Physical locations within Ames city limits
  • Condition1: Proximity to main road or railroad
  • Condition2: Proximity to main road or railroad (if a second is present)
  • BldgType: Type of dwelling
  • HouseStyle: Style of dwelling
  • OverallQual: Overall material and finish quality
  • OverallCond: Overall condition rating
  • YearBuilt: Original construction date
  • YearRemodAdd: Remodel date
  • RoofStyle: Type of roof
  • RoofMatl: Roof material
  • Exterior1st: Exterior covering on house
  • Exterior2nd: Exterior covering on house (if more than one material)
  • MasVnrType: Masonry veneer type
  • MasVnrArea: Masonry veneer area in square feet
  • ExterQual: Exterior material quality
  • ExterCond: Present condition of the material on the exterior
  • Foundation: Type of foundation
  • BsmtQual: Height of the basement
  • BsmtCond: General condition of the basement
  • BsmtExposure: Walkout or garden level basement walls
  • BsmtFinType1: Quality of basement finished area
  • BsmtFinSF1: Type 1 finished square feet
  • BsmtFinType2: Quality of second finished area (if present)
  • BsmtFinSF2: Type 2 finished square feet
  • BsmtUnfSF: Unfinished square feet of basement area
  • TotalBsmtSF: Total square feet of basement area
  • Heating: Type of heating
  • HeatingQC: Heating quality and condition
  • CentralAir: Central air conditioning
  • Electrical: Electrical system
  • 1stFlrSF: First Floor square feet
  • 2ndFlrSF: Second floor square feet
  • LowQualFinSF: Low quality finished square feet (all floors)
  • GrLivArea: Above grade (ground) living area square feet
  • BsmtFullBath: Basement full bathrooms
  • BsmtHalfBath: Basement half bathrooms
  • FullBath: Full bathrooms above grade
  • HalfBath: Half baths above grade
  • Bedroom: Number of bedrooms above basement level
  • Kitchen: Number of kitchens
  • KitchenQual: Kitchen quality
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • Functional: Home functionality rating
  • Fireplaces: Number of fireplaces
  • FireplaceQu: Fireplace quality
  • GarageType: Garage location
  • GarageYrBlt: Year garage was built
  • GarageFinish: Interior finish of the garage
  • GarageCars: Size of garage in car capacity
  • GarageArea: Size of garage in square feet
  • GarageQual: Garage quality
  • GarageCond: Garage condition
  • PavedDrive: Paved driveway
  • WoodDeckSF: Wood deck area in square feet
  • OpenPorchSF: Open porch area in square feet
  • EnclosedPorch: Enclosed porch area in square feet
  • 3SsnPorch: Three season porch area in square feet
  • ScreenPorch: Screen porch area in square feet
  • PoolArea: Pool area in square feet
  • PoolQC: Pool quality
  • Fence: Fence quality
  • MiscFeature: Miscellaneous feature not covered in other categories
  • MiscVal: Value of miscellaneous feature
  • MoSold: Month Sold
  • YrSold: Year Sold
  • SaleType: Type of sale
  • SaleCondition: Condition of sale


4- Inputs & Outputs

For every machine learning problem, you should ask yourself, what are inputs and outputs for the model?


4-1 Inputs

  • train.csv - the training set
  • test.csv - the test set


4-2 Outputs

  • sale prices for every record in test.csv


5 Loading Packages

In this kernel we are using the following packages:


5-1 Import

In [1]:
from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import confusion_matrix
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from scipy.stats import skew
import scipy.stats as stats
import lightgbm as lgb
import seaborn as sns
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import scipy
import json
import sys
import csv
import os


5-2 Version

In [2]:
print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))
matplotlib: 2.2.3
sklearn: 0.20.2
scipy: 1.1.0
seaborn: 0.9.0
pandas: 0.23.4
numpy: 1.15.4
Python: 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16) 
[GCC 7.3.0]


5-5-3 Setup

A few tiny adjustments for better code readability

In [3]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
sns.set(style='white', context='notebook', palette='deep')
warnings.filterwarnings('ignore')
sns.set_style('white')
%matplotlib inline


6- Exploratory Data Analysis(EDA)

In this section, you'll learn how to use graphical and numerical techniques to begin uncovering the structure of your data.

  1. Which variables suggest interesting relationships?
  2. Which observations are unusual?

By the end of the section, you'll be able to answer these questions and more, while generating graphics that are both insightful and beautiful. then We will review analytical and statistical operations:

  1. Data Collection
  2. Visualization
  3. Data Cleaning
  4. Data Preprocessing


6-1 Data Collection

Data collection is the process of gathering and measuring data, information or any variables of interest in a standardized and established manner that enables the collector to answer or test hypothesis and evaluate outcomes of the particular collection.[techopedia]

<< Note >>

The rows being the samples and the columns being attributes

In [4]:
# import Dataset to play with it
train = pd.read_csv('../input/train.csv')
test= pd.read_csv('../input/test.csv')

The concat function does all of the heavy lifting of performing concatenation operations along an axis. Let us create all_data.

In [5]:
all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))

<< Note 1 >>

  1. Each row is an observation (also known as : sample, example, instance, record)
  2. Each column is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor, Covariate)

After loading the data via pandas, we should checkout what the content is, description and via the following:

In [6]:
type(train),type(test)
Out[6]:
(pandas.core.frame.DataFrame, pandas.core.frame.DataFrame)


6-1-1 Statistical Summary

1- Dimensions of the dataset.

2- Peek at the data itself.

3- Statistical summary of all attributes.

4- Breakdown of the data by the class variable.[7]

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

In [7]:
# shape
print(train.shape)
(1460, 81)

Train has one column more than test why? (yes ==>> target value)

In [8]:
# shape
print(test.shape)
(1459, 80)

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

You should see 1460 instances and 81 attributes for train and 1459 instances and 80 attributes for test

For getting some information about the dataset you can use info() command.

In [9]:
print(train.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
None

if you want see the type of data and unique value of it you use following script

In [10]:
train['Fence'].unique()
Out[10]:
array([nan, 'MnPrv', 'GdWo', 'GdPrv', 'MnWw'], dtype=object)
In [11]:
train["Fence"].value_counts()
Out[11]:
MnPrv    157
GdPrv     59
GdWo      54
MnWw      11
Name: Fence, dtype: int64

Copy Id for test and train data set

In [12]:
train_id=train['Id'].copy()
test_id=test['Id'].copy()

to check the first 5 rows of the data set, we can use head(5).

In [13]:
train.head(5) 
Out[13]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating ... CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.000 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.000 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA ... Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.000 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.000 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.000 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA ... Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.000 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.000 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.000 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA ... Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.000 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.000 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.000 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA ... Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.000 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.000 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.000 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA ... Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.000 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000

1to check out last 5 row of the data set, we use tail() function

In [14]:
train.tail() 
Out[14]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating ... CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1455 1456 60 RL 62.000 7917 Pave NaN Reg Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 6 5 1999 2000 Gable CompShg VinylSd VinylSd None 0.000 TA TA PConc Gd TA No Unf 0 Unf 0 953 953 GasA ... Y SBrkr 953 694 0 1647 0 0 2 1 3 1 TA 7 Typ 1 TA Attchd 1999.000 RFn 2 460 TA TA Y 0 40 0 0 0 0 NaN NaN NaN 0 8 2007 WD Normal 175000
1456 1457 20 RL 85.000 13175 Pave NaN Reg Lvl AllPub Inside Gtl NWAmes Norm Norm 1Fam 1Story 6 6 1978 1988 Gable CompShg Plywood Plywood Stone 119.000 TA TA CBlock Gd TA No ALQ 790 Rec 163 589 1542 GasA ... Y SBrkr 2073 0 0 2073 1 0 2 0 3 1 TA 7 Min1 2 TA Attchd 1978.000 Unf 2 500 TA TA Y 349 0 0 0 0 0 NaN MnPrv NaN 0 2 2010 WD Normal 210000
1457 1458 70 RL 66.000 9042 Pave NaN Reg Lvl AllPub Inside Gtl Crawfor Norm Norm 1Fam 2Story 7 9 1941 2006 Gable CompShg CemntBd CmentBd None 0.000 Ex Gd Stone TA Gd No GLQ 275 Unf 0 877 1152 GasA ... Y SBrkr 1188 1152 0 2340 0 0 2 0 4 1 Gd 9 Typ 2 Gd Attchd 1941.000 RFn 1 252 TA TA Y 0 60 0 0 0 0 NaN GdPrv Shed 2500 5 2010 WD Normal 266500
1458 1459 20 RL 68.000 9717 Pave NaN Reg Lvl AllPub Inside Gtl NAmes Norm Norm 1Fam 1Story 5 6 1950 1996 Hip CompShg MetalSd MetalSd None 0.000 TA TA CBlock TA TA Mn GLQ 49 Rec 1029 0 1078 GasA ... Y FuseA 1078 0 0 1078 1 0 1 0 2 1 Gd 5 Typ 0 NaN Attchd 1950.000 Unf 1 240 TA TA Y 366 0 112 0 0 0 NaN NaN NaN 0 4 2010 WD Normal 142125
1459 1460 20 RL 75.000 9937 Pave NaN Reg Lvl AllPub Inside Gtl Edwards Norm Norm 1Fam 1Story 5 6 1965 1965 Gable CompShg HdBoard HdBoard None 0.000 Gd TA CBlock TA TA No BLQ 830 LwQ 290 136 1256 GasA ... Y SBrkr 1256 0 0 1256 1 0 1 1 3 1 TA 6 Typ 0 NaN Attchd 1965.000 Fin 1 276 TA TA Y 736 68 0 0 0 0 NaN NaN NaN 0 6 2008 WD Normal 147500

to pop up 5 random rows from the data set, we can use sample(5) function

In [15]:
train.sample(5) 
Out[15]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating ... CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1235 1236 70 RL 96.000 13132 Pave NaN Reg Lvl AllPub Inside Gtl Crawfor Norm Norm 1Fam 2Story 5 5 1914 1950 Gable CompShg Wd Sdng Wd Sdng None 0.000 TA TA BrkTil Gd TA Mn Unf 0 Unf 0 747 747 GasA ... Y FuseF 892 892 0 1784 0 0 1 1 4 1 TA 9 Typ 0 NaN Detchd 1914.000 Unf 1 180 Fa Fa N 203 40 0 0 0 0 NaN NaN NaN 0 7 2006 WD Normal 138887
1342 1343 60 RL nan 9375 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 8 5 2002 2002 Gable CompShg VinylSd VinylSd BrkFace 149.000 Gd TA PConc Gd TA No Unf 0 Unf 0 1284 1284 GasA ... Y SBrkr 1284 885 0 2169 0 0 2 1 3 1 Gd 7 Typ 1 Gd Attchd 2002.000 RFn 2 647 TA TA Y 192 87 0 0 0 0 NaN NaN NaN 0 8 2007 WD Normal 228500
1283 1284 90 RL 94.000 9400 Pave NaN Reg Low AllPub Corner Gtl Mitchel Norm Norm Duplex 2Story 6 5 1971 1971 Mansard CompShg MetalSd Wd Shng None 0.000 TA TA CBlock TA TA Av Unf 0 Unf 0 912 912 GasA ... Y SBrkr 912 912 0 1824 0 0 2 2 4 2 TA 8 Typ 0 NaN NaN nan NaN 0 0 NaN NaN Y 128 0 0 0 0 0 NaN NaN NaN 0 4 2010 WD Normal 139000
1301 1302 70 RL nan 7500 Pave NaN IR1 Bnk AllPub Inside Gtl Crawfor Norm Norm 1Fam 2Story 6 7 1942 1950 Gable CompShg Wd Sdng Wd Sdng None 0.000 TA TA CBlock TA TA No BLQ 547 Unf 0 224 771 GasA ... Y SBrkr 753 741 0 1494 0 0 1 0 3 1 Gd 7 Typ 2 Gd Attchd 1942.000 Unf 1 213 TA TA P 0 0 0 0 224 0 NaN NaN NaN 0 11 2009 WD Normal 177500
966 967 50 RL 130.000 9600 Pave NaN IR1 HLS AllPub Inside Gtl Crawfor Norm Norm 1Fam 1.5Fin 5 7 1940 1950 Gable CompShg MetalSd MetalSd None 0.000 Gd Gd BrkTil TA Fa No Rec 428 Unf 0 300 728 GasA ... Y SBrkr 976 332 0 1308 1 0 1 1 2 1 TA 7 Min2 2 TA Detchd 1940.000 Unf 1 256 TA TA Y 0 70 0 0 0 0 NaN NaN NaN 0 6 2009 WD Normal 160000

To give a statistical summary about the dataset, we can use **describe()

In [16]:
train.describe() 
Out[16]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
count 1460.000 1460.000 1201.000 1460.000 1460.000 1460.000 1460.000 1460.000 1452.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1379.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000 1460.000
mean 730.500 56.897 70.050 10516.828 6.099 5.575 1971.268 1984.866 103.685 443.640 46.549 567.240 1057.429 1162.627 346.992 5.845 1515.464 0.425 0.058 1.565 0.383 2.866 1.047 6.518 0.613 1978.506 1.767 472.980 94.245 46.660 21.954 3.410 15.061 2.759 43.489 6.322 2007.816 180921.196
std 421.610 42.301 24.285 9981.265 1.383 1.113 30.203 20.645 181.066 456.098 161.319 441.867 438.705 386.588 436.528 48.623 525.480 0.519 0.239 0.551 0.503 0.816 0.220 1.625 0.645 24.690 0.747 213.805 125.339 66.256 61.119 29.317 55.757 40.177 496.123 2.704 1.328 79442.503
min 1.000 20.000 21.000 1300.000 1.000 1.000 1872.000 1950.000 0.000 0.000 0.000 0.000 0.000 334.000 0.000 0.000 334.000 0.000 0.000 0.000 0.000 0.000 0.000 2.000 0.000 1900.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 2006.000 34900.000
25% 365.750 20.000 59.000 7553.500 5.000 5.000 1954.000 1967.000 0.000 0.000 0.000 223.000 795.750 882.000 0.000 0.000 1129.500 0.000 0.000 1.000 0.000 2.000 1.000 5.000 0.000 1961.000 1.000 334.500 0.000 0.000 0.000 0.000 0.000 0.000 0.000 5.000 2007.000 129975.000
50% 730.500 50.000 69.000 9478.500 6.000 5.000 1973.000 1994.000 0.000 383.500 0.000 477.500 991.500 1087.000 0.000 0.000 1464.000 0.000 0.000 2.000 0.000 3.000 1.000 6.000 1.000 1980.000 2.000 480.000 0.000 25.000 0.000 0.000 0.000 0.000 0.000 6.000 2008.000 163000.000
75% 1095.250 70.000 80.000 11601.500 7.000 6.000 2000.000 2004.000 166.000 712.250 0.000 808.000 1298.250 1391.250 728.000 0.000 1776.750 1.000 0.000 2.000 1.000 3.000 1.000 7.000 1.000 2002.000 2.000 576.000 168.000 68.000 0.000 0.000 0.000 0.000 0.000 8.000 2009.000 214000.000
max 1460.000 190.000 313.000 215245.000 10.000 9.000 2010.000 2010.000 1600.000 5644.000 1474.000 2336.000 6110.000 4692.000 2065.000 572.000 5642.000 3.000 2.000 3.000 2.000 8.000 3.000 14.000 3.000 2010.000 4.000 1418.000 857.000 547.000 552.000 508.000 480.000 738.000 15500.000 12.000 2010.000 755000.000

To check out how many null info are on the dataset, we can use **isnull().sum()

In [17]:
train.isnull().sum().head(2)
Out[17]:
Id            0
MSSubClass    0
dtype: int64
In [18]:
train.groupby('SaleType').count()
Out[18]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleCondition SalePrice
SaleType
COD 43 43 43 32 43 43 3 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 42 42 42 42 43 42 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 15 43 43 43 43 43 43 43 43 43 43 43 43 43 43 0 13 2 43 43 43 43 43
CWD 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 0 2 0 4 4 4 4 4
Con 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 2 2 2 2 2
ConLD 9 9 9 8 9 9 0 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 8 8 8 8 9 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 3 7 7 7 9 9 7 7 9 9 9 9 9 9 9 0 2 1 9 9 9 9 9
ConLI 5 5 5 4 5 5 1 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 3 4 4 4 5 5 4 4 5 5 5 5 5 5 5 0 1 1 5 5 5 5 5
ConLw 5 5 5 5 5 5 0 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 5 5 4 4 5 5 5 5 5 5 5 0 0 0 5 5 5 5 5
New 122 122 122 121 122 122 7 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 119 119 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 85 120 120 120 122 122 120 120 122 122 122 122 122 122 122 1 0 0 122 122 122 122 122
Oth 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 1 1 1 3 3 1 1 3 3 3 3 3 3 3 0 1 1 3 3 3 3 3
WD 1267 1267 1267 1022 1267 1267 79 1267 1267 1267 1267 1267 1267 1267 1267 1267 1267 1267 1267 1267 1267 1267 1267 1267 1267 1262 1262 1267 1267 1267 1232 1232 1231 1232 1267 1231 1267 1267 1267 1267 1267 1267 1266 1267 1267 1267 1267 1267 1267 1267 1267 1267 1267 1267 1267 1267 1267 656 1194 1194 1194 1267 1267 1194 1194 1267 1267 1267 1267 1267 1267 1267 6 262 49 1267 1267 1267 1267 1267

to print dataset columns, we can use columns atribute

In [19]:
train.columns
Out[19]:
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
In [20]:
type((train.columns))
Out[20]:
pandas.core.indexes.base.Index

<< Note 2 >> in pandas's data frame you can perform some query such as "where"

In [21]:
train[train['SalePrice']>700000]
Out[21]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating ... CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
691 692 60 RL 104.000 21535 Pave NaN IR1 Lvl AllPub Corner Gtl NoRidge Norm Norm 1Fam 2Story 10 6 1994 1995 Gable WdShngl HdBoard HdBoard BrkFace 1170.000 Ex TA PConc Ex TA Gd GLQ 1455 Unf 0 989 2444 GasA ... Y SBrkr 2444 1872 0 4316 0 1 3 1 4 1 Ex 10 Typ 2 Ex Attchd 1994.000 Fin 3 832 TA TA Y 382 50 0 0 0 0 NaN NaN NaN 0 1 2007 WD Normal 755000
1182 1183 60 RL 160.000 15623 Pave NaN IR1 Lvl AllPub Corner Gtl NoRidge Norm Norm 1Fam 2Story 10 5 1996 1996 Hip CompShg Wd Sdng ImStucc None 0.000 Gd TA PConc Ex TA Av GLQ 2096 Unf 0 300 2396 GasA ... Y SBrkr 2411 2065 0 4476 1 0 3 1 4 1 Ex 10 Typ 2 TA Attchd 1996.000 Fin 3 813 TA TA Y 171 78 0 0 0 555 Ex MnPrv NaN 0 7 2007 WD Abnorml 745000


6-1-2 Target Value Analysis

As you know SalePrice is our target value that we should predict it then now we take a look at it

In [22]:
train['SalePrice'].describe()
Out[22]:
count     1460.000
mean    180921.196
std      79442.503
min      34900.000
25%     129975.000
50%     163000.000
75%     214000.000
max     755000.000
Name: SalePrice, dtype: float64

Flexibly plot a univariate distribution of observations.

In [23]:
sns.set(rc={'figure.figsize':(9,7)})
sns.distplot(train['SalePrice']);


6-1-3 Skewness vs Kurtosis

  1. Skewness
    1. It is the degree of distortion from the symmetrical bell curve or the normal distribution. It measures the lack of symmetry in data distribution. It differentiates extreme values in one versus the other tail. A symmetrical distribution will have a skewness of 0.
  2. Kurtosis
    1. Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution.
In [24]:
#skewness and kurtosis
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())
Skewness: 1.882876
Kurtosis: 6.536282


6-2 Visualization

Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns.

With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed.[SAS]

In this section I show you 11 plots with matplotlib and seaborn that is listed in the blew picture:


6-2-1 Scatter plot

Scatter plot Purpose To identify the type of relationship (if any) between two quantitative variables

In [25]:
# Modify the graph above by assigning each species an individual color.
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
g=sns.FacetGrid(train[columns], hue="OverallQual", size=5) \
   .map(plt.scatter, "OverallQual", "SalePrice") \
   .add_legend()
g=g.map(plt.scatter, "OverallQual", "SalePrice",edgecolor="w").add_legend();
plt.show()


6-2-2 Box

In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.[wikipedia]

In [26]:
data = pd.concat([train['SalePrice'], train['OverallQual']], axis=1)
f, ax = plt.subplots(figsize=(12, 8))
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=data)
In [27]:
ax= sns.boxplot(x="OverallQual", y="SalePrice", data=train[columns])
ax= sns.stripplot(x="OverallQual", y="SalePrice", data=train[columns], jitter=True, edgecolor="gray")
plt.show()


6-2-3 Histogram

We can also create a histogram of each input variable to get an idea of the distribution.

In [28]:
# histograms
train.hist(figsize=(15,20))
plt.figure()
Out[28]:
<Figure size 648x504 with 0 Axes>
<Figure size 648x504 with 0 Axes>
In [29]:
mini_train=train[columns]
f,ax=plt.subplots(1,2,figsize=(20,10))
mini_train[mini_train['SalePrice']>100000].GarageArea.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('SalePrice>100000')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
mini_train[mini_train['SalePrice']<100000].GarageArea.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('SalePrice<100000')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()
In [30]:
 
mini_train[['SalePrice','OverallQual']].groupby(['OverallQual']).mean().plot.bar()
 
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe1de5dc470>
In [31]:
train['OverallQual'].value_counts().plot(kind="bar");

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.


6-2-4 Multivariate Plots

Now we can look at the interactions between the variables.

First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

In [32]:
# scatter plot matrix
pd.plotting.scatter_matrix(train[columns],figsize=(10,10))
plt.figure()
Out[32]:
<Figure size 648x504 with 0 Axes>
<Figure size 648x504 with 0 Axes>

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.


6-2-5 violinplots

In [33]:
# violinplots on petal-length for each species
sns.violinplot(data=train,x="Functional", y="SalePrice")
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe1de8bdc88>


6-2-6 pairplot

In [34]:
# Using seaborn pairplot to see the bivariate relation between each pair of features
sns.set()
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.pairplot(train[columns],size = 2 ,kind ='scatter')
plt.show()


6-2-7 kdeplot

In [35]:
# seaborn's kdeplot, plots univariate or bivariate density estimates.
#Size can be changed by tweeking the value used
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.FacetGrid(train[columns], hue="OverallQual", size=5).map(sns.kdeplot, "YearBuilt").add_legend()
plt.show()


6-2-8 jointplot

In [36]:
# Use seaborn's jointplot to make a hexagonal bin plot
#Set desired size and ratio and choose a color.
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.jointplot(x="OverallQual", y="SalePrice", data=train[columns], size=10,ratio=10, kind='hex',color='green')
plt.show()
In [37]:
# we will use seaborn jointplot shows bivariate scatterplots and univariate histograms with Kernel density 
# estimation in the same figure
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.jointplot(x="SalePrice", y="YearBuilt", data=train[columns], size=6, kind='kde', color='#800000', space=0)
Out[37]:
<seaborn.axisgrid.JointGrid at 0x7fe1deb5bdd8>


6-2-9 Heatmap

In [38]:
plt.figure(figsize=(7,4)) 
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.heatmap(train[columns].corr(),annot=True,cmap='cubehelix_r') #draws  heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()


6-2-10 radviz

In [39]:
from pandas.tools.plotting import radviz
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
radviz(train[columns], "OverallQual")
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe1de71c978>


6-2-12 Factorplot

In [40]:
sns.factorplot('OverallQual','SalePrice',hue='Functional',data=train)
plt.show()


6-3 Data Preprocessing

Data preprocessing refers to the transformations applied to our data before feeding it to the algorithm.

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. there are plenty of steps for data preprocessing and we just listed some of them :

  1. removing Target column (id)
  2. Sampling (without replacement)
  3. Making part of iris unbalanced and balancing (with undersampling and SMOTE)
  4. Introducing missing values and treating them (replacing by average values)
  5. Noise filtering
  6. Data discretization
  7. Normalization and standardization
  8. PCA analysis
  9. Feature selection (filter, embedded, wrapper)


6-3-1 Noise filtering (Outliers)

An outlier is a data point that is distant from other similar points. Further simplifying an outlier is an observation that lies on abnormal observation amongst the normal observations in a sample set of population. In statistics, an outlier is an observation point that is distant from other observations.

In [41]:
# Looking for outliers, as indicated in https://ww2.amstat.org/publications/jse/v19n3/decock.pdf
plt.scatter(train.GrLivArea, train.SalePrice, c = "blue", marker = "s")
plt.title("Looking for outliers")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()

train = train[train.GrLivArea < 4000]

2 extreme outliers on the bottom right

In [42]:
#deleting points
train.sort_values(by = 'GrLivArea', ascending = False)[:2]
train = train.drop(train[train['Id'] == 1299].index)
train = train.drop(train[train['Id'] == 524].index)
In [43]:
#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index

all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
In [44]:
all_data = pd.get_dummies(all_data)
In [45]:
# Log transform the target for official scoring
#The key point is to to log_transform the numeric variables since most of them are skewed.
train.SalePrice = np.log1p(train.SalePrice)
y = train.SalePrice

Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.

In [46]:
plt.scatter(train.GrLivArea, train.SalePrice, c = "blue", marker = "s")
plt.title("Looking for outliers")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()


6-4 Data Cleaning

When dealing with real-world data, dirty data is the norm rather than the exception. We continuously need to predict correct values, impute missing ones, and find links between various data artefacts such as schemas and records. We need to stop treating data cleaning as a piecemeal exercise (resolving different types of errors in isolation), and instead leverage all signals and resources (such as constraints, available statistics, and dictionaries) to accurately predict corrective actions.


6-4-1 Handle missing values

Firstly, understand that there is NO good way to deal with missing data

In [47]:
#filling NA's with the mean of the column:
all_data = all_data.fillna(all_data.mean())


7- Model Deployment

In this section have been applied plenty of learning algorithms that play an important rule in your experiences and improve your knowledge in case of ML technique.

<< Note 3 >> : The results shown here may be slightly different for your analysis because, for example, the neural network algorithms use random number generators for fixing the initial value of the weights (starting points) of the neural networks, which often result in obtaining slightly different (local minima) solutions each time you run the analysis. Also note that changing the seed for the random number generator used to create the train, test, and validation samples can change your results.

go to top


7-1 Families of ML algorithms

There are several categories for machine learning algorithms, below are some of these categories:

  • Linear
    • Linear Regression
    • Logistic Regression
    • Support Vector Machines
  • Tree-Based
    • Decision Tree
    • Random Forest
    • GBDT
  • KNN
  • Neural Networks

And if we want to categorize ML algorithms with the type of learning, there are below type:

  • Classification

    • k-Nearest Neighbors
    • LinearRegression
    • SVM
    • DT
    • NN
  • clustering

    • K-means
    • HCA
    • Expectation Maximization
  • Visualization and dimensionality reduction:

    • Principal Component Analysis(PCA)
    • Kernel PCA
    • Locally -Linear Embedding (LLE)
    • t-distributed Stochastic Neighbor Embedding (t-SNE)
  • Association rule learning

    • Apriori
    • Eclat
  • Semisupervised learning
  • Reinforcement Learning
    • Q-learning
  • Batch learning & Online learning
  • Ensemble Learning

<< Note >>

Here is no method which outperforms all others for all tasks


7-2 Accuracy and precision

One of the most important questions to ask as a machine learning engineer when evaluating our model is how to judge our own model? each machine learning model is trying to solve a problem with a different objective using a different dataset and hence, it is important to understand the context before choosing a metric.

7-2-1 RMSE

Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

go to top

In [48]:
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice
In [49]:
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1456 entries, 0 to 1455
Columns: 288 entries, MSSubClass to SaleCondition_Partial
dtypes: float64(25), int64(11), uint8(252)
memory usage: 779.2 KB


7-3 Ridge

In [50]:
def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)
In [51]:
model_ridge = Ridge()
In [52]:
alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean() for alpha in alphas]


7-3-1 Root Mean Squared Error

In [53]:
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot(title = "Validation")
plt.xlabel("alpha")
plt.ylabel("rmse")
Out[53]:
Text(0,0.5,'rmse')
In [54]:
# steps
steps = [('scaler', StandardScaler()),
         ('ridge', Ridge())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'ridge__alpha':np.logspace(-4, 0, 50)}

# Create the GridSearchCV object: cv
cv = GridSearchCV(pipeline, parameters, cv=3)

# Fit to the training set
cv.fit(X_train, y)

#predict on train set
y_pred_train=cv.predict(X_train)

# Predict test set
y_pred_test=cv.predict(X_test)

# rmse on train set
rmse = np.sqrt(mean_squared_error(y, y_pred_train))
print("Root Mean Squared Error: {}".format(rmse))
Root Mean Squared Error: 0.32811446712445086


7-4 RandomForestClassifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

In [55]:
num_test = 0.3
X_train, X_test, y_train, y_test = train_test_split(X_train, y, test_size=num_test, random_state=100)
In [56]:
# Fit Random Forest on Training Set
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=300, random_state=0)
regressor.fit(X_train, y_train)

# Score model
regressor.score(X_train, y_train)
Out[56]:
0.877294261920777


7-5 XGBoost

XGBoost is one of the most popular machine learning algorithm these days. Regardless of the type of prediction task at hand; regression or classification.

7-5-1 But what makes XGBoost so popular?

  1. Speed and performance : Originally written in C++, it is comparatively faster than other ensemble classifiers.

  2. Core algorithm is parallelizable : Because the core XGBoost algorithm is parallelizable it can harness the power of multi-core computers. It is also parallelizable onto GPU’s and across networks of computers making it feasible to train on very large datasets as well.

  3. Consistently outperforms other algorithm methods : It has shown better performance on a variety of machine learning benchmark datasets.

  4. Wide variety of tuning parameters : XGBoost internally has parameters for cross-validation, regularization, user-defined objective functions, missing values, tree parameters, scikit-learn compatible API etc.[10]

XGBoost (Extreme Gradient Boosting) belongs to a family of boosting algorithms and uses the gradient boosting (GBM) framework at its core. It is an optimized distributed gradient boosting library. But wait, what is boosting? Well, keep on reading.

In [57]:
# Initialize model
from xgboost.sklearn import XGBRegressor
XGB_Regressor = XGBRegressor()                  

# Fit the model on our data
XGB_Regressor.fit(X_train, y_train)
Out[57]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
In [58]:
# Score model
XGB_Regressor.score(X_train, y_train)
Out[58]:
0.4598271920204544


7-6 LassoCV

Lasso linear model with iterative fitting along a regularization path. The best model is selected by cross-validation.

In [59]:
lasso=LassoCV()
In [60]:
# Fit the model on our data
lasso.fit(X_train, y_train)
Out[60]:
LassoCV(alphas=None, copy_X=True, cv='warn', eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=None, normalize=False,
    positive=False, precompute='auto', random_state=None,
    selection='cyclic', tol=0.0001, verbose=False)
In [61]:
# Score model
lasso.score(X_train, y_train)
Out[61]:
0.1171814974164107


7-7 GradientBoostingRegressor

GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.

In [62]:
boostingregressor=GradientBoostingRegressor()
In [63]:
# Fit the model on our data
boostingregressor.fit(X_train, y_train)
Out[63]:
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, n_iter_no_change=None, presort='auto',
             random_state=None, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)
In [64]:
# Score model
boostingregressor.score(X_train, y_train)
Out[64]:
0.4931895744123065


7-8 DecisionTree

In [72]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
dt = DecisionTreeRegressor(random_state=1)
In [73]:
# Fit model
dt.fit(X_train, y_train)
Out[73]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')
In [74]:
dt.score(X_train, y_train)
Out[74]:
0.9999999994606554


7-9 ExtraTreeRegressor

In [78]:
from sklearn.tree import ExtraTreeRegressor

dtr = ExtraTreeRegressor()
In [79]:
# Fit model
dtr.fit(X_train, y_train)
Out[79]:
ExtraTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
          max_leaf_nodes=None, min_impurity_decrease=0.0,
          min_impurity_split=None, min_samples_leaf=1, min_samples_split=2,
          min_weight_fraction_leaf=0.0, random_state=None,
          splitter='random')
In [80]:
# Fit model
dtr.score(X_train, y_train)
Out[80]:
0.999999998257654


8- Conclusion

This kernel is not completed yet, I will try to cover all the parts related to the process of ML with a variety of Python packages and I know that there are still some problems then I hope to get your feedback to improve it.

You can follow me on:

GitHub


I hope you find this kernel helpful and some UPVOTES would be very much appreciated

Go to first step: Course Home Page

Go to next step : Titanic