Machine Learning Workflow for House Prices
¶

Quite Practical and Far from any Theoretical Concepts
¶

last update: 05/01/2019

You may be interested have a look at 10 Steps to Become a Data Scientist:¶
1. Machine Learning ¶

you can Fork and Run this kernel on Github:

GitHub ¶

I hope you find this kernel helpful and some UPVOTES would be very much appreciated

Notebook Content¶

Introduction
Machine learning workflow
Problem Definition
Problem feature
1. Aim
2. Variables
3. Inputs & Outputs
  1. Inputs
  2. Outputs
Loading Packages
Exploratory data analysis
Model Deployment
Conclusion
References

1- Introduction¶

This is a A Comprehensive ML Workflow for House Prices data set, it is clear that everyone in this community is familiar with house prices dataset but if you need to review your information about the dataset please visit this link.

I have tried to help Fans of Machine Learning in Kaggle how to face machine learning problems. and I think it is a great opportunity for who want to learn machine learning workflow with python completely.

I want to covere most of the methods that are implemented for house prices until 2018, you can start to learn and review your knowledge about ML with a simple dataset and try to learn and memorize the workflow for your journey in Data science world.

Before we get into the notebook, let me introduce some helpful resources.

1-1 Courses¶

There are a lot of Online courses that can help you develop your knowledge, here I have just listed some of them:

1-2 Kaggle kernels¶

I want to thanks Kaggle team and all of the kernel's authors who develop this huge resources for Data scientists. I have learned from The work of others and I have just listed some more important kernels that inspired my work and I've used them in this kernel:

1-3 Ebooks¶

So you love reading , here is 10 free machine learning books

I am open to your feedback for improving this kernel

go to top

2- Machine Learning Workflow¶

If you have already read some machine learning books. You have noticed that there are different ways to stream data into machine learning.

Most of these books share the following steps:

Define Problem
Specify Inputs & Outputs
Exploratory data analysis
Data Collection
Data Preprocessing
Data Cleaning
Visualization
Model Design, Training, and Offline Evaluation
Model Deployment, Online Evaluation, and Monitoring
Model Maintenance, Diagnosis, and Retraining

Of course, the same solution can not be provided for all problems, so the best way is to create a general framework and adapt it to new problem.

You can see my workflow in the below image :

Data Science has so many techniques and procedures that can confuse anyone.

2-2 Real world Application Vs Competitions¶

We all know that there are differences between real world problem and competition problem. The following figure that is taken from one of the courses in coursera, has partly made this comparison

As you can see, there are a lot more steps to solve in real problems.

3- Problem Definition¶

I think one of the important things when you start a new machine learning project is defining your problem.that means you should understand business problem.( Problem Formalization).

Problem definition has four steps that have illustrated in the picture below:

3-1 Problem Feature¶

We will use the house prices data set. This dataset contains information about house prices and the target value is:

SalePrice

Why am I using House price dataset:

This is a good project because it is so well understood.
Attributes are numeric and categurical so you have to figure out how to load and handle data.
It is a Regression problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.
This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition.
Creative feature engineering .

3-1-1 Metric¶

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

3-2 Aim¶

It is our job to predict the sales price for each house. for each Id in the test set, you must predict the value of the SalePrice variable.

3-3 Variables¶

The variables are :

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

4- Inputs & Outputs¶

For every machine learning problem, you should ask yourself, what are inputs and outputs for the model?

4-1 Inputs¶

train.csv - the training set
test.csv - the test set

4-2 Outputs¶

sale prices for every record in test.csv

5 Loading Packages¶

In this kernel we are using the following packages:

5-1 Import¶

from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import confusion_matrix
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from scipy.stats import skew
import scipy.stats as stats
import lightgbm as lgb
import seaborn as sns
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import scipy
import json
import sys
import csv
import os

5-2 Version¶

print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))

matplotlib: 2.2.3
sklearn: 0.20.2
scipy: 1.1.0
seaborn: 0.9.0
pandas: 0.23.4
numpy: 1.15.4
Python: 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16) 
[GCC 7.3.0]

5-5-3 Setup¶

A few tiny adjustments for better code readability

pd.set_option('display.float_format', lambda x: '%.3f' % x)
sns.set(style='white', context='notebook', palette='deep')
warnings.filterwarnings('ignore')
sns.set_style('white')
%matplotlib inline

6- Exploratory Data Analysis(EDA)¶

In this section, you'll learn how to use graphical and numerical techniques to begin uncovering the structure of your data.

Which variables suggest interesting relationships?
Which observations are unusual?

By the end of the section, you'll be able to answer these questions and more, while generating graphics that are both insightful and beautiful. then We will review analytical and statistical operations:

Data Collection
Visualization
Data Cleaning
Data Preprocessing

6-1 Data Collection¶

Data collection is the process of gathering and measuring data, information or any variables of interest in a standardized and established manner that enables the collector to answer or test hypothesis and evaluate outcomes of the particular collection.[techopedia]

<< Note >>

The rows being the samples and the columns being attributes

# import Dataset to play with it
train = pd.read_csv('../input/train.csv')
test= pd.read_csv('../input/test.csv')

The concat function does all of the heavy lifting of performing concatenation operations along an axis. Let us create all_data.

all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))

<< Note 1 >>

Each row is an observation (also known as : sample, example, instance, record)
Each column is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor, Covariate)

After loading the data via pandas, we should checkout what the content is, description and via the following:

type(train),type(test)

(pandas.core.frame.DataFrame, pandas.core.frame.DataFrame)

6-1-1 Statistical Summary¶

1- Dimensions of the dataset.

2- Peek at the data itself.

3- Statistical summary of all attributes.

4- Breakdown of the data by the class variable.[7]

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

# shape
print(train.shape)

(1460, 81)

Train has one column more than test why? (yes ==>> target value)

# shape
print(test.shape)

(1459, 80)

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

You should see 1460 instances and 81 attributes for train and 1459 instances and 80 attributes for test

For getting some information about the dataset you can use info() command.

print(train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
None

if you want see the type of data and unique value of it you use following script

train['Fence'].unique()

array([nan, 'MnPrv', 'GdWo', 'GdPrv', 'MnWw'], dtype=object)

train["Fence"].value_counts()

MnPrv    157
GdPrv     59
GdWo      54
MnWw      11
Name: Fence, dtype: int64

Copy Id for test and train data set

train_id=train['Id'].copy()
test_id=test['Id'].copy()

to check the first 5 rows of the data set, we can use head(5).

train.head(5)

1to check out last 5 row of the data set, we use tail() function

train.tail()

to pop up 5 random rows from the data set, we can use sample(5) function

train.sample(5)

To give a statistical summary about the dataset, we can use **describe()

train.describe()

To check out how many null info are on the dataset, we can use **isnull().sum()

train.isnull().sum().head(2)

Id            0
MSSubClass    0
dtype: int64

train.groupby('SaleType').count()

to print dataset columns, we can use columns atribute

train.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

type((train.columns))

pandas.core.indexes.base.Index

<< Note 2 >> in pandas's data frame you can perform some query such as "where"

train[train['SalePrice']>700000]

6-1-2 Target Value Analysis¶

As you know SalePrice is our target value that we should predict it then now we take a look at it

train['SalePrice'].describe()

count     1460.000
mean    180921.196
std      79442.503
min      34900.000
25%     129975.000
50%     163000.000
75%     214000.000
max     755000.000
Name: SalePrice, dtype: float64

Flexibly plot a univariate distribution of observations.

sns.set(rc={'figure.figsize':(9,7)})
sns.distplot(train['SalePrice']);

6-1-3 Skewness vs Kurtosis¶

Skewness
1. It is the degree of distortion from the symmetrical bell curve or the normal distribution. It measures the lack of symmetry in data distribution. It differentiates extreme values in one versus the other tail. A symmetrical distribution will have a skewness of 0.
Kurtosis
1. Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution.

#skewness and kurtosis
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())

Skewness: 1.882876
Kurtosis: 6.536282

6-2 Visualization¶

Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns.

With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed.[SAS]

In this section I show you 11 plots with matplotlib and seaborn that is listed in the blew picture:

6-2-1 Scatter plot¶

Scatter plot Purpose To identify the type of relationship (if any) between two quantitative variables

# Modify the graph above by assigning each species an individual color.
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
g=sns.FacetGrid(train[columns], hue="OverallQual", size=5) \
   .map(plt.scatter, "OverallQual", "SalePrice") \
   .add_legend()
g=g.map(plt.scatter, "OverallQual", "SalePrice",edgecolor="w").add_legend();
plt.show()

6-2-2 Box¶

In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.[wikipedia]

data = pd.concat([train['SalePrice'], train['OverallQual']], axis=1)
f, ax = plt.subplots(figsize=(12, 8))
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=data)

ax= sns.boxplot(x="OverallQual", y="SalePrice", data=train[columns])
ax= sns.stripplot(x="OverallQual", y="SalePrice", data=train[columns], jitter=True, edgecolor="gray")
plt.show()

6-2-3 Histogram¶

We can also create a histogram of each input variable to get an idea of the distribution.

# histograms
train.hist(figsize=(15,20))
plt.figure()

<Figure size 648x504 with 0 Axes>

<Figure size 648x504 with 0 Axes>

mini_train=train[columns]
f,ax=plt.subplots(1,2,figsize=(20,10))
mini_train[mini_train['SalePrice']>100000].GarageArea.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('SalePrice>100000')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
mini_train[mini_train['SalePrice']<100000].GarageArea.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('SalePrice<100000')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()

 
mini_train[['SalePrice','OverallQual']].groupby(['OverallQual']).mean().plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x7fe1de5dc470>

train['OverallQual'].value_counts().plot(kind="bar");

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

6-2-4 Multivariate Plots¶

Now we can look at the interactions between the variables.

First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

# scatter plot matrix
pd.plotting.scatter_matrix(train[columns],figsize=(10,10))
plt.figure()

<Figure size 648x504 with 0 Axes>

<Figure size 648x504 with 0 Axes>

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.

6-2-5 violinplots¶

# violinplots on petal-length for each species
sns.violinplot(data=train,x="Functional", y="SalePrice")

<matplotlib.axes._subplots.AxesSubplot at 0x7fe1de8bdc88>

6-2-6 pairplot¶

# Using seaborn pairplot to see the bivariate relation between each pair of features
sns.set()
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.pairplot(train[columns],size = 2 ,kind ='scatter')
plt.show()

6-2-7 kdeplot¶

# seaborn's kdeplot, plots univariate or bivariate density estimates.
#Size can be changed by tweeking the value used
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.FacetGrid(train[columns], hue="OverallQual", size=5).map(sns.kdeplot, "YearBuilt").add_legend()
plt.show()

6-2-8 jointplot¶

# Use seaborn's jointplot to make a hexagonal bin plot
#Set desired size and ratio and choose a color.
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.jointplot(x="OverallQual", y="SalePrice", data=train[columns], size=10,ratio=10, kind='hex',color='green')
plt.show()

# we will use seaborn jointplot shows bivariate scatterplots and univariate histograms with Kernel density 
# estimation in the same figure
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.jointplot(x="SalePrice", y="YearBuilt", data=train[columns], size=6, kind='kde', color='#800000', space=0)

<seaborn.axisgrid.JointGrid at 0x7fe1deb5bdd8>

6-2-9 Heatmap¶

plt.figure(figsize=(7,4)) 
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.heatmap(train[columns].corr(),annot=True,cmap='cubehelix_r') #draws  heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()

6-2-10 radviz¶

from pandas.tools.plotting import radviz
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
radviz(train[columns], "OverallQual")

<matplotlib.axes._subplots.AxesSubplot at 0x7fe1de71c978>

6-2-12 Factorplot¶

sns.factorplot('OverallQual','SalePrice',hue='Functional',data=train)
plt.show()

6-3 Data Preprocessing¶

Data preprocessing refers to the transformations applied to our data before feeding it to the algorithm.

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. there are plenty of steps for data preprocessing and we just listed some of them :

removing Target column (id)
Sampling (without replacement)
Making part of iris unbalanced and balancing (with undersampling and SMOTE)
Introducing missing values and treating them (replacing by average values)
Noise filtering
Data discretization
Normalization and standardization
PCA analysis
Feature selection (filter, embedded, wrapper)

6-3-1 Noise filtering (Outliers)¶

An outlier is a data point that is distant from other similar points. Further simplifying an outlier is an observation that lies on abnormal observation amongst the normal observations in a sample set of population. In statistics, an outlier is an observation point that is distant from other observations.

# Looking for outliers, as indicated in https://ww2.amstat.org/publications/jse/v19n3/decock.pdf
plt.scatter(train.GrLivArea, train.SalePrice, c = "blue", marker = "s")
plt.title("Looking for outliers")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()

train = train[train.GrLivArea < 4000]

2 extreme outliers on the bottom right

#deleting points
train.sort_values(by = 'GrLivArea', ascending = False)[:2]
train = train.drop(train[train['Id'] == 1299].index)
train = train.drop(train[train['Id'] == 524].index)

#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index

all_data[skewed_feats] = np.log1p(all_data[skewed_feats])

all_data = pd.get_dummies(all_data)

# Log transform the target for official scoring
#The key point is to to log_transform the numeric variables since most of them are skewed.
train.SalePrice = np.log1p(train.SalePrice)
y = train.SalePrice

Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.

plt.scatter(train.GrLivArea, train.SalePrice, c = "blue", marker = "s")
plt.title("Looking for outliers")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()

6-4 Data Cleaning¶

When dealing with real-world data, dirty data is the norm rather than the exception. We continuously need to predict correct values, impute missing ones, and find links between various data artefacts such as schemas and records. We need to stop treating data cleaning as a piecemeal exercise (resolving different types of errors in isolation), and instead leverage all signals and resources (such as constraints, available statistics, and dictionaries) to accurately predict corrective actions.

6-4-1 Handle missing values¶

Firstly, understand that there is NO good way to deal with missing data

#filling NA's with the mean of the column:
all_data = all_data.fillna(all_data.mean())

7- Model Deployment¶

In this section have been applied plenty of learning algorithms that play an important rule in your experiences and improve your knowledge in case of ML technique.

<< Note 3 >> : The results shown here may be slightly different for your analysis because, for example, the neural network algorithms use random number generators for fixing the initial value of the weights (starting points) of the neural networks, which often result in obtaining slightly different (local minima) solutions each time you run the analysis. Also note that changing the seed for the random number generator used to create the train, test, and validation samples can change your results.

go to top

7-1 Families of ML algorithms¶

There are several categories for machine learning algorithms, below are some of these categories:

Linear
- Linear Regression
- Logistic Regression
- Support Vector Machines
Tree-Based
- Decision Tree
- Random Forest
- GBDT
KNN
Neural Networks

And if we want to categorize ML algorithms with the type of learning, there are below type:

Classification
- k-Nearest Neighbors
- LinearRegression
- SVM
- DT
- NN
clustering
- K-means
- HCA
- Expectation Maximization
Visualization and dimensionality reduction:
- Principal Component Analysis(PCA)
- Kernel PCA
- Locally -Linear Embedding (LLE)
- t-distributed Stochastic Neighbor Embedding (t-SNE)
Association rule learning
- Apriori
- Eclat
Semisupervised learning
Reinforcement Learning
- Q-learning
Batch learning & Online learning
Ensemble Learning

<< Note >>

Here is no method which outperforms all others for all tasks

7-2 Accuracy and precision¶

One of the most important questions to ask as a machine learning engineer when evaluating our model is how to judge our own model? each machine learning model is trying to solve a problem with a different objective using a different dataset and hence, it is important to understand the context before choosing a metric.

7-2-1 RMSE¶

Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

go to top

#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1456 entries, 0 to 1455
Columns: 288 entries, MSSubClass to SaleCondition_Partial
dtypes: float64(25), int64(11), uint8(252)
memory usage: 779.2 KB

7-3 Ridge¶

def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)

model_ridge = Ridge()

alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean() for alpha in alphas]

7-3-1 Root Mean Squared Error¶

cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot(title = "Validation")
plt.xlabel("alpha")
plt.ylabel("rmse")

Text(0,0.5,'rmse')

# steps
steps = [('scaler', StandardScaler()),
         ('ridge', Ridge())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'ridge__alpha':np.logspace(-4, 0, 50)}

# Create the GridSearchCV object: cv
cv = GridSearchCV(pipeline, parameters, cv=3)

# Fit to the training set
cv.fit(X_train, y)

#predict on train set
y_pred_train=cv.predict(X_train)

# Predict test set
y_pred_test=cv.predict(X_test)

# rmse on train set
rmse = np.sqrt(mean_squared_error(y, y_pred_train))
print("Root Mean Squared Error: {}".format(rmse))

Root Mean Squared Error: 0.32811446712445086

7-4 RandomForestClassifier¶

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

num_test = 0.3
X_train, X_test, y_train, y_test = train_test_split(X_train, y, test_size=num_test, random_state=100)

# Fit Random Forest on Training Set
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=300, random_state=0)
regressor.fit(X_train, y_train)

# Score model
regressor.score(X_train, y_train)

0.877294261920777

7-5 XGBoost¶

XGBoost is one of the most popular machine learning algorithm these days. Regardless of the type of prediction task at hand; regression or classification.

7-5-1 But what makes XGBoost so popular?¶

Speed and performance : Originally written in C++, it is comparatively faster than other ensemble classifiers.
Core algorithm is parallelizable : Because the core XGBoost algorithm is parallelizable it can harness the power of multi-core computers. It is also parallelizable onto GPU’s and across networks of computers making it feasible to train on very large datasets as well.
Consistently outperforms other algorithm methods : It has shown better performance on a variety of machine learning benchmark datasets.
Wide variety of tuning parameters : XGBoost internally has parameters for cross-validation, regularization, user-defined objective functions, missing values, tree parameters, scikit-learn compatible API etc.[10]

XGBoost (Extreme Gradient Boosting) belongs to a family of boosting algorithms and uses the gradient boosting (GBM) framework at its core. It is an optimized distributed gradient boosting library. But wait, what is boosting? Well, keep on reading.

# Initialize model
from xgboost.sklearn import XGBRegressor
XGB_Regressor = XGBRegressor()                  

# Fit the model on our data
XGB_Regressor.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

# Score model
XGB_Regressor.score(X_train, y_train)

0.4598271920204544

7-6 LassoCV¶

Lasso linear model with iterative fitting along a regularization path. The best model is selected by cross-validation.

lasso=LassoCV()

# Fit the model on our data
lasso.fit(X_train, y_train)

LassoCV(alphas=None, copy_X=True, cv='warn', eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=None, normalize=False,
    positive=False, precompute='auto', random_state=None,
    selection='cyclic', tol=0.0001, verbose=False)

# Score model
lasso.score(X_train, y_train)

0.1171814974164107

7-7 GradientBoostingRegressor¶

GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.

boostingregressor=GradientBoostingRegressor()

# Fit the model on our data
boostingregressor.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, n_iter_no_change=None, presort='auto',
             random_state=None, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)

# Score model
boostingregressor.score(X_train, y_train)

0.4931895744123065

7-8 DecisionTree¶

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
dt = DecisionTreeRegressor(random_state=1)

# Fit model
dt.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

dt.score(X_train, y_train)

0.9999999994606554

7-9 ExtraTreeRegressor¶

from sklearn.tree import ExtraTreeRegressor

dtr = ExtraTreeRegressor()

# Fit model
dtr.fit(X_train, y_train)

ExtraTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
          max_leaf_nodes=None, min_impurity_decrease=0.0,
          min_impurity_split=None, min_samples_leaf=1, min_samples_split=2,
          min_weight_fraction_leaf=0.0, random_state=None,
          splitter='random')

# Fit model
dtr.score(X_train, y_train)

0.999999998257654

8- Conclusion¶

This kernel is not completed yet, I will try to cover all the parts related to the process of ML with a variety of Python packages and I know that there are still some problems then I hope to get your feedback to improve it.

You can follow me on:

GitHub ¶

I hope you find this kernel helpful and some UPVOTES would be very much appreciated

9- References¶

Go to first step: Course Home Page

Go to next step : Titanic

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtUnfSF	TotalBsmtSF	Heating	...	CentralAir	Electrical	1stFlrSF	2ndFlrSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.000	8450	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2003	2003	Gable	CompShg	VinylSd	VinylSd	BrkFace	196.000	Gd	TA	PConc	Gd	TA	No	GLQ	706	Unf	150	856	GasA	...	Y	SBrkr	856	854	1710	1	0	2	1	3	1	Gd	8	Typ	0	NaN	Attchd	2003.000	RFn	2	548	TA	TA	Y	0	61	0	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.000	9600	Pave	NaN	Reg	Lvl	AllPub	FR2	Gtl	Veenker	Feedr	Norm	1Fam	1Story	6	8	1976	1976	Gable	CompShg	MetalSd	MetalSd	None	0.000	TA	TA	CBlock	Gd	TA	Gd	ALQ	978	Unf	284	1262	GasA	...	Y	SBrkr	1262	0	1262	0	1	2	0	3	1	TA	6	Typ	1	TA	Attchd	1976.000	RFn	2	460	TA	TA	Y	298	0	0	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.000	11250	Pave	NaN	IR1	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2001	2002	Gable	CompShg	VinylSd	VinylSd	BrkFace	162.000	Gd	TA	PConc	Gd	TA	Mn	GLQ	486	Unf	434	920	GasA	...	Y	SBrkr	920	866	1786	1	0	2	1	3	1	Gd	6	Typ	1	TA	Attchd	2001.000	RFn	2	608	TA	TA	Y	0	42	0	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.000	9550	Pave	NaN	IR1	Lvl	AllPub	Corner	Gtl	Crawfor	Norm	Norm	1Fam	2Story	7	5	1915	1970	Gable	CompShg	Wd Sdng	Wd Shng	None	0.000	TA	TA	BrkTil	TA	Gd	No	ALQ	216	Unf	540	756	GasA	...	Y	SBrkr	961	756	1717	1	0	1	0	3	1	Gd	7	Typ	1	Gd	Detchd	1998.000	Unf	3	642	TA	TA	Y	0	35	272	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.000	14260	Pave	NaN	IR1	Lvl	AllPub	FR2	Gtl	NoRidge	Norm	Norm	1Fam	2Story	8	5	2000	2000	Gable	CompShg	VinylSd	VinylSd	BrkFace	350.000	Gd	TA	PConc	Gd	TA	Av	GLQ	655	Unf	490	1145	GasA	...	Y	SBrkr	1145	1053	2198	1	0	2	1	4	1	Gd	9	Typ	1	TA	Attchd	2000.000	RFn	3	836	TA	TA	Y	192	84	0	NaN	NaN	NaN	12	2008	WD	Normal	250000

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	Heating	...	CentralAir	Electrical	1stFlrSF	2ndFlrSF	GrLivArea	BsmtFullBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
1455	1456	60	RL	62.000	7917	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	2Story	6	5	1999	2000	Gable	CompShg	VinylSd	VinylSd	None	0.000	TA	TA	PConc	Gd	TA	No	Unf	0	Unf	0	953	953	GasA	...	Y	SBrkr	953	694	1647	0	2	1	3	1	TA	7	Typ	1	TA	Attchd	1999.000	RFn	2	460	TA	TA	Y	0	40	0	NaN	NaN	NaN	0	8	2007	WD	Normal	175000
1456	1457	20	RL	85.000	13175	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	NWAmes	Norm	Norm	1Fam	1Story	6	6	1978	1988	Gable	CompShg	Plywood	Plywood	Stone	119.000	TA	TA	CBlock	Gd	TA	No	ALQ	790	Rec	163	589	1542	GasA	...	Y	SBrkr	2073	0	2073	1	2	0	3	1	TA	7	Min1	2	TA	Attchd	1978.000	Unf	2	500	TA	TA	Y	349	0	0	NaN	MnPrv	NaN	0	2	2010	WD	Normal	210000
1457	1458	70	RL	66.000	9042	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	Crawfor	Norm	Norm	1Fam	2Story	7	9	1941	2006	Gable	CompShg	CemntBd	CmentBd	None	0.000	Ex	Gd	Stone	TA	Gd	No	GLQ	275	Unf	0	877	1152	GasA	...	Y	SBrkr	1188	1152	2340	0	2	0	4	1	Gd	9	Typ	2	Gd	Attchd	1941.000	RFn	1	252	TA	TA	Y	0	60	0	NaN	GdPrv	Shed	2500	5	2010	WD	Normal	266500
1458	1459	20	RL	68.000	9717	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	NAmes	Norm	Norm	1Fam	1Story	5	6	1950	1996	Hip	CompShg	MetalSd	MetalSd	None	0.000	TA	TA	CBlock	TA	TA	Mn	GLQ	49	Rec	1029	0	1078	GasA	...	Y	FuseA	1078	0	1078	1	1	0	2	1	Gd	5	Typ	0	NaN	Attchd	1950.000	Unf	1	240	TA	TA	Y	366	0	112	NaN	NaN	NaN	0	4	2010	WD	Normal	142125
1459	1460	20	RL	75.000	9937	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	Edwards	Norm	Norm	1Fam	1Story	5	6	1965	1965	Gable	CompShg	HdBoard	HdBoard	None	0.000	Gd	TA	CBlock	TA	TA	No	BLQ	830	LwQ	290	136	1256	GasA	...	Y	SBrkr	1256	0	1256	1	1	1	3	1	TA	6	Typ	0	NaN	Attchd	1965.000	Fin	1	276	TA	TA	Y	736	68	0	NaN	NaN	NaN	0	6	2008	WD	Normal	147500

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtUnfSF	TotalBsmtSF	Heating	...	CentralAir	Electrical	1stFlrSF	2ndFlrSF	GrLivArea	BsmtFullBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	ScreenPorch	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
1235	1236	70	RL	96.000	13132	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	Crawfor	Norm	Norm	1Fam	2Story	5	5	1914	1950	Gable	CompShg	Wd Sdng	Wd Sdng	None	0.000	TA	TA	BrkTil	Gd	TA	Mn	Unf	0	Unf	747	747	GasA	...	Y	FuseF	892	892	1784	0	1	1	4	1	TA	9	Typ	0	NaN	Detchd	1914.000	Unf	1	180	Fa	Fa	N	203	40	0	NaN	NaN	NaN	7	2006	WD	Normal	138887
1342	1343	60	RL	nan	9375	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	8	5	2002	2002	Gable	CompShg	VinylSd	VinylSd	BrkFace	149.000	Gd	TA	PConc	Gd	TA	No	Unf	0	Unf	1284	1284	GasA	...	Y	SBrkr	1284	885	2169	0	2	1	3	1	Gd	7	Typ	1	Gd	Attchd	2002.000	RFn	2	647	TA	TA	Y	192	87	0	NaN	NaN	NaN	8	2007	WD	Normal	228500
1283	1284	90	RL	94.000	9400	Pave	NaN	Reg	Low	AllPub	Corner	Gtl	Mitchel	Norm	Norm	Duplex	2Story	6	5	1971	1971	Mansard	CompShg	MetalSd	Wd Shng	None	0.000	TA	TA	CBlock	TA	TA	Av	Unf	0	Unf	912	912	GasA	...	Y	SBrkr	912	912	1824	0	2	2	4	2	TA	8	Typ	0	NaN	NaN	nan	NaN	0	0	NaN	NaN	Y	128	0	0	NaN	NaN	NaN	4	2010	WD	Normal	139000
1301	1302	70	RL	nan	7500	Pave	NaN	IR1	Bnk	AllPub	Inside	Gtl	Crawfor	Norm	Norm	1Fam	2Story	6	7	1942	1950	Gable	CompShg	Wd Sdng	Wd Sdng	None	0.000	TA	TA	CBlock	TA	TA	No	BLQ	547	Unf	224	771	GasA	...	Y	SBrkr	753	741	1494	0	1	0	3	1	Gd	7	Typ	2	Gd	Attchd	1942.000	Unf	1	213	TA	TA	P	0	0	224	NaN	NaN	NaN	11	2009	WD	Normal	177500
966	967	50	RL	130.000	9600	Pave	NaN	IR1	HLS	AllPub	Inside	Gtl	Crawfor	Norm	Norm	1Fam	1.5Fin	5	7	1940	1950	Gable	CompShg	MetalSd	MetalSd	None	0.000	Gd	Gd	BrkTil	TA	Fa	No	Rec	428	Unf	300	728	GasA	...	Y	SBrkr	976	332	1308	1	1	1	2	1	TA	7	Min2	2	TA	Detchd	1940.000	Unf	1	256	TA	TA	Y	0	70	0	NaN	NaN	NaN	6	2009	WD	Normal	160000

	Id	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	1stFlrSF	2ndFlrSF	LowQualFinSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	TotRmsAbvGrd	Fireplaces	GarageYrBlt	GarageCars	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SalePrice
count	1460.000	1460.000	1201.000	1460.000	1460.000	1460.000	1460.000	1460.000	1452.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1379.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000	1460.000
mean	730.500	56.897	70.050	10516.828	6.099	5.575	1971.268	1984.866	103.685	443.640	46.549	567.240	1057.429	1162.627	346.992	5.845	1515.464	0.425	0.058	1.565	0.383	2.866	1.047	6.518	0.613	1978.506	1.767	472.980	94.245	46.660	21.954	3.410	15.061	2.759	43.489	6.322	2007.816	180921.196
std	421.610	42.301	24.285	9981.265	1.383	1.113	30.203	20.645	181.066	456.098	161.319	441.867	438.705	386.588	436.528	48.623	525.480	0.519	0.239	0.551	0.503	0.816	0.220	1.625	0.645	24.690	0.747	213.805	125.339	66.256	61.119	29.317	55.757	40.177	496.123	2.704	1.328	79442.503
min	1.000	20.000	21.000	1300.000	1.000	1.000	1872.000	1950.000	0.000	0.000	0.000	0.000	0.000	334.000	0.000	0.000	334.000	0.000	0.000	0.000	0.000	0.000	0.000	2.000	0.000	1900.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	1.000	2006.000	34900.000
25%	365.750	20.000	59.000	7553.500	5.000	5.000	1954.000	1967.000	0.000	0.000	0.000	223.000	795.750	882.000	0.000	0.000	1129.500	0.000	0.000	1.000	0.000	2.000	1.000	5.000	0.000	1961.000	1.000	334.500	0.000	0.000	0.000	0.000	0.000	0.000	0.000	5.000	2007.000	129975.000
50%	730.500	50.000	69.000	9478.500	6.000	5.000	1973.000	1994.000	0.000	383.500	0.000	477.500	991.500	1087.000	0.000	0.000	1464.000	0.000	0.000	2.000	0.000	3.000	1.000	6.000	1.000	1980.000	2.000	480.000	0.000	25.000	0.000	0.000	0.000	0.000	0.000	6.000	2008.000	163000.000
75%	1095.250	70.000	80.000	11601.500	7.000	6.000	2000.000	2004.000	166.000	712.250	0.000	808.000	1298.250	1391.250	728.000	0.000	1776.750	1.000	0.000	2.000	1.000	3.000	1.000	7.000	1.000	2002.000	2.000	576.000	168.000	68.000	0.000	0.000	0.000	0.000	0.000	8.000	2009.000	214000.000
max	1460.000	190.000	313.000	215245.000	10.000	9.000	2010.000	2010.000	1600.000	5644.000	1474.000	2336.000	6110.000	4692.000	2065.000	572.000	5642.000	3.000	2.000	3.000	2.000	8.000	3.000	14.000	3.000	2010.000	4.000	1418.000	857.000	547.000	552.000	508.000	480.000	738.000	15500.000	12.000	2010.000	755000.000

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	Heating	HeatingQC	CentralAir	Electrical	1stFlrSF	2ndFlrSF	LowQualFinSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleCondition	SalePrice
SaleType
COD	43	43	43	32	43	43	3	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	42	42	42	42	43	42	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	43	15	43	43	43	43	43	43	43	43	43	43	43	43	43	43	0	13	2	43	43	43	43	43
CWD	4	4	4	4	4	4	1	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	3	4	4	4	4	4	4	4	4	4	4	4	4	4	4	0	2	0	4	4	4	4	4
Con	2	2	2	2	2	2	0	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	1	2	2	2	2	2	2	2	2	2	2	2	2	2	2	0	0	0	2	2	2	2	2
ConLD	9	9	9	8	9	9	0	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	8	8	8	8	9	8	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	9	3	7	7	7	9	9	7	7	9	9	9	9	9	9	9	0	2	1	9	9	9	9	9
ConLI	5	5	5	4	5	5	1	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	3	4	4	4	5	5	4	4	5	5	5	5	5	5	5	0	1	1	5	5	5	5	5
ConLw	5	5	5	5	5	5	0	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	5	4	4	4	4	5	5	4	4	5	5	5	5	5	5	5	0	0	0	5	5	5	5	5
New	122	122	122	121	122	122	7	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	119	119	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	122	85	120	120	120	122	122	120	120	122	122	122	122	122	122	122	1	0	0	122	122	122	122	122
Oth	3	3	3	3	3	3	0	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	0	1	1	1	3	3	1	1	3	3	3	3	3	3	3	0	1	1	3	3	3	3	3
WD	1267	1267	1267	1022	1267	1267	79	1267	1267	1267	1267	1267	1267	1267	1267	1267	1267	1267	1267	1267	1267	1267	1267	1267	1267	1262	1262	1267	1267	1267	1232	1232	1231	1232	1267	1231	1267	1267	1267	1267	1267	1267	1266	1267	1267	1267	1267	1267	1267	1267	1267	1267	1267	1267	1267	1267	1267	656	1194	1194	1194	1267	1267	1194	1194	1267	1267	1267	1267	1267	1267	1267	6	262	49	1267	1267	1267	1267	1267

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	Heating	...	CentralAir	Electrical	1stFlrSF	2ndFlrSF	LowQualFinSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
691	692	60	RL	104.000	21535	Pave	NaN	IR1	Lvl	AllPub	Corner	Gtl	NoRidge	Norm	Norm	1Fam	2Story	10	6	1994	1995	Gable	WdShngl	HdBoard	HdBoard	BrkFace	1170.000	Ex	TA	PConc	Ex	TA	Gd	GLQ	1455	Unf	0	989	2444	GasA	...	Y	SBrkr	2444	1872	0	4316	0	1	3	1	4	1	Ex	10	Typ	2	Ex	Attchd	1994.000	Fin	3	832	TA	TA	Y	382	50	0	0	0	0	NaN	NaN	NaN	0	1	2007	WD	Normal	755000
1182	1183	60	RL	160.000	15623	Pave	NaN	IR1	Lvl	AllPub	Corner	Gtl	NoRidge	Norm	Norm	1Fam	2Story	10	5	1996	1996	Hip	CompShg	Wd Sdng	ImStucc	None	0.000	Gd	TA	PConc	Ex	TA	Av	GLQ	2096	Unf	0	300	2396	GasA	...	Y	SBrkr	2411	2065	0	4476	1	0	3	1	4	1	Ex	10	Typ	2	TA	Attchd	1996.000	Fin	3	813	TA	TA	Y	171	78	0	0	0	555	Ex	MnPrv	NaN	0	7	2007	WD	Abnorml	745000

Machine Learning Workflow for House Prices¶

Quite Practical and Far from any Theoretical Concepts ¶

You may be interested have a look at 10 Steps to Become a Data Scientist:¶

1. Machine Learning¶

GitHub¶

Notebook Content¶

1- Introduction¶

1-1 Courses¶

1-2 Kaggle kernels¶

1-3 Ebooks¶

2- Machine Learning Workflow¶

2-2 Real world Application Vs Competitions¶

3- Problem Definition¶

3-1 Problem Feature¶

3-1-1 Metric¶

3-2 Aim¶

3-3 Variables¶

4- Inputs & Outputs¶

4-1 Inputs¶

4-2 Outputs¶

5 Loading Packages¶

5-1 Import¶

5-2 Version¶

5-5-3 Setup¶

6- Exploratory Data Analysis(EDA)¶

6-1 Data Collection¶

6-1-1 Statistical Summary¶

6-1-2 Target Value Analysis¶

6-1-3 Skewness vs Kurtosis¶

6-2 Visualization¶

6-2-1 Scatter plot¶

6-2-2 Box¶

6-2-3 Histogram¶

6-2-4 Multivariate Plots¶

6-2-5 violinplots¶

6-2-6 pairplot¶

6-2-7 kdeplot¶

6-2-8 jointplot¶

6-2-9 Heatmap¶

6-2-10 radviz¶

6-2-12 Factorplot¶

6-3 Data Preprocessing¶

6-3-1 Noise filtering (Outliers)¶

6-4 Data Cleaning¶

6-4-1 Handle missing values¶

7- Model Deployment¶

7-1 Families of ML algorithms¶

7-2 Accuracy and precision¶

7-2-1 RMSE¶

7-3 Ridge¶

7-3-1 Root Mean Squared Error¶

7-4 RandomForestClassifier¶

7-5 XGBoost¶

7-5-1 But what makes XGBoost so popular?¶

7-6 LassoCV¶

7-7 GradientBoostingRegressor¶

7-8 DecisionTree¶

7-9 ExtraTreeRegressor¶

8- Conclusion¶

GitHub¶

9- References¶

Machine Learning Workflow for House Prices
¶

Quite Practical and Far from any Theoretical Concepts
¶

1. Machine Learning ¶

GitHub ¶

GitHub ¶