Machine Learning Workflow for House Prices

¶ Quite Practical and Far from any Theoretical Concepts

¶last update: **05/01/2019**

## You may be interested have a look at 10 Steps to Become a Data Scientist:¶

## 1. Machine Learning¶

you can Fork and Run this kernel on Github:

## GitHub¶

**I hope you find this kernel helpful and some UPVOTES would be very much appreciated**

This is a **A Comprehensive ML Workflow for House Prices** data set, it is clear that everyone in this community is familiar with house prices dataset but if you need to review your information about the dataset please visit this link.

I have tried to help **Fans of Machine Learning** in Kaggle how to face machine learning problems. and I think it is a great opportunity for who want to learn machine learning workflow with python **completely**.

I want to covere most of the methods that are implemented for house prices until **2018**, you can start to learn and review your knowledge about ML with a simple dataset and try to learn and memorize the workflow for your journey in Data science world.

Before we get into the notebook, let me introduce some helpful resources.

There are a lot of Online courses that can help you develop your knowledge, here I have just listed some of them:

Machine Learning Certification by Stanford University (Coursera)

Machine Learning A-Zâ„¢: Hands-On Python & R In Data Science (Udemy)

Deep Learning Certification by Andrew Ng from deeplearning.ai (Coursera)

Python for Data Science and Machine Learning Bootcamp (Udemy))

Complete Guide to TensorFlow for Deep Learning Tutorial with Python

Data Science and Machine Learning Tutorial with Python â€“ Hands On

- Creative Applications of Deep Learning with TensorFlow
- Neural Networks for Machine Learning
- Practical Deep Learning For Coders, Part 1
- Machine Learning

I want to thanks **Kaggle team** and all of the **kernel's authors** who develop this huge resources for Data scientists. I have learned from The work of others and I have just listed some more important kernels that inspired my work and I've used them in this kernel:

So you love reading , here is **10 free machine learning books**

- Probability and Statistics for Programmers
- Bayesian Reasoning and Machine Learning
- An Introduction to Statistical Learning
- Understanding Machine Learning
- A Programmerâ€™s Guide to Data Mining
- Mining of Massive Datasets
- A Brief Introduction to Neural Networks
- Deep Learning
- Natural Language Processing with Python
- Machine Learning Yearning

I am open to your feedback for improving this **kernel**

If you have already read some machine learning books. You have noticed that there are different ways to stream data into machine learning.

Most of these books share the following steps:

- Define Problem
- Specify Inputs & Outputs
- Exploratory data analysis
- Data Collection
- Data Preprocessing
- Data Cleaning
- Visualization
- Model Design, Training, and Offline Evaluation
- Model Deployment, Online Evaluation, and Monitoring
- Model Maintenance, Diagnosis, and Retraining

Of course, the same solution can not be provided for all problems, so the best way is to create a **general framework** and adapt it to new problem.

**You can see my workflow in the below image** :

**Data Science has so many techniques and procedures that can confuse anyone.**

We all know that there are differences between real world problem and competition problem. The following figure that is taken from one of the courses in coursera, has partly made this comparison

As you can see, there are a lot more steps to solve in real problems.

I think one of the important things when you start a new machine learning project is defining your problem.that means you should understand business problem.( **Problem Formalization**).

Problem definition has four steps that have illustrated in the picture below:

We will use the house prices data set. This dataset contains information about house prices and the target value is:

- SalePrice

**Why am I using House price dataset:**

- This is a good project because it is so well understood.
- Attributes are numeric and categurical so you have to figure out how to load and handle data.
- It is a Regression problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.
- This is a perfect competition for data science studentsÂ who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition.Â
- Creative feature engineeringÂ .

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

It is our job to predict the sales price for each house.Â for each Id in the test set, you must predict the value of the **SalePrice** variable.Â

The variables are :

- SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
- MSSubClass: The building class
- MSZoning: The general zoning classification
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Quality of second finished area (if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: Value of miscellaneous feature
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale

In [1]:

```
from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import confusion_matrix
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from scipy.stats import skew
import scipy.stats as stats
import lightgbm as lgb
import seaborn as sns
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import scipy
import json
import sys
import csv
import os
```

In [2]:

```
print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))
```

In [3]:

```
pd.set_option('display.float_format', lambda x: '%.3f' % x)
sns.set(style='white', context='notebook', palette='deep')
warnings.filterwarnings('ignore')
sns.set_style('white')
%matplotlib inline
```

In this section, you'll learn how to use graphical and numerical techniques to begin uncovering the structure of your data.

- Which variables suggest interesting relationships?
- Which observations are unusual?

By the end of the section, you'll be able to answer these questions and more, while generating graphics that are both insightful and beautiful. then We will review analytical and statistical operations:

- Data Collection
- Visualization
- Data Cleaning
- Data Preprocessing

**Data collection** is the process of gathering and measuring data, information or any variables of interest in a standardized and established manner that enables the collector to answer or test hypothesis and evaluate outcomes of the particular collection.[techopedia]

<< Note >>

The rows being the samples and the columns being attributes

In [4]:

```
# import Dataset to play with it
train = pd.read_csv('../input/train.csv')
test= pd.read_csv('../input/test.csv')
```

The **concat** function does all of the heavy lifting of performing concatenation operations along an axis. Let us create all_data.

In [5]:

```
all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
test.loc[:,'MSSubClass':'SaleCondition']))
```

**<< Note 1 >>**

- Each row is an observation (also known as : sample, example, instance, record)
- Each column is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor, Covariate)

After loading the data via **pandas**, we should checkout what the content is, description and via the following:

In [6]:

```
type(train),type(test)
```

Out[6]:

1- Dimensions of the dataset.

2- Peek at the data itself.

3- Statistical summary of all attributes.

4- Breakdown of the data by the class variable.[7]

Donâ€™t worry, each look at the data is **one command**. These are useful commands that you can use again and again on future projects.

In [7]:

```
# shape
print(train.shape)
```

Train has one column more than test why? (yes ==>> **target value**)

In [8]:

```
# shape
print(test.shape)
```

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

**You should see 1460 instances and 81 attributes for train and 1459 instances and 80 attributes for test**

For getting some information about the dataset you can use **info()** command.

In [9]:

```
print(train.info())
```

**if you want see the type of data and unique value of it you use following script**

In [10]:

```
train['Fence'].unique()
```

Out[10]:

In [11]:

```
train["Fence"].value_counts()
```

Out[11]:

Copy Id for test and train data set

In [12]:

```
train_id=train['Id'].copy()
test_id=test['Id'].copy()
```

**to check the first 5 rows of the data set, we can use head(5).**

In [13]:

```
train.head(5)
```

Out[13]:

1**to check out last 5 row of the data set, we use tail() function**

In [14]:

```
train.tail()
```

Out[14]:

to pop up 5 random rows from the data set, we can use **sample(5)** function

In [15]:

```
train.sample(5)
```

Out[15]:

To give a **statistical summary** about the dataset, we can use **describe()

In [16]:

```
train.describe()
```

Out[16]:

To check out how many null info are on the dataset, we can use **isnull().sum()

In [17]:

```
train.isnull().sum().head(2)
```

Out[17]:

In [18]:

```
train.groupby('SaleType').count()
```

Out[18]:

to print dataset **columns**, we can use columns atribute

In [19]:

```
train.columns
```

Out[19]:

In [20]:

```
type((train.columns))
```

Out[20]:

**<< Note 2 >>**
in pandas's data frame you can perform some query such as "where"

In [21]:

```
train[train['SalePrice']>700000]
```

Out[21]:

As you know **SalePrice** is our target value that we should predict it then now we take a look at it

In [22]:

```
train['SalePrice'].describe()
```

Out[22]:

Flexibly plot a univariate distribution of observations.

In [23]:

```
sns.set(rc={'figure.figsize':(9,7)})
sns.distplot(train['SalePrice']);
```

- Skewness
- It is the degree of distortion from the symmetrical bell curve or the normal distribution. It measures the lack of symmetry in data distribution. It differentiates extreme values in one versus the other tail. A symmetrical distribution will have a skewness of 0.

- Kurtosis
- Kurtosis is all about the tails of the distributionâ€Šâ€”â€Šnot the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution.

In [24]:

```
#skewness and kurtosis
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())
```

**Data visualization** is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns.

With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how itâ€™s processed.[SAS]

In this section I show you **11 plots** with **matplotlib** and **seaborn** that is listed in the blew picture:

Scatter plot Purpose To identify the type of relationship (if any) between two quantitative variables

In [25]:

```
# Modify the graph above by assigning each species an individual color.
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
g=sns.FacetGrid(train[columns], hue="OverallQual", size=5) \
.map(plt.scatter, "OverallQual", "SalePrice") \
.add_legend()
g=g.map(plt.scatter, "OverallQual", "SalePrice",edgecolor="w").add_legend();
plt.show()
```

In descriptive statistics, a **box plot** or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.[wikipedia]

In [26]:

```
data = pd.concat([train['SalePrice'], train['OverallQual']], axis=1)
f, ax = plt.subplots(figsize=(12, 8))
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=data)
```

In [27]:

```
ax= sns.boxplot(x="OverallQual", y="SalePrice", data=train[columns])
ax= sns.stripplot(x="OverallQual", y="SalePrice", data=train[columns], jitter=True, edgecolor="gray")
plt.show()
```

We can also create a **histogram** of each input variable to get an idea of the distribution.

In [28]:

```
# histograms
train.hist(figsize=(15,20))
plt.figure()
```

Out[28]:

In [29]:

```
mini_train=train[columns]
f,ax=plt.subplots(1,2,figsize=(20,10))
mini_train[mini_train['SalePrice']>100000].GarageArea.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('SalePrice>100000')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
mini_train[mini_train['SalePrice']<100000].GarageArea.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('SalePrice<100000')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()
```

In [30]:

```
mini_train[['SalePrice','OverallQual']].groupby(['OverallQual']).mean().plot.bar()
```

Out[30]:

In [31]:

```
train['OverallQual'].value_counts().plot(kind="bar");
```

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

Now we can look at the interactions between the variables.

First, letâ€™s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

In [32]:

```
# scatter plot matrix
pd.plotting.scatter_matrix(train[columns],figsize=(10,10))
plt.figure()
```

Out[32]:

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.

In [33]:

```
# violinplots on petal-length for each species
sns.violinplot(data=train,x="Functional", y="SalePrice")
```

Out[33]:

In [34]:

```
# Using seaborn pairplot to see the bivariate relation between each pair of features
sns.set()
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.pairplot(train[columns],size = 2 ,kind ='scatter')
plt.show()
```