Warehouse optimization

Within this kernel we will analyse sales data of an UK online retailer. As storage area may be expensive and fast delivery on time is important to prevail over the competition we like to help the retailer by predicting daily amounts of sold products as well as gained revenues. This way we will try to answer:

  • How many items are sold per product tomorrow, next week, next month?
  • Which products yield highest revenues and which ones are best-sellers?
  • Which products have unstable prices and how does this affect the purchase behavior of the retails customers?
  • What are the customer preferences? Can we gain a basic understanding how the requested products would change if we add new customers of a specific kind?

Prepare to start

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()

from catboost import CatBoostRegressor, Pool, cv
from sklearn.model_selection import TimeSeriesSplit
from catboost import MetricVisualizer

from os import listdir

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import shap
shap.initjs()
In [2]:
print(listdir("../input"))
['data.csv']
In [3]:
data = pd.read_csv("../input/data.csv", encoding="ISO-8859-1", dtype={'CustomerID': str})
data.shape
Out[3]:
(541909, 8)

The data has 541909 entries and 8 variables.

Get familiar with the data

Sneak a peek - features and targets

In [4]:
data.head()
Out[4]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850 United Kingdom

We can see that the datafile has information given for each single transaction. Take a look at the InvoiceNo and the CustomerID of the first entries. Here we can see that one customer with ID 17850 of the United Kingdom made a single order that has the InvoideNo 536365. The customer ordered several products with different stockcodes, descriptions, unit prices and quantities. In addition we can see that the InvoiceDate was the same for these products.

Aggregating daily product sales and revenues

As we like to predict the daily amount of product sales as well as revenues, we need to compute a daily aggregation of this data. For this purpose we need to compute the revenue per transaction and extract temporal features out of the InvoiceDate:

In [5]:
data["Revenue"] = data.Quantity * data.UnitPrice
data["InvoiceDate"] = pd.to_datetime(data.InvoiceDate,
                                     cache=True)

data["Year"] = data.InvoiceDate.dt.year
data["Quarter"] = data.InvoiceDate.dt.quarter
data["Month"] = data.InvoiceDate.dt.month
data["Week"] = data.InvoiceDate.dt.week
data["Weekday"] = data.InvoiceDate.dt.weekday
data["Day"] = data.InvoiceDate.dt.day
data["Dayofyear"] = data.InvoiceDate.dt.dayofyear

As the key task of this kernel is to predict the gained revenue and amount of products sold per day, we can sum up the daily quantities and revenues per product stockcode and description:

In [6]:
grouped_features = ["Year", "Quarter","Month", "Week", "Weekday", "Dayofyear", "Day",
                    "StockCode", "Description"]

This way we loose information abount customers and countries but we will recover it later on during this kernel.

In [7]:
daily_data = pd.DataFrame(data.groupby(grouped_features).Quantity.sum(),
                          columns=["Quantity"])
daily_data["Revenue"] = data.groupby(grouped_features).Revenue.sum()
daily_data = daily_data.reset_index()
daily_data.head()
Out[7]:
Year Quarter Month Week Weekday Dayofyear Day StockCode Description Quantity Revenue
0 2010 4 12 48 2 335 1 10002 INFLATABLE POLITICAL GLOBE 60 51.00
1 2010 4 12 48 2 335 1 10125 MINI FUNKY DESIGN TAPES 2 1.70
2 2010 4 12 48 2 335 1 10133 COLOURING PENCILS BROWN TUBE 5 4.25
3 2010 4 12 48 2 335 1 10135 COLOURING PENCILS BROWN TUBE 1 2.51
4 2010 4 12 48 2 335 1 11001 ASSTD DESIGN RACING CAR PEN 3 10.08

Gain first impressions by exploration

Let's explore the daily data a little bit to obtain first insights that can be useful for modelling or understanding.

First of all, which timespan is covered by the data?

In [8]:
daily_data["Date"] = pd.to_datetime(daily_data[['Year', 'Month', 'Day']])
In [9]:
daily_data.Date.max() - daily_data.Date.min()
Out[9]:
Timedelta('373 days 00:00:00')
In [10]:
print("Datafile starts with timepoint {}".format(daily_data.Date.min()))
print("Datafile ends with timepoint {}".format(daily_data.Date.max()))
Datafile starts with timepoint 2010-12-01 00:00:00
Datafile ends with timepoint 2011-12-09 00:00:00

We can see that the data covers one year from december 2010 to december 2011.

How many entries do we have?

In [11]:
samples = daily_data.shape[0]
samples
Out[11]:
280224

How many different products do we have (unique stockcodes)?

In [12]:
daily_data.StockCode.nunique()
Out[12]:
3958

How many different product descriptions do we have?

In [13]:
daily_data.Description.nunique()
Out[13]:
4223

We have more descriptions than product codes:

In [14]:
daily_data.Description.nunique() - daily_data.StockCode.nunique()
Out[14]:
265

How are the product sales and revenues distributed?

In [15]:
daily_data.loc[:, ["Quantity", "Revenue"]].describe()
Out[15]:
Quantity Revenue
count 280224.000000 280224.000000
mean 18.521108 34.785557
std 81.977877 205.556779
min -19200.000000 -39243.080000
25% 2.000000 5.000000
50% 6.000000 14.525000
75% 18.000000 31.080000
max 12540.000000 39619.500000

As we can see by their min and max values both target variables exhibit extreme outliers. If we would like to use them as targets, we should exclude them as they will mislead our validation. As I like to use early stopping this will directly influence training of predictive models as well.

In [16]:
low_quantity = daily_data.Quantity.quantile(0.05)
high_quantity = daily_data.Quantity.quantile(0.95)
print((low_quantity, high_quantity))
(1.0, 72.0)
In [17]:
low_revenue = daily_data.Revenue.quantile(0.05)
high_revenue = daily_data.Revenue.quantile(0.95)
print((low_revenue, high_revenue))
(1.25, 129.8)

Let's only use target ranges data that are occupied by 90 % of the data entries. This is a first and easy strategy to exclude heavy outliers but we should always be aware of the fact that we have lost some information given by the 2 % we have excluded. It could be nice and useful in general to understand and analyse what has caused these outliers.

In [18]:
daily_data = daily_data.loc[
    (daily_data.Quantity >= low_quantity) & (daily_data.Quantity <= high_quantity)]
In [19]:
daily_data = daily_data.loc[
    (daily_data.Revenue >= low_revenue) & (daily_data.Revenue <= high_revenue)]

How much entries have we lost?

In [20]:
samples - daily_data.shape[0]
Out[20]:
32960

Let's take a look at the remaining distributions of daily quantities and revenues:

In [21]:
fig, ax = plt.subplots(1,2,figsize=(20,5))
sns.distplot(daily_data.Quantity.values, kde=False, ax=ax[0], color="Orange");
ax[0].set_xlabel("Number of daily product sales");
ax[0].set_ylabel("Frequency");
ax[0].set_title("How many products are sold per day?");
sns.distplot(daily_data.Revenue.values, kde=True, ax=ax[1], color="Dodgerblue");
ax[1].set_xlabel("Daily revenue");
ax[1].set_ylabel("Frequency");
ax[1].set_title("How many revenue is gained per day?");

We can see that the distributions are right skewed. Lower values are more common. In addition the daily sales quantities seem to be multimodal. A daily sale of 1 is common as well as a quantity of 12 and 24. This pattern is very interesting and leads to the conclusion that quantities are probably often divisible by 2 or 3. The daily revenues are right skewed as well and low values are most common. In a nutshell we can say that specific products are often bought as single quantites or in a small bunch and the revenue we gain this way is relatively low.

How did the daily product sales change over time?

In [22]:
daily_data.head()
Out[22]:
Year Quarter Month Week Weekday Dayofyear Day StockCode Description Quantity Revenue Date
0 2010 4 12 48 2 335 1 10002 INFLATABLE POLITICAL GLOBE 60 51.00 2010-12-01
1 2010 4 12 48 2 335 1 10125 MINI FUNKY DESIGN TAPES 2 1.70 2010-12-01
2 2010 4 12 48 2 335 1 10133 COLOURING PENCILS BROWN TUBE 5 4.25 2010-12-01
3 2010 4 12 48 2 335 1 10135 COLOURING PENCILS BROWN TUBE 1 2.51 2010-12-01
4 2010 4 12 48 2 335 1 11001 ASSTD DESIGN RACING CAR PEN 3 10.08 2010-12-01
In [23]:
plt.figure(figsize=(20,5))
plt.plot(daily_data.groupby("Date").Quantity.sum(), marker='+', c="darkorange")
plt.plot(daily_data.groupby("Date").Quantity.sum().rolling(window=30, center=True).mean(),
        c="red")
plt.xticks(rotation=90);
plt.title("How many quantities are sold per day over the given time?");

The daily amounts of sold products show an expected increase in the months before christmas as seem to be lowest in february. We have high fluctuations between the days ranging from values close to 2500 products to roughly 19000 products per day. The average amount of product sales is around 7500 that strongly increases after september and is roughly doubled in the end of november. This emphasizes the need of an optimized storekeeping. Knowing which products are currently of high demand for important customers in the future could be a adventage to deliver fast without unnecessarily crowded full stocks.

In [24]:
plt.figure(figsize=(20,5))
plt.plot(daily_data.groupby("Date").Revenue.sum(), marker='+', c="royalblue")
plt.plot(daily_data.groupby("Date").Revenue.sum().rolling(window=30, center=True).mean(), 
         c="darkblue")
plt.xticks(rotation=90);
plt.title("How many revenue is gained per day over the given time?");

The daily revenue shows the roughly the same temporal pattern as the quantities.

In [25]:
fig, ax = plt.subplots(1,2,figsize=(20,5))

weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
yearmonth = ["Dec-2010", "Jan-2011", "Feb-2011", "Mar-2011", "Apr-2011", "May-2011",
             "Jun-2011", "Jul-1011", "Aug-2011", "Sep-2011", "Oct-2011", "Nov-2011", 
             "Dec-2011"]

daily_data.groupby("Weekday").Quantity.sum().plot(
    ax=ax[0], marker='o', label="Quantity", c="darkorange");
daily_data.groupby("Weekday").Revenue.sum().plot(
    ax=ax[0], marker='o', label="Revenue", c="royalblue");
ax[0].legend();
ax[0].set_xticks(np.arange(0,7))
ax[0].set_xticklabels(weekdays);
ax[0].set_title("Total sales per weekday");

ax[1].plot(daily_data.groupby(["Year", "Month"]).Quantity.sum().values,
    marker='o', label="Quantities", c="darkorange");
ax[1].plot(daily_data.groupby(["Year", "Month"]).Revenue.sum().values,
    marker='o', label="Revenue", c="royalblue");
ax[1].set_xticklabels(yearmonth, rotation=90)
ax[1].set_xticks(np.arange(0, len(yearmonth)))
ax[1].legend();
ax[1].set_title("Total sales per month");

Both visualisations yield further interesting insights:

  • Thursday seems to be the day on which most products are sold.
  • In contrast friday, and sunday have very low transactions
  • On saturday there are no transactions at all
  • The pre-Christmas season starts in september and shows a peak in november
  • Indeed february and april are month with very low sales.

How to predict daily product sales?

In this kernel I like to use catboost as predictive model. The prediction of daily quantities and revenues are both regression tasks and consequently I will use the catboost regressor. The loss and metric I like to use is the root mean square error (RMSE):

$$ E = \sqrt{ \frac{1}{N}\sum_{n=1}^{N} (t_{n} - y_{n})^{2}}$$

It computes the e