Warehouse optimization

Within this kernel we will analyse sales data of an UK online retailer. As storage area may be expensive and fast delivery on time is important to prevail over the competition we like to help the retailer by predicting daily amounts of sold products as well as gained revenues. This way we will try to answer:

  • How many items are sold per product tomorrow, next week, next month?
  • Which products yield highest revenues and which ones are best-sellers?
  • Which products have unstable prices and how does this affect the purchase behavior of the retails customers?
  • What are the customer preferences? Can we gain a basic understanding how the requested products would change if we add new customers of a specific kind?

Prepare to start

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()

from catboost import CatBoostRegressor, Pool, cv
from sklearn.model_selection import TimeSeriesSplit
from catboost import MetricVisualizer

from os import listdir

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import shap
shap.initjs()
In [2]:
print(listdir("../input"))
['data.csv']
In [3]:
data = pd.read_csv("../input/data.csv", encoding="ISO-8859-1", dtype={'CustomerID': str})
data.shape
Out[3]:
(541909, 8)

The data has 541909 entries and 8 variables.

Get familiar with the data

Sneak a peek - features and targets

In [4]:
data.head()
Out[4]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850 United Kingdom

We can see that the datafile has information given for each single transaction. Take a look at the InvoiceNo and the CustomerID of the first entries. Here we can see that one customer with ID 17850 of the United Kingdom made a single order that has the InvoideNo 536365. The customer ordered several products with different stockcodes, descriptions, unit prices and quantities. In addition we can see that the InvoiceDate was the same for these products.

Aggregating daily product sales and revenues

As we like to predict the daily amount of product sales as well as revenues, we need to compute a daily aggregation of this data. For this purpose we need to compute the revenue per transaction and extract temporal features out of the InvoiceDate:

In [5]:
data["Revenue"] = data.Quantity * data.UnitPrice
data["InvoiceDate"] = pd.to_datetime(data.InvoiceDate,
                                     cache=True)

data["Year"] = data.InvoiceDate.dt.year
data["Quarter"] = data.InvoiceDate.dt.quarter
data["Month"] = data.InvoiceDate.dt.month
data["Week"] = data.InvoiceDate.dt.week
data["Weekday"] = data.InvoiceDate.dt.weekday
data["Day"] = data.InvoiceDate.dt.day
data["Dayofyear"] = data.InvoiceDate.dt.dayofyear

As the key task of this kernel is to predict the gained revenue and amount of products sold per day, we can sum up the daily quantities and revenues per product stockcode and description:

In [6]:
grouped_features = ["Year", "Quarter","Month", "Week", "Weekday", "Dayofyear", "Day",
                    "StockCode", "Description"]

This way we loose information abount customers and countries but we will recover it later on during this kernel.

In [7]:
daily_data = pd.DataFrame(data.groupby(grouped_features).Quantity.sum(),
                          columns=["Quantity"])
daily_data["Revenue"] = data.groupby(grouped_features).Revenue.sum()
daily_data = daily_data.reset_index()
daily_data.head()
Out[7]:
Year Quarter Month Week Weekday Dayofyear Day StockCode Description Quantity Revenue
0 2010 4 12 48 2 335 1 10002 INFLATABLE POLITICAL GLOBE 60 51.00
1 2010 4 12 48 2 335 1 10125 MINI FUNKY DESIGN TAPES 2 1.70
2 2010 4 12 48 2 335 1 10133 COLOURING PENCILS BROWN TUBE 5 4.25
3 2010 4 12 48 2 335 1 10135 COLOURING PENCILS BROWN TUBE 1 2.51
4 2010 4 12 48 2 335 1 11001 ASSTD DESIGN RACING CAR PEN 3 10.08

Gain first impressions by exploration

Let's explore the daily data a little bit to obtain first insights that can be useful for modelling or understanding.

First of all, which timespan is covered by the data?

In [8]:
daily_data["Date"] = pd.to_datetime(daily_data[['Year', 'Month', 'Day']])
In [9]:
daily_data.Date.max() - daily_data.Date.min()
Out[9]:
Timedelta('373 days 00:00:00')
In [10]:
print("Datafile starts with timepoint {}".format(daily_data.Date.min()))
print("Datafile ends with timepoint {}".format(daily_data.Date.max()))
Datafile starts with timepoint 2010-12-01 00:00:00
Datafile ends with timepoint 2011-12-09 00:00:00

We can see that the data covers one year from december 2010 to december 2011.

How many entries do we have?

In [11]:
samples = daily_data.shape[0]
samples
Out[11]:
280224

How many different products do we have (unique stockcodes)?

In [12]:
daily_data.StockCode.nunique()
Out[12]:
3958

How many different product descriptions do we have?

In [13]:
daily_data.Description.nunique()
Out[13]:
4223

We have more descriptions than product codes:

In [14]:
daily_data.Description.nunique() - daily_data.StockCode.nunique()
Out[14]:
265

How are the product sales and revenues distributed?

In [15]:
daily_data.loc[:, ["Quantity", "Revenue"]].describe()
Out[15]:
Quantity Revenue
count 280224.000000 280224.000000
mean 18.521108 34.785557
std 81.977877 205.556779
min -19200.000000 -39243.080000
25% 2.000000 5.000000
50% 6.000000 14.525000
75% 18.000000 31.080000
max 12540.000000 39619.500000

As we can see by their min and max values both target variables exhibit extreme outliers. If we would like to use them as targets, we should exclude them as they will mislead our validation. As I like to use early stopping this will directly influence training of predictive models as well.

In [16]:
low_quantity = daily_data.Quantity.quantile(0.05)
high_quantity = daily_data.Quantity.quantile(0.95)
print((low_quantity, high_quantity))
(1.0, 72.0)
In [17]:
low_revenue = daily_data.Revenue.quantile(0.05)
high_revenue = daily_data.Revenue.quantile(0.95)
print((low_revenue, high_revenue))
(1.25, 129.8)

Let's only use target ranges data that are occupied by 90 % of the data entries. This is a first and easy strategy to exclude heavy outliers but we should always be aware of the fact that we have lost some information given by the 2 % we have excluded. It could be nice and useful in general to understand and analyse what has caused these outliers.

In [18]:
daily_data = daily_data.loc[
    (daily_data.Quantity >= low_quantity) & (daily_data.Quantity <= high_quantity)]
In [19]:
daily_data = daily_data.loc[
    (daily_data.Revenue >= low_revenue) & (daily_data.Revenue <= high_revenue)]

How much entries have we lost?

In [20]:
samples - daily_data.shape[0]
Out[20]:
32960

Let's take a look at the remaining distributions of daily quantities and revenues:

In [21]:
fig, ax = plt.subplots(1,2,figsize=(20,5))
sns.distplot(daily_data.Quantity.values, kde=False, ax=ax[0], color="Orange");
ax[0].set_xlabel("Number of daily product sales");
ax[0].set_ylabel("Frequency");
ax[0].set_title("How many products are sold per day?");
sns.distplot(daily_data.Revenue.values, kde=True, ax=ax[1], color="Dodgerblue");
ax[1].set_xlabel("Daily revenue");
ax[1].set_ylabel("Frequency");
ax[1].set_title("How many revenue is gained per day?");

We can see that the distributions are right skewed. Lower values are more common. In addition the daily sales quantities seem to be multimodal. A daily sale of 1 is common as well as a quantity of 12 and 24. This pattern is very interesting and leads to the conclusion that quantities are probably often divisible by 2 or 3. The daily revenues are right skewed as well and low values are most common. In a nutshell we can say that specific products are often bought as single quantites or in a small bunch and the revenue we gain this way is relatively low.

How did the daily product sales change over time?

In [22]:
daily_data.head()
Out[22]:
Year Quarter Month Week Weekday Dayofyear Day StockCode Description Quantity Revenue Date
0 2010 4 12 48 2 335 1 10002 INFLATABLE POLITICAL GLOBE 60 51.00 2010-12-01
1 2010 4 12 48 2 335 1 10125 MINI FUNKY DESIGN TAPES 2 1.70 2010-12-01
2 2010 4 12 48 2 335 1 10133 COLOURING PENCILS BROWN TUBE 5 4.25 2010-12-01
3 2010 4 12 48 2 335 1 10135 COLOURING PENCILS BROWN TUBE 1 2.51 2010-12-01
4 2010 4 12 48 2 335 1 11001 ASSTD DESIGN RACING CAR PEN 3 10.08 2010-12-01
In [23]:
plt.figure(figsize=(20,5))
plt.plot(daily_data.groupby("Date").Quantity.sum(), marker='+', c="darkorange")
plt.plot(daily_data.groupby("Date").Quantity.sum().rolling(window=30, center=True).mean(),
        c="red")
plt.xticks(rotation=90);
plt.title("How many quantities are sold per day over the given time?");

The daily amounts of sold products show an expected increase in the months before christmas as seem to be lowest in february. We have high fluctuations between the days ranging from values close to 2500 products to roughly 19000 products per day. The average amount of product sales is around 7500 that strongly increases after september and is roughly doubled in the end of november. This emphasizes the need of an optimized storekeeping. Knowing which products are currently of high demand for important customers in the future could be a adventage to deliver fast without unnecessarily crowded full stocks.

In [24]:
plt.figure(figsize=(20,5))
plt.plot(daily_data.groupby("Date").Revenue.sum(), marker='+', c="royalblue")
plt.plot(daily_data.groupby("Date").Revenue.sum().rolling(window=30, center=True).mean(), 
         c="darkblue")
plt.xticks(rotation=90);
plt.title("How many revenue is gained per day over the given time?");

The daily revenue shows the roughly the same temporal pattern as the quantities.

In [25]:
fig, ax = plt.subplots(1,2,figsize=(20,5))

weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
yearmonth = ["Dec-2010", "Jan-2011", "Feb-2011", "Mar-2011", "Apr-2011", "May-2011",
             "Jun-2011", "Jul-1011", "Aug-2011", "Sep-2011", "Oct-2011", "Nov-2011", 
             "Dec-2011"]

daily_data.groupby("Weekday").Quantity.sum().plot(
    ax=ax[0], marker='o', label="Quantity", c="darkorange");
daily_data.groupby("Weekday").Revenue.sum().plot(
    ax=ax[0], marker='o', label="Revenue", c="royalblue");
ax[0].legend();
ax[0].set_xticks(np.arange(0,7))
ax[0].set_xticklabels(weekdays);
ax[0].set_title("Total sales per weekday");

ax[1].plot(daily_data.groupby(["Year", "Month"]).Quantity.sum().values,
    marker='o', label="Quantities", c="darkorange");
ax[1].plot(daily_data.groupby(["Year", "Month"]).Revenue.sum().values,
    marker='o', label="Revenue", c="royalblue");
ax[1].set_xticklabels(yearmonth, rotation=90)
ax[1].set_xticks(np.arange(0, len(yearmonth)))
ax[1].legend();
ax[1].set_title("Total sales per month");

Both visualisations yield further interesting insights:

  • Thursday seems to be the day on which most products are sold.
  • In contrast friday, and sunday have very low transactions
  • On saturday there are no transactions at all
  • The pre-Christmas season starts in september and shows a peak in november
  • Indeed february and april are month with very low sales.

How to predict daily product sales?

In this kernel I like to use catboost as predictive model. The prediction of daily quantities and revenues are both regression tasks and consequently I will use the catboost regressor. The loss and metric I like to use is the root mean square error (RMSE):

$$ E = \sqrt{ \frac{1}{N}\sum_{n=1}^{N} (t_{n} - y_{n})^{2}}$$

It computes the error between the target value $t_{n}$ and the predicted value $y_{n}$ per sample, takes the square to make sure that both, positive and negative deviations, contribute to the sum the same way. Then the mean is taken by dividing with the total amount $N$ of samples (entries) in the data. And finally to obtain an impression of the error for single predictions, the root is taken. What should we keep in mind when working with this loss and metric function? :-) It's heavily influenced by outliers! If we have some predictions that are far away from the targets, they will guide the mean towards higher values as well. Hence it could be that we will make nice predictions for a majority of samples but the RMSE is still high due to high errors for a minority of samples.

Validation strategy

As the data covers only one year and we have a high increase of sold products during pre-christmas period, we need to select validation data carefully. To start I like to use all days of november and december 2011 for validation. Doing so we cover the trend of the pre-christmas period and hopefully we will still be able to explain the further increase of quantities and revenues during november.

Catboost model and hyperparameter class

To easily generate comparable new model instances, I wrote a class that can obtain catboots data pools or pandas dataframes to train the catboost model. It yields a root mean square evaluation score and can show feature importances.

In [26]:
class Catmodel:
    
    def __init__(self, name, params):
        self.name = name
        self.params = params
    
    def set_data_pool(self, train_pool, val_pool):
        self.train_pool = train_pool
        self.val_pool = val_pool
    
    def set_data(self, X, y):
        val_criteria = (X.Month.isin([11, 12])) & (X.Year == 2011)
        cat_features_idx = np.where(X.dtypes != np.float)[0]
        x_train, self.x_val = X.loc[val_criteria==False], X.loc[val_criteria==True]
        y_train, self.y_val = y.loc[val_criteria==False], y.loc[val_criteria==True]
        self.train_pool = Pool(x_train, y_train, cat_features=cat_features_idx)
        self.val_pool = Pool(self.x_val, self.y_val, cat_features=cat_features_idx)
    
    def prepare_model(self):
        self.model = CatBoostRegressor(
                loss_function = self.params.loss[0],
                random_seed = self.params.seed,
                logging_level = 'Silent',
                iterations = self.params.iterations,
                max_depth = self.params.max_depth[0],
                #learning_rate = self.params.learning_rate[0],
                l2_leaf_reg = self.params.l2_leaf_reg[0],
                od_type='Iter',
                od_wait=40,
                train_dir=self.name
            )
    
    def learn(self, plot=False):
        self.prepare_model()
        self.model.fit(self.train_pool, eval_set=self.val_pool, plot=plot);
        print("{}, early-stopped model tree count {}".format(
            self.name, self.model.tree_count_
        ))
    
    def score(self):
        return self.model.score(self.val_pool)
    
    def show_importances(self, kind="bar"):
        explainer = shap.TreeExplainer(self.model)
        shap_values = explainer.shap_values(self.val_pool)
        if kind=="bar":
            return shap.summary_plot(shap_values, self.x_val, plot_type="bar")
        return shap.summary_plot(shap_values, self.x_val)
    
    def get_val_results(self):
        self.results = pd.DataFrame(self.y_val)
        self.results["prediction"] = self.predict(self.x_val)
        self.results["error"] = np.abs(
            self.results[self.results.columns.values[0]].values - self.results.prediction)
        self.results["Month"] = self.x_val.Month
        self.results["SquaredError"] = self.results.error.apply(lambda l: np.power(l, 2))
        
    def show_val_results(self):
        self.get_val_results()
        fig, ax = plt.subplots(1,2,figsize=(20,5))
        sns.distplot(self.results.error, ax=ax[0])
        ax[0].set_xlabel("Single absolute error")
        ax[0].set_ylabel("Density")
        ax[0].axvline(np.median(self.results.error), c="black")
        ax[1].scatter(self.results.prediction.values,
                      self.results[self.results.columns[0]].values,
                      c=self.results.error, cmap="RdYlBu_r")
        ax[1].set_xlabel("Prediction")
        ax[1].set_ylabel("Target")
        return ax
    
    def get_monthly_RMSE(self):
        return self.results.groupby("Month").SquaredError.mean().apply(lambda l: np.sqrt(l))
        
    def predict(self, x):
        return self.model.predict(x)

I like to use one class for hyperparameters that can be passed directly to the model or via hyperparameter search:

In [27]:
class CatHyperparameter:
    
    def __init__(self,
                 loss="RMSE",
                 metric="RMSE",
                 iterations=400,
                 max_depth=6,
                 l2_leaf_reg=3,
                 #learning_rate=0.05,
                 seed=0):
        self.loss = loss,
        self.metric = metric,
        self.max_depth = max_depth,
        self.l2_leaf_reg = l2_leaf_reg,
        #self.learning_rate = learning_rate,
        self.iterations=iterations
        self.seed = seed

Baseline model & result analysis

Let's see how good this model performs without feature engineering and hyperparameter search:

In [28]:
X = daily_data.drop(["Quantity", "Revenue", "Date"], axis=1)
y = daily_data.Quantity
params = CatHyperparameter()

model = Catmodel("baseline", params)
model.set_data(X,y)
model.learn(plot=True)
baseline, early-stopped model tree count 252

If you have forked this kernel and are in interactive model you can see that the model loss has converged. How big is the evaluated root mean square error on validation data?

In [29]:
model.score()
Out[29]:
12.07717034115981

This value close to 12 is high, but as already mentioned the RSME is influenced by outliers and we should take a look at the distribution of individual absolute errors:

In [30]:
model.show_val_results();
  • We can see that the distribution of absolute errors of single predictions is right skewed.
  • The median single error (black) is half of the RMSE score and significantly lower.
  • By plotting the target versus prediction we can see that we made higher errors for validation entries that have high true quantity values above 30. The strong blue line shows the identity where predictions are close to target values. To improve we need to make better predictions for products with true high quantities during validation time.
In [31]:
model.show_importances()
The model has complex ctrs, so the SHAP values will be calculated approximately.
  • We can see that the stock code as well as the description of the products are very important. They do not have a color as they are not numerical and do not have low or high values.
  • The weekday is an important feature as well. We have already seen this by exploring the data. Low values from monday up to thursday are those days where the retailer sales most products. In contrast high values (friday to sunday) only yield a few sales.
In [32]:
model.show_importances(kind=None)
The model has complex ctrs, so the SHAP values will be calculated approximately.
  • Take a look at the weekday to understand this plot: Low values (0 to 3) correspond to Monday, Tuesday, Wednesday and Thursday. These are days with high amount of product sales (high quantity target values). They are colored in blue and push towards higher sharp values and consequently to higher predicted quantity values. Higher weekday values suite to friday, saturday and sunday. They are colored in red and push towards negative sharp values and to lower predicted values. This confirms to the observations we made during exploration of weekday and the sum of daily quantities.
  • The StockCode and the Description are important features but they are also very complex. We have seen that we have close to 4000 different stock codes and even more descriptions. To improve we should try to engineer features that are able to descripe the products in a more general way.

Bayesian Hyperparameter Search with GPyOpt

In [33]:
import GPyOpt

class Hypertuner:
    
    def __init__(self, model, max_iter=10, max_time=10):
        self.bounds = [{'name': 'depth','type': 'discrete','domain': (1, 16)},
                       {'name': 'l2_leaf_reg','type': 'discrete','domain': (1, 80)}]
        self.model = model
        self.max_iter=max_iter
        self.max_time=max_time
    
    def objective(self, params):
        params = params[0]
        params = CatHyperparameter(
            max_depth=params[0],
            l2_leaf_reg=params[1]
        )
        self.model.params = params
        self.model.learn()
        return self.model.score()
        
    def learn(self):
        np.random.seed(777)
        optimizer = GPyOpt.methods.BayesianOptimization(
            f=self.objective, domain=self.bounds,
            acquisition_type ='MPI',
            acquisition_par = 0.1,
            exact_eval=True)
        optimizer.run_optimization(self.max_iter, self.max_time)
        optimizer.plot_convergence()
        print(optimizer.X[np.argmin(optimizer.Y)])

        print('RMSE:', np.min(optimizer.Y))

Set it to learn, to find better suitable tree depth and l2-leaf-reg (to reduce overfitting). But this takes some time... ;-)

In [34]:
search = Hypertuner(model)
#search.learn()

What do we know about the products?

Given the data, we are able to include much more information into our daily aggreation. Let's start with engineering new features given the StockCode:

In [35]:
products = pd.DataFrame(index=daily_data.StockCode.unique())

Currently we use november and december 2011 as our validation data in a hold-out fassion. To prevent leakages of future information, we need to perform all feature egineering tasks on data, that have timestamps below:

In [36]:
fe_selector = (data.Year == 2011) & (data.Month.isin([11,12]))
fe_data = data.loc[fe_selector==False].copy()
fe_daily_data = daily_data.loc[fe_selector==False].copy()

Product attraction

  • How many customers purchase a specific product?
  • When do customers buy a specific product? On which month? On which day?
In [37]:
attraction_map = fe_data.groupby("StockCode").CustomerID.nunique()
quarter_attraction = fe_data.loc[fe_data.Year==2010].groupby(["StockCode", "Quarter"]).CustomerID.nunique()
monthly_attraction = fe_data.groupby(["StockCode", "Month"]).CustomerID.nunique().unstack()
weekday_attraction = fe_data.groupby(["StockCode", "Weekday"]).CustomerID.nunique().unstack().fillna(0)
week_attraction = fe_data.groupby(["StockCode", "Week"]).CustomerID.nunique().unstack().fillna(0)
In [38]:
products["Attraction"] = products.index.map(attraction_map)
for col in monthly_attraction.columns.values:
    products["Attraction_Month_" + str(col)] = products.index.map(monthly_attraction.loc[:, col])
    
In [39]:
products = products.fillna(0)

Unit Price

  • What's the average unit price of the product and it's standard deviation? Is the price stable?
  • Does the price vary over time or between customers or both?
  • What's the mean/median unit price during and not during pre-christmas period?
In [40]:
monthly_prices = fe_data.groupby(
    ["StockCode", "Month"]
).UnitPrice.median().reset_index()

pre_christmas_prices = monthly_prices.loc[
    monthly_prices.Month.isin([9,10,11])
].groupby("StockCode").UnitPrice.median()
not_christmas_prices = monthly_prices.loc[
    monthly_prices.Month.isin([9,10,11]) == False
].groupby("StockCode").UnitPrice.median()
month_medianprice_map = monthly_prices.groupby("StockCode").UnitPrice.std()

mean_price_map = fe_data.groupby("StockCode").UnitPrice.mean()
median_price_map = fe_data.groupby("StockCode").UnitPrice.median()
std_price_map = fe_data.groupby("StockCode").UnitPrice.std()

customer_medianprice_map = fe_data.groupby(
    ["StockCode", "CustomerID"]
).UnitPrice.median().reset_index().groupby("StockCode").UnitPrice.std()
In [41]:
## Unit Price information
products["MeanPrice"] = products.index.map(mean_price_map)
products["MedianPrice"] = products.index.map(median_price_map)
products["StdPrice"] = products.index.map(std_price_map)
products.StdPrice = products.StdPrice.fillna(0)
products["PriceStability"] = np.where(products.StdPrice==0, 1, 0)
In [42]:
products["CustomerMedianPriceVariation"] = products.index.map(customer_medianprice_map)
products["MonthMedianPriceVariation"] = products.index.map(month_medianprice_map)
products.CustomerMedianPriceVariation = products.CustomerMedianPriceVariation.fillna(0)

products["PreChristmasMedianPrice"] = products.index.map(pre_christmas_prices)
products["NotChristmasMedianPrice"] = products.index.map(not_christmas_prices)
products["ChristmasPriceDifference"] = products.PreChristmasMedianPrice - products.NotChristmasMedianPrice
products.head()
Out[42]:
Attraction Attraction_Month_1 Attraction_Month_2 Attraction_Month_3 Attraction_Month_4 Attraction_Month_5 Attraction_Month_6 Attraction_Month_7 Attraction_Month_8 Attraction_Month_9 Attraction_Month_10 Attraction_Month_12 MeanPrice MedianPrice StdPrice PriceStability CustomerMedianPriceVariation MonthMedianPriceVariation PreChristmasMedianPrice NotChristmasMedianPrice ChristmasPriceDifference
10002 40.0 18.0 5.0 5.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 15.0 1.056849 0.85 0.404244 0 0.000000 0.000000 NaN 0.85 NaN
10125 47.0 7.0 0.0 9.0 6.0 6.0 2.0 8.0 4.0 4.0 4.0 7.0 0.860111 0.85 0.278626 0 0.177886 0.018974 0.85 0.85 0.00
10133 101.0 4.0 5.0 7.0 2.0 4.0 30.0 31.0 21.0 5.0 0.0 12.0 0.649045 0.79 0.248819 0 0.178759 0.218947 0.42 0.85 -0.43
10135 82.0 15.0 7.0 8.0 5.0 2.0 8.0 9.0 4.0 8.0 10.0 17.0 1.365263 1.25 0.615934 0 0.395704 0.000000 1.25 1.25 0.00
11001 38.0 2.0 1.0 3.0 3.0 5.0 5.0 4.0 4.0 6.0 9.0 6.0 1.771800 1.69 0.806065 0 0.270123 0.371386 1.69 1.69 0.00

We need to add our engineered features for entries of the train and validation data:

In [43]:
for col in products.columns:
    daily_data[col] = daily_data.StockCode.map(products.loc[:, col])
In [44]:
X = daily_data.drop(["Quantity", "Revenue", "Date", "StockCode", "Description"], axis=1)
y = daily_data.Quantity
params = CatHyperparameter()

model = Catmodel("product_features", params)
model.set_data(X,y)
model.learn(plot=True)
product_features, early-stopped model tree count 400
In [45]:
model.show_importances()
The model has complex ctrs, so the SHAP values will be calculated approximately.
In [46]:
model.show_val_results();

A catboost cross validation family

In [47]:
class CatCvFamily:
    
    def __init__(self, params, X, y, n_splits=2):
        self.family = {}
        self.cat_features_idx = np.where(X.dtypes != np.float)[0]
        self.X = X.values
        self.y = y.values
        self.n_splits = n_splits
        self.params = params
    
    def set_validation_strategy(self):
        self.cv = TimeSeriesSplit(max_train_size = None,
                                  n_splits = self.n_splits)
        self.gen = self.cv.split(self.X)
    
    def get_split(self):
        train_idx, val_idx = next(self.gen)
        x_train, x_val = self.X[train_idx], self.X[val_idx]
        y_train, y_val = self.y[train_idx], self.y[val_idx]
        train_pool = Pool(x_train, y_train, cat_features=self.cat_features_idx)
        val_pool = Pool(x_val, y_val, cat_features=self.cat_features_idx)
        return train_pool, val_pool
    
    def learn(self):
        self.set_validation_strategy()
        self.model_names = []
        self.model_scores = []
        for split in range(self.n_splits):
            name = 'Model_cv_' + str(split) + '/'
            train_pool, val_pool = self.get_split()
            self.model_names.append(name)
            self.family[name], score = self.fit_catmodel(name, train_pool, val_pool)
            self.model_scores.append(score)
            
    def fit_catmodel(self, name, train_pool, val_pool):
        cat = Catmodel(name, train_pool, val_pool, self.params)
        cat.prepare_model()
        cat.learn()
        score = cat.score()
        return cat, score
    
    def score(self):
        return np.mean(self.model_scores)
    
    def show_learning(self):
        widget = MetricVisualizer(self.model_names)
        widget.start()

    def show_importances(self):
        name = self.model_names[-1]
        cat = self.family[name]
        explainer = shap.TreeExplainer(cat.model)
        shap_values = explainer.shap_values(cat.val_pool)
        return shap.summary_plot(shap_values, X, plot_type="bar")

About returns, cancellations and give-aways

To understand the quantity outliers we should take a closer look to some of them. To continue we can focus on the 0.01 % with the lowest quantites and the 0.01 % with the highest:

Very low negative quantities

In [48]:
low = data.Quantity.quantile(0.0001)
high = data.Quantity.quantile(0.9999)
In [49]:
low_outliers = data.loc[(data.Quantity < low)]
low_outliers.head(5)
Out[49]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country Revenue Year Quarter Month Week Weekday Day Dayofyear
4287 C536757 84347 ROTATING SILVER ANGELS T-LIGHT HLDR -9360 2010-12-02 14:23:00 0.03 15838 United Kingdom -280.8 2010 4 12 48 3 2 336
50849 540564 22617 mouldy, thrown away. -2600 2011-01-10 10:36:00 0.00 NaN United Kingdom -0.0 2011 1 1 2 0 10 10
61624 C541433 23166 MEDIUM CERAMIC TOP STORAGE JAR -74215 2011-01-18 10:17:00 1.04 12346 United Kingdom -77183.6 2011 1 1 3 1 18 18
65063 541685 22351 Given away -1400 2011-01-20 15:41:00 0.00 NaN United Kingdom -0.0 2011 1 1 3 3 20 20
82794 543257 84611B thrown away -1430 2011-02-04 16:06:00 0.00 NaN United Kingdom -0.0 2011 1 2 5 4 4 35

Very interesting! Take a look at the product descriptions as well as the customer ID. Sometimes the latter is missing and the description is not the product description but a return description like "mouldy, thrown away". In these cases the unit price is zero as well. They don't hold the "C" in front of the InvoiceNo. The "C" tells us that a product was cancelled. You can see that the both remaining example outliers are product returns.

Very high positive quantities

In [50]:
high_outliers = data.loc[(data.Quantity > high)]
high_outliers.head(5)
Out[50]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country Revenue Year Quarter Month Week Weekday Day Dayofyear
4850 536809 84950 ASSORTED COLOUR T-LIGHT HOLDER 1824 2010-12-02 16:48:00 0.55 15299 United Kingdom 1003.20 2010 4 12 48 3 2 336
4945 536830 84077 WORLD WAR 2 GLIDERS ASSTD DESIGNS 2880 2010-12-02 17:38:00 0.18 16754 United Kingdom 518.40 2010 4 12 48 3 2 336
19871 537899 22328 ROUND SNACK BOXES SET OF 4 FRUITS 1488 2010-12-09 10:44:00 2.55 12755 Japan 3794.40 2010 4 12 49 3 9 343
25920 538420 17096 ASSORTED LAQUERED INCENSE HOLDERS 1728 2010-12-12 12:03:00 0.17 12875 United Kingdom 293.76 2010 4 12 49 6 12 346
32671 539101 22693 GROW A FLYTRAP OR SUNFLOWER IN TIN 2400 2010-12-16 10:35:00 0.94 16029 United Kingdom 2256.00 2010 4 12 50 3 16 350

In contrast to the findings above these seem to be valid transactions but perhaps they are returned as erroneous purchases.

Why are products returned or cancelled?

To analyse these kind of products we need to generate new features in our data:

  • A binary feature indicating cancelled products
  • A binary feature indicating unknown customers
  • A binary feature indicating zero unit prices
  • A binary feature indicating a negative quantity (return)

In addition we should filter if there are some entries with a negative unit price or a zero quantity:

In [51]:
data[data.UnitPrice < 0]
Out[51]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country Revenue Year Quarter Month Week Weekday Day Dayofyear
299983 A563186 B Adjust bad debt 1 2011-08-12 14:51:00 -11062.06 NaN United Kingdom -11062.06 2011 3 8 32 4 12 224
299984 A563187 B Adjust bad debt 1 2011-08-12 14:52:00 -11062.06 NaN United Kingdom -11062.06 2011 3 8 32 4 12 224

Ah! :-) It's also possible that the retailer has to pay for an open debt.

In [52]:
data[data.Quantity == 0]
Out[52]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country Revenue Year Quarter Month Week Weekday Day Dayofyear

Ok, we have no entries with zero quantities. Now, let's create the new features:

In [53]:
data["Cancelled"] = np.where(data.InvoiceNo.apply(lambda l: l[0]=="C"), 1, 0)
data["UnknownCustomer"] = np.where(data.CustomerID.isna(), 1, 0)
data["ZeroPrice"] = np.where(data.UnitPrice == 0, 1, 0)
data["Return"] = np.where(data.Quantity <= 0, 1, 0)

Which pathways do returned products take?

By diving into outliers of quantities and unit prices we have already find out that some of them are somehow related to returns. To answer the question, which pathways returned products can go, let's take a look at the interaction of these new features:

In [54]:
marker = ["Cancelled", "UnknownCustomer", "ZeroPrice", "Return"]
In [55]:
marker_counts = 100 * data.loc[:, marker].sum().sort_values(ascending=False) / data.shape[0]

interactions = pd.DataFrame(index=marker, columns=marker, dtype=np.float)
for feat in marker:
    interactions.loc[feat] = 100 * data.loc[data[feat] == 1, marker].sum() / data.loc[:, feat].sum()
    
interactions = interactions.apply(np.round)

fig, ax = plt.subplots(1,2,figsize=(20,5))
sns.heatmap(data=interactions,
            cmap="PuRd",
            annot=True, fmt="g",
            linewidth=2,
            cbar=False, ax=ax[0]);
sns.barplot(x=marker_counts.index, y=marker_counts.values,
           palette="PuRd_r", ax=ax[1]);
ax[1].set_ylabel("% in data");

Insights

  • Cancelled products are always returns. In contrast not all returns are marked as cancelled.
  • The occurences of unknown customers can't be explained solely by returns, cancelled or zero-price products.
  • Entries with zero price are very likely to have an unknown customer and half of them are returns. The latter is interesting as there must be entries with a zero price but a positive quantity count. Does the retailer sometimes offer give-aways?
  • Returns are often cancellations but there are some that correspond to zero-price products probably joined together with unknown customers.
  • The percentage of unknown customers is with a value of 25 % very high. In contrast zero-price entries are very seldom.

An official return is indicated by a cancellation of the transaction. In these cases the customer is often known and the product has an offical price that can contribute to revenue values. In contrast there exist at least one more pathway of returns: In the case of zero prices and negative quantities the products are often damaged or lost. If you browse through the entries you can see that most of them have missing descriptions (indicated by NaN). On the other hand products with zero prices and positive quantities are often found, have to be checked or have "amazon" in their description.

In [56]:
data.loc[(data.ZeroPrice==1) & (data.Return == 0)].head()
Out[56]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country Revenue Year Quarter Month Week Weekday Day Dayofyear Cancelled UnknownCustomer ZeroPrice Return
622 536414 22139 NaN 56 2010-12-01 11:52:00 0.0 NaN United Kingdom 0.0 2010 4 12 48 2 1 335 0 1 1 0
1970 536545 21134 NaN 1 2010-12-01 14:32:00 0.0 NaN United Kingdom 0.0 2010 4 12 48 2 1 335 0 1 1 0
1971 536546 22145 NaN 1 2010-12-01 14:33:00 0.0 NaN United Kingdom 0.0 2010 4 12 48 2 1 335 0 1 1 0
1972 536547 37509 NaN 1 2010-12-01 14:33:00 0.0 NaN United Kingdom 0.0 2010 4 12 48 2 1 335 0 1 1 0
1987 536549 85226A NaN 1 2010-12-01 14:34:00 0.0 NaN United Kingdom 0.0 2010 4 12 48 2 1 335 0 1 1 0

Let's close this section by introducting a new features that fuse what we have found:

In [57]:
data["ReturnedCancelled"] = np.where((data.Return == 1) & (data.Cancelled == 1),1,0)
data["ReturnedZeroPrice"] = np.where((data.Return == 1) & (data.ZeroPrice == 1),1,0)
data["NotReturnedZeroPrice"] = np.where((data.Return == 0) & (data.ZeroPrice == 1),1,0)
In [58]:
data.loc[:, ["ReturnedCancelled", "ReturnedZeroPrice", "NotReturnedZeroPrice"]].sum()
Out[58]:
ReturnedCancelled       9288
ReturnedZeroPrice       1336
NotReturnedZeroPrice    1179
dtype: int64

Why should we bother about outliers?

Products that were purchased mistakenly with very high amount will cause very high outliers in the quantities sold and the revenues gained. Any outliers can be very misleading for predictive models and we should always be aware of them. To setup a model for the retailer we need to understand the different kinds of outliers. Doing so we might be able to build up strategies that can detect them and to avoid a negative impact on the forecast.

By explorating the data, we have already found some differences in how products can be returned. To understand the process how the data was collected and how the company deals with outliers, let's take a look at the most extreme quantity outlier we can observe:

In [59]:
data.loc[(data.Quantity == data.Quantity.min()) | (data.Quantity == data.Quantity.max())]
Out[59]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country Revenue Year Quarter Month Week Weekday Day Dayofyear Cancelled UnknownCustomer ZeroPrice Return ReturnedCancelled ReturnedZeroPrice NotReturnedZeroPrice
540421 581483 23843 PAPER CRAFT , LITTLE BIRDIE 80995 2011-12-09 09:15:00 2.08 16446 United Kingdom 168469.6 2011 4 12 49 4 9 343 0 0 0 0 0 0 0
540422 C581484 23843 PAPER CRAFT , LITTLE BIRDIE -80995 2011-12-09 09:27:00 2.08 16446 United Kingdom -168469.6 2011 4 12 49 4 9 343 1 0 0 1 1 0 0

We can see that this purchase was done mistakenly: A customer ordered a product with very high quantities and cancelled this transaction a fiew minutes later. The absolute quantities are the same, as well as the stock code. But unfortunately the invoice number is not the same if you remove the C in front of the cancellation. A first advice for the retailer would be to change this as it would make it much easier to detect such pairs.

Product sales and revenues per country

How many different countries are delivered by the retailer in this data?

In [60]:
data.Country.nunique()
Out[60]:
38

Which countries purchase the most products and which ones yield highest revenues?

To answer these questions let's separate between the country "United Kingdom" of the retailer and all others by using a new feature. I expect that most of the customers reside in the UK and consequently it might be useful to distinguish between domestic and foreign orders.

In [61]:
data["UK"] = np.where(data.Country == "United Kingdom", 1, 0)

In addition we need a feature that stands for the revenue that the retailer gained during the transaction. It's given by the product of quantities ordered and the unit price of the product:

In [62]:
data["Revenue"] = data.Quantity * data.UnitPrice
In [63]:
country_quantities = data.groupby("Country").Quantity.sum()
country_revenues = data.groupby("Country").Revenue.sum()
revenue_per_quantity = country_revenues / country_quantities

uk_quantities = data.groupby("UK").Quantity.sum().sort_values(ascending=False)
uk_revenues = data.groupby("UK").Revenue.sum().sort_values(ascending=False)

top5_foreign_most_products = country_quantities.sort_values(ascending=False).iloc[1:6]
top5_foreign_highest_revenue = country_revenues.sort_values(ascending=False).iloc[1:6]

top5_revenue_per_quantity = revenue_per_quantity.loc[
    top5_foreign_most_products.index
].sort_values(ascending=False)

uk_revenue_per_quantity = data.groupby("UK").Revenue.sum() / data.groupby("UK").Quantity.sum() 
In [64]:
fig, ax = plt.subplots(3,2,figsize=(20,19))

total_quantities = data.Quantity.sum() * 1/100
total_revenues = data.Revenue.sum() * 1/100

ax[0,0].set_title("Out- and inside UK - Product sales");
sns.barplot(uk_quantities.index, uk_quantities.values/total_quantities, ax=ax[0,0], palette="Purples")
ax[0,0].set_ylabel("% of total product sales")
ax[0,0].set_xticklabels(["outside", "inside"])
ax[0,0].set_xlabel("")

ax[0,1].set_title("Top 5 foreign countries - Product sales");
sns.barplot(top5_foreign_most_products.index,
            top5_foreign_most_products.values/total_quantities,
            ax=ax[0,1], palette="Purples_r")
ax[0,1].set_xlabel("");
ax[0,1].set_ylabel("% of total product sales")

ax[1,0].set_title("Out- and inside UK - Revenues");
sns.barplot(uk_revenues.index, uk_revenues.values/total_revenues, ax=ax[1,0], palette="Blues")
ax[1,0].set_ylabel("% of total revenue")
ax[1,0].set_xticklabels(["outside", "inside"])
ax[1,0].set_xlabel("")

ax[1,1].set_title("Top 5 foreign countries - Revenues");
sns.barplot(top5_foreign_highest_revenue.index,
            top5_foreign_highest_revenue.values/total_revenues,
            ax=ax[1,1], palette="Blues_r")
ax[1,1].set_xlabel("");
ax[1,1].set_ylabel("% of total revenue")

ax[2,0].set_title("Out- and inside UK - Revenue per quantity");
sns.barplot(uk_revenue_per_quantity.index, uk_revenue_per_quantity.values,
            ax=ax[2,0], palette="Greens")
ax[2,0].set_ylabel("revenue/quantity")
ax[2,0].set_xticklabels(["outside", "inside"])
ax[2,0].set_xlabel("")

ax[2,1].set_title("Top5 foreign countries - Revenue per quantity")
sns.barplot(top5_revenue_per_quantity.index,
            top5_revenue_per_quantity.values, ax=ax[2,1], palette="Greens_r");
ax[2,1].set_xlabel("");
ax[2,1].set_ylabel("revenue/quantity");

Insights

  • The retailer is selling most of its products and gains most of the revenue inside the UK.
  • The next top 4 countries reside in Europe.
  • Australia had former British colonies and perhaps this is still the reason why it has a closer connection to the retailer than other English-speaking countries. But this is just an assumption that has not to be true. It could also be the case that the retailer sells some products that are for some other reason very interesting for the australian market.
  • Furthermore we can see that we make more profit per quantity in Germany than Netherlands. Are they buying more expensive products? If we would like to know which countries could be worth it for winning new customers by marketing campains it could be good to analyse the specific needs of these top 5 countries first.

Product best-seller?

Imagine you have to know what kind of products will be sold tomorrow, next week or month to optimize your inventory and storage strategy for your products. A simple way to start is to focus on the products that are best-seller or show some seasonal patterns (for example are only sold on christmas but with very high amount). Next step then could be to include more and more products such that the optimization becomes better and better for your whole stock.

In [65]:
products = pd.DataFrame(data.groupby("StockCode").Quantity.sum(),
                        columns=["Quantity"])
products["Revenue"] = data.groupby("StockCode").Revenue.sum()
products["MeanUnitPrice"] = data.groupby("StockCode").UnitPrice.mean()
products["StdUnitPrice"] = data.groupby("StockCode").UnitPrice.std().fillna(0)
products["MedianUnitPrice"] = data.groupby("StockCode").UnitPrice.median()
products["Q25UnitPrice"] = data.groupby("StockCode").UnitPrice.quantile(0.25)
products["Q75UnitPrice"] = data.groupby("StockCode").UnitPrice.quantile(0.75)
In [66]:
#data.loc[data.year==2011].groupby(["StockCode", "month"]).Quantity.sum().unstack().fillna(0)
In [67]:
#data.loc[data.year==2011].groupby(["StockCode", "weekday"]).Quantity.sum().unstack().fillna(0)
In [68]:
 
In [68]: