Warehouse optimization¶

Within this kernel we will analyse sales data of an UK online retailer. As storage area may be expensive and fast delivery on time is important to prevail over the competition we like to help the retailer by predicting daily amounts of sold products as well as gained revenues. This way we will try to answer:

How many items are sold per product tomorrow, next week, next month?
Which products yield highest revenues and which ones are best-sellers?
Which products have unstable prices and how does this affect the purchase behavior of the retails customers?
What are the customer preferences? Can we gain a basic understanding how the requested products would change if we add new customers of a specific kind?

Contents¶

Prepare to start (complete)
Get familiar with the data
1. Sneak a peek - features and targets (complete)
2. Aggregating daily product sales and revenues (complete)
3. Gain first impressions by exploration (complete)

Work under construction

How to predict daily product sales? (almost complete)
1. Catboost model and hyperparameter class (add more explanations)
2. Baseline model & result analysis (add more explanations)
3. Bayesian Hyperparameter Search with GPyOpt (add more explanations)
What do we know about the products?
1. Explorative analysis and feature engineering
2. Model improvement step
3. Result analysis
4. Conclusion
About returns, cancellations and give-aways
Product sales and revenues per country

Prepare to start ¶

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()

from catboost import CatBoostRegressor, Pool, cv
from sklearn.model_selection import TimeSeriesSplit
from catboost import MetricVisualizer

from os import listdir

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

import shap
shap.initjs()

print(listdir("../input"))

['data.csv']

data = pd.read_csv("../input/data.csv", encoding="ISO-8859-1", dtype={'CustomerID': str})
data.shape

(541909, 8)

The data has 541909 entries and 8 variables.

Get familiar with the data ¶

Sneak a peek - features and targets ¶

data.head()

We can see that the datafile has information given for each single transaction. Take a look at the InvoiceNo and the CustomerID of the first entries. Here we can see that one customer with ID 17850 of the United Kingdom made a single order that has the InvoideNo 536365. The customer ordered several products with different stockcodes, descriptions, unit prices and quantities. In addition we can see that the InvoiceDate was the same for these products.

Aggregating daily product sales and revenues ¶

As we like to predict the daily amount of product sales as well as revenues, we need to compute a daily aggregation of this data. For this purpose we need to compute the revenue per transaction and extract temporal features out of the InvoiceDate:

data["Revenue"] = data.Quantity * data.UnitPrice
data["InvoiceDate"] = pd.to_datetime(data.InvoiceDate,
                                     cache=True)

data["Year"] = data.InvoiceDate.dt.year
data["Quarter"] = data.InvoiceDate.dt.quarter
data["Month"] = data.InvoiceDate.dt.month
data["Week"] = data.InvoiceDate.dt.week
data["Weekday"] = data.InvoiceDate.dt.weekday
data["Day"] = data.InvoiceDate.dt.day
data["Dayofyear"] = data.InvoiceDate.dt.dayofyear

As the key task of this kernel is to predict the gained revenue and amount of products sold per day, we can sum up the daily quantities and revenues per product stockcode and description:

grouped_features = ["Year", "Quarter","Month", "Week", "Weekday", "Dayofyear", "Day",
                    "StockCode", "Description"]

This way we loose information abount customers and countries but we will recover it later on during this kernel.

daily_data = pd.DataFrame(data.groupby(grouped_features).Quantity.sum(),
                          columns=["Quantity"])
daily_data["Revenue"] = data.groupby(grouped_features).Revenue.sum()
daily_data = daily_data.reset_index()
daily_data.head()

Gain first impressions by exploration ¶

Let's explore the daily data a little bit to obtain first insights that can be useful for modelling or understanding.

First of all, which timespan is covered by the data?¶

daily_data["Date"] = pd.to_datetime(daily_data[['Year', 'Month', 'Day']])

daily_data.Date.max() - daily_data.Date.min()

Timedelta('373 days 00:00:00')

print("Datafile starts with timepoint {}".format(daily_data.Date.min()))
print("Datafile ends with timepoint {}".format(daily_data.Date.max()))

Datafile starts with timepoint 2010-12-01 00:00:00
Datafile ends with timepoint 2011-12-09 00:00:00

We can see that the data covers one year from december 2010 to december 2011.

How many entries do we have?¶

samples = daily_data.shape[0]
samples

280224

How many different products do we have (unique stockcodes)?¶

daily_data.StockCode.nunique()

3958

How many different product descriptions do we have?¶

daily_data.Description.nunique()

4223

We have more descriptions than product codes:

daily_data.Description.nunique() - daily_data.StockCode.nunique()

265

How are the product sales and revenues distributed?¶

daily_data.loc[:, ["Quantity", "Revenue"]].describe()

As we can see by their min and max values both target variables exhibit extreme outliers. If we would like to use them as targets, we should exclude them as they will mislead our validation. As I like to use early stopping this will directly influence training of predictive models as well.

low_quantity = daily_data.Quantity.quantile(0.05)
high_quantity = daily_data.Quantity.quantile(0.95)
print((low_quantity, high_quantity))

(1.0, 72.0)

low_revenue = daily_data.Revenue.quantile(0.05)
high_revenue = daily_data.Revenue.quantile(0.95)
print((low_revenue, high_revenue))

(1.25, 129.8)

Let's only use target ranges data that are occupied by 90 % of the data entries. This is a first and easy strategy to exclude heavy outliers but we should always be aware of the fact that we have lost some information given by the 2 % we have excluded. It could be nice and useful in general to understand and analyse what has caused these outliers.

daily_data = daily_data.loc[
    (daily_data.Quantity >= low_quantity) & (daily_data.Quantity <= high_quantity)]

daily_data = daily_data.loc[
    (daily_data.Revenue >= low_revenue) & (daily_data.Revenue <= high_revenue)]

How much entries have we lost?

samples - daily_data.shape[0]

32960

Let's take a look at the remaining distributions of daily quantities and revenues:

fig, ax = plt.subplots(1,2,figsize=(20,5))
sns.distplot(daily_data.Quantity.values, kde=False, ax=ax[0], color="Orange");
ax[0].set_xlabel("Number of daily product sales");
ax[0].set_ylabel("Frequency");
ax[0].set_title("How many products are sold per day?");
sns.distplot(daily_data.Revenue.values, kde=True, ax=ax[1], color="Dodgerblue");
ax[1].set_xlabel("Daily revenue");
ax[1].set_ylabel("Frequency");
ax[1].set_title("How many revenue is gained per day?");

We can see that the distributions are right skewed. Lower values are more common. In addition the daily sales quantities seem to be multimodal. A daily sale of 1 is common as well as a quantity of 12 and 24. This pattern is very interesting and leads to the conclusion that quantities are probably often divisible by 2 or 3. The daily revenues are right skewed as well and low values are most common. In a nutshell we can say that specific products are often bought as single quantites or in a small bunch and the revenue we gain this way is relatively low.

How did the daily product sales change over time?¶

daily_data.head()

plt.figure(figsize=(20,5))
plt.plot(daily_data.groupby("Date").Quantity.sum(), marker='+', c="darkorange")
plt.plot(daily_data.groupby("Date").Quantity.sum().rolling(window=30, center=True).mean(),
        c="red")
plt.xticks(rotation=90);
plt.title("How many quantities are sold per day over the given time?");

The daily amounts of sold products show an expected increase in the months before christmas as seem to be lowest in february. We have high fluctuations between the days ranging from values close to 2500 products to roughly 19000 products per day. The average amount of product sales is around 7500 that strongly increases after september and is roughly doubled in the end of november. This emphasizes the need of an optimized storekeeping. Knowing which products are currently of high demand for important customers in the future could be a adventage to deliver fast without unnecessarily crowded full stocks.

plt.figure(figsize=(20,5))
plt.plot(daily_data.groupby("Date").Revenue.sum(), marker='+', c="royalblue")
plt.plot(daily_data.groupby("Date").Revenue.sum().rolling(window=30, center=True).mean(), 
         c="darkblue")
plt.xticks(rotation=90);
plt.title("How many revenue is gained per day over the given time?");

The daily revenue shows the roughly the same temporal pattern as the quantities.

fig, ax = plt.subplots(1,2,figsize=(20,5))

weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
yearmonth = ["Dec-2010", "Jan-2011", "Feb-2011", "Mar-2011", "Apr-2011", "May-2011",
             "Jun-2011", "Jul-1011", "Aug-2011", "Sep-2011", "Oct-2011", "Nov-2011", 
             "Dec-2011"]

daily_data.groupby("Weekday").Quantity.sum().plot(
    ax=ax[0], marker='o', label="Quantity", c="darkorange");
daily_data.groupby("Weekday").Revenue.sum().plot(
    ax=ax[0], marker='o', label="Revenue", c="royalblue");
ax[0].legend();
ax[0].set_xticks(np.arange(0,7))
ax[0].set_xticklabels(weekdays);
ax[0].set_title("Total sales per weekday");

ax[1].plot(daily_data.groupby(["Year", "Month"]).Quantity.sum().values,
    marker='o', label="Quantities", c="darkorange");
ax[1].plot(daily_data.groupby(["Year", "Month"]).Revenue.sum().values,
    marker='o', label="Revenue", c="royalblue");
ax[1].set_xticklabels(yearmonth, rotation=90)
ax[1].set_xticks(np.arange(0, len(yearmonth)))
ax[1].legend();
ax[1].set_title("Total sales per month");

Both visualisations yield further interesting insights:

Thursday seems to be the day on which most products are sold.
In contrast friday, and sunday have very low transactions
On saturday there are no transactions at all
The pre-Christmas season starts in september and shows a peak in november
Indeed february and april are month with very low sales.

How to predict daily product sales? ¶

In this kernel I like to use catboost as predictive model. The prediction of daily quantities and revenues are both regression tasks and consequently I will use the catboost regressor. The loss and metric I like to use is the root mean square error (RMSE):

$$ E = \sqrt{ \frac{1}{N}\sum_{n=1}^{N} (t_{n} - y_{n})^{2}}$$

It computes the error between the target value $t_{n}$ and the predicted value $y_{n}$ per sample, takes the square to make sure that both, positive and negative deviations, contribute to the sum the same way. Then the mean is taken by dividing with the total amount $N$ of samples (entries) in the data. And finally to obtain an impression of the error for single predictions, the root is taken. What should we keep in mind when working with this loss and metric function? :-) It's heavily influenced by outliers! If we have some predictions that are far away from the targets, they will guide the mean towards higher values as well. Hence it could be that we will make nice predictions for a majority of samples but the RMSE is still high due to high errors for a minority of samples.

Validation strategy¶

As the data covers only one year and we have a high increase of sold products during pre-christmas period, we need to select validation data carefully. To start I like to use all days of november and december 2011 for validation. Doing so we cover the trend of the pre-christmas period and hopefully we will still be able to explain the further increase of quantities and revenues during november.

Catboost model and hyperparameter class ¶

To easily generate comparable new model instances, I wrote a class that can obtain catboots data pools or pandas dataframes to train the catboost model. It yields a root mean square evaluation score and can show feature importances.

class Catmodel:
    
    def __init__(self, name, params):
        self.name = name
        self.params = params
    
    def set_data_pool(self, train_pool, val_pool):
        self.train_pool = train_pool
        self.val_pool = val_pool
    
    def set_data(self, X, y):
        val_criteria = (X.Month.isin([11, 12])) & (X.Year == 2011)
        cat_features_idx = np.where(X.dtypes != np.float)[0]
        x_train, self.x_val = X.loc[val_criteria==False], X.loc[val_criteria==True]
        y_train, self.y_val = y.loc[val_criteria==False], y.loc[val_criteria==True]
        self.train_pool = Pool(x_train, y_train, cat_features=cat_features_idx)
        self.val_pool = Pool(self.x_val, self.y_val, cat_features=cat_features_idx)
    
    def prepare_model(self):
        self.model = CatBoostRegressor(
                loss_function = self.params.loss[0],
                random_seed = self.params.seed,
                logging_level = 'Silent',
                iterations = self.params.iterations,
                max_depth = self.params.max_depth[0],
                #learning_rate = self.params.learning_rate[0],
                l2_leaf_reg = self.params.l2_leaf_reg[0],
                od_type='Iter',
                od_wait=40,
                train_dir=self.name
            )
    
    def learn(self, plot=False):
        self.prepare_model()
        self.model.fit(self.train_pool, eval_set=self.val_pool, plot=plot);
        print("{}, early-stopped model tree count {}".format(
            self.name, self.model.tree_count_
        ))
    
    def score(self):
        return self.model.score(self.val_pool)
    
    def show_importances(self, kind="bar"):
        explainer = shap.TreeExplainer(self.model)
        shap_values = explainer.shap_values(self.val_pool)
        if kind=="bar":
            return shap.summary_plot(shap_values, self.x_val, plot_type="bar")
        return shap.summary_plot(shap_values, self.x_val)
    
    def get_val_results(self):
        self.results = pd.DataFrame(self.y_val)
        self.results["prediction"] = self.predict(self.x_val)
        self.results["error"] = np.abs(
            self.results[self.results.columns.values[0]].values - self.results.prediction)
        self.results["Month"] = self.x_val.Month
        self.results["SquaredError"] = self.results.error.apply(lambda l: np.power(l, 2))
        
    def show_val_results(self):
        self.get_val_results()
        fig, ax = plt.subplots(1,2,figsize=(20,5))
        sns.distplot(self.results.error, ax=ax[0])
        ax[0].set_xlabel("Single absolute error")
        ax[0].set_ylabel("Density")
        ax[0].axvline(np.median(self.results.error), c="black")
        ax[1].scatter(self.results.prediction.values,
                      self.results[self.results.columns[0]].values,
                      c=self.results.error, cmap="RdYlBu_r")
        ax[1].set_xlabel("Prediction")
        ax[1].set_ylabel("Target")
        return ax
    
    def get_monthly_RMSE(self):
        return self.results.groupby("Month").SquaredError.mean().apply(lambda l: np.sqrt(l))
        
    def predict(self, x):
        return self.model.predict(x)

I like to use one class for hyperparameters that can be passed directly to the model or via hyperparameter search:

class CatHyperparameter:
    
    def __init__(self,
                 loss="RMSE",
                 metric="RMSE",
                 iterations=400,
                 max_depth=6,
                 l2_leaf_reg=3,
                 #learning_rate=0.05,
                 seed=0):
        self.loss = loss,
        self.metric = metric,
        self.max_depth = max_depth,
        self.l2_leaf_reg = l2_leaf_reg,
        #self.learning_rate = learning_rate,
        self.iterations=iterations
        self.seed = seed

Baseline model & result analysis ¶

Let's see how good this model performs without feature engineering and hyperparameter search:

X = daily_data.drop(["Quantity", "Revenue", "Date"], axis=1)
y = daily_data.Quantity
params = CatHyperparameter()

model = Catmodel("baseline", params)
model.set_data(X,y)
model.learn(plot=True)

baseline, early-stopped model tree count 252

If you have forked this kernel and are in interactive model you can see that the model loss has converged. How big is the evaluated root mean square error on validation data?

model.score()

12.07717034115981

This value close to 12 is high, but as already mentioned the RSME is influenced by outliers and we should take a look at the distribution of individual absolute errors:

model.show_val_results();

We can see that the distribution of absolute errors of single predictions is right skewed.
The median single error (black) is half of the RMSE score and significantly lower.
By plotting the target versus prediction we can see that we made higher errors for validation entries that have high true quantity values above 30. The strong blue line shows the identity where predictions are close to target values. To improve we need to make better predictions for products with true high quantities during validation time.

model.show_importances()

The model has complex ctrs, so the SHAP values will be calculated approximately.

We can see that the stock code as well as the description of the products are very important. They do not have a color as they are not numerical and do not have low or high values.
The weekday is an important feature as well. We have already seen this by exploring the data. Low values from monday up to thursday are those days where the retailer sales most products. In contrast high values (friday to sunday) only yield a few sales.

model.show_importances(kind=None)

The model has complex ctrs, so the SHAP values will be calculated approximately.

Take a look at the weekday to understand this plot: Low values (0 to 3) correspond to Monday, Tuesday, Wednesday and Thursday. These are days with high amount of product sales (high quantity target values). They are colored in blue and push towards higher sharp values and consequently to higher predicted quantity values. Higher weekday values suite to friday, saturday and sunday. They are colored in red and push towards negative sharp values and to lower predicted values. This confirms to the observations we made during exploration of weekday and the sum of daily quantities.
The StockCode and the Description are important features but they are also very complex. We have seen that we have close to 4000 different stock codes and even more descriptions. To improve we should try to engineer features that are able to descripe the products in a more general way.

Bayesian Hyperparameter Search with GPyOpt ¶

import GPyOpt

class Hypertuner:
    
    def __init__(self, model, max_iter=10, max_time=10):
        self.bounds = [{'name': 'depth','type': 'discrete','domain': (1, 16)},
                       {'name': 'l2_leaf_reg','type': 'discrete','domain': (1, 80)}]
        self.model = model
        self.max_iter=max_iter
        self.max_time=max_time
    
    def objective(self, params):
        params = params[0]
        params = CatHyperparameter(
            max_depth=params[0],
            l2_leaf_reg=params[1]
        )
        self.model.params = params
        self.model.learn()
        return self.model.score()
        
    def learn(self):
        np.random.seed(777)
        optimizer = GPyOpt.methods.BayesianOptimization(
            f=self.objective, domain=self.bounds,
            acquisition_type ='MPI',
            acquisition_par = 0.1,
            exact_eval=True)
        optimizer.run_optimization(self.max_iter, self.max_time)
        optimizer.plot_convergence()
        print(optimizer.X[np.argmin(optimizer.Y)])

        print('RMSE:', np.min(optimizer.Y))

Set it to learn, to find better suitable tree depth and l2-leaf-reg (to reduce overfitting). But this takes some time... ;-)

search = Hypertuner(model)
#search.learn()

What do we know about the products? ¶

Given the data, we are able to include much more information into our daily aggreation. Let's start with engineering new features given the StockCode:

products = pd.DataFrame(index=daily_data.StockCode.unique())

Currently we use november and december 2011 as our validation data in a hold-out fassion. To prevent leakages of future information, we need to perform all feature egineering tasks on data, that have timestamps below:

fe_selector = (data.Year == 2011) & (data.Month.isin([11,12]))
fe_data = data.loc[fe_selector==False].copy()
fe_daily_data = daily_data.loc[fe_selector==False].copy()

Product attraction¶

How many customers purchase a specific product?
When do customers buy a specific product? On which month? On which day?

attraction_map = fe_data.groupby("StockCode").CustomerID.nunique()
quarter_attraction = fe_data.loc[fe_data.Year==2010].groupby(["StockCode", "Quarter"]).CustomerID.nunique()
monthly_attraction = fe_data.groupby(["StockCode", "Month"]).CustomerID.nunique().unstack()
weekday_attraction = fe_data.groupby(["StockCode", "Weekday"]).CustomerID.nunique().unstack().fillna(0)
week_attraction = fe_data.groupby(["StockCode", "Week"]).CustomerID.nunique().unstack().fillna(0)

products["Attraction"] = products.index.map(attraction_map)
for col in monthly_attraction.columns.values:
    products["Attraction_Month_" + str(col)] = products.index.map(monthly_attraction.loc[:, col])

products = products.fillna(0)

Unit Price¶

What's the average unit price of the product and it's standard deviation? Is the price stable?
Does the price vary over time or between customers or both?
What's the mean/median unit price during and not during pre-christmas period?

monthly_prices = fe_data.groupby(
    ["StockCode", "Month"]
).UnitPrice.median().reset_index()

pre_christmas_prices = monthly_prices.loc[
    monthly_prices.Month.isin([9,10,11])
].groupby("StockCode").UnitPrice.median()
not_christmas_prices = monthly_prices.loc[
    monthly_prices.Month.isin([9,10,11]) == False
].groupby("StockCode").UnitPrice.median()
month_medianprice_map = monthly_prices.groupby("StockCode").UnitPrice.std()

mean_price_map = fe_data.groupby("StockCode").UnitPrice.mean()
median_price_map = fe_data.groupby("StockCode").UnitPrice.median()
std_price_map = fe_data.groupby("StockCode").UnitPrice.std()

customer_medianprice_map = fe_data.groupby(
    ["StockCode", "CustomerID"]
).UnitPrice.median().reset_index().groupby("StockCode").UnitPrice.std()

## Unit Price information
products["MeanPrice"] = products.index.map(mean_price_map)
products["MedianPrice"] = products.index.map(median_price_map)
products["StdPrice"] = products.index.map(std_price_map)
products.StdPrice = products.StdPrice.fillna(0)
products["PriceStability"] = np.where(products.StdPrice==0, 1, 0)

products["CustomerMedianPriceVariation"] = products.index.map(customer_medianprice_map)
products["MonthMedianPriceVariation"] = products.index.map(month_medianprice_map)
products.CustomerMedianPriceVariation = products.CustomerMedianPriceVariation.fillna(0)

products["PreChristmasMedianPrice"] = products.index.map(pre_christmas_prices)
products["NotChristmasMedianPrice"] = products.index.map(not_christmas_prices)
products["ChristmasPriceDifference"] = products.PreChristmasMedianPrice - products.NotChristmasMedianPrice
products.head()

We need to add our engineered features for entries of the train and validation data:

for col in products.columns:
    daily_data[col] = daily_data.StockCode.map(products.loc[:, col])

X = daily_data.drop(["Quantity", "Revenue", "Date", "StockCode", "Description"], axis=1)
y = daily_data.Quantity
params = CatHyperparameter()

model = Catmodel("product_features", params)
model.set_data(X,y)
model.learn(plot=True)

product_features, early-stopped model tree count 400

model.show_importances()

The model has complex ctrs, so the SHAP values will be calculated approximately.

model.show_val_results();

A catboost cross validation family¶

class CatCvFamily:
    
    def __init__(self, params, X, y, n_splits=2):
        self.family = {}
        self.cat_features_idx = np.where(X.dtypes != np.float)[0]
        self.X = X.values
        self.y = y.values
        self.n_splits = n_splits
        self.params = params
    
    def set_validation_strategy(self):
        self.cv = TimeSeriesSplit(max_train_size = None,
                                  n_splits = self.n_splits)
        self.gen = self.cv.split(self.X)
    
    def get_split(self):
        train_idx, val_idx = next(self.gen)
        x_train, x_val = self.X[train_idx], self.X[val_idx]
        y_train, y_val = self.y[train_idx], self.y[val_idx]
        train_pool = Pool(x_train, y_train, cat_features=self.cat_features_idx)
        val_pool = Pool(x_val, y_val, cat_features=self.cat_features_idx)
        return train_pool, val_pool
    
    def learn(self):
        self.set_validation_strategy()
        self.model_names = []
        self.model_scores = []
        for split in range(self.n_splits):
            name = 'Model_cv_' + str(split) + '/'
            train_pool, val_pool = self.get_split()
            self.model_names.append(name)
            self.family[name], score = self.fit_catmodel(name, train_pool, val_pool)
            self.model_scores.append(score)
            
    def fit_catmodel(self, name, train_pool, val_pool):
        cat = Catmodel(name, train_pool, val_pool, self.params)
        cat.prepare_model()
        cat.learn()
        score = cat.score()
        return cat, score
    
    def score(self):
        return np.mean(self.model_scores)
    
    def show_learning(self):
        widget = MetricVisualizer(self.model_names)
        widget.start()

    def show_importances(self):
        name = self.model_names[-1]
        cat = self.family[name]
        explainer = shap.TreeExplainer(cat.model)
        shap_values = explainer.shap_values(cat.val_pool)
        return shap.summary_plot(shap_values, X, plot_type="bar")

About returns, cancellations and give-aways ¶

To understand the quantity outliers we should take a closer look to some of them. To continue we can focus on the 0.01 % with the lowest quantites and the 0.01 % with the highest:

Very low negative quantities¶

low = data.Quantity.quantile(0.0001)
high = data.Quantity.quantile(0.9999)

low_outliers = data.loc[(data.Quantity < low)]
low_outliers.head(5)

Very interesting! Take a look at the product descriptions as well as the customer ID. Sometimes the latter is missing and the description is not the product description but a return description like "mouldy, thrown away". In these cases the unit price is zero as well. They don't hold the "C" in front of the InvoiceNo. The "C" tells us that a product was cancelled. You can see that the both remaining example outliers are product returns.

Very high positive quantities¶

high_outliers = data.loc[(data.Quantity > high)]
high_outliers.head(5)

In contrast to the findings above these seem to be valid transactions but perhaps they are returned as erroneous purchases.

Why are products returned or cancelled?¶

To analyse these kind of products we need to generate new features in our data:

A binary feature indicating cancelled products
A binary feature indicating unknown customers
A binary feature indicating zero unit prices
A binary feature indicating a negative quantity (return)

In addition we should filter if there are some entries with a negative unit price or a zero quantity:

data[data.UnitPrice < 0]

Ah! :-) It's also possible that the retailer has to pay for an open debt.

data[data.Quantity == 0]

Ok, we have no entries with zero quantities. Now, let's create the new features:

data["Cancelled"] = np.where(data.InvoiceNo.apply(lambda l: l[0]=="C"), 1, 0)
data["UnknownCustomer"] = np.where(data.CustomerID.isna(), 1, 0)
data["ZeroPrice"] = np.where(data.UnitPrice == 0, 1, 0)
data["Return"] = np.where(data.Quantity <= 0, 1, 0)

Which pathways do returned products take?¶

By diving into outliers of quantities and unit prices we have already find out that some of them are somehow related to returns. To answer the question, which pathways returned products can go, let's take a look at the interaction of these new features:

marker = ["Cancelled", "UnknownCustomer", "ZeroPrice", "Return"]

marker_counts = 100 * data.loc[:, marker].sum().sort_values(ascending=False) / data.shape[0]

interactions = pd.DataFrame(index=marker, columns=marker, dtype=np.float)
for feat in marker:
    interactions.loc[feat] = 100 * data.loc[data[feat] == 1, marker].sum() / data.loc[:, feat].sum()
    
interactions = interactions.apply(np.round)

fig, ax = plt.subplots(1,2,figsize=(20,5))
sns.heatmap(data=interactions,
            cmap="PuRd",
            annot=True, fmt="g",
            linewidth=2,
            cbar=False, ax=ax[0]);
sns.barplot(x=marker_counts.index, y=marker_counts.values,
           palette="PuRd_r", ax=ax[1]);
ax[1].set_ylabel("% in data");

Insights¶

Cancelled products are always returns. In contrast not all returns are marked as cancelled.
The occurences of unknown customers can't be explained solely by returns, cancelled or zero-price products.
Entries with zero price are very likely to have an unknown customer and half of them are returns. The latter is interesting as there must be entries with a zero price but a positive quantity count. Does the retailer sometimes offer give-aways?
Returns are often cancellations but there are some that correspond to zero-price products probably joined together with unknown customers.
The percentage of unknown customers is with a value of 25 % very high. In contrast zero-price entries are very seldom.

An official return is indicated by a cancellation of the transaction. In these cases the customer is often known and the product has an offical price that can contribute to revenue values. In contrast there exist at least one more pathway of returns: In the case of zero prices and negative quantities the products are often damaged or lost. If you browse through the entries you can see that most of them have missing descriptions (indicated by NaN). On the other hand products with zero prices and positive quantities are often found, have to be checked or have "amazon" in their description.

data.loc[(data.ZeroPrice==1) & (data.Return == 0)].head()

Let's close this section by introducting a new features that fuse what we have found:

data["ReturnedCancelled"] = np.where((data.Return == 1) & (data.Cancelled == 1),1,0)
data["ReturnedZeroPrice"] = np.where((data.Return == 1) & (data.ZeroPrice == 1),1,0)
data["NotReturnedZeroPrice"] = np.where((data.Return == 0) & (data.ZeroPrice == 1),1,0)

data.loc[:, ["ReturnedCancelled", "ReturnedZeroPrice", "NotReturnedZeroPrice"]].sum()

ReturnedCancelled       9288
ReturnedZeroPrice       1336
NotReturnedZeroPrice    1179
dtype: int64

Why should we bother about outliers?¶

Products that were purchased mistakenly with very high amount will cause very high outliers in the quantities sold and the revenues gained. Any outliers can be very misleading for predictive models and we should always be aware of them. To setup a model for the retailer we need to understand the different kinds of outliers. Doing so we might be able to build up strategies that can detect them and to avoid a negative impact on the forecast.

By explorating the data, we have already found some differences in how products can be returned. To understand the process how the data was collected and how the company deals with outliers, let's take a look at the most extreme quantity outlier we can observe:

data.loc[(data.Quantity == data.Quantity.min()) | (data.Quantity == data.Quantity.max())]

We can see that this purchase was done mistakenly: A customer ordered a product with very high quantities and cancelled this transaction a fiew minutes later. The absolute quantities are the same, as well as the stock code. But unfortunately the invoice number is not the same if you remove the C in front of the cancellation. A first advice for the retailer would be to change this as it would make it much easier to detect such pairs.

Product sales and revenues per country ¶

How many different countries are delivered by the retailer in this data?

data.Country.nunique()

38

Which countries purchase the most products and which ones yield highest revenues?¶

To answer these questions let's separate between the country "United Kingdom" of the retailer and all others by using a new feature. I expect that most of the customers reside in the UK and consequently it might be useful to distinguish between domestic and foreign orders.

data["UK"] = np.where(data.Country == "United Kingdom", 1, 0)

In addition we need a feature that stands for the revenue that the retailer gained during the transaction. It's given by the product of quantities ordered and the unit price of the product:

data["Revenue"] = data.Quantity * data.UnitPrice

country_quantities = data.groupby("Country").Quantity.sum()
country_revenues = data.groupby("Country").Revenue.sum()
revenue_per_quantity = country_revenues / country_quantities

uk_quantities = data.groupby("UK").Quantity.sum().sort_values(ascending=False)
uk_revenues = data.groupby("UK").Revenue.sum().sort_values(ascending=False)

top5_foreign_most_products = country_quantities.sort_values(ascending=False).iloc[1:6]
top5_foreign_highest_revenue = country_revenues.sort_values(ascending=False).iloc[1:6]

top5_revenue_per_quantity = revenue_per_quantity.loc[
    top5_foreign_most_products.index
].sort_values(ascending=False)

uk_revenue_per_quantity = data.groupby("UK").Revenue.sum() / data.groupby("UK").Quantity.sum()

fig, ax = plt.subplots(3,2,figsize=(20,19))

total_quantities = data.Quantity.sum() * 1/100
total_revenues = data.Revenue.sum() * 1/100

ax[0,0].set_title("Out- and inside UK - Product sales");
sns.barplot(uk_quantities.index, uk_quantities.values/total_quantities, ax=ax[0,0], palette="Purples")
ax[0,0].set_ylabel("% of total product sales")
ax[0,0].set_xticklabels(["outside", "inside"])
ax[0,0].set_xlabel("")

ax[0,1].set_title("Top 5 foreign countries - Product sales");
sns.barplot(top5_foreign_most_products.index,
            top5_foreign_most_products.values/total_quantities,
            ax=ax[0,1], palette="Purples_r")
ax[0,1].set_xlabel("");
ax[0,1].set_ylabel("% of total product sales")

ax[1,0].set_title("Out- and inside UK - Revenues");
sns.barplot(uk_revenues.index, uk_revenues.values/total_revenues, ax=ax[1,0], palette="Blues")
ax[1,0].set_ylabel("% of total revenue")
ax[1,0].set_xticklabels(["outside", "inside"])
ax[1,0].set_xlabel("")

ax[1,1].set_title("Top 5 foreign countries - Revenues");
sns.barplot(top5_foreign_highest_revenue.index,
            top5_foreign_highest_revenue.values/total_revenues,
            ax=ax[1,1], palette="Blues_r")
ax[1,1].set_xlabel("");
ax[1,1].set_ylabel("% of total revenue")

ax[2,0].set_title("Out- and inside UK - Revenue per quantity");
sns.barplot(uk_revenue_per_quantity.index, uk_revenue_per_quantity.values,
            ax=ax[2,0], palette="Greens")
ax[2,0].set_ylabel("revenue/quantity")
ax[2,0].set_xticklabels(["outside", "inside"])
ax[2,0].set_xlabel("")

ax[2,1].set_title("Top5 foreign countries - Revenue per quantity")
sns.barplot(top5_revenue_per_quantity.index,
            top5_revenue_per_quantity.values, ax=ax[2,1], palette="Greens_r");
ax[2,1].set_xlabel("");
ax[2,1].set_ylabel("revenue/quantity");

Insights¶

The retailer is selling most of its products and gains most of the revenue inside the UK.
The next top 4 countries reside in Europe.
Australia had former British colonies and perhaps this is still the reason why it has a closer connection to the retailer than other English-speaking countries. But this is just an assumption that has not to be true. It could also be the case that the retailer sells some products that are for some other reason very interesting for the australian market.
Furthermore we can see that we make more profit per quantity in Germany than Netherlands. Are they buying more expensive products? If we would like to know which countries could be worth it for winning new customers by marketing campains it could be good to analyse the specific needs of these top 5 countries first.

Product best-seller? ¶

Imagine you have to know what kind of products will be sold tomorrow, next week or month to optimize your inventory and storage strategy for your products. A simple way to start is to focus on the products that are best-seller or show some seasonal patterns (for example are only sold on christmas but with very high amount). Next step then could be to include more and more products such that the optimization becomes better and better for your whole stock.

products = pd.DataFrame(data.groupby("StockCode").Quantity.sum(),
                        columns=["Quantity"])
products["Revenue"] = data.groupby("StockCode").Revenue.sum()
products["MeanUnitPrice"] = data.groupby("StockCode").UnitPrice.mean()
products["StdUnitPrice"] = data.groupby("StockCode").UnitPrice.std().fillna(0)
products["MedianUnitPrice"] = data.groupby("StockCode").UnitPrice.median()
products["Q25UnitPrice"] = data.groupby("StockCode").UnitPrice.quantile(0.25)
products["Q75UnitPrice"] = data.groupby("StockCode").UnitPrice.quantile(0.75)

#data.loc[data.year==2011].groupby(["StockCode", "month"]).Quantity.sum().unstack().fillna(0)

#data.loc[data.year==2011].groupby(["StockCode", "weekday"]).Quantity.sum().unstack().fillna(0)

	Year	Quarter	Month	Week	Weekday	Dayofyear	Day	StockCode	Description	Quantity	Revenue
0	2010	4	12	48	2	335	1	10002	INFLATABLE POLITICAL GLOBE	60	51.00
1	2010	4	12	48	2	335	1	10125	MINI FUNKY DESIGN TAPES	2	1.70
2	2010	4	12	48	2	335	1	10133	COLOURING PENCILS BROWN TUBE	5	4.25
3	2010	4	12	48	2	335	1	10135	COLOURING PENCILS BROWN TUBE	1	2.51
4	2010	4	12	48	2	335	1	11001	ASSTD DESIGN RACING CAR PEN	3	10.08

	Quantity	Revenue
count	280224.000000	280224.000000
mean	18.521108	34.785557
std	81.977877	205.556779
min	-19200.000000	-39243.080000
25%	2.000000	5.000000
50%	6.000000	14.525000
75%	18.000000	31.080000
max	12540.000000	39619.500000

	Attraction	Attraction_Month_1	Attraction_Month_2	Attraction_Month_3	Attraction_Month_4	Attraction_Month_5	Attraction_Month_6	Attraction_Month_7	Attraction_Month_8	Attraction_Month_9	Attraction_Month_10	Attraction_Month_12	MeanPrice	MedianPrice	StdPrice	CustomerMedianPriceVariation	MonthMedianPriceVariation	PreChristmasMedianPrice	NotChristmasMedianPrice	ChristmasPriceDifference
10002	40.0	18.0	5.0	5.0	5.0	0.0	0.0	0.0	0.0	0.0	0.0	15.0	1.056849	0.85	0.404244	0.000000	0.000000	NaN	0.85	NaN
10125	47.0	7.0	0.0	9.0	6.0	6.0	2.0	8.0	4.0	4.0	4.0	7.0	0.860111	0.85	0.278626	0.177886	0.018974	0.85	0.85	0.00
10133	101.0	4.0	5.0	7.0	2.0	4.0	30.0	31.0	21.0	5.0	0.0	12.0	0.649045	0.79	0.248819	0.178759	0.218947	0.42	0.85	-0.43
10135	82.0	15.0	7.0	8.0	5.0	2.0	8.0	9.0	4.0	8.0	10.0	17.0	1.365263	1.25	0.615934	0.395704	0.000000	1.25	1.25	0.00
11001	38.0	2.0	1.0	3.0	3.0	5.0	5.0	4.0	4.0	6.0	9.0	6.0	1.771800	1.69	0.806065	0.270123	0.371386	1.69	1.69	0.00

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	Revenue	Year	Quarter	Month	Week	Weekday	Day	Dayofyear
4287	C536757	84347	ROTATING SILVER ANGELS T-LIGHT HLDR	-9360	2010-12-02 14:23:00	0.03	15838	United Kingdom	-280.8	2010	4	12	48	3	2	336
50849	540564	22617	mouldy, thrown away.	-2600	2011-01-10 10:36:00	0.00	NaN	United Kingdom	-0.0	2011	1	1	2	0	10	10
61624	C541433	23166	MEDIUM CERAMIC TOP STORAGE JAR	-74215	2011-01-18 10:17:00	1.04	12346	United Kingdom	-77183.6	2011	1	1	3	1	18	18
65063	541685	22351	Given away	-1400	2011-01-20 15:41:00	0.00	NaN	United Kingdom	-0.0	2011	1	1	3	3	20	20
82794	543257	84611B	thrown away	-1430	2011-02-04 16:06:00	0.00	NaN	United Kingdom	-0.0	2011	1	2	5	4	4	35

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	Revenue	Year	Quarter	Month	Week	Weekday	Day	Dayofyear
4850	536809	84950	ASSORTED COLOUR T-LIGHT HOLDER	1824	2010-12-02 16:48:00	0.55	15299	United Kingdom	1003.20	2010	4	12	48	3	2	336
4945	536830	84077	WORLD WAR 2 GLIDERS ASSTD DESIGNS	2880	2010-12-02 17:38:00	0.18	16754	United Kingdom	518.40	2010	4	12	48	3	2	336
19871	537899	22328	ROUND SNACK BOXES SET OF 4 FRUITS	1488	2010-12-09 10:44:00	2.55	12755	Japan	3794.40	2010	4	12	49	3	9	343
25920	538420	17096	ASSORTED LAQUERED INCENSE HOLDERS	1728	2010-12-12 12:03:00	0.17	12875	United Kingdom	293.76	2010	4	12	49	6	12	346
32671	539101	22693	GROW A FLYTRAP OR SUNFLOWER IN TIN	2400	2010-12-16 10:35:00	0.94	16029	United Kingdom	2256.00	2010	4	12	50	3	16	350

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	12/1/2010 8:26	3.39	17850	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850	United Kingdom

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	Revenue	Year	Quarter	Month	Week	Weekday	Day	Dayofyear
299983	A563186	B	Adjust bad debt	1	2011-08-12 14:51:00	-11062.06	NaN	United Kingdom	-11062.06	2011	3	8	32	4	12	224
299984	A563187	B	Adjust bad debt	1	2011-08-12 14:52:00	-11062.06	NaN	United Kingdom	-11062.06	2011	3	8	32	4	12	224

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	CustomerID	Country	Year	Quarter	Month	Week	Weekday	Day	Dayofyear	UnknownCustomer	ZeroPrice
622	536414	22139	NaN	56	2010-12-01 11:52:00	NaN	United Kingdom	2010	4	12	48	2	1	335	1	1
1970	536545	21134	NaN	1	2010-12-01 14:32:00	NaN	United Kingdom	2010	4	12	48	2	1	335	1	1
1971	536546	22145	NaN	1	2010-12-01 14:33:00	NaN	United Kingdom	2010	4	12	48	2	1	335	1	1
1972	536547	37509	NaN	1	2010-12-01 14:33:00	NaN	United Kingdom	2010	4	12	48	2	1	335	1	1
1987	536549	85226A	NaN	1	2010-12-01 14:34:00	NaN	United Kingdom	2010	4	12	48	2	1	335	1	1

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	Revenue	Year	Quarter	Month	Week	Weekday	Day	Dayofyear	Cancelled	UnknownCustomer	ZeroPrice	Return	ReturnedCancelled	ReturnedZeroPrice	NotReturnedZeroPrice
540421	581483	23843	PAPER CRAFT , LITTLE BIRDIE	80995	2011-12-09 09:15:00	2.08	16446	United Kingdom	168469.6	2011	4	12	49	4	9	343	0	0	0	0	0	0	0
540422	C581484	23843	PAPER CRAFT , LITTLE BIRDIE	-80995	2011-12-09 09:27:00	2.08	16446	United Kingdom	-168469.6	2011	4	12	49	4	9	343	1	0	0	1	1	0	0