Within this kernel we will analyse sales data of an UK online retailer. As storage area may be expensive and fast delivery on time is important to prevail over the competition we like to help the retailer by predicting daily amounts of sold products as well as gained revenues. This way we will try to answer:

- How many items are sold per product tomorrow, next week, next month?
- Which products yield highest revenues and which ones are best-sellers?
- Which products have unstable prices and how does this affect the purchase behavior of the retails customers?
- What are the customer preferences? Can we gain a basic understanding how the requested products would change if we add new customers of a specific kind?

- Prepare to start (complete)
- Get familiar with the data
- Sneak a peek - features and targets (complete)
- Aggregating daily product sales and revenues (complete)
- Gain first impressions by exploration (complete)

**Work under construction**

How to predict daily product sales? (almost complete)

- Catboost model and hyperparameter class (add more explanations)
- Baseline model & result analysis (add more explanations)
- Bayesian Hyperparameter Search with GPyOpt (add more explanations)

What do we know about the products?

- Explorative analysis and feature engineering
- Model improvement step
- Result analysis
- Conclusion

- Product sales and revenues per country

In [1]:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
from catboost import CatBoostRegressor, Pool, cv
from sklearn.model_selection import TimeSeriesSplit
from catboost import MetricVisualizer
from os import listdir
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
import shap
shap.initjs()
```

In [2]:

```
print(listdir("../input"))
```

In [3]:

```
data = pd.read_csv("../input/data.csv", encoding="ISO-8859-1", dtype={'CustomerID': str})
data.shape
```

Out[3]:

The data has 541909 entries and 8 variables.

In [4]:

```
data.head()
```

Out[4]:

We can see that the datafile has information given for **each single transaction**. Take a look at the InvoiceNo and the CustomerID of the first entries. Here we can see that **one customer with ID 17850 of the United Kingdom made a single order that has the InvoideNo 536365**. The customer ordered **several products with different stockcodes, descriptions, unit prices and quantities**. In addition we can see that the InvoiceDate was the same for these products.

As we like to predict the daily amount of product sales as well as revenues, we need to compute a daily aggregation of this data. For this purpose we need to compute the revenue per transaction and extract temporal features out of the InvoiceDate:

In [5]:

```
data["Revenue"] = data.Quantity * data.UnitPrice
data["InvoiceDate"] = pd.to_datetime(data.InvoiceDate,
cache=True)
data["Year"] = data.InvoiceDate.dt.year
data["Quarter"] = data.InvoiceDate.dt.quarter
data["Month"] = data.InvoiceDate.dt.month
data["Week"] = data.InvoiceDate.dt.week
data["Weekday"] = data.InvoiceDate.dt.weekday
data["Day"] = data.InvoiceDate.dt.day
data["Dayofyear"] = data.InvoiceDate.dt.dayofyear
```

As the key task of this kernel is to predict the gained revenue and amount of products sold per day, we can sum up the daily quantities and revenues per product stockcode and description:

In [6]:

```
grouped_features = ["Year", "Quarter","Month", "Week", "Weekday", "Dayofyear", "Day",
"StockCode", "Description"]
```

This way we loose information abount customers and countries but we will recover it later on during this kernel.

In [7]:

```
daily_data = pd.DataFrame(data.groupby(grouped_features).Quantity.sum(),
columns=["Quantity"])
daily_data["Revenue"] = data.groupby(grouped_features).Revenue.sum()
daily_data = daily_data.reset_index()
daily_data.head()
```

Out[7]:

In [8]:

```
daily_data["Date"] = pd.to_datetime(daily_data[['Year', 'Month', 'Day']])
```

In [9]:

```
daily_data.Date.max() - daily_data.Date.min()
```

Out[9]:

In [10]:

```
print("Datafile starts with timepoint {}".format(daily_data.Date.min()))
print("Datafile ends with timepoint {}".format(daily_data.Date.max()))
```

We can see that the data covers one year from december 2010 to december 2011.

In [11]:

```
samples = daily_data.shape[0]
samples
```

Out[11]:

In [12]:

```
daily_data.StockCode.nunique()
```

Out[12]:

In [13]:

```
daily_data.Description.nunique()
```

Out[13]:

We have more descriptions than product codes:

In [14]:

```
daily_data.Description.nunique() - daily_data.StockCode.nunique()
```

Out[14]:

In [15]:

```
daily_data.loc[:, ["Quantity", "Revenue"]].describe()
```

Out[15]:

As we can see by their min and max values both **target variables exhibit extreme outliers**. If we would like to use them as targets, we **should exclude them as they will mislead our validation. As I like to use early stopping this will directly influence training of predictive models as well**.

In [16]:

```
low_quantity = daily_data.Quantity.quantile(0.05)
high_quantity = daily_data.Quantity.quantile(0.95)
print((low_quantity, high_quantity))
```

In [17]:

```
low_revenue = daily_data.Revenue.quantile(0.05)
high_revenue = daily_data.Revenue.quantile(0.95)
print((low_revenue, high_revenue))
```

Let's **only use target ranges data that are occupied by 90 % of the data entries**. This is a first and easy strategy to exclude heavy outliers but we should always be aware of the fact that we have lost some information given by the 2 % we have excluded. It could be nice and useful in general to understand and analyse what has caused these outliers.

In [18]:

```
daily_data = daily_data.loc[
(daily_data.Quantity >= low_quantity) & (daily_data.Quantity <= high_quantity)]
```

In [19]:

```
daily_data = daily_data.loc[
(daily_data.Revenue >= low_revenue) & (daily_data.Revenue <= high_revenue)]
```

How much entries have we lost?

In [20]:

```
samples - daily_data.shape[0]
```

Out[20]:

Let's take a look at the remaining distributions of daily quantities and revenues:

In [21]:

```
fig, ax = plt.subplots(1,2,figsize=(20,5))
sns.distplot(daily_data.Quantity.values, kde=False, ax=ax[0], color="Orange");
ax[0].set_xlabel("Number of daily product sales");
ax[0].set_ylabel("Frequency");
ax[0].set_title("How many products are sold per day?");
sns.distplot(daily_data.Revenue.values, kde=True, ax=ax[1], color="Dodgerblue");
ax[1].set_xlabel("Daily revenue");
ax[1].set_ylabel("Frequency");
ax[1].set_title("How many revenue is gained per day?");
```

We can see that the distributions are **right skewed. Lower values are more common**. In addition the daily sales quantities seem to be **multimodal**. A daily sale of 1 is common as well as a quantity of 12 and 24. This pattern is very interesting and leads to the conclusion that quantities are probably often divisible by 2 or 3. The daily revenues are right skewed as well and low values are most common. In a nutshell we can say that specific products are often bought as single quantites or in a small bunch and the revenue we gain this way is relatively low.

In [22]:

```
daily_data.head()
```

Out[22]:

In [23]:

```
plt.figure(figsize=(20,5))
plt.plot(daily_data.groupby("Date").Quantity.sum(), marker='+', c="darkorange")
plt.plot(daily_data.groupby("Date").Quantity.sum().rolling(window=30, center=True).mean(),
c="red")
plt.xticks(rotation=90);
plt.title("How many quantities are sold per day over the given time?");
```

The daily amounts of sold products show an expected **increase in the months before christmas as seem to be lowest in february**. We have **high fluctuations** between the days ranging from values close to 2500 products to roughly 19000 products per day. The average **amount of product sales is** around 7500 that strongly increases after september and is **roughly doubled in the end of november**. This emphasizes the need of an optimized storekeeping. Knowing which products are currently of high demand for important customers in the future could be a adventage to deliver fast without unnecessarily crowded full stocks.

In [24]:

```
plt.figure(figsize=(20,5))
plt.plot(daily_data.groupby("Date").Revenue.sum(), marker='+', c="royalblue")
plt.plot(daily_data.groupby("Date").Revenue.sum().rolling(window=30, center=True).mean(),
c="darkblue")
plt.xticks(rotation=90);
plt.title("How many revenue is gained per day over the given time?");
```

The daily revenue shows the roughly the same temporal pattern as the quantities.

In [25]:

```
fig, ax = plt.subplots(1,2,figsize=(20,5))
weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
yearmonth = ["Dec-2010", "Jan-2011", "Feb-2011", "Mar-2011", "Apr-2011", "May-2011",
"Jun-2011", "Jul-1011", "Aug-2011", "Sep-2011", "Oct-2011", "Nov-2011",
"Dec-2011"]
daily_data.groupby("Weekday").Quantity.sum().plot(
ax=ax[0], marker='o', label="Quantity", c="darkorange");
daily_data.groupby("Weekday").Revenue.sum().plot(
ax=ax[0], marker='o', label="Revenue", c="royalblue");
ax[0].legend();
ax[0].set_xticks(np.arange(0,7))
ax[0].set_xticklabels(weekdays);
ax[0].set_title("Total sales per weekday");
ax[1].plot(daily_data.groupby(["Year", "Month"]).Quantity.sum().values,
marker='o', label="Quantities", c="darkorange");
ax[1].plot(daily_data.groupby(["Year", "Month"]).Revenue.sum().values,
marker='o', label="Revenue", c="royalblue");
ax[1].set_xticklabels(yearmonth, rotation=90)
ax[1].set_xticks(np.arange(0, len(yearmonth)))
ax[1].legend();
ax[1].set_title("Total sales per month");
```

Both visualisations yield further interesting insights:

- Thursday seems to be the day on which most products are sold.
- In contrast friday, and sunday have very low transactions
- On saturday there are no transactions at all
- The pre-Christmas season starts in september and shows a peak in november
- Indeed february and april are month with very low sales.

In this kernel I like to use catboost as predictive model. The prediction of daily quantities and revenues are both regression tasks and consequently I will use the catboost regressor. The loss and metric I like to use is the root mean square error (RMSE):

$$ E = \sqrt{ \frac{1}{N}\sum_{n=1}^{N} (t_{n} - y_{n})^{2}}$$It computes the e