Within this kernel we will analyse sales data of an UK online retailer. As storage area may be expensive and fast delivery on time is important to prevail over the competition we like to help the retailer by predicting daily amounts of sold products as well as gained revenues. This way we will try to answer:
Work under construction
How to predict daily product sales? (almost complete)
What do we know about the products?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
from catboost import CatBoostRegressor, Pool, cv
from sklearn.model_selection import TimeSeriesSplit
from catboost import MetricVisualizer
from os import listdir
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
import shap
shap.initjs()
print(listdir("../input"))
data = pd.read_csv("../input/data.csv", encoding="ISO-8859-1", dtype={'CustomerID': str})
data.shape
The data has 541909 entries and 8 variables.
data.head()
We can see that the datafile has information given for each single transaction. Take a look at the InvoiceNo and the CustomerID of the first entries. Here we can see that one customer with ID 17850 of the United Kingdom made a single order that has the InvoideNo 536365. The customer ordered several products with different stockcodes, descriptions, unit prices and quantities. In addition we can see that the InvoiceDate was the same for these products.
As we like to predict the daily amount of product sales as well as revenues, we need to compute a daily aggregation of this data. For this purpose we need to compute the revenue per transaction and extract temporal features out of the InvoiceDate:
data["Revenue"] = data.Quantity * data.UnitPrice
data["InvoiceDate"] = pd.to_datetime(data.InvoiceDate,
cache=True)
data["Year"] = data.InvoiceDate.dt.year
data["Quarter"] = data.InvoiceDate.dt.quarter
data["Month"] = data.InvoiceDate.dt.month
data["Week"] = data.InvoiceDate.dt.week
data["Weekday"] = data.InvoiceDate.dt.weekday
data["Day"] = data.InvoiceDate.dt.day
data["Dayofyear"] = data.InvoiceDate.dt.dayofyear
As the key task of this kernel is to predict the gained revenue and amount of products sold per day, we can sum up the daily quantities and revenues per product stockcode and description:
grouped_features = ["Year", "Quarter","Month", "Week", "Weekday", "Dayofyear", "Day",
"StockCode", "Description"]
This way we loose information abount customers and countries but we will recover it later on during this kernel.
daily_data = pd.DataFrame(data.groupby(grouped_features).Quantity.sum(),
columns=["Quantity"])
daily_data["Revenue"] = data.groupby(grouped_features).Revenue.sum()
daily_data = daily_data.reset_index()
daily_data.head()
daily_data["Date"] = pd.to_datetime(daily_data[['Year', 'Month', 'Day']])
daily_data.Date.max() - daily_data.Date.min()
print("Datafile starts with timepoint {}".format(daily_data.Date.min()))
print("Datafile ends with timepoint {}".format(daily_data.Date.max()))
We can see that the data covers one year from december 2010 to december 2011.
samples = daily_data.shape[0]
samples
daily_data.StockCode.nunique()
daily_data.Description.nunique()
We have more descriptions than product codes:
daily_data.Description.nunique() - daily_data.StockCode.nunique()
daily_data.loc[:, ["Quantity", "Revenue"]].describe()
As we can see by their min and max values both target variables exhibit extreme outliers. If we would like to use them as targets, we should exclude them as they will mislead our validation. As I like to use early stopping this will directly influence training of predictive models as well.
low_quantity = daily_data.Quantity.quantile(0.05)
high_quantity = daily_data.Quantity.quantile(0.95)
print((low_quantity, high_quantity))
low_revenue = daily_data.Revenue.quantile(0.05)
high_revenue = daily_data.Revenue.quantile(0.95)
print((low_revenue, high_revenue))
Let's only use target ranges data that are occupied by 90 % of the data entries. This is a first and easy strategy to exclude heavy outliers but we should always be aware of the fact that we have lost some information given by the 2 % we have excluded. It could be nice and useful in general to understand and analyse what has caused these outliers.
daily_data = daily_data.loc[
(daily_data.Quantity >= low_quantity) & (daily_data.Quantity <= high_quantity)]
daily_data = daily_data.loc[
(daily_data.Revenue >= low_revenue) & (daily_data.Revenue <= high_revenue)]
How much entries have we lost?
samples - daily_data.shape[0]
Let's take a look at the remaining distributions of daily quantities and revenues:
fig, ax = plt.subplots(1,2,figsize=(20,5))
sns.distplot(daily_data.Quantity.values, kde=False, ax=ax[0], color="Orange");
ax[0].set_xlabel("Number of daily product sales");
ax[0].set_ylabel("Frequency");
ax[0].set_title("How many products are sold per day?");
sns.distplot(daily_data.Revenue.values, kde=True, ax=ax[1], color="Dodgerblue");
ax[1].set_xlabel("Daily revenue");
ax[1].set_ylabel("Frequency");
ax[1].set_title("How many revenue is gained per day?");
We can see that the distributions are right skewed. Lower values are more common. In addition the daily sales quantities seem to be multimodal. A daily sale of 1 is common as well as a quantity of 12 and 24. This pattern is very interesting and leads to the conclusion that quantities are probably often divisible by 2 or 3. The daily revenues are right skewed as well and low values are most common. In a nutshell we can say that specific products are often bought as single quantites or in a small bunch and the revenue we gain this way is relatively low.
daily_data.head()
plt.figure(figsize=(20,5))
plt.plot(daily_data.groupby("Date").Quantity.sum(), marker='+', c="darkorange")
plt.plot(daily_data.groupby("Date").Quantity.sum().rolling(window=30, center=True).mean(),
c="red")
plt.xticks(rotation=90);
plt.title("How many quantities are sold per day over the given time?");
The daily amounts of sold products show an expected increase in the months before christmas as seem to be lowest in february. We have high fluctuations between the days ranging from values close to 2500 products to roughly 19000 products per day. The average amount of product sales is around 7500 that strongly increases after september and is roughly doubled in the end of november. This emphasizes the need of an optimized storekeeping. Knowing which products are currently of high demand for important customers in the future could be a adventage to deliver fast without unnecessarily crowded full stocks.
plt.figure(figsize=(20,5))
plt.plot(daily_data.groupby("Date").Revenue.sum(), marker='+', c="royalblue")
plt.plot(daily_data.groupby("Date").Revenue.sum().rolling(window=30, center=True).mean(),
c="darkblue")
plt.xticks(rotation=90);
plt.title("How many revenue is gained per day over the given time?");
The daily revenue shows the roughly the same temporal pattern as the quantities.
fig, ax = plt.subplots(1,2,figsize=(20,5))
weekdays = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
yearmonth = ["Dec-2010", "Jan-2011", "Feb-2011", "Mar-2011", "Apr-2011", "May-2011",
"Jun-2011", "Jul-1011", "Aug-2011", "Sep-2011", "Oct-2011", "Nov-2011",
"Dec-2011"]
daily_data.groupby("Weekday").Quantity.sum().plot(
ax=ax[0], marker='o', label="Quantity", c="darkorange");
daily_data.groupby("Weekday").Revenue.sum().plot(
ax=ax[0], marker='o', label="Revenue", c="royalblue");
ax[0].legend();
ax[0].set_xticks(np.arange(0,7))
ax[0].set_xticklabels(weekdays);
ax[0].set_title("Total sales per weekday");
ax[1].plot(daily_data.groupby(["Year", "Month"]).Quantity.sum().values,
marker='o', label="Quantities", c="darkorange");
ax[1].plot(daily_data.groupby(["Year", "Month"]).Revenue.sum().values,
marker='o', label="Revenue", c="royalblue");
ax[1].set_xticklabels(yearmonth, rotation=90)
ax[1].set_xticks(np.arange(0, len(yearmonth)))
ax[1].legend();
ax[1].set_title("Total sales per month");
Both visualisations yield further interesting insights:
In this kernel I like to use catboost as predictive model. The prediction of daily quantities and revenues are both regression tasks and consequently I will use the catboost regressor. The loss and metric I like to use is the root mean square error (RMSE):
$$ E = \sqrt{ \frac{1}{N}\sum_{n=1}^{N} (t_{n} - y_{n})^{2}}$$It computes the error between the target value $t_{n}$ and the predicted value $y_{n}$ per sample, takes the square to make sure that both, positive and negative deviations, contribute to the sum the same way. Then the mean is taken by dividing with the total amount $N$ of samples (entries) in the data. And finally to obtain an impression of the error for single predictions, the root is taken. What should we keep in mind when working with this loss and metric function? :-) It's heavily influenced by outliers! If we have some predictions that are far away from the targets, they will guide the mean towards higher values as well. Hence it could be that we will make nice predictions for a majority of samples but the RMSE is still high due to high errors for a minority of samples.
As the data covers only one year and we have a high increase of sold products during pre-christmas period, we need to select validation data carefully. To start I like to use all days of november and december 2011 for validation. Doing so we cover the trend of the pre-christmas period and hopefully we will still be able to explain the further increase of quantities and revenues during november.
To easily generate comparable new model instances, I wrote a class that can obtain catboots data pools or pandas dataframes to train the catboost model. It yields a root mean square evaluation score and can show feature importances.
class Catmodel:
def __init__(self, name, params):
self.name = name
self.params = params
def set_data_pool(self, train_pool, val_pool):
self.train_pool = train_pool
self.val_pool = val_pool
def set_data(self, X, y):
val_criteria = (X.Month.isin([11, 12])) & (X.Year == 2011)
cat_features_idx = np.where(X.dtypes != np.float)[0]
x_train, self.x_val = X.loc[val_criteria==False], X.loc[val_criteria==True]
y_train, self.y_val = y.loc[val_criteria==False], y.loc[val_criteria==True]
self.train_pool = Pool(x_train, y_train, cat_features=cat_features_idx)
self.val_pool = Pool(self.x_val, self.y_val, cat_features=cat_features_idx)
def prepare_model(self):
self.model = CatBoostRegressor(
loss_function = self.params.loss[0],
random_seed = self.params.seed,
logging_level = 'Silent',
iterations = self.params.iterations,
max_depth = self.params.max_depth[0],
#learning_rate = self.params.learning_rate[0],
l2_leaf_reg = self.params.l2_leaf_reg[0],
od_type='Iter',
od_wait=40,
train_dir=self.name
)
def learn(self, plot=False):
self.prepare_model()
self.model.fit(self.train_pool, eval_set=self.val_pool, plot=plot);
print("{}, early-stopped model tree count {}".format(
self.name, self.model.tree_count_
))
def score(self):
return self.model.score(self.val_pool)
def show_importances(self, kind="bar"):
explainer = shap.TreeExplainer(self.model)
shap_values = explainer.shap_values(self.val_pool)
if kind=="bar":
return shap.summary_plot(shap_values, self.x_val, plot_type="bar")
return shap.summary_plot(shap_values, self.x_val)
def get_val_results(self):
self.results = pd.DataFrame(self.y_val)
self.results["prediction"] = self.predict(self.x_val)
self.results["error"] = np.abs(
self.results[self.results.columns.values[0]].values - self.results.prediction)
self.results["Month"] = self.x_val.Month
self.results["SquaredError"] = self.results.error.apply(lambda l: np.power(l, 2))
def show_val_results(self):
self.get_val_results()
fig, ax = plt.subplots(1,2,figsize=(20,5))
sns.distplot(self.results.error, ax=ax[0])
ax[0].set_xlabel("Single absolute error")
ax[0].set_ylabel("Density")
ax[0].axvline(np.median(self.results.error), c="black")
ax[1].scatter(self.results.prediction.values,
self.results[self.results.columns[0]].values,
c=self.results.error, cmap="RdYlBu_r")
ax[1].set_xlabel("Prediction")
ax[1].set_ylabel("Target")
return ax
def get_monthly_RMSE(self):
return self.results.groupby("Month").SquaredError.mean().apply(lambda l: np.sqrt(l))
def predict(self, x):
return self.model.predict(x)
I like to use one class for hyperparameters that can be passed directly to the model or via hyperparameter search:
class CatHyperparameter:
def __init__(self,
loss="RMSE",
metric="RMSE",
iterations=400,
max_depth=6,
l2_leaf_reg=3,
#learning_rate=0.05,
seed=0):
self.loss = loss,
self.metric = metric,
self.max_depth = max_depth,
self.l2_leaf_reg = l2_leaf_reg,
#self.learning_rate = learning_rate,
self.iterations=iterations
self.seed = seed
Let's see how good this model performs without feature engineering and hyperparameter search:
X = daily_data.drop(["Quantity", "Revenue", "Date"], axis=1)
y = daily_data.Quantity
params = CatHyperparameter()
model = Catmodel("baseline", params)
model.set_data(X,y)
model.learn(plot=True)
If you have forked this kernel and are in interactive model you can see that the model loss has converged. How big is the evaluated root mean square error on validation data?
model.score()
This value close to 12 is high, but as already mentioned the RSME is influenced by outliers and we should take a look at the distribution of individual absolute errors:
model.show_val_results();
model.show_importances()
model.show_importances(kind=None)
import GPyOpt
class Hypertuner:
def __init__(self, model, max_iter=10, max_time=10):
self.bounds = [{'name': 'depth','type': 'discrete','domain': (1, 16)},
{'name': 'l2_leaf_reg','type': 'discrete','domain': (1, 80)}]
self.model = model
self.max_iter=max_iter
self.max_time=max_time
def objective(self, params):
params = params[0]
params = CatHyperparameter(
max_depth=params[0],
l2_leaf_reg=params[1]
)
self.model.params = params
self.model.learn()
return self.model.score()
def learn(self):
np.random.seed(777)
optimizer = GPyOpt.methods.BayesianOptimization(
f=self.objective, domain=self.bounds,
acquisition_type ='MPI',
acquisition_par = 0.1,
exact_eval=True)
optimizer.run_optimization(self.max_iter, self.max_time)
optimizer.plot_convergence()
print(optimizer.X[np.argmin(optimizer.Y)])
print('RMSE:', np.min(optimizer.Y))
Set it to learn, to find better suitable tree depth and l2-leaf-reg (to reduce overfitting). But this takes some time... ;-)
search = Hypertuner(model)
#search.learn()
Given the data, we are able to include much more information into our daily aggreation. Let's start with engineering new features given the StockCode:
products = pd.DataFrame(index=daily_data.StockCode.unique())
Currently we use november and december 2011 as our validation data in a hold-out fassion. To prevent leakages of future information, we need to perform all feature egineering tasks on data, that have timestamps below:
fe_selector = (data.Year == 2011) & (data.Month.isin([11,12]))
fe_data = data.loc[fe_selector==False].copy()
fe_daily_data = daily_data.loc[fe_selector==False].copy()
attraction_map = fe_data.groupby("StockCode").CustomerID.nunique()
quarter_attraction = fe_data.loc[fe_data.Year==2010].groupby(["StockCode", "Quarter"]).CustomerID.nunique()
monthly_attraction = fe_data.groupby(["StockCode", "Month"]).CustomerID.nunique().unstack()
weekday_attraction = fe_data.groupby(["StockCode", "Weekday"]).CustomerID.nunique().unstack().fillna(0)
week_attraction = fe_data.groupby(["StockCode", "Week"]).CustomerID.nunique().unstack().fillna(0)
products["Attraction"] = products.index.map(attraction_map)
for col in monthly_attraction.columns.values:
products["Attraction_Month_" + str(col)] = products.index.map(monthly_attraction.loc[:, col])
products = products.fillna(0)
monthly_prices = fe_data.groupby(
["StockCode", "Month"]
).UnitPrice.median().reset_index()
pre_christmas_prices = monthly_prices.loc[
monthly_prices.Month.isin([9,10,11])
].groupby("StockCode").UnitPrice.median()
not_christmas_prices = monthly_prices.loc[
monthly_prices.Month.isin([9,10,11]) == False
].groupby("StockCode").UnitPrice.median()
month_medianprice_map = monthly_prices.groupby("StockCode").UnitPrice.std()
mean_price_map = fe_data.groupby("StockCode").UnitPrice.mean()
median_price_map = fe_data.groupby("StockCode").UnitPrice.median()
std_price_map = fe_data.groupby("StockCode").UnitPrice.std()
customer_medianprice_map = fe_data.groupby(
["StockCode", "CustomerID"]
).UnitPrice.median().reset_index().groupby("StockCode").UnitPrice.std()
## Unit Price information
products["MeanPrice"] = products.index.map(mean_price_map)
products["MedianPrice"] = products.index.map(median_price_map)
products["StdPrice"] = products.index.map(std_price_map)
products.StdPrice = products.StdPrice.fillna(0)
products["PriceStability"] = np.where(products.StdPrice==0, 1, 0)
products["CustomerMedianPriceVariation"] = products.index.map(customer_medianprice_map)
products["MonthMedianPriceVariation"] = products.index.map(month_medianprice_map)
products.CustomerMedianPriceVariation = products.CustomerMedianPriceVariation.fillna(0)
products["PreChristmasMedianPrice"] = products.index.map(pre_christmas_prices)
products["NotChristmasMedianPrice"] = products.index.map(not_christmas_prices)
products["ChristmasPriceDifference"] = products.PreChristmasMedianPrice - products.NotChristmasMedianPrice
products.head()
We need to add our engineered features for entries of the train and validation data:
for col in products.columns:
daily_data[col] = daily_data.StockCode.map(products.loc[:, col])
X = daily_data.drop(["Quantity", "Revenue", "Date", "StockCode", "Description"], axis=1)
y = daily_data.Quantity
params = CatHyperparameter()
model = Catmodel("product_features", params)
model.set_data(X,y)
model.learn(plot=True)
model.show_importances()
model.show_val_results();
class CatCvFamily:
def __init__(self, params, X, y, n_splits=2):
self.family = {}
self.cat_features_idx = np.where(X.dtypes != np.float)[0]
self.X = X.values
self.y = y.values
self.n_splits = n_splits
self.params = params
def set_validation_strategy(self):
self.cv = TimeSeriesSplit(max_train_size = None,
n_splits = self.n_splits)
self.gen = self.cv.split(self.X)
def get_split(self):
train_idx, val_idx = next(self.gen)
x_train, x_val = self.X[train_idx], self.X[val_idx]
y_train, y_val = self.y[train_idx], self.y[val_idx]
train_pool = Pool(x_train, y_train, cat_features=self.cat_features_idx)
val_pool = Pool(x_val, y_val, cat_features=self.cat_features_idx)
return train_pool, val_pool
def learn(self):
self.set_validation_strategy()
self.model_names = []
self.model_scores = []
for split in range(self.n_splits):
name = 'Model_cv_' + str(split) + '/'
train_pool, val_pool = self.get_split()
self.model_names.append(name)
self.family[name], score = self.fit_catmodel(name, train_pool, val_pool)
self.model_scores.append(score)
def fit_catmodel(self, name, train_pool, val_pool):
cat = Catmodel(name, train_pool, val_pool, self.params)
cat.prepare_model()
cat.learn()
score = cat.score()
return cat, score
def score(self):
return np.mean(self.model_scores)
def show_learning(self):
widget = MetricVisualizer(self.model_names)
widget.start()
def show_importances(self):
name = self.model_names[-1]
cat = self.family[name]
explainer = shap.TreeExplainer(cat.model)
shap_values = explainer.shap_values(cat.val_pool)
return shap.summary_plot(shap_values, X, plot_type="bar")
To understand the quantity outliers we should take a closer look to some of them. To continue we can focus on the 0.01 % with the lowest quantites and the 0.01 % with the highest:
low = data.Quantity.quantile(0.0001)
high = data.Quantity.quantile(0.9999)
low_outliers = data.loc[(data.Quantity < low)]
low_outliers.head(5)
Very interesting! Take a look at the product descriptions as well as the customer ID. Sometimes the latter is missing and the description is not the product description but a return description like "mouldy, thrown away". In these cases the unit price is zero as well. They don't hold the "C" in front of the InvoiceNo. The "C" tells us that a product was cancelled. You can see that the both remaining example outliers are product returns.
high_outliers = data.loc[(data.Quantity > high)]
high_outliers.head(5)
In contrast to the findings above these seem to be valid transactions but perhaps they are returned as erroneous purchases.
To analyse these kind of products we need to generate new features in our data:
In addition we should filter if there are some entries with a negative unit price or a zero quantity:
data[data.UnitPrice < 0]
Ah! :-) It's also possible that the retailer has to pay for an open debt.
data[data.Quantity == 0]
Ok, we have no entries with zero quantities. Now, let's create the new features:
data["Cancelled"] = np.where(data.InvoiceNo.apply(lambda l: l[0]=="C"), 1, 0)
data["UnknownCustomer"] = np.where(data.CustomerID.isna(), 1, 0)
data["ZeroPrice"] = np.where(data.UnitPrice == 0, 1, 0)
data["Return"] = np.where(data.Quantity <= 0, 1, 0)
By diving into outliers of quantities and unit prices we have already find out that some of them are somehow related to returns. To answer the question, which pathways returned products can go, let's take a look at the interaction of these new features:
marker = ["Cancelled", "UnknownCustomer", "ZeroPrice", "Return"]
marker_counts = 100 * data.loc[:, marker].sum().sort_values(ascending=False) / data.shape[0]
interactions = pd.DataFrame(index=marker, columns=marker, dtype=np.float)
for feat in marker:
interactions.loc[feat] = 100 * data.loc[data[feat] == 1, marker].sum() / data.loc[:, feat].sum()
interactions = interactions.apply(np.round)
fig, ax = plt.subplots(1,2,figsize=(20,5))
sns.heatmap(data=interactions,
cmap="PuRd",
annot=True, fmt="g",
linewidth=2,
cbar=False, ax=ax[0]);
sns.barplot(x=marker_counts.index, y=marker_counts.values,
palette="PuRd_r", ax=ax[1]);
ax[1].set_ylabel("% in data");
An official return is indicated by a cancellation of the transaction. In these cases the customer is often known and the product has an offical price that can contribute to revenue values. In contrast there exist at least one more pathway of returns: In the case of zero prices and negative quantities the products are often damaged or lost. If you browse through the entries you can see that most of them have missing descriptions (indicated by NaN). On the other hand products with zero prices and positive quantities are often found, have to be checked or have "amazon" in their description.
data.loc[(data.ZeroPrice==1) & (data.Return == 0)].head()
Let's close this section by introducting a new features that fuse what we have found:
data["ReturnedCancelled"] = np.where((data.Return == 1) & (data.Cancelled == 1),1,0)
data["ReturnedZeroPrice"] = np.where((data.Return == 1) & (data.ZeroPrice == 1),1,0)
data["NotReturnedZeroPrice"] = np.where((data.Return == 0) & (data.ZeroPrice == 1),1,0)
data.loc[:, ["ReturnedCancelled", "ReturnedZeroPrice", "NotReturnedZeroPrice"]].sum()
Products that were purchased mistakenly with very high amount will cause very high outliers in the quantities sold and the revenues gained. Any outliers can be very misleading for predictive models and we should always be aware of them. To setup a model for the retailer we need to understand the different kinds of outliers. Doing so we might be able to build up strategies that can detect them and to avoid a negative impact on the forecast.
By explorating the data, we have already found some differences in how products can be returned. To understand the process how the data was collected and how the company deals with outliers, let's take a look at the most extreme quantity outlier we can observe:
data.loc[(data.Quantity == data.Quantity.min()) | (data.Quantity == data.Quantity.max())]
We can see that this purchase was done mistakenly: A customer ordered a product with very high quantities and cancelled this transaction a fiew minutes later. The absolute quantities are the same, as well as the stock code. But unfortunately the invoice number is not the same if you remove the C in front of the cancellation. A first advice for the retailer would be to change this as it would make it much easier to detect such pairs.
How many different countries are delivered by the retailer in this data?
data.Country.nunique()
To answer these questions let's separate between the country "United Kingdom" of the retailer and all others by using a new feature. I expect that most of the customers reside in the UK and consequently it might be useful to distinguish between domestic and foreign orders.
data["UK"] = np.where(data.Country == "United Kingdom", 1, 0)
In addition we need a feature that stands for the revenue that the retailer gained during the transaction. It's given by the product of quantities ordered and the unit price of the product:
data["Revenue"] = data.Quantity * data.UnitPrice
country_quantities = data.groupby("Country").Quantity.sum()
country_revenues = data.groupby("Country").Revenue.sum()
revenue_per_quantity = country_revenues / country_quantities
uk_quantities = data.groupby("UK").Quantity.sum().sort_values(ascending=False)
uk_revenues = data.groupby("UK").Revenue.sum().sort_values(ascending=False)
top5_foreign_most_products = country_quantities.sort_values(ascending=False).iloc[1:6]
top5_foreign_highest_revenue = country_revenues.sort_values(ascending=False).iloc[1:6]
top5_revenue_per_quantity = revenue_per_quantity.loc[
top5_foreign_most_products.index
].sort_values(ascending=False)
uk_revenue_per_quantity = data.groupby("UK").Revenue.sum() / data.groupby("UK").Quantity.sum()
fig, ax = plt.subplots(3,2,figsize=(20,19))
total_quantities = data.Quantity.sum() * 1/100
total_revenues = data.Revenue.sum() * 1/100
ax[0,0].set_title("Out- and inside UK - Product sales");
sns.barplot(uk_quantities.index, uk_quantities.values/total_quantities, ax=ax[0,0], palette="Purples")
ax[0,0].set_ylabel("% of total product sales")
ax[0,0].set_xticklabels(["outside", "inside"])
ax[0,0].set_xlabel("")
ax[0,1].set_title("Top 5 foreign countries - Product sales");
sns.barplot(top5_foreign_most_products.index,
top5_foreign_most_products.values/total_quantities,
ax=ax[0,1], palette="Purples_r")
ax[0,1].set_xlabel("");
ax[0,1].set_ylabel("% of total product sales")
ax[1,0].set_title("Out- and inside UK - Revenues");
sns.barplot(uk_revenues.index, uk_revenues.values/total_revenues, ax=ax[1,0], palette="Blues")
ax[1,0].set_ylabel("% of total revenue")
ax[1,0].set_xticklabels(["outside", "inside"])
ax[1,0].set_xlabel("")
ax[1,1].set_title("Top 5 foreign countries - Revenues");
sns.barplot(top5_foreign_highest_revenue.index,
top5_foreign_highest_revenue.values/total_revenues,
ax=ax[1,1], palette="Blues_r")
ax[1,1].set_xlabel("");
ax[1,1].set_ylabel("% of total revenue")
ax[2,0].set_title("Out- and inside UK - Revenue per quantity");
sns.barplot(uk_revenue_per_quantity.index, uk_revenue_per_quantity.values,
ax=ax[2,0], palette="Greens")
ax[2,0].set_ylabel("revenue/quantity")
ax[2,0].set_xticklabels(["outside", "inside"])
ax[2,0].set_xlabel("")
ax[2,1].set_title("Top5 foreign countries - Revenue per quantity")
sns.barplot(top5_revenue_per_quantity.index,
top5_revenue_per_quantity.values, ax=ax[2,1], palette="Greens_r");
ax[2,1].set_xlabel("");
ax[2,1].set_ylabel("revenue/quantity");
Imagine you have to know what kind of products will be sold tomorrow, next week or month to optimize your inventory and storage strategy for your products. A simple way to start is to focus on the products that are best-seller or show some seasonal patterns (for example are only sold on christmas but with very high amount). Next step then could be to include more and more products such that the optimization becomes better and better for your whole stock.
products = pd.DataFrame(data.groupby("StockCode").Quantity.sum(),
columns=["Quantity"])
products["Revenue"] = data.groupby("StockCode").Revenue.sum()
products["MeanUnitPrice"] = data.groupby("StockCode").UnitPrice.mean()
products["StdUnitPrice"] = data.groupby("StockCode").UnitPrice.std().fillna(0)
products["MedianUnitPrice"] = data.groupby("StockCode").UnitPrice.median()
products["Q25UnitPrice"] = data.groupby("StockCode").UnitPrice.quantile(0.25)
products["Q75UnitPrice"] = data.groupby("StockCode").UnitPrice.quantile(0.75)
#data.loc[data.year==2011].groupby(["StockCode", "month"]).Quantity.sum().unstack().fillna(0)
#data.loc[data.year==2011].groupby(["StockCode", "weekday"]).Quantity.sum().unstack().fillna(0)