This notebook contains my solution to the Kaggle Recruit Restaurant Visitor Forecasting. I went for simplicity and used a single, global, model. I spent most of my time doing feature engineering. I mainly focused on extracting so-called rolling statistics.
By itself the output of this notebook scores 0.512 on the private leaderboard, which is enough to be part of the top 25. I then averaged my solution with the Surprise Me public kernel to score 0.507. You could say that I only did half the work. However if the public kernel was not available I would simply have built another model and used it instead. Blending kernels to score has become ubiquitous to do well on Kaggle.
For obvious reasons I haven't commited the data files on GitHub. To run this kernel yourself as it is you will have to organise your files in a certain way. All the files provided by Kaggle should be stored in the data/kaggle
directory. Additionally, the weather data located here has to be kept in the data/weather
directory.
First of all let's load the visit data. I did something very important in the following code block. Instead of taking the visiting data as is, I resampled it by day so that for days where there are no data points the number of visits is 0. This is important for computing rolling features based on time. I also keep track of whether a point was included in the original dataset or was added by the resampling procedure.
import pandas as pd
air_visit = pd.read_csv('data/kaggle/air_visit_data.csv')
air_visit.index = pd.to_datetime(air_visit['visit_date'])
air_visit = air_visit.groupby('air_store_id').apply(lambda g: g['visitors'].resample('1d').sum()).reset_index()
air_visit['visit_date'] = air_visit['visit_date'].dt.strftime('%Y-%m-%d')
air_visit['was_nil'] = air_visit['visitors'].isnull()
air_visit['visitors'].fillna(0, inplace=True)
air_visit.head()
Next let's take care of the calendar information. Apart from remaining the column names for practical reasons I added two features indicating if the previous or the next day is a holiday.
date_info = pd.read_csv('data/kaggle/date_info.csv')
date_info.rename(columns={'holiday_flg': 'is_holiday', 'calendar_date': 'visit_date'}, inplace=True)
date_info['prev_day_is_holiday'] = date_info['is_holiday'].shift().fillna(0)
date_info['next_day_is_holiday'] = date_info['is_holiday'].shift(-1).fillna(0)
date_info.head()
As for the store information I used the preprocessed version coming from the weather data instead of the "official" Kaggle version. The reason why is that the preprocessed version contains weather station data important for joining the weather features.
air_store_info = pd.read_csv('data/weather/air_store_info_with_nearest_active_station.csv')
air_store_info.head()
Now let's handle the test set. Because of the timeseries nature of the competition the test split is implicitely contained in the example submission. We can extract the store id and the date from the id
column. I also keep track of the order of the test because I'm a maniac and I want to make sure my submission is sorted in the right order.
import numpy as np
submission = pd.read_csv('data/kaggle/sample_submission.csv')
submission['air_store_id'] = submission['id'].str.slice(0, 20)
submission['visit_date'] = submission['id'].str.slice(21)
submission['is_test'] = True
submission['visitors'] = np.nan
submission['test_number'] = range(len(submission))
submission.head()
Now that we have the training set and the test set in same format we can merge them together. Merging the training and test sets is a good idea to extract features in one go. Let's also merge the full dataset with the calendar information and the store information extracted previously.
data = pd.concat((air_visit, submission.drop('id', axis='columns')))
data['is_test'].fillna(False, inplace=True)
data = pd.merge(left=data, right=date_info, on='visit_date', how='left')
data = pd.merge(left=data, right=air_store_info, on='air_store_id', how='left')
data['visitors'] = data['visitors'].astype(float)
data.head()
Now for the messy part. The weather data is stored in individual files, one file for weather station. We also know each restaurant's closest weather station. To make things even more enjoyable there are some missing values for each weather station. Let's first start by loading the data.
import glob
weather_dfs = []
for path in glob.glob('data/weather/1-1-16_5-31-17_Weather_Translated/*.csv'):
weather_df = pd.read_csv(path)
weather_df['station_id'] = path.split('\\')[-1].rstrip('.csv')
weather_dfs.append(weather_df)
weather = pd.concat(weather_dfs, axis='rows')
weather.rename(columns={'calendar_date': 'visit_date'}, inplace=True)
weather.head()
After some testing and browsing the Kaggle forums it seemed only worthwhile to use the precipitation and the temperature features. I handled the missing values by replacing them with the global daily average. If I wasn't so lazy I would have taken the average of the $n$ closest stations.
means = weather.groupby('visit_date')[['avg_temperature', 'precipitation']].mean().reset_index()
means.rename(columns={'avg_temperature': 'global_avg_temperature', 'precipitation': 'global_precipitation'}, inplace=True)
weather = pd.merge(left=weather, right=means, on='visit_date', how='left')
weather['avg_temperature'].fillna(weather['global_avg_temperature'], inplace=True)
weather['precipitation'].fillna(weather['global_precipitation'], inplace=True)
weather[['visit_date', 'avg_temperature', 'precipitation']].head()
Finally let's tidy the final dataframe before extracting some juicy features.
data['visit_date'] = pd.to_datetime(data['visit_date'])
data.index = data['visit_date']
data.sort_values(['air_store_id', 'visit_date'], inplace=True)
data.head()
After a few iterations I noticed that there were many outliers in the datasets. For example the number of visits around New Year is clearly not representative of the rest of the year. In this case outliers are values that are extremely high (there are no negative values in the dataset, duh). I didn't go out of my way and did something really simple. I simply defined outliers as values that were outside of a confidence interval, supposing that the visits follow a normal distribution per restaurant. The 2.4 in the following code block is simply a high quantile of the normal distribution. You could just as well have used 1.96 if you wanted to choose the top 5% of values as outliers. Once I have detected outliers, I created a new variable called visitors_capped
where the outlier values are replaced with the maximum of the non-outlier values. You can read more about my method (and others) by checking out this thread.
def find_outliers(series):
return (series - series.mean()) > 2.4 * series.std()
def cap_values(series):
outliers = find_outliers(series)
max_val = series[~outliers].max()
series[outliers] = max_val
return series
stores = data.groupby('air_store_id')
data['is_outlier'] = stores.apply(lambda g: find_outliers(g['visitors'])).values
data['visitors_capped'] = stores.apply(lambda g: cap_values(g['visitors'])).values
data['visitors_capped_log1p'] = np.log1p(data['visitors_capped'])
data.head()
Next I extracted some simple features. The day of the month feature is quite interesting because it can be seen as a proxy of when people get paid in the month, supposing they get paid on a monthly basis.
data['is_weekend'] = data['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)
data['day_of_month'] = data['visit_date'].dt.day
The following code block is probably the most important part of this notebook. Exponentially weighted means (EWM) are a very way to capture the trend of a timeseries. However they rely on a paramater alpha
which models the weighting of the mean. For example if alpha
is close to 1 then recent values are more weighted then older ones. Choosing alpha
is very important and there is no obvious way to choose it. I decided to do an optimization procedure to determine the best alpha
for each store's timeseries, per day of the week.
The optimization is done with differential evolution. The main reason why I chose this method is that it ensures that the search space is limited between 0 and 1 (the domain of alpha
). The objective function is the mean squared error between the timeseries and it's EWM. The EWM is computed on the timeseries shifted by 1 to make sure that the EWM can't "look" at the current value (else it won't work on the test set). As for how on the test, the values will be the same for the first week as well as the second, the third, etc. Of course this isn't ideal, but in my opinion there is not much else that can be done.
One important thing is that I applied the following procedure per store and per day of the week but also per store and per weekend. The latter is less granular and is helpful for timeseries with not much information. Also we apply the procedure to the capped visitor information (calculated previously) and to it's logarithm. The reason why I'm doing this for the logarithm is that it smooths the timeseries and can help a bit. Variety is the spice of life.
from scipy import optimize
def calc_shifted_ewm(series, alpha, adjust=True):
return series.shift().ewm(alpha=alpha, adjust=adjust).mean()
def find_best_signal(series, adjust=False, eps=10e-5):
def f(alpha):
shifted_ewm = calc_shifted_ewm(series=series, alpha=min(max(alpha, 0), 1), adjust=adjust)
corr = np.mean(np.power(series - shifted_ewm, 2))
return corr
res = optimize.differential_evolution(func=f, bounds=[(0 + eps, 1 - eps)])
return calc_shifted_ewm(series=series, alpha=res['x'][0], adjust=adjust)
roll = data.groupby(['air_store_id', 'day_of_week']).apply(lambda g: find_best_signal(g['visitors_capped']))
data['optimized_ewm_by_air_store_id_&_day_of_week'] = roll.sort_index(level=['air_store_id', 'visit_date']).values
roll = data.groupby(['air_store_id', 'is_weekend']).apply(lambda g: find_best_signal(g['visitors_capped']))
data['optimized_ewm_by_air_store_id_&_is_weekend'] = roll.sort_index(level=['air_store_id', 'visit_date']).values
roll = data.groupby(['air_store_id', 'day_of_week']).apply(lambda g: find_best_signal(g['visitors_capped_log1p']))
data['optimized_ewm_log1p_by_air_store_id_&_day_of_week'] = roll.sort_index(level=['air_store_id', 'visit_date']).values
roll = data.groupby(['air_store_id', 'is_weekend']).apply(lambda g: find_best_signal(g['visitors_capped_log1p']))
data['optimized_ewm_log1p_by_air_store_id_&_is_weekend'] = roll.sort_index(level=['air_store_id', 'visit_date']).values
data.head()
I also extracted some more "naive" rolling features. I calculated the mean, median, standard deviation, number of values, minimum, maximum, and EWM with fixed a alpha
parameter per timeseries. Because it's always the same I put all the logic inside a function and made the most of pandas's groupby
and apply
.
def extract_precedent_statistics(df, on, group_by):
df.sort_values(group_by + ['visit_date'], inplace=True)
groups = df.groupby(group_by, sort=False)
stats = {
'mean': [],
'median': [],
'std': [],
'count': [],
'max': [],
'min': []
}
exp_alphas = [0.1, 0.25, 0.3, 0.5, 0.75]
stats.update({'exp_{}_mean'.format(alpha): [] for alpha in exp_alphas})
for _, group in groups:
shift = group[on].shift()
roll = shift.rolling(window=len(group), min_periods=1)
stats['mean'].extend(roll.mean())
stats['median'].extend(roll.median())
stats['std'].extend(roll.std())
stats['count'].extend(roll.count())
stats['max'].extend(roll.max())
stats['min'].extend(roll.min())
for alpha in exp_alphas:
exp = shift.ewm(alpha=alpha, adjust=False)
stats['exp_{}_mean'.format(alpha)].extend(exp.mean())
suffix = '_&_'.join(group_by)
for stat_name, values in stats.items():
df['{}_{}_by_{}'.format(on, stat_name, suffix)] = values
extract_precedent_statistics(
df=data,
on='visitors_capped',
group_by=['air_store_id', 'day_of_week']
)
extract_precedent_statistics(
df=data,
on='visitors_capped',
group_by=['air_store_id', 'is_weekend']
)
extract_precedent_statistics(
df=data,
on='visitors_capped',
group_by=['air_store_id']
)
extract_precedent_statistics(
df=data,
on='visitors_capped_log1p',
group_by=['air_store_id', 'day_of_week']
)
extract_precedent_statistics(
df=data,
on='visitors_capped_log1p',
group_by=['air_store_id', 'is_weekend']
)
extract_precedent_statistics(
df=data,
on='visitors_capped_log1p',
group_by=['air_store_id']
)
data.sort_values(['air_store_id', 'visit_date']).head()
I'm going to go with boosting in the end so the categorical variables have to be one-hot encoded.
data = pd.get_dummies(data, columns=['day_of_week', 'air_genre_name'])
Now let's define the training and sets so that we can train a model. A few columns in the dataset have to be dropped because they are useless. I went for predicting the log of the visitors. During prediction I also predict the log of the visitors, to get back to the original magnitude I simply apply the exponential function to the prediction. This works because $exp(log(x)) = x$. I'm pretty sure anyone serious did this during the competition. I'm not sure why this works theoretically. My intuition tells me that it helps a decision tree to pack values in a leaf because the values are "closer" to each other.
data['visitors_log1p'] = np.log1p(data['visitors'])
train = data[(data['is_test'] == False) & (data['is_outlier'] == False) & (data['was_nil'] == False)]
test = data[data['is_test']].sort_values('test_number')
to_drop = ['air_store_id', 'is_test', 'test_number', 'visit_date', 'was_nil',
'is_outlier', 'visitors_capped', 'visitors', 'air_area_name',
'station_id', 'station_latitude', 'station_longitude', 'station_vincenty',
'station_great_circle', 'visitors_capped_log1p']
train = train.drop(to_drop, axis='columns')
train = train.dropna()
test = test.drop(to_drop, axis='columns')
X_train = train.drop('visitors_log1p', axis='columns')
X_test = test.drop('visitors_log1p', axis='columns')
y_train = train['visitors_log1p']
The following code block just does a few sanity checks.
assert X_train.isnull().sum().sum() == 0
assert y_train.isnull().sum() == 0
assert len(X_train) == len(y_train)
assert X_test.isnull().sum().sum() == 0
assert len(X_test) == 32019
Finally let's do the actual machine learning. I went for a LightGVM cross-validation where I train 6 models on a random sample of the dataset. The sampling is done without replacement so technically this is known as pasting. I like doing parameter tuning myself and I don't use grid-search techniques as I find there are too burdensome. This is pretty classic stuff and isn't the most part of my notebook. A lot of Kagglers use this setup.
import lightgbm as lgbm
from sklearn import metrics
from sklearn import model_selection
np.random.seed(42)
model = lgbm.LGBMRegressor(
objective='regression',
max_depth=5,
num_leaves=5 ** 2 - 1,
learning_rate=0.007,
n_estimators=30000,
min_child_samples=80,
subsample=0.8,
colsample_bytree=1,
reg_alpha=0,
reg_lambda=0,
random_state=np.random.randint(10e6)
)
n_splits = 6
cv = model_selection.KFold(n_splits=n_splits, shuffle=True, random_state=42)
val_scores = [0] * n_splits
sub = submission['id'].to_frame()
sub['visitors'] = 0
feature_importances = pd.DataFrame(index=X_train.columns)
for i, (fit_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
X_fit = X_train.iloc[fit_idx]
y_fit = y_train.iloc[fit_idx]
X_val = X_train.iloc[val_idx]
y_val = y_train.iloc[val_idx]
model.fit(
X_fit,
y_fit,
eval_set=[(X_fit, y_fit), (X_val, y_val)],
eval_names=('fit', 'val'),
eval_metric='l2',
early_stopping_rounds=200,
feature_name=X_fit.columns.tolist(),
verbose=False
)
val_scores[i] = np.sqrt(model.best_score_['val']['l2'])
sub['visitors'] += model.predict(X_test, num_iteration=model.best_iteration_)
feature_importances[i] = model.feature_importances_
print('Fold {} RMSLE: {:.5f}'.format(i+1, val_scores[i]))
sub['visitors'] /= n_splits
sub['visitors'] = np.expm1(sub['visitors'])
val_mean = np.mean(val_scores)
val_std = np.std(val_scores)
print('Local RMSLE: {:.5f} (±{:.5f})'.format(val_mean, val_std))
Now let's save the submission. I like annotating the submission with the local score to compare it with the leaderboard score.
sub.to_csv('submissions/lgbm_{:.5f}_{:.5f}.csv'.format(val_mean, val_std), index=False)
Here are the feature importances so to get a feeling of what worked well or not.
feature_importances.sort_values(0, ascending=False)
That's all there is to it! Thanks for reading. I hope you picked up some useful stuff in this notebook. Feel free to shoot me an email at [email protected] if you have any questions.