Kaggle recruit restaurant solution

This notebook contains my solution to the Kaggle Recruit Restaurant Visitor Forecasting. I went for simplicity and used a single, global, model. I spent most of my time doing feature engineering. I mainly focused on extracting so-called rolling statistics.

By itself the output of this notebook scores 0.512 on the private leaderboard, which is enough to be part of the top 25. I then averaged my solution with the Surprise Me public kernel to score 0.507. You could say that I only did half the work. However if the public kernel was not available I would simply have built another model and used it instead. Blending kernels to score has become ubiquitous to do well on Kaggle.

Data loading

For obvious reasons I haven't commited the data files on GitHub. To run this kernel yourself as it is you will have to organise your files in a certain way. All the files provided by Kaggle should be stored in the data/kaggle directory. Additionally, the weather data located here has to be kept in the data/weather directory.

First of all let's load the visit data. I did something very important in the following code block. Instead of taking the visiting data as is, I resampled it by day so that for days where there are no data points the number of visits is 0. This is important for computing rolling features based on time. I also keep track of whether a point was included in the original dataset or was added by the resampling procedure.

In [2]:
import pandas as pd

air_visit = pd.read_csv('data/kaggle/air_visit_data.csv')
air_visit.index = pd.to_datetime(air_visit['visit_date'])
air_visit = air_visit.groupby('air_store_id').apply(lambda g: g['visitors'].resample('1d').sum()).reset_index()
air_visit['visit_date'] = air_visit['visit_date'].dt.strftime('%Y-%m-%d')
air_visit['was_nil'] = air_visit['visitors'].isnull()
air_visit['visitors'].fillna(0, inplace=True)

air_visit.head()
Out[2]:
air_store_id visit_date visitors was_nil
0 air_00a91d42b08b08d9 2016-07-01 35.0 False
1 air_00a91d42b08b08d9 2016-07-02 9.0 False
2 air_00a91d42b08b08d9 2016-07-03 0.0 True
3 air_00a91d42b08b08d9 2016-07-04 20.0 False
4 air_00a91d42b08b08d9 2016-07-05 25.0 False

Next let's take care of the calendar information. Apart from remaining the column names for practical reasons I added two features indicating if the previous or the next day is a holiday.

In [3]:
date_info = pd.read_csv('data/kaggle/date_info.csv')
date_info.rename(columns={'holiday_flg': 'is_holiday', 'calendar_date': 'visit_date'}, inplace=True)
date_info['prev_day_is_holiday'] = date_info['is_holiday'].shift().fillna(0)
date_info['next_day_is_holiday'] = date_info['is_holiday'].shift(-1).fillna(0)

date_info.head()
Out[3]:
visit_date day_of_week is_holiday prev_day_is_holiday next_day_is_holiday
0 2016-01-01 Friday 1 0.0 1.0
1 2016-01-02 Saturday 1 1.0 1.0
2 2016-01-03 Sunday 1 1.0 0.0
3 2016-01-04 Monday 0 1.0 0.0
4 2016-01-05 Tuesday 0 0.0 0.0

As for the store information I used the preprocessed version coming from the weather data instead of the "official" Kaggle version. The reason why is that the preprocessed version contains weather station data important for joining the weather features.

In [5]:
air_store_info = pd.read_csv('data/weather/air_store_info_with_nearest_active_station.csv')

air_store_info.head()
Out[5]:
air_store_id air_genre_name air_area_name latitude longitude station_id station_latitude station_longitude station_vincenty station_great_circle
0 air_0f0cdeee6c9bf3d7 Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852 hyogo__kobe-kana__koube 34.696667 135.211667 1.277232 1.274882
1 air_7cc17a324ae5c7dc Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852 hyogo__kobe-kana__koube 34.696667 135.211667 1.277232 1.274882
2 air_fee8dcf4d619598e Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852 hyogo__kobe-kana__koube 34.696667 135.211667 1.277232 1.274882
3 air_a17f0778617c76e2 Italian/French Hyōgo-ken Kōbe-shi Kumoidōri 34.695124 135.197852 hyogo__kobe-kana__koube 34.696667 135.211667 1.277232 1.274882
4 air_83db5aff8f50478e Italian/French Tōkyō-to Minato-ku Shibakōen 35.658068 139.751599 tokyo__tokyo-kana__tonokyo 35.691667 139.750000 3.730672 3.739835

Now let's handle the test set. Because of the timeseries nature of the competition the test split is implicitely contained in the example submission. We can extract the store id and the date from the id column. I also keep track of the order of the test because I'm a maniac and I want to make sure my submission is sorted in the right order.

In [7]:
import numpy as np

submission = pd.read_csv('data/kaggle/sample_submission.csv')
submission['air_store_id'] = submission['id'].str.slice(0, 20)
submission['visit_date'] = submission['id'].str.slice(21)
submission['is_test'] = True
submission['visitors'] = np.nan
submission['test_number'] = range(len(submission))

submission.head()
Out[7]:
id visitors air_store_id visit_date is_test test_number
0 air_00a91d42b08b08d9_2017-04-23 NaN air_00a91d42b08b08d9 2017-04-23 True 0
1 air_00a91d42b08b08d9_2017-04-24 NaN air_00a91d42b08b08d9 2017-04-24 True 1
2 air_00a91d42b08b08d9_2017-04-25 NaN air_00a91d42b08b08d9 2017-04-25 True 2
3 air_00a91d42b08b08d9_2017-04-26 NaN air_00a91d42b08b08d9 2017-04-26 True 3
4 air_00a91d42b08b08d9_2017-04-27 NaN air_00a91d42b08b08d9 2017-04-27 True 4

Now that we have the training set and the test set in same format we can merge them together. Merging the training and test sets is a good idea to extract features in one go. Let's also merge the full dataset with the calendar information and the store information extracted previously.

In [8]:
data = pd.concat((air_visit, submission.drop('id', axis='columns')))
data['is_test'].fillna(False, inplace=True)
data = pd.merge(left=data, right=date_info, on='visit_date', how='left')
data = pd.merge(left=data, right=air_store_info, on='air_store_id', how='left')
data['visitors'] = data['visitors'].astype(float)

data.head()
Out[8]:
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday air_genre_name air_area_name latitude longitude station_id station_latitude station_longitude station_vincenty station_great_circle
0 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906
1 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906
2 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 True Sunday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906
3 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906
4 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906

Now for the messy part. The weather data is stored in individual files, one file for weather station. We also know each restaurant's closest weather station. To make things even more enjoyable there are some missing values for each weather station. Let's first start by loading the data.

In [12]:
import glob

weather_dfs = []

for path in glob.glob('data/weather/1-1-16_5-31-17_Weather_Translated/*.csv'):
    weather_df = pd.read_csv(path)
    weather_df['station_id'] = path.split('\\')[-1].rstrip('.csv')
    weather_dfs.append(weather_df)

weather = pd.concat(weather_dfs, axis='rows')
weather.rename(columns={'calendar_date': 'visit_date'}, inplace=True)

weather.head()
Out[12]:
visit_date precipitation avg_temperature total_snowfall deepest_snowfall hours_sunlight avg_wind_speed avg_vapor_pressure avg_humidity avg_sea_pressure avg_local_pressure solar_radiation cloud_cover high_temperature low_temperature station_id
0 2016-01-01 0.0 5.7 NaN NaN 5.1 1.9 NaN NaN NaN NaN NaN NaN 11.0 2.1 data/weather/1-1-16_5-31-17_Weather_Translated...
1 2016-01-02 0.5 10.2 NaN NaN 1.3 2.2 NaN NaN NaN NaN NaN NaN 15.3 6.1 data/weather/1-1-16_5-31-17_Weather_Translated...
2 2016-01-03 0.0 11.2 NaN NaN 0.5 0.9 NaN NaN NaN NaN NaN NaN 15.5 7.1 data/weather/1-1-16_5-31-17_Weather_Translated...
3 2016-01-04 0.0 9.4 NaN NaN 4.2 1.8 NaN NaN NaN NaN NaN NaN 16.1 5.9 data/weather/1-1-16_5-31-17_Weather_Translated...
4 2016-01-05 1.5 8.8 NaN NaN 0.0 0.8 NaN NaN NaN NaN NaN NaN 12.3 7.1 data/weather/1-1-16_5-31-17_Weather_Translated...

After some testing and browsing the Kaggle forums it seemed only worthwhile to use the precipitation and the temperature features. I handled the missing values by replacing them with the global daily average. If I wasn't so lazy I would have taken the average of the $n$ closest stations.

In [13]:
means = weather.groupby('visit_date')[['avg_temperature', 'precipitation']].mean().reset_index()
means.rename(columns={'avg_temperature': 'global_avg_temperature', 'precipitation': 'global_precipitation'}, inplace=True)
weather = pd.merge(left=weather, right=means, on='visit_date', how='left')
weather['avg_temperature'].fillna(weather['global_avg_temperature'], inplace=True)
weather['precipitation'].fillna(weather['global_precipitation'], inplace=True)

weather[['visit_date', 'avg_temperature', 'precipitation']].head()
Out[13]:
visit_date avg_temperature precipitation
0 2016-01-01 5.7 0.0
1 2016-01-02 10.2 0.5
2 2016-01-03 11.2 0.0
3 2016-01-04 9.4 0.0
4 2016-01-05 8.8 1.5

Finally let's tidy the final dataframe before extracting some juicy features.

In [14]:
data['visit_date'] = pd.to_datetime(data['visit_date'])
data.index = data['visit_date']
data.sort_values(['air_store_id', 'visit_date'], inplace=True)

data.head()
Out[14]:
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday air_genre_name air_area_name latitude longitude station_id station_latitude station_longitude station_vincenty station_great_circle
visit_date
2016-07-01 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906
2016-07-02 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906
2016-07-03 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 True Sunday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906
2016-07-04 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906
2016-07-05 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 Italian/French Tōkyō-to Chiyoda-ku Kudanminami 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906

Preprocessing

After a few iterations I noticed that there were many outliers in the datasets. For example the number of visits around New Year is clearly not representative of the rest of the year. In this case outliers are values that are extremely high (there are no negative values in the dataset, duh). I didn't go out of my way and did something really simple. I simply defined outliers as values that were outside of a confidence interval, supposing that the visits follow a normal distribution per restaurant. The 2.4 in the following code block is simply a high quantile of the normal distribution. You could just as well have used 1.96 if you wanted to choose the top 5% of values as outliers. Once I have detected outliers, I created a new variable called visitors_capped where the outlier values are replaced with the maximum of the non-outlier values. You can read more about my method (and others) by checking out this thread.

In [15]:
def find_outliers(series):
    return (series - series.mean()) > 2.4 * series.std()


def cap_values(series):
    outliers = find_outliers(series)
    max_val = series[~outliers].max()
    series[outliers] = max_val
    return series


stores = data.groupby('air_store_id')
data['is_outlier'] = stores.apply(lambda g: find_outliers(g['visitors'])).values
data['visitors_capped'] = stores.apply(lambda g: cap_values(g['visitors'])).values
data['visitors_capped_log1p'] = np.log1p(data['visitors_capped'])

data.head()
Out[15]:
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday ... latitude longitude station_id station_latitude station_longitude station_vincenty station_great_circle is_outlier visitors_capped visitors_capped_log1p
visit_date
2016-07-01 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 ... 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906 False 35.0 3.583519
2016-07-02 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 ... 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906 False 9.0 2.302585
2016-07-03 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 True Sunday 0 0.0 0.0 ... 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906 False 0.0 0.000000
2016-07-04 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 ... 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906 False 20.0 3.044522
2016-07-05 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 ... 35.694003 139.753595 tokyo__tokyo-kana__tonokyo 35.691667 139.75 0.416011 0.415906 False 25.0 3.258097

5 rows × 22 columns

Next I extracted some simple features. The day of the month feature is quite interesting because it can be seen as a proxy of when people get paid in the month, supposing they get paid on a monthly basis.

In [17]:
data['is_weekend'] = data['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)
data['day_of_month'] = data['visit_date'].dt.day

The following code block is probably the most important part of this notebook. Exponentially weighted means (EWM) are a very way to capture the trend of a timeseries. However they rely on a paramater alpha which models the weighting of the mean. For example if alpha is close to 1 then recent values are more weighted then older ones. Choosing alpha is very important and there is no obvious way to choose it. I decided to do an optimization procedure to determine the best alpha for each store's timeseries, per day of the week.

The optimization is done with differential evolution. The main reason why I chose this method is that it ensures that the search space is limited between 0 and 1 (the domain of alpha). The objective function is the mean squared error between the timeseries and it's EWM. The EWM is computed on the timeseries shifted by 1 to make sure that the EWM can't "look" at the current value (else it won't work on the test set). As for how on the test, the values will be the same for the first week as well as the second, the third, etc. Of course this isn't ideal, but in my opinion there is not much else that can be done.

One important thing is that I applied the following procedure per store and per day of the week but also per store and per weekend. The latter is less granular and is helpful for timeseries with not much information. Also we apply the procedure to the capped visitor information (calculated previously) and to it's logarithm. The reason why I'm doing this for the logarithm is that it smooths the timeseries and can help a bit. Variety is the spice of life.

In [18]:
from scipy import optimize


def calc_shifted_ewm(series, alpha, adjust=True):
    return series.shift().ewm(alpha=alpha, adjust=adjust).mean()


def find_best_signal(series, adjust=False, eps=10e-5):
    
    def f(alpha):
        shifted_ewm = calc_shifted_ewm(series=series, alpha=min(max(alpha, 0), 1), adjust=adjust)
        corr = np.mean(np.power(series - shifted_ewm, 2))
        return corr
     
    res = optimize.differential_evolution(func=f, bounds=[(0 + eps, 1 - eps)])
    
    return calc_shifted_ewm(series=series, alpha=res['x'][0], adjust=adjust)


roll = data.groupby(['air_store_id', 'day_of_week']).apply(lambda g: find_best_signal(g['visitors_capped']))
data['optimized_ewm_by_air_store_id_&_day_of_week'] = roll.sort_index(level=['air_store_id', 'visit_date']).values

roll = data.groupby(['air_store_id', 'is_weekend']).apply(lambda g: find_best_signal(g['visitors_capped']))
data['optimized_ewm_by_air_store_id_&_is_weekend'] = roll.sort_index(level=['air_store_id', 'visit_date']).values

roll = data.groupby(['air_store_id', 'day_of_week']).apply(lambda g: find_best_signal(g['visitors_capped_log1p']))
data['optimized_ewm_log1p_by_air_store_id_&_day_of_week'] = roll.sort_index(level=['air_store_id', 'visit_date']).values

roll = data.groupby(['air_store_id', 'is_weekend']).apply(lambda g: find_best_signal(g['visitors_capped_log1p']))
data['optimized_ewm_log1p_by_air_store_id_&_is_weekend'] = roll.sort_index(level=['air_store_id', 'visit_date']).values

data.head()
Out[18]:
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday ... station_great_circle is_outlier visitors_capped visitors_capped_log1p optimized_ewm_by_air_store_id_&_day_of_week is_weekend day_of_month optimized_ewm_by_air_store_id_&_is_weekend optimized_ewm_log1p_by_air_store_id_&_day_of_week optimized_ewm_log1p_by_air_store_id_&_is_weekend
visit_date
2016-07-01 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 ... 0.415906 False 35.0 3.583519 NaN 0 1 NaN NaN NaN
2016-07-02 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 ... 0.415906 False 9.0 2.302585 NaN 1 2 NaN NaN NaN
2016-07-03 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 True Sunday 0 0.0 0.0 ... 0.415906 False 0.0 0.000000 NaN 1 3 9.000000 NaN 2.302585
2016-07-04 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 ... 0.415906 False 20.0 3.044522 NaN 0 4 35.000000 NaN 3.583519
2016-07-05 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 ... 0.415906 False 25.0 3.258097 NaN 0 5 33.554228 NaN 3.429215

5 rows × 28 columns

I also extracted some more "naive" rolling features. I calculated the mean, median, standard deviation, number of values, minimum, maximum, and EWM with fixed a alpha parameter per timeseries. Because it's always the same I put all the logic inside a function and made the most of pandas's groupby and apply.

In [22]:
def extract_precedent_statistics(df, on, group_by):
    
    df.sort_values(group_by + ['visit_date'], inplace=True)
    
    groups = df.groupby(group_by, sort=False)
    
    stats = {
        'mean': [],
        'median': [],
        'std': [],
        'count': [],
        'max': [],
        'min': []
    }
    
    exp_alphas = [0.1, 0.25, 0.3, 0.5, 0.75]
    stats.update({'exp_{}_mean'.format(alpha): [] for alpha in exp_alphas})
    
    for _, group in groups:
        
        shift = group[on].shift()
        roll = shift.rolling(window=len(group), min_periods=1)
        
        stats['mean'].extend(roll.mean())
        stats['median'].extend(roll.median())
        stats['std'].extend(roll.std())
        stats['count'].extend(roll.count())
        stats['max'].extend(roll.max())
        stats['min'].extend(roll.min())
        
        for alpha in exp_alphas:
            exp = shift.ewm(alpha=alpha, adjust=False)
            stats['exp_{}_mean'.format(alpha)].extend(exp.mean())
    
    suffix = '_&_'.join(group_by)
    
    for stat_name, values in stats.items():
        df['{}_{}_by_{}'.format(on, stat_name, suffix)] = values


extract_precedent_statistics(
    df=data,
    on='visitors_capped',
    group_by=['air_store_id', 'day_of_week']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped',
    group_by=['air_store_id', 'is_weekend']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped',
    group_by=['air_store_id']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped_log1p',
    group_by=['air_store_id', 'day_of_week']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped_log1p',
    group_by=['air_store_id', 'is_weekend']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped_log1p',
    group_by=['air_store_id']
)

data.sort_values(['air_store_id', 'visit_date']).head()
Out[22]:
air_store_id is_test test_number visit_date visitors was_nil day_of_week is_holiday prev_day_is_holiday next_day_is_holiday ... visitors_capped_log1p_median_by_air_store_id visitors_capped_log1p_std_by_air_store_id visitors_capped_log1p_count_by_air_store_id visitors_capped_log1p_max_by_air_store_id visitors_capped_log1p_min_by_air_store_id visitors_capped_log1p_exp_0.1_mean_by_air_store_id visitors_capped_log1p_exp_0.25_mean_by_air_store_id visitors_capped_log1p_exp_0.3_mean_by_air_store_id visitors_capped_log1p_exp_0.5_mean_by_air_store_id visitors_capped_log1p_exp_0.75_mean_by_air_store_id
visit_date
2016-07-01 air_00a91d42b08b08d9 False NaN 2016-07-01 35.0 False Friday 0 0.0 0.0 ... NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN
2016-07-02 air_00a91d42b08b08d9 False NaN 2016-07-02 9.0 False Saturday 0 0.0 0.0 ... 3.583519 NaN 1.0 3.583519 3.583519 3.583519 3.583519 3.583519 3.583519 3.583519
2016-07-03 air_00a91d42b08b08d9 False NaN 2016-07-03 0.0 True Sunday 0 0.0 0.0 ... 2.943052 0.905757 2.0 3.583519 2.302585 3.455426 3.263285 3.199239 2.943052 2.622819
2016-07-04 air_00a91d42b08b08d9 False NaN 2016-07-04 20.0 False Monday 0 0.0 0.0 ... 2.302585 1.815870 3.0 3.583519 0.000000 3.109883 2.447464 2.239467 1.471526 0.655705
2016-07-05 air_00a91d42b08b08d9 False NaN 2016-07-05 25.0 False Tuesday 0 0.0 0.0 ... 2.673554 1.578354 4.0 3.583519 0.000000 3.103347 2.596729 2.480984 2.258024 2.447318

5 rows × 94 columns

I'm going to go with boosting in the end so the categorical variables have to be one-hot encoded.

In [23]:
data = pd.get_dummies(data, columns=['day_of_week', 'air_genre_name'])

Train/test split

Now let's define the training and sets so that we can train a model. A few columns in the dataset have to be dropped because they are useless. I went for predicting the log of the visitors. During prediction I also predict the log of the visitors, to get back to the original magnitude I simply apply the exponential function to the prediction. This works because $exp(log(x)) = x$. I'm pretty sure anyone serious did this during the competition. I'm not sure why this works theoretically. My intuition tells me that it helps a decision tree to pack values in a leaf because the values are "closer" to each other.

In [24]:
data['visitors_log1p'] = np.log1p(data['visitors'])
train = data[(data['is_test'] == False) & (data['is_outlier'] == False) & (data['was_nil'] == False)]
test = data[data['is_test']].sort_values('test_number')

to_drop = ['air_store_id', 'is_test', 'test_number', 'visit_date', 'was_nil',
           'is_outlier', 'visitors_capped', 'visitors', 'air_area_name',
           'station_id', 'station_latitude', 'station_longitude', 'station_vincenty',
           'station_great_circle', 'visitors_capped_log1p']
train = train.drop(to_drop, axis='columns')
train = train.dropna()
test = test.drop(to_drop, axis='columns')

X_train = train.drop('visitors_log1p', axis='columns')
X_test = test.drop('visitors_log1p', axis='columns')
y_train = train['visitors_log1p']

The following code block just does a few sanity checks.

In [25]:
assert X_train.isnull().sum().sum() == 0
assert y_train.isnull().sum() == 0
assert len(X_train) == len(y_train)
assert X_test.isnull().sum().sum() == 0
assert len(X_test) == 32019

Finally let's do the actual machine learning. I went for a LightGVM cross-validation where I train 6 models on a random sample of the dataset. The sampling is done without replacement so technically this is known as pasting. I like doing parameter tuning myself and I don't use grid-search techniques as I find there are too burdensome. This is pretty classic stuff and isn't the most part of my notebook. A lot of Kagglers use this setup.

In [27]:
import lightgbm as lgbm
from sklearn import metrics
from sklearn import model_selection


np.random.seed(42)

model = lgbm.LGBMRegressor(
    objective='regression',
    max_depth=5,
    num_leaves=5 ** 2 - 1,
    learning_rate=0.007,
    n_estimators=30000,
    min_child_samples=80,
    subsample=0.8,
    colsample_bytree=1,
    reg_alpha=0,
    reg_lambda=0,
    random_state=np.random.randint(10e6)
)

n_splits = 6
cv = model_selection.KFold(n_splits=n_splits, shuffle=True, random_state=42)

val_scores = [0] * n_splits

sub = submission['id'].to_frame()
sub['visitors'] = 0

feature_importances = pd.DataFrame(index=X_train.columns)

for i, (fit_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
    
    X_fit = X_train.iloc[fit_idx]
    y_fit = y_train.iloc[fit_idx]
    X_val = X_train.iloc[val_idx]
    y_val = y_train.iloc[val_idx]
    
    model.fit(
        X_fit,
        y_fit,
        eval_set=[(X_fit, y_fit), (X_val, y_val)],
        eval_names=('fit', 'val'),
        eval_metric='l2',
        early_stopping_rounds=200,
        feature_name=X_fit.columns.tolist(),
        verbose=False
    )
    
    val_scores[i] = np.sqrt(model.best_score_['val']['l2'])
    sub['visitors'] += model.predict(X_test, num_iteration=model.best_iteration_)
    feature_importances[i] = model.feature_importances_
    
    print('Fold {} RMSLE: {:.5f}'.format(i+1, val_scores[i]))
    
sub['visitors'] /= n_splits
sub['visitors'] = np.expm1(sub['visitors'])

val_mean = np.mean(val_scores)
val_std = np.std(val_scores)

print('Local RMSLE: {:.5f}{:.5f})'.format(val_mean, val_std))
Fold 1 RMSLE: 0.48486
Fold 2 RMSLE: 0.48360
Fold 3 RMSLE: 0.48666
Fold 4 RMSLE: 0.48522
Fold 5 RMSLE: 0.48259
Fold 6 RMSLE: 0.48010
Local RMSLE: 0.48384 (±0.00210)

Now let's save the submission. I like annotating the submission with the local score to compare it with the leaderboard score.

In [16]:
sub.to_csv('submissions/lgbm_{:.5f}_{:.5f}.csv'.format(val_mean, val_std), index=False)

Here are the feature importances so to get a feeling of what worked well or not.

In [28]:
feature_importances.sort_values(0, ascending=False)
Out[28]:
0 1 2 3 4 5
day_of_month 7565 6490 7200 7041 6583 7424
visitors_capped_count_by_air_store_id_&_is_weekend 7137 6101 6570 6306 6193 7160
optimized_ewm_by_air_store_id_&_day_of_week 7097 6057 6338 6428 6124 7023
visitors_capped_log1p_std_by_air_store_id 6834 5766 6552 6110 5680 6651
optimized_ewm_log1p_by_air_store_id_&_day_of_week 6755 5862 6476 6118 6171 7006
visitors_capped_std_by_air_store_id_&_day_of_week 6662 5524 6273 5860 5582 6728
visitors_capped_log1p_std_by_air_store_id_&_day_of_week 6651 5739 6575 6131 5838 7237
visitors_capped_std_by_air_store_id 6582 5873 6382 5958 5813 6795
visitors_capped_std_by_air_store_id_&_is_weekend 6053 5200 5630 5385 5309 6215
visitors_capped_count_by_air_store_id 6052 4889 5589 5638 5186 6126
visitors_capped_exp_0.1_mean_by_air_store_id 5959 5168 5343 5207 5106 6046
visitors_capped_log1p_std_by_air_store_id_&_is_weekend 5940 4870 5335 5304 5127 5701
latitude 5940 4690 5497 5174 4716 5698
visitors_capped_max_by_air_store_id 5735 5162 5709 5381 5065 5931
optimized_ewm_by_air_store_id_&_is_weekend 5593 4776 5395 4888 4784 5421
visitors_capped_exp_0.1_mean_by_air_store_id_&_day_of_week 5544 4696 5160 5120 4983 5815
longitude 5182 4293 4727 4343 4220 5033
visitors_capped_log1p_exp_0.1_mean_by_air_store_id_&_day_of_week 5138 4087 4617 4416 4128 4840
visitors_capped_log1p_mean_by_air_store_id 5070 3872 4579 4315 4366 5113
visitors_capped_mean_by_air_store_id_&_day_of_week 4995 4255 4610 4299 4312 5161
visitors_capped_log1p_exp_0.1_mean_by_air_store_id 4848 3616 4288 4202 3742 4680
visitors_capped_log1p_exp_0.75_mean_by_air_store_id_&_day_of_week 4833 3933 4324 4172 3856 4760
visitors_capped_log1p_mean_by_air_store_id_&_day_of_week 4800 4095 4493 4248 4090 4895
visitors_capped_mean_by_air_store_id 4643 3885 4353 3988 3856 4646
visitors_capped_exp_0.75_mean_by_air_store_id 4457 3501 4094 3846 3411 4546
optimized_ewm_log1p_by_air_store_id_&_is_weekend 4375 3277 3979 3912 3697 4342
visitors_capped_max_by_air_store_id_&_day_of_week 4313 3691 3930 3711 3819 4418
visitors_capped_exp_0.75_mean_by_air_store_id_&_day_of_week 4301 3429 4061 3835 3582 4665
visitors_capped_log1p_exp_0.75_mean_by_air_store_id 4276 3505 4297 3879 3504 4477
visitors_capped_log1p_exp_0.75_mean_by_air_store_id_&_is_weekend 4070 3335 3751 3527 3242 4046
... ... ... ... ... ... ...
visitors_capped_log1p_median_by_air_store_id 785 578 776 705 689 849
air_genre_name_Izakaya 736 674 751 733 650 840
day_of_week_Sunday 665 586 624 614 540 621
air_genre_name_Dining bar 649 545 631 590 595 738
is_weekend 586 561 631 596 597 611
day_of_week_Thursday 506 444 520 465 453 539
day_of_week_Tuesday 496 446 465 481 415 418
air_genre_name_Bar/Cocktail 479 438 420 438 442 506
air_genre_name_Italian/French 317 299 301 318 280 381
visitors_capped_min_by_air_store_id_&_is_weekend 309 266 298 289 218 258
day_of_week_Wednesday 280 230 269 271 209 262
air_genre_name_Japanese food 262 228 265 200 209 276
visitors_capped_min_by_air_store_id 255 217 246 289 164 217
air_genre_name_Yakiniku/Korean food 190 143 159 177 199 189
air_genre_name_Western food 186 158 149 202 157 148
air_genre_name_Okonomiyaki/Monja/Teppanyaki 170 137 94 114 110 134
air_genre_name_Creative cuisine 121 115 118 106 136 161
air_genre_name_Karaoke/Party 99 128 73 83 70 167
air_genre_name_Other 99 88 127 92 117 100
air_genre_name_Asian 18 4 7 12 7 25
air_genre_name_International cuisine 1 7 19 0 1 18
visitors_capped_log1p_min_by_air_store_id 0 0 0 0 0 0
visitors_capped_log1p_max_by_air_store_id 0 0 0 0 0 0
visitors_capped_log1p_count_by_air_store_id 0 0 0 0 0 0
visitors_capped_log1p_min_by_air_store_id_&_is_weekend 0 0 0 0 0 0
visitors_capped_log1p_max_by_air_store_id_&_is_weekend 0 0 0 0 0 0
visitors_capped_log1p_count_by_air_store_id_&_is_weekend 0 0 0 0 0 0
visitors_capped_log1p_max_by_air_store_id_&_day_of_week 0 0 0 0 0 0
visitors_capped_log1p_count_by_air_store_id_&_day_of_week 0 0 0 0 0 0
visitors_capped_log1p_min_by_air_store_id_&_day_of_week 0 0 0 0 0 0

98 rows × 6 columns

That's all there is to it! Thanks for reading. I hope you picked up some useful stuff in this notebook. Feel free to shoot me an email at [email protected] if you have any questions.