Kaggle recruit restaurant solution¶

This notebook contains my solution to the Kaggle Recruit Restaurant Visitor Forecasting. I went for simplicity and used a single, global, model. I spent most of my time doing feature engineering. I mainly focused on extracting so-called rolling statistics.

By itself the output of this notebook scores 0.512 on the private leaderboard, which is enough to be part of the top 25. I then averaged my solution with the Surprise Me public kernel to score 0.507. You could say that I only did half the work. However if the public kernel was not available I would simply have built another model and used it instead. Blending kernels to score has become ubiquitous to do well on Kaggle.

Data loading¶

For obvious reasons I haven't commited the data files on GitHub. To run this kernel yourself as it is you will have to organise your files in a certain way. All the files provided by Kaggle should be stored in the data/kaggle directory. Additionally, the weather data located here has to be kept in the data/weather directory.

First of all let's load the visit data. I did something very important in the following code block. Instead of taking the visiting data as is, I resampled it by day so that for days where there are no data points the number of visits is 0. This is important for computing rolling features based on time. I also keep track of whether a point was included in the original dataset or was added by the resampling procedure.

import pandas as pd

air_visit = pd.read_csv('data/kaggle/air_visit_data.csv')
air_visit.index = pd.to_datetime(air_visit['visit_date'])
air_visit = air_visit.groupby('air_store_id').apply(lambda g: g['visitors'].resample('1d').sum()).reset_index()
air_visit['visit_date'] = air_visit['visit_date'].dt.strftime('%Y-%m-%d')
air_visit['was_nil'] = air_visit['visitors'].isnull()
air_visit['visitors'].fillna(0, inplace=True)

air_visit.head()

Next let's take care of the calendar information. Apart from remaining the column names for practical reasons I added two features indicating if the previous or the next day is a holiday.

date_info = pd.read_csv('data/kaggle/date_info.csv')
date_info.rename(columns={'holiday_flg': 'is_holiday', 'calendar_date': 'visit_date'}, inplace=True)
date_info['prev_day_is_holiday'] = date_info['is_holiday'].shift().fillna(0)
date_info['next_day_is_holiday'] = date_info['is_holiday'].shift(-1).fillna(0)

date_info.head()

As for the store information I used the preprocessed version coming from the weather data instead of the "official" Kaggle version. The reason why is that the preprocessed version contains weather station data important for joining the weather features.

air_store_info = pd.read_csv('data/weather/air_store_info_with_nearest_active_station.csv')

air_store_info.head()

Now let's handle the test set. Because of the timeseries nature of the competition the test split is implicitely contained in the example submission. We can extract the store id and the date from the id column. I also keep track of the order of the test because I'm a maniac and I want to make sure my submission is sorted in the right order.

import numpy as np

submission = pd.read_csv('data/kaggle/sample_submission.csv')
submission['air_store_id'] = submission['id'].str.slice(0, 20)
submission['visit_date'] = submission['id'].str.slice(21)
submission['is_test'] = True
submission['visitors'] = np.nan
submission['test_number'] = range(len(submission))

submission.head()

Now that we have the training set and the test set in same format we can merge them together. Merging the training and test sets is a good idea to extract features in one go. Let's also merge the full dataset with the calendar information and the store information extracted previously.

data = pd.concat((air_visit, submission.drop('id', axis='columns')))
data['is_test'].fillna(False, inplace=True)
data = pd.merge(left=data, right=date_info, on='visit_date', how='left')
data = pd.merge(left=data, right=air_store_info, on='air_store_id', how='left')
data['visitors'] = data['visitors'].astype(float)

data.head()

Now for the messy part. The weather data is stored in individual files, one file for weather station. We also know each restaurant's closest weather station. To make things even more enjoyable there are some missing values for each weather station. Let's first start by loading the data.

import glob

weather_dfs = []

for path in glob.glob('data/weather/1-1-16_5-31-17_Weather_Translated/*.csv'):
    weather_df = pd.read_csv(path)
    weather_df['station_id'] = path.split('\\')[-1].rstrip('.csv')
    weather_dfs.append(weather_df)

weather = pd.concat(weather_dfs, axis='rows')
weather.rename(columns={'calendar_date': 'visit_date'}, inplace=True)

weather.head()

After some testing and browsing the Kaggle forums it seemed only worthwhile to use the precipitation and the temperature features. I handled the missing values by replacing them with the global daily average. If I wasn't so lazy I would have taken the average of the $n$ closest stations.

means = weather.groupby('visit_date')[['avg_temperature', 'precipitation']].mean().reset_index()
means.rename(columns={'avg_temperature': 'global_avg_temperature', 'precipitation': 'global_precipitation'}, inplace=True)
weather = pd.merge(left=weather, right=means, on='visit_date', how='left')
weather['avg_temperature'].fillna(weather['global_avg_temperature'], inplace=True)
weather['precipitation'].fillna(weather['global_precipitation'], inplace=True)

weather[['visit_date', 'avg_temperature', 'precipitation']].head()

Finally let's tidy the final dataframe before extracting some juicy features.

data['visit_date'] = pd.to_datetime(data['visit_date'])
data.index = data['visit_date']
data.sort_values(['air_store_id', 'visit_date'], inplace=True)

data.head()

Preprocessing¶

After a few iterations I noticed that there were many outliers in the datasets. For example the number of visits around New Year is clearly not representative of the rest of the year. In this case outliers are values that are extremely high (there are no negative values in the dataset, duh). I didn't go out of my way and did something really simple. I simply defined outliers as values that were outside of a confidence interval, supposing that the visits follow a normal distribution per restaurant. The 2.4 in the following code block is simply a high quantile of the normal distribution. You could just as well have used 1.96 if you wanted to choose the top 5% of values as outliers. Once I have detected outliers, I created a new variable called visitors_capped where the outlier values are replaced with the maximum of the non-outlier values. You can read more about my method (and others) by checking out this thread.

def find_outliers(series):
    return (series - series.mean()) > 2.4 * series.std()


def cap_values(series):
    outliers = find_outliers(series)
    max_val = series[~outliers].max()
    series[outliers] = max_val
    return series


stores = data.groupby('air_store_id')
data['is_outlier'] = stores.apply(lambda g: find_outliers(g['visitors'])).values
data['visitors_capped'] = stores.apply(lambda g: cap_values(g['visitors'])).values
data['visitors_capped_log1p'] = np.log1p(data['visitors_capped'])

data.head()

Next I extracted some simple features. The day of the month feature is quite interesting because it can be seen as a proxy of when people get paid in the month, supposing they get paid on a monthly basis.

data['is_weekend'] = data['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)
data['day_of_month'] = data['visit_date'].dt.day

The following code block is probably the most important part of this notebook. Exponentially weighted means (EWM) are a very way to capture the trend of a timeseries. However they rely on a paramater alpha which models the weighting of the mean. For example if alpha is close to 1 then recent values are more weighted then older ones. Choosing alpha is very important and there is no obvious way to choose it. I decided to do an optimization procedure to determine the best alpha for each store's timeseries, per day of the week.

The optimization is done with differential evolution. The main reason why I chose this method is that it ensures that the search space is limited between 0 and 1 (the domain of alpha). The objective function is the mean squared error between the timeseries and it's EWM. The EWM is computed on the timeseries shifted by 1 to make sure that the EWM can't "look" at the current value (else it won't work on the test set). As for how on the test, the values will be the same for the first week as well as the second, the third, etc. Of course this isn't ideal, but in my opinion there is not much else that can be done.

One important thing is that I applied the following procedure per store and per day of the week but also per store and per weekend. The latter is less granular and is helpful for timeseries with not much information. Also we apply the procedure to the capped visitor information (calculated previously) and to it's logarithm. The reason why I'm doing this for the logarithm is that it smooths the timeseries and can help a bit. Variety is the spice of life.

from scipy import optimize


def calc_shifted_ewm(series, alpha, adjust=True):
    return series.shift().ewm(alpha=alpha, adjust=adjust).mean()


def find_best_signal(series, adjust=False, eps=10e-5):
    
    def f(alpha):
        shifted_ewm = calc_shifted_ewm(series=series, alpha=min(max(alpha, 0), 1), adjust=adjust)
        corr = np.mean(np.power(series - shifted_ewm, 2))
        return corr
     
    res = optimize.differential_evolution(func=f, bounds=[(0 + eps, 1 - eps)])
    
    return calc_shifted_ewm(series=series, alpha=res['x'][0], adjust=adjust)


roll = data.groupby(['air_store_id', 'day_of_week']).apply(lambda g: find_best_signal(g['visitors_capped']))
data['optimized_ewm_by_air_store_id_&_day_of_week'] = roll.sort_index(level=['air_store_id', 'visit_date']).values

roll = data.groupby(['air_store_id', 'is_weekend']).apply(lambda g: find_best_signal(g['visitors_capped']))
data['optimized_ewm_by_air_store_id_&_is_weekend'] = roll.sort_index(level=['air_store_id', 'visit_date']).values

roll = data.groupby(['air_store_id', 'day_of_week']).apply(lambda g: find_best_signal(g['visitors_capped_log1p']))
data['optimized_ewm_log1p_by_air_store_id_&_day_of_week'] = roll.sort_index(level=['air_store_id', 'visit_date']).values

roll = data.groupby(['air_store_id', 'is_weekend']).apply(lambda g: find_best_signal(g['visitors_capped_log1p']))
data['optimized_ewm_log1p_by_air_store_id_&_is_weekend'] = roll.sort_index(level=['air_store_id', 'visit_date']).values

data.head()

I also extracted some more "naive" rolling features. I calculated the mean, median, standard deviation, number of values, minimum, maximum, and EWM with fixed a alpha parameter per timeseries. Because it's always the same I put all the logic inside a function and made the most of pandas's groupby and apply.

def extract_precedent_statistics(df, on, group_by):
    
    df.sort_values(group_by + ['visit_date'], inplace=True)
    
    groups = df.groupby(group_by, sort=False)
    
    stats = {
        'mean': [],
        'median': [],
        'std': [],
        'count': [],
        'max': [],
        'min': []
    }
    
    exp_alphas = [0.1, 0.25, 0.3, 0.5, 0.75]
    stats.update({'exp_{}_mean'.format(alpha): [] for alpha in exp_alphas})
    
    for _, group in groups:
        
        shift = group[on].shift()
        roll = shift.rolling(window=len(group), min_periods=1)
        
        stats['mean'].extend(roll.mean())
        stats['median'].extend(roll.median())
        stats['std'].extend(roll.std())
        stats['count'].extend(roll.count())
        stats['max'].extend(roll.max())
        stats['min'].extend(roll.min())
        
        for alpha in exp_alphas:
            exp = shift.ewm(alpha=alpha, adjust=False)
            stats['exp_{}_mean'.format(alpha)].extend(exp.mean())
    
    suffix = '_&_'.join(group_by)
    
    for stat_name, values in stats.items():
        df['{}_{}_by_{}'.format(on, stat_name, suffix)] = values


extract_precedent_statistics(
    df=data,
    on='visitors_capped',
    group_by=['air_store_id', 'day_of_week']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped',
    group_by=['air_store_id', 'is_weekend']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped',
    group_by=['air_store_id']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped_log1p',
    group_by=['air_store_id', 'day_of_week']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped_log1p',
    group_by=['air_store_id', 'is_weekend']
)

extract_precedent_statistics(
    df=data,
    on='visitors_capped_log1p',
    group_by=['air_store_id']
)

data.sort_values(['air_store_id', 'visit_date']).head()

I'm going to go with boosting in the end so the categorical variables have to be one-hot encoded.

data = pd.get_dummies(data, columns=['day_of_week', 'air_genre_name'])

Train/test split¶

Now let's define the training and sets so that we can train a model. A few columns in the dataset have to be dropped because they are useless. I went for predicting the log of the visitors. During prediction I also predict the log of the visitors, to get back to the original magnitude I simply apply the exponential function to the prediction. This works because $exp(log(x)) = x$. I'm pretty sure anyone serious did this during the competition. I'm not sure why this works theoretically. My intuition tells me that it helps a decision tree to pack values in a leaf because the values are "closer" to each other.

data['visitors_log1p'] = np.log1p(data['visitors'])
train = data[(data['is_test'] == False) & (data['is_outlier'] == False) & (data['was_nil'] == False)]
test = data[data['is_test']].sort_values('test_number')

to_drop = ['air_store_id', 'is_test', 'test_number', 'visit_date', 'was_nil',
           'is_outlier', 'visitors_capped', 'visitors', 'air_area_name',
           'station_id', 'station_latitude', 'station_longitude', 'station_vincenty',
           'station_great_circle', 'visitors_capped_log1p']
train = train.drop(to_drop, axis='columns')
train = train.dropna()
test = test.drop(to_drop, axis='columns')

X_train = train.drop('visitors_log1p', axis='columns')
X_test = test.drop('visitors_log1p', axis='columns')
y_train = train['visitors_log1p']

The following code block just does a few sanity checks.

assert X_train.isnull().sum().sum() == 0
assert y_train.isnull().sum() == 0
assert len(X_train) == len(y_train)
assert X_test.isnull().sum().sum() == 0
assert len(X_test) == 32019

Finally let's do the actual machine learning. I went for a LightGVM cross-validation where I train 6 models on a random sample of the dataset. The sampling is done without replacement so technically this is known as pasting. I like doing parameter tuning myself and I don't use grid-search techniques as I find there are too burdensome. This is pretty classic stuff and isn't the most part of my notebook. A lot of Kagglers use this setup.

import lightgbm as lgbm
from sklearn import metrics
from sklearn import model_selection


np.random.seed(42)

model = lgbm.LGBMRegressor(
    objective='regression',
    max_depth=5,
    num_leaves=5 ** 2 - 1,
    learning_rate=0.007,
    n_estimators=30000,
    min_child_samples=80,
    subsample=0.8,
    colsample_bytree=1,
    reg_alpha=0,
    reg_lambda=0,
    random_state=np.random.randint(10e6)
)

n_splits = 6
cv = model_selection.KFold(n_splits=n_splits, shuffle=True, random_state=42)

val_scores = [0] * n_splits

sub = submission['id'].to_frame()
sub['visitors'] = 0

feature_importances = pd.DataFrame(index=X_train.columns)

for i, (fit_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
    
    X_fit = X_train.iloc[fit_idx]
    y_fit = y_train.iloc[fit_idx]
    X_val = X_train.iloc[val_idx]
    y_val = y_train.iloc[val_idx]
    
    model.fit(
        X_fit,
        y_fit,
        eval_set=[(X_fit, y_fit), (X_val, y_val)],
        eval_names=('fit', 'val'),
        eval_metric='l2',
        early_stopping_rounds=200,
        feature_name=X_fit.columns.tolist(),
        verbose=False
    )
    
    val_scores[i] = np.sqrt(model.best_score_['val']['l2'])
    sub['visitors'] += model.predict(X_test, num_iteration=model.best_iteration_)
    feature_importances[i] = model.feature_importances_
    
    print('Fold {} RMSLE: {:.5f}'.format(i+1, val_scores[i]))
    
sub['visitors'] /= n_splits
sub['visitors'] = np.expm1(sub['visitors'])

val_mean = np.mean(val_scores)
val_std = np.std(val_scores)

print('Local RMSLE: {:.5f} (±{:.5f})'.format(val_mean, val_std))

Fold 1 RMSLE: 0.48486
Fold 2 RMSLE: 0.48360
Fold 3 RMSLE: 0.48666
Fold 4 RMSLE: 0.48522
Fold 5 RMSLE: 0.48259
Fold 6 RMSLE: 0.48010
Local RMSLE: 0.48384 (±0.00210)

Now let's save the submission. I like annotating the submission with the local score to compare it with the leaderboard score.

sub.to_csv('submissions/lgbm_{:.5f}_{:.5f}.csv'.format(val_mean, val_std), index=False)

Here are the feature importances so to get a feeling of what worked well or not.

feature_importances.sort_values(0, ascending=False)

That's all there is to it! Thanks for reading. I hope you picked up some useful stuff in this notebook. Feel free to shoot me an email at [email protected] if you have any questions.

	air_store_id	visit_date	visitors	was_nil
0	air_00a91d42b08b08d9	2016-07-01	35.0	False
1	air_00a91d42b08b08d9	2016-07-02	9.0	False
2	air_00a91d42b08b08d9	2016-07-03	0.0	True
3	air_00a91d42b08b08d9	2016-07-04	20.0	False
4	air_00a91d42b08b08d9	2016-07-05	25.0	False

	visit_date	day_of_week	is_holiday	prev_day_is_holiday	next_day_is_holiday
0	2016-01-01	Friday	1	0.0	1.0
1	2016-01-02	Saturday	1	1.0	1.0
2	2016-01-03	Sunday	1	1.0	0.0
3	2016-01-04	Monday	0	1.0	0.0
4	2016-01-05	Tuesday	0	0.0	0.0

	air_store_id	air_genre_name	air_area_name	latitude	longitude	station_id	station_latitude	station_longitude	station_vincenty	station_great_circle
0	air_0f0cdeee6c9bf3d7	Italian/French	Hyōgo-ken Kōbe-shi Kumoidōri	34.695124	135.197852	hyogo__kobe-kana__koube	34.696667	135.211667	1.277232	1.274882
1	air_7cc17a324ae5c7dc	Italian/French	Hyōgo-ken Kōbe-shi Kumoidōri	34.695124	135.197852	hyogo__kobe-kana__koube	34.696667	135.211667	1.277232	1.274882
2	air_fee8dcf4d619598e	Italian/French	Hyōgo-ken Kōbe-shi Kumoidōri	34.695124	135.197852	hyogo__kobe-kana__koube	34.696667	135.211667	1.277232	1.274882
3	air_a17f0778617c76e2	Italian/French	Hyōgo-ken Kōbe-shi Kumoidōri	34.695124	135.197852	hyogo__kobe-kana__koube	34.696667	135.211667	1.277232	1.274882
4	air_83db5aff8f50478e	Italian/French	Tōkyō-to Minato-ku Shibakōen	35.658068	139.751599	tokyo__tokyo-kana__tonokyo	35.691667	139.750000	3.730672	3.739835

	id	visitors	air_store_id	visit_date	is_test	test_number
0	air_00a91d42b08b08d9_2017-04-23	NaN	air_00a91d42b08b08d9	2017-04-23	True	0
1	air_00a91d42b08b08d9_2017-04-24	NaN	air_00a91d42b08b08d9	2017-04-24	True	1
2	air_00a91d42b08b08d9_2017-04-25	NaN	air_00a91d42b08b08d9	2017-04-25	True	2
3	air_00a91d42b08b08d9_2017-04-26	NaN	air_00a91d42b08b08d9	2017-04-26	True	3
4	air_00a91d42b08b08d9_2017-04-27	NaN	air_00a91d42b08b08d9	2017-04-27	True	4

	air_store_id	is_test	test_number	visit_date	visitors	was_nil	day_of_week	air_genre_name	air_area_name	latitude	longitude	station_id	station_latitude	station_longitude	station_vincenty	station_great_circle
0	air_00a91d42b08b08d9	False	NaN	2016-07-01	35.0	False	Friday	Italian/French	Tōkyō-to Chiyoda-ku Kudanminami	35.694003	139.753595	tokyo__tokyo-kana__tonokyo	35.691667	139.75	0.416011	0.415906
1	air_00a91d42b08b08d9	False	NaN	2016-07-02	9.0	False	Saturday	Italian/French	Tōkyō-to Chiyoda-ku Kudanminami	35.694003	139.753595	tokyo__tokyo-kana__tonokyo	35.691667	139.75	0.416011	0.415906
2	air_00a91d42b08b08d9	False	NaN	2016-07-03	0.0	True	Sunday	Italian/French	Tōkyō-to Chiyoda-ku Kudanminami	35.694003	139.753595	tokyo__tokyo-kana__tonokyo	35.691667	139.75	0.416011	0.415906
3	air_00a91d42b08b08d9	False	NaN	2016-07-04	20.0	False	Monday	Italian/French	Tōkyō-to Chiyoda-ku Kudanminami	35.694003	139.753595	tokyo__tokyo-kana__tonokyo	35.691667	139.75	0.416011	0.415906
4	air_00a91d42b08b08d9	False	NaN	2016-07-05	25.0	False	Tuesday	Italian/French	Tōkyō-to Chiyoda-ku Kudanminami	35.694003	139.753595	tokyo__tokyo-kana__tonokyo	35.691667	139.75	0.416011	0.415906

	visit_date	precipitation	avg_temperature	total_snowfall	deepest_snowfall	hours_sunlight	avg_wind_speed	avg_vapor_pressure	avg_humidity	avg_sea_pressure	avg_local_pressure	solar_radiation	cloud_cover	high_temperature	low_temperature	station_id
0	2016-01-01	0.0	5.7	NaN	NaN	5.1	1.9	NaN	NaN	NaN	NaN	NaN	NaN	11.0	2.1	data/weather/1-1-16_5-31-17_Weather_Translated...
1	2016-01-02	0.5	10.2	NaN	NaN	1.3	2.2	NaN	NaN	NaN	NaN	NaN	NaN	15.3	6.1	data/weather/1-1-16_5-31-17_Weather_Translated...
2	2016-01-03	0.0	11.2	NaN	NaN	0.5	0.9	NaN	NaN	NaN	NaN	NaN	NaN	15.5	7.1	data/weather/1-1-16_5-31-17_Weather_Translated...
3	2016-01-04	0.0	9.4	NaN	NaN	4.2	1.8	NaN	NaN	NaN	NaN	NaN	NaN	16.1	5.9	data/weather/1-1-16_5-31-17_Weather_Translated...
4	2016-01-05	1.5	8.8	NaN	NaN	0.0	0.8	NaN	NaN	NaN	NaN	NaN	NaN	12.3	7.1	data/weather/1-1-16_5-31-17_Weather_Translated...

	0	1	2	3	4	5
day_of_month	7565	6490	7200	7041	6583	7424
visitors_capped_count_by_air_store_id_&_is_weekend	7137	6101	6570	6306	6193	7160
optimized_ewm_by_air_store_id_&_day_of_week	7097	6057	6338	6428	6124	7023
visitors_capped_log1p_std_by_air_store_id	6834	5766	6552	6110	5680	6651
optimized_ewm_log1p_by_air_store_id_&_day_of_week	6755	5862	6476	6118	6171	7006
visitors_capped_std_by_air_store_id_&_day_of_week	6662	5524	6273	5860	5582	6728
visitors_capped_log1p_std_by_air_store_id_&_day_of_week	6651	5739	6575	6131	5838	7237
visitors_capped_std_by_air_store_id	6582	5873	6382	5958	5813	6795
visitors_capped_std_by_air_store_id_&_is_weekend	6053	5200	5630	5385	5309	6215
visitors_capped_count_by_air_store_id	6052	4889	5589	5638	5186	6126
visitors_capped_exp_0.1_mean_by_air_store_id	5959	5168	5343	5207	5106	6046
visitors_capped_log1p_std_by_air_store_id_&_is_weekend	5940	4870	5335	5304	5127	5701
latitude	5940	4690	5497	5174	4716	5698
visitors_capped_max_by_air_store_id	5735	5162	5709	5381	5065	5931
optimized_ewm_by_air_store_id_&_is_weekend	5593	4776	5395	4888	4784	5421
visitors_capped_exp_0.1_mean_by_air_store_id_&_day_of_week	5544	4696	5160	5120	4983	5815
longitude	5182	4293	4727	4343	4220	5033
visitors_capped_log1p_exp_0.1_mean_by_air_store_id_&_day_of_week	5138	4087	4617	4416	4128	4840
visitors_capped_log1p_mean_by_air_store_id	5070	3872	4579	4315	4366	5113
visitors_capped_mean_by_air_store_id_&_day_of_week	4995	4255	4610	4299	4312	5161
visitors_capped_log1p_exp_0.1_mean_by_air_store_id	4848	3616	4288	4202	3742	4680
visitors_capped_log1p_exp_0.75_mean_by_air_store_id_&_day_of_week	4833	3933	4324	4172	3856	4760
visitors_capped_log1p_mean_by_air_store_id_&_day_of_week	4800	4095	4493	4248	4090	4895
visitors_capped_mean_by_air_store_id	4643	3885	4353	3988	3856	4646
visitors_capped_exp_0.75_mean_by_air_store_id	4457	3501	4094	3846	3411	4546
optimized_ewm_log1p_by_air_store_id_&_is_weekend	4375	3277	3979	3912	3697	4342
visitors_capped_max_by_air_store_id_&_day_of_week	4313	3691	3930	3711	3819	4418
visitors_capped_exp_0.75_mean_by_air_store_id_&_day_of_week	4301	3429	4061	3835	3582	4665
visitors_capped_log1p_exp_0.75_mean_by_air_store_id	4276	3505	4297	3879	3504	4477
visitors_capped_log1p_exp_0.75_mean_by_air_store_id_&_is_weekend	4070	3335	3751	3527	3242	4046
...	...	...	...	...	...	...
visitors_capped_log1p_median_by_air_store_id	785	578	776	705	689	849
air_genre_name_Izakaya	736	674	751	733	650	840
day_of_week_Sunday	665	586	624	614	540	621
air_genre_name_Dining bar	649	545	631	590	595	738
is_weekend	586	561	631	596	597	611
day_of_week_Thursday	506	444	520	465	453	539
day_of_week_Tuesday	496	446	465	481	415	418
air_genre_name_Bar/Cocktail	479	438	420	438	442	506
air_genre_name_Italian/French	317	299	301	318	280	381
visitors_capped_min_by_air_store_id_&_is_weekend	309	266	298	289	218	258
day_of_week_Wednesday	280	230	269	271	209	262
air_genre_name_Japanese food	262	228	265	200	209	276
visitors_capped_min_by_air_store_id	255	217	246	289	164	217
air_genre_name_Yakiniku/Korean food	190	143	159	177	199	189
air_genre_name_Western food	186	158	149	202	157	148
air_genre_name_Okonomiyaki/Monja/Teppanyaki	170	137	94	114	110	134
air_genre_name_Creative cuisine	121	115	118	106	136	161
air_genre_name_Karaoke/Party	99	128	73	83	70	167
air_genre_name_Other	99	88	127	92	117	100
air_genre_name_Asian	18	4	7	12	7	25
air_genre_name_International cuisine	1	7	19	0	1	18
visitors_capped_log1p_min_by_air_store_id	0	0	0	0	0	0
visitors_capped_log1p_max_by_air_store_id	0	0	0	0	0	0
visitors_capped_log1p_count_by_air_store_id	0	0	0	0	0	0
visitors_capped_log1p_min_by_air_store_id_&_is_weekend	0	0	0	0	0	0
visitors_capped_log1p_max_by_air_store_id_&_is_weekend	0	0	0	0	0	0
visitors_capped_log1p_count_by_air_store_id_&_is_weekend	0	0	0	0	0	0
visitors_capped_log1p_max_by_air_store_id_&_day_of_week	0	0	0	0	0	0
visitors_capped_log1p_count_by_air_store_id_&_day_of_week	0	0	0	0	0	0
visitors_capped_log1p_min_by_air_store_id_&_day_of_week	0	0	0	0	0	0