Insurance Claims Fraud Detection

Business Problem

An insurance company has approached you with a dataset of previous claims of their clients. The insurance company wants you to develop a model to help them predict which claims look fraudulent. By doing so you hope to save the company millions of dollars annually.

In [1]:
#Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [301]:
#Load the dataset into a dataframe
df = pd.read_csv('insurance_claims.csv')
df.head()
Out[301]:
months_as_customer age policy_number policy_bind_date policy_state policy_csl policy_deductable policy_annual_premium umbrella_limit insured_zip insured_sex insured_education_level insured_occupation insured_hobbies insured_relationship capital-gains capital-loss incident_date incident_type collision_type incident_severity authorities_contacted incident_state incident_city incident_location incident_hour_of_the_day number_of_vehicles_involved property_damage bodily_injuries witnesses police_report_available total_claim_amount injury_claim property_claim vehicle_claim auto_make auto_model auto_year fraud_reported
0 328 48 521585 2014-10-17 OH 250/500 1000 1406.91 0 466132 MALE MD craft-repair sleeping husband 53300 0 2015-01-25 Single Vehicle Collision Side Collision Major Damage Police SC Columbus 9935 4th Drive 5 1 YES 1 2 YES 71610 6510 13020 52080 Saab 92x 2004 Y
1 228 42 342868 2006-06-27 IN 250/500 2000 1197.22 5000000 468176 MALE MD machine-op-inspct reading other-relative 0 0 2015-01-21 Vehicle Theft ? Minor Damage Police VA Riverwood 6608 MLK Hwy 8 1 ? 0 0 ? 5070 780 780 3510 Mercedes E400 2007 Y
2 134 29 687698 2000-09-06 OH 100/300 2000 1413.14 5000000 430632 FEMALE PhD sales board-games own-child 35100 0 2015-02-22 Multi-vehicle Collision Rear Collision Minor Damage Police NY Columbus 7121 Francis Lane 7 3 NO 2 3 NO 34650 7700 3850 23100 Dodge RAM 2007 N
3 256 41 227811 1990-05-25 IL 250/500 2000 1415.74 6000000 608117 FEMALE PhD armed-forces board-games unmarried 48900 -62400 2015-01-10 Single Vehicle Collision Front Collision Major Damage Police OH Arlington 6956 Maple Drive 5 1 ? 1 2 NO 63400 6340 6340 50720 Chevrolet Tahoe 2014 Y
4 228 44 367455 2014-06-06 IL 500/1000 1000 1583.91 6000000 610706 MALE Associate sales board-games unmarried 66000 -46000 2015-02-17 Vehicle Theft ? Minor Damage None NY Arlington 3041 3rd Ave 20 1 NO 0 1 NO 6500 1300 650 4550 Accura RSX 2009 N
In [302]:
#Check the shape of the dataframe
df.shape
Out[302]:
(1000, 39)
In [303]:
#check the data types of each column
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 39 columns):
months_as_customer             1000 non-null int64
age                            1000 non-null int64
policy_number                  1000 non-null int64
policy_bind_date               1000 non-null object
policy_state                   1000 non-null object
policy_csl                     1000 non-null object
policy_deductable              1000 non-null int64
policy_annual_premium          1000 non-null float64
umbrella_limit                 1000 non-null int64
insured_zip                    1000 non-null int64
insured_sex                    1000 non-null object
insured_education_level        1000 non-null object
insured_occupation             1000 non-null object
insured_hobbies                1000 non-null object
insured_relationship           1000 non-null object
capital-gains                  1000 non-null int64
capital-loss                   1000 non-null int64
incident_date                  1000 non-null object
incident_type                  1000 non-null object
collision_type                 1000 non-null object
incident_severity              1000 non-null object
authorities_contacted          1000 non-null object
incident_state                 1000 non-null object
incident_city                  1000 non-null object
incident_location              1000 non-null object
incident_hour_of_the_day       1000 non-null int64
number_of_vehicles_involved    1000 non-null int64
property_damage                1000 non-null object
bodily_injuries                1000 non-null int64
witnesses                      1000 non-null int64
police_report_available        1000 non-null object
total_claim_amount             1000 non-null int64
injury_claim                   1000 non-null int64
property_claim                 1000 non-null int64
vehicle_claim                  1000 non-null int64
auto_make                      1000 non-null object
auto_model                     1000 non-null object
auto_year                      1000 non-null int64
fraud_reported                 1000 non-null object
dtypes: float64(1), int64(17), object(21)
memory usage: 304.8+ KB
In [309]:
df.columns[df.isnull().any()]
Out[309]:
Index([], dtype='object')
In [310]:
df.fraud_reported.value_counts()
Out[310]:
N    753
Y    247
Name: fraud_reported, dtype: int64
In [17]:
sns.pairplot(df)
Out[17]:
<seaborn.axisgrid.PairGrid at 0x151dc6d8>
In [311]:
df.policy_deductable.unique()
Out[311]:
array([1000, 2000,  500], dtype=int64)
In [312]:
df.number_of_vehicles_involved.unique()
Out[312]:
array([1, 3, 4, 2], dtype=int64)
In [313]:
df.number_of_vehicles_involved.value_counts()
Out[313]:
1    581
3    358
4    31 
2    30 
Name: number_of_vehicles_involved, dtype: int64
In [314]:
df.bodily_injuries.unique()
Out[314]:
array([1, 0, 2], dtype=int64)
In [315]:
df.witnesses.unique()
Out[315]:
array([2, 0, 3, 1], dtype=int64)
In [316]:
df[df['capital-gains'] == 0].shape[0]
Out[316]:
508
In [317]:
df[(df['capital-gains'] > 0) & (df['capital-gains'] <=50000)].shape[0]
Out[317]:
233
In [318]:
df[df['capital-gains'] > 50000].shape[0]
Out[318]:
259
In [319]:
df[df['capital-loss'] == 0].shape[0]
Out[319]:
475
In [320]:
df[(df['capital-loss'] < 0) & (df['capital-loss'] >=-50000)].shape[0]
Out[320]:
255
In [321]:
df[(df['capital-loss'] >= -100000) & (df['capital-loss'] < -50000)].shape[0]
Out[321]:
269
In [322]:
df.umbrella_limit.value_counts()
Out[322]:
 0           798
 6000000     57 
 5000000     46 
 4000000     39 
 7000000     29 
 3000000     12 
 8000000     8  
 9000000     5  
 2000000     3  
 10000000    2  
-1000000     1  
Name: umbrella_limit, dtype: int64
In [323]:
sns.distplot(df.age, bins=np.arange(19,64,5))
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6448: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
Out[323]:
<matplotlib.axes._subplots.AxesSubplot at 0x3caa2390>
In [324]:
objectcols = [col for col in df.columns if df[col].dtype == 'O']
objectcols
Out[324]:
['policy_bind_date',
 'policy_state',
 'policy_csl',
 'insured_sex',
 'insured_education_level',
 'insured_occupation',
 'insured_hobbies',
 'insured_relationship',
 'incident_date',
 'incident_type',
 'collision_type',
 'incident_severity',
 'authorities_contacted',
 'incident_state',
 'incident_city',
 'incident_location',
 'property_damage',
 'police_report_available',
 'auto_make',
 'auto_model',
 'fraud_reported']
In [96]:
#Plot the distribution of categorical variables
fig = plt.figure(figsize=(12, 12))

sub1 = plt.subplot(5, 4, 1)
sns.countplot(x='policy_state', data=df)

sub2 = plt.subplot(5, 4, 2)
sns.countplot(x='policy_csl', data=df)

sub3 = plt.subplot(5, 4, 3)
sns.countplot(x='insured_sex', data=df)

sub4 = plt.subplot(5, 4, 4)
sns.countplot(x='insured_education_level', data=df)

sub5 = plt.subplot(5, 4, 5)
sns.countplot(x='insured_occupation', data=df)

sub6 = plt.subplot(5, 4, 6)
sns.countplot(x='insured_hobbies', data=df)

sub7 = plt.subplot(5, 4, 7)
sns.countplot(x='insured_relationship', data=df)

sub8 = plt.subplot(5, 4, 8)
sns.countplot(x='incident_type', data=df)

sub9 = plt.subplot(5, 4, 9)
sns.countplot(x='collision_type', data=df)

sub9 = plt.subplot(5, 4, 10)
sns.countplot(x='incident_severity', data=df)
sub9 = plt.subplot(5, 4, 11)
sns.countplot(x='authorities_contacted', data=df)
sub9 = plt.subplot(5, 4, 12)
sns.countplot(x='incident_state', data=df)
sub9 = plt.subplot(5, 4, 13)
sns.countplot(x='incident_city', data=df)
sub9 = plt.subplot(5, 4, 14)
sns.countplot(x='incident_location', data=df)
sub9 = plt.subplot(5, 4, 15)
sns.countplot(x='property_damage', data=df)
sub9 = plt.subplot(5, 4, 16)
sns.countplot(x='police_report_available', data=df)
sub9 = plt.subplot(5, 4, 17)
sns.countplot(x='auto_make', data=df)
sub9 = plt.subplot(5, 4, 18)
sns.countplot(x='auto_model', data=df)
sub9 = plt.subplot(5, 4, 19)
sns.countplot(x='fraud_reported', data=df)
    
fig.tight_layout()
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\categorical.py:1428: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  stat_data = remove_na(group_data)
In [325]:
df.fraud_reported = df.fraud_reported.map({'Y':1, 'N':0})
df.property_damage = df.property_damage.map({'?':'UNKNOWN'})
df.police_report_available = df.police_report_available.map({'?':'UNKNOWN'})
df.collision_type = df.collision_type.map({'?':'UNKNOWN'})
df.insured_sex = df.insured_sex.map({'MALE':1,'FEMALE':0})
df.insured_education_level = df.insured_education_level.map({'High School': 1, 'College': 2, 'Associate': 2, 'JD': 3,\
                                                             'Masters': 4, 'MD': 5, 'PhD': 6})
df.incident_severity = df.incident_severity.map({'Minor Damage': 1, 'Major Damage': 2, 'Total Loss': 3, 'Trivial Damage': 4})
In [327]:
df['csl_per_person'] = df.policy_csl.str.split('/', expand=True)[0]
df['csl_per_accident'] = df.policy_csl.str.split('/', expand=True)[1]
In [345]:
features = ['months_as_customer','age','policy_state','csl_per_person','csl_per_accident','policy_deductable',\
            'policy_annual_premium','umbrella_limit','insured_sex','insured_education_level','insured_occupation',\
            'insured_hobbies','insured_relationship','capital-gains','capital-loss','incident_type','collision_type',\
            'incident_severity','authorities_contacted','incident_state','incident_hour_of_the_day','number_of_vehicles_involved',\
            'property_damage','bodily_injuries','witnesses','police_report_available','total_claim_amount','auto_make',\
            'auto_model','fraud_reported']
df_final = pd.get_dummies(df[features],drop_first=True)
df_final.head()
Out[345]:
months_as_customer age policy_deductable policy_annual_premium umbrella_limit insured_sex insured_education_level capital-gains capital-loss incident_severity incident_hour_of_the_day number_of_vehicles_involved bodily_injuries witnesses total_claim_amount fraud_reported policy_state_IN policy_state_OH csl_per_person_250 csl_per_person_500 csl_per_accident_300 csl_per_accident_500 insured_occupation_armed-forces insured_occupation_craft-repair insured_occupation_exec-managerial insured_occupation_farming-fishing insured_occupation_handlers-cleaners insured_occupation_machine-op-inspct insured_occupation_other-service insured_occupation_priv-house-serv insured_occupation_prof-specialty insured_occupation_protective-serv insured_occupation_sales insured_occupation_tech-support insured_occupation_transport-moving insured_hobbies_basketball insured_hobbies_board-games insured_hobbies_bungie-jumping insured_hobbies_camping insured_hobbies_chess insured_hobbies_cross-fit insured_hobbies_dancing insured_hobbies_exercise insured_hobbies_golf insured_hobbies_hiking insured_hobbies_kayaking insured_hobbies_movies insured_hobbies_paintball insured_hobbies_polo insured_hobbies_reading insured_hobbies_skydiving insured_hobbies_sleeping insured_hobbies_video-games insured_hobbies_yachting insured_relationship_not-in-family insured_relationship_other-relative insured_relationship_own-child insured_relationship_unmarried insured_relationship_wife incident_type_Parked Car incident_type_Single Vehicle Collision incident_type_Vehicle Theft authorities_contacted_Fire authorities_contacted_None authorities_contacted_Other authorities_contacted_Police incident_state_NY incident_state_OH incident_state_PA incident_state_SC incident_state_VA incident_state_WV auto_make_Audi auto_make_BMW auto_make_Chevrolet auto_make_Dodge auto_make_Ford auto_make_Honda auto_make_Jeep auto_make_Mercedes auto_make_Nissan auto_make_Saab auto_make_Suburu auto_make_Toyota auto_make_Volkswagen auto_model_92x auto_model_93 auto_model_95 auto_model_A3 auto_model_A5 auto_model_Accord auto_model_C300 auto_model_CRV auto_model_Camry auto_model_Civic auto_model_Corolla auto_model_E400 auto_model_Escape auto_model_F150 auto_model_Forrestor auto_model_Fusion auto_model_Grand Cherokee auto_model_Highlander auto_model_Impreza auto_model_Jetta auto_model_Legacy auto_model_M5 auto_model_MDX auto_model_ML350 auto_model_Malibu auto_model_Maxima auto_model_Neon auto_model_Passat auto_model_Pathfinder auto_model_RAM auto_model_RSX auto_model_Silverado auto_model_TL auto_model_Tahoe auto_model_Ultima auto_model_Wrangler auto_model_X5 auto_model_X6
0 328 48 1000 1406.91 0 1 5 53300 0 2 5 1 1 2 71610 1 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 228 42 2000 1197.22 5000000 1 5 0 0 1 8 1 0 0 5070 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 134 29 2000 1413.14 5000000 0 6 35100 0 1 7 3 2 3 34650 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
3 256 41 2000 1415.74 6000000 0 6 48900 -62400 2 5 1 1 2 63400 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
4 228 44 1000 1583.91 6000000 1 2 66000 -46000 1 20 1 0 1 6500 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
In [356]:
#We use sklearn’s train_test_split to split the data into a training set and a test set.
from sklearn.model_selection import train_test_split

X = df_final.drop('fraud_reported', axis=1)
y = df['fraud_reported'].values
feature_name = X.columns.tolist()

Due to the massive amounts of computations taking place in deep learning, feature scaling is compulsory. Feature scaling standardizes the range of our independent variables.

In [333]:
print(X.shape)
print(y.shape)
(1000, 122)
(1000,)
In [334]:
print(type(X))
print(type(y))
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
In [347]:
# apply PCA on X_train
from sklearn.decomposition import PCA
pca = PCA().fit(X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
#plt.annotate('90',xy=(90, .90))
Out[347]:
Text(0,0.5,'Cumulative explained variance')
In [348]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=10)
X_r2 = lda.fit(X, y).transform(X)
# Percentage of variance explained for each components
print('LDA explained variance ratio (first two components): %s'
      % str(lda.explained_variance_ratio_))
LDA explained variance ratio (first two components): [ 1.]
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:388: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
In [349]:
#Check the correlations among feature variables
corrmat = df_final.corr()
f, ax = plt.subplots(figsize=(15, 12))
sns.heatmap(corrmat);