Insurance Claims Fraud Detection

Business Problem

An insurance company has approached you with a dataset of previous claims of their clients. The insurance company wants you to develop a model to help them predict which claims look fraudulent. By doing so you hope to save the company millions of dollars annually.

In [1]:
#Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [301]:
#Load the dataset into a dataframe
df = pd.read_csv('insurance_claims.csv')
df.head()
Out[301]:
months_as_customer age policy_number policy_bind_date policy_state policy_csl policy_deductable policy_annual_premium umbrella_limit insured_zip insured_sex insured_education_level insured_occupation insured_hobbies insured_relationship capital-gains capital-loss incident_date incident_type collision_type incident_severity authorities_contacted incident_state incident_city incident_location incident_hour_of_the_day number_of_vehicles_involved property_damage bodily_injuries witnesses police_report_available total_claim_amount injury_claim property_claim vehicle_claim auto_make auto_model auto_year fraud_reported
0 328 48 521585 2014-10-17 OH 250/500 1000 1406.91 0 466132 MALE MD craft-repair sleeping husband 53300 0 2015-01-25 Single Vehicle Collision Side Collision Major Damage Police SC Columbus 9935 4th Drive 5 1 YES 1 2 YES 71610 6510 13020 52080 Saab 92x 2004 Y
1 228 42 342868 2006-06-27 IN 250/500 2000 1197.22 5000000 468176 MALE MD machine-op-inspct reading other-relative 0 0 2015-01-21 Vehicle Theft ? Minor Damage Police VA Riverwood 6608 MLK Hwy 8 1 ? 0 0 ? 5070 780 780 3510 Mercedes E400 2007 Y
2 134 29 687698 2000-09-06 OH 100/300 2000 1413.14 5000000 430632 FEMALE PhD sales board-games own-child 35100 0 2015-02-22 Multi-vehicle Collision Rear Collision Minor Damage Police NY Columbus 7121 Francis Lane 7 3 NO 2 3 NO 34650 7700 3850 23100 Dodge RAM 2007 N
3 256 41 227811 1990-05-25 IL 250/500 2000 1415.74 6000000 608117 FEMALE PhD armed-forces board-games unmarried 48900 -62400 2015-01-10 Single Vehicle Collision Front Collision Major Damage Police OH Arlington 6956 Maple Drive 5 1 ? 1 2 NO 63400 6340 6340 50720 Chevrolet Tahoe 2014 Y
4 228 44 367455 2014-06-06 IL 500/1000 1000 1583.91 6000000 610706 MALE Associate sales board-games unmarried 66000 -46000 2015-02-17 Vehicle Theft ? Minor Damage None NY Arlington 3041 3rd Ave 20 1 NO 0 1 NO 6500 1300 650 4550 Accura RSX 2009 N
In [302]:
#Check the shape of the dataframe
df.shape
Out[302]:
(1000, 39)
In [303]:
#check the data types of each column
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 39 columns):
months_as_customer             1000 non-null int64
age                            1000 non-null int64
policy_number                  1000 non-null int64
policy_bind_date               1000 non-null object
policy_state                   1000 non-null object
policy_csl                     1000 non-null object
policy_deductable              1000 non-null int64
policy_annual_premium          1000 non-null float64
umbrella_limit                 1000 non-null int64
insured_zip                    1000 non-null int64
insured_sex                    1000 non-null object
insured_education_level        1000 non-null object
insured_occupation             1000 non-null object
insured_hobbies                1000 non-null object
insured_relationship           1000 non-null object
capital-gains                  1000 non-null int64
capital-loss                   1000 non-null int64
incident_date                  1000 non-null object
incident_type                  1000 non-null object
collision_type                 1000 non-null object
incident_severity              1000 non-null object
authorities_contacted          1000 non-null object
incident_state                 1000 non-null object
incident_city                  1000 non-null object
incident_location              1000 non-null object
incident_hour_of_the_day       1000 non-null int64
number_of_vehicles_involved    1000 non-null int64
property_damage                1000 non-null object
bodily_injuries                1000 non-null int64
witnesses                      1000 non-null int64
police_report_available        1000 non-null object
total_claim_amount             1000 non-null int64
injury_claim                   1000 non-null int64
property_claim                 1000 non-null int64
vehicle_claim                  1000 non-null int64
auto_make                      1000 non-null object
auto_model                     1000 non-null object
auto_year                      1000 non-null int64
fraud_reported                 1000 non-null object
dtypes: float64(1), int64(17), object(21)
memory usage: 304.8+ KB
In [309]:
df.columns[df.isnull().any()]
Out[309]:
Index([], dtype='object')
In [310]:
df.fraud_reported.value_counts()
Out[310]:
N    753
Y    247
Name: fraud_reported, dtype: int64
In [17]:
sns.pairplot(df)
Out[17]:
<seaborn.axisgrid.PairGrid at 0x151dc6d8>