In this tutorial, I only explain you what you need to be a data scientist neither more nor less.
Data scientist need to have these skills:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns # visualization tool
data = pd.read_csv('../input/pokemon.csv')
#correlation map
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(data.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
Matplot is a python library that help us to plot data. The easiest and basic plots are line, scatter and histogram plots.
# Line Plot
# color = color, label = label, linewidth = width of line, alpha = opacity, grid = grid, linestyle = sytle of line
data.Speed.plot(kind = 'line', color = 'g',label = 'Speed',linewidth=1,alpha = 0.5,grid = True,linestyle = ':')
data.Defense.plot(color = 'r',label = 'Defense',linewidth=1, alpha = 0.5,grid = True,linestyle = '-.')
plt.legend(loc='upper right') # legend = puts label into plot
plt.xlabel('x axis') # label = name of label
plt.ylabel('y axis')
plt.title('Line Plot') # title = title of plot
# Scatter Plot
# x = attack, y = defense
data.plot(kind='scatter', x='Attack', y='Defense',alpha = 0.5,color = 'red')
plt.xlabel('Attack') # label = name of label
plt.title('Attack Defense Scatter Plot') # title = title of plot
# Histogram
# bins = number of bar in figure
data.Speed.plot(kind = 'hist',bins = 50,figsize = (12,12))
# clf() = cleans it up again you can start a fresh
data.Speed.plot(kind = 'hist',bins = 50)
# We cannot see plot due to clf()
Why we need dictionary?
#create dictionary and look its keys and values
dictionary = {'spain' : 'madrid','usa' : 'vegas'}
# Keys have to be immutable objects like string, boolean, float, integer or tubles
# List is not immutable
# Keys are unique
dictionary['spain'] = "barcelona" # update existing entry
dictionary['france'] = "paris" # Add new entry
del dictionary['spain'] # remove entry with key 'spain'
print('france' in dictionary) # check include or not
dictionary.clear() # remove all entries in dict
# In order to run all code you need to take comment this line
# del dictionary # delete entire dictionary
print(dictionary) # it gives error because dictionary is deleted
data = pd.read_csv('../input/pokemon.csv')
series = data['Defense'] # data['Defense'] = series
data_frame = data[['Defense']] # data[['Defense']] = data frame
# Comparison operator
print(3 > 2)
# Boolean operators
print(True and False)
print(True or False)
# 1 - Filtering Pandas data frame
x = data['Defense']>200 # There are only 3 pokemons who have higher defense value than 200
# 2 - Filtering pandas with logical_and
# There are only 2 pokemons who have higher defence value than 2oo and higher attack value than 100
data[np.logical_and(data['Defense']>200, data['Attack']>100 )]
# This is also same with previous code line. Therefore we can also use '&' for filtering.
data[(data['Defense']>200) & (data['Attack']>100)]
# Stay in loop if condition( i is not equal 5) is true
i = 0
while i != 5 :
print('i is: ',i)
i +=1
print(i,' is equal to 5')
# Stay in loop if condition( i is not equal 5) is true
lis = [1,2,3,4,5]
for i in lis:
print('i is: ',i)
# Enumerate index and value of list
# index : value = 0:1, 1:2, 2:3, 3:4, 4:5
for index, value in enumerate(lis):
print(index," : ",value)
# For dictionaries
# We can use for loop to achive key and value of dictionary. We learnt key and value at dictionary part.
dictionary = {'spain':'madrid','france':'paris'}
for key,value in dictionary.items():
print(key," : ",value)
# For pandas we can achieve index and value
for index,value in data[['Attack']][0:1].iterrows():
print(index," : ",value)
In this part, you learn:
What we need to know about functions:
# example of what we learn above
def tuble_ex():
""" return defined t tuble"""
t = (1,2,3)
return t
a,b,c = tuble_ex()
What we need to know about scope:
# guess print what
x = 2
def f():
x = 3
return x
print(x) # x = 2 global scope
print(f()) # x = 3 local scope
# What if there is no local scope
x = 5
def f():
y = 2*x # there is no local scope x
return y
print(f()) # it uses global scope x
# First local scopesearched, then global scope searched, if two of them cannot be found lastly built in scope searched.
# How can we learn what is built in scope
import builtins
#nested function
def square():
""" return square of value """
def add():
""" add two local variable """
x = 2
y = 3
z = x + y
return z
return add()**2
""" b = 1 is default argument"""
""" *args can be one or more"""
""" **kwargs is a dictionary"""
lets write some code to practice
# default arguments
def f(a, b = 1, c = 2):
y = a + b + c
return y
# what if we want to change default arguments
# flexible arguments *args
def f(*args):
for i in args:
# flexible arguments **kwargs that is dictionary
def f(**kwargs):
""" print key and value of dictionary"""
for key, value in kwargs.items(): # If you do not understand this part turn for loop part and look at dictionary in for loop
print(key, " ", value)
f(country = 'spain', capital = 'madrid', population = 123456)
# lambda function
square = lambda x: x**2 # where x is name of argument
tot = lambda x,y,z: x+y+z # where x,y,z are names of arguments
Like lambda function but it can take more than one arguments.
number_list = [1,2,3]
y = map(lambda x:x**2,number_list)
# iteration example
name = "ronaldo"
it = iter(name)
print(next(it)) # print next iteration
print(*it) # print remaining iteration
zip(): zip lists
# zip example
list1 = [1,2,3,4]
list2 = [5,6,7,8]
z = zip(list1,list2)
z_list = list(z)
un_zip = zip(*z_list)
un_list1,un_list2 = list(un_zip) # unzip returns tuble
One of the most important topic of this kernel
We use list comprehension for data analysis often.
list comprehension: collapse for loops for building lists into a single line
Ex: num1 = [1,2,3] and we want to make it num2 = [2,3,4]. This can be done with for loop. However it is unnecessarily long. We can make it one line code that is list comprehension.
# Example of list comprehension
num1 = [1,2,3]
num2 = [i + 1 for i in num1 ]
[i + 1 for i in num1 ]: list of comprehension
i +1: list comprehension syntax
for i in num1: for loop syntax
i: iterator
num1: iterable object
# Conditionals on iterable
num1 = [5,10,15]
num2 = [i**2 if i == 10 else i-5 if i < 7 else i+5 for i in num1]
# lets return pokemon csv and make one more list comprehension example
# lets classify pokemons whether they have high or low speed. Our threshold is average speed.
threshold = sum(data.Speed)/len(data.Speed)
data["speed_level"] = ["high" if i > threshold else "low" for i in data.Speed]
data.loc[:10,["speed_level","Speed"]] # we will learn loc more detailed later
Up to now, you learn
We need to diagnose and clean data before exploring.
Unclean data:
We will use head, tail, columns, shape and info methods to diagnose data
data = pd.read_csv('../input/pokemon.csv')
data.head() # head shows first 5 rows
# tail shows last 5 rows
# columns gives column names of features
# shape gives number of rows and columns in a tuble
# info gives data type like dataframe, number of sample or row, number of feature or column, feature types and memory usage
value_counts(): Frequency counts
outliers: the value that is considerably higher or lower from rest of the data
What is quantile?
The median is the number that is in middle of the sequence. In this case it would be 11.
The lower quartile is the median in between the smallest number and the median i.e. in between 1 and 11, which is 6.
# For example lets look frequency of pokemom types
print(data['Type 1'].value_counts(dropna =False)) # if there are nan values that also be counted
# As it can be seen below there are 112 water pokemon or 70 grass pokemon
# For example max HP is 255 or min defense is 5
data.describe() #ignore null entries
# For example: compare attack of pokemons that are legendary or not
# Black line at top is max
# Blue line at top is 75%
# Red line is median (50%)
# Blue line at bottom is 25%
# Black line at bottom is min
# There are no outliers
data.boxplot(column='Attack',by = 'Legendary')
We tidy data with melt(). Describing melt is confusing. Therefore lets make example to understand it.
# Firstly I create new data from pokemons data to explain melt nore easily.
data_new = data.head() # I only take 5 rows into new data
# lets melt
# id_vars = what we do not wish to melt
# value_vars = what we want to melt
melted = pd.melt(frame=data_new,id_vars = 'Name', value_vars= ['Attack','Defense'])
# Index is name
# I want to make that columns are variable
# Finally values in columns are value
melted.pivot(index = 'Name', columns = 'variable',values='value')
# Firstly lets create 2 data frame
data1 = data.head()
data2= data.tail()
conc_data_row = pd.concat([data1,data2],axis =0,ignore_index =True) # axis = 0 : adds dataframes in row
data1 = data['Attack'].head()
data2= data['Defense'].head()
conc_data_col = pd.concat([data1,data2],axis =1) # axis = 0 : adds dataframes in row
There are 5 basic data types: object(string),booleab, integer, float and categorical.
We can make conversion data types like from str to categorical or from int to float
Why is category important:
# lets convert object(str) to categorical and int to float.
data['Type 1'] = data['Type 1'].astype('category')
data['Speed'] = data['Speed'].astype('float')
# As you can see Type 1 is converted from object to categorical
# And Speed ,s converted from int to float
If we encounter with missing data, what we can do:
# Lets look at does pokemon data have nan value
# As you can see there are 800 entries. However Type 2 has 414 non-null object so it has 386 null object.
# Lets chech Type 2
data["Type 2"].value_counts(dropna =False)
# As you can see, there are 386 NAN value
# Lets drop nan values
data1=data # also we will use data to fill missing value so I assign it to data1 variable
data1["Type 2"].dropna(inplace = True) # inplace = True means we do not assign it to new variable. Changes automatically assigned to data
# So does it work ?
# Lets check with assert statement
# Assert statement:
assert 1==1 # return nothing because it is true
# In order to run all code, we need to make this line comment
# assert 1==2 # return error because it is false
assert data['Type 2'].notnull().all() # returns nothing because we drop nan values
data["Type 2"].fillna('empty',inplace = True)
assert data['Type 2'].notnull().all() # returns nothing because we do not have nan values
# # With assert statement we can check a lot of thing. For example
# assert data.columns[1] == 'Name'
# assert data.Speed.dtypes ==
In this part, you learn:
As you notice, I do not give all idea in a same time. Although, we learn some basics of pandas, we will go deeper in pandas.
# data frames from dictionary
country = ["Spain","France"]
population = ["11","12"]
list_label = ["country","population"]
list_col = [country,population]
zipped = list(zip(list_label,list_col))
data_dict = dict(zipped)
df = pd.DataFrame(data_dict)
# Add new columns
df["capital"] = ["madrid","paris"]
# Broadcasting
df["income"] = 0 #Broadcasting entire column
# Plotting all data
data1 = data.loc[:,["Attack","Defense","Speed"]]
# it is confusing
# subplots
data1.plot(subplots = True)
# scatter plot
data1.plot(kind = "scatter",x="Attack",y = "Defense")
# hist plot
data1.plot(kind = "hist",y = "Defense",bins = 50,range= (0,250),normed = True)
# histogram subplot with non cumulative and cumulative
fig, axes = plt.subplots(nrows=2,ncols=1)
data1.plot(kind = "hist",y = "Defense",bins = 50,range= (0,250),normed = True,ax = axes[0])
data1.plot(kind = "hist",y = "Defense",bins = 50,range= (0,250),normed = True,ax = axes[1],cumulative = True)
I already explained it at previous parts. However lets look at one more time.
time_list = ["1992-03-08","1992-04-12"]
print(type(time_list[1])) # As you can see date is string
# however we want it to be datetime object
datetime_object = pd.to_datetime(time_list)
# close warning
import warnings
# In order to practice lets take head of pokemon data and add it a time list
data2 = data.head()
date_list = ["1992-01-10","1992-02-10","1992-03-10","1993-03-15","1993-03-16"]
datetime_object = pd.to_datetime(date_list)
data2["date"] = datetime_object
# lets make date as index
data2= data2.set_index("date")
# Now we can select according to our date index
# We will use data2 that we create at previous part
# Lets resample with month
# As you can see there are a lot of nan because data2 does not include all months
# In real life (data is real. Not created from us like data2) we can solve this problem with interpolate
# We can interpolete from first value
# Or we can interpolate with mean()
# read data
data = pd.read_csv('../input/pokemon.csv')
data= data.set_index("#")
# indexing using square brackets
# using column attribute and row label
# using loc accessor
# Selecting only some columns
# Difference between selecting columns: series and dataframes
print(type(data["HP"])) # series
print(type(data[["HP"]])) # data frames
# Slicing and indexing series
data.loc[1:10,"HP":"Defense"] # 10 and "Defense" are inclusive
# Reverse slicing
# From something to end
# Creating boolean series
boolean = data.HP > 200
# Combining filters
first_filter = data.HP > 150
second_filter = data.Speed > 35
data[first_filter & second_filter]
# Filtering column based others
# Plain python functions
def div(n):
return n/2
# Or we can use lambda function
data.HP.apply(lambda n : n/2)
# Defining column using other columns
data["total_power"] = data.Attack + data.Defense
# our index name is this:
# lets change it = "index_name"
# Overwrite index
# if we want to modify index we need to change all of them.
# first copy of our data to data3 then change index
data3 = data.copy()
# lets make index start from 100. It is not remarkable change but it is just example
data3.index = range(100,900,1)
# We can make one of the column as index. I actually did it at the beginning of manipulating data frames with pandas section
# It was like this
# data= data.set_index("#")
# also you can use
# data.index = data["#"]
# lets read data frame one more time to start from beginning
data = pd.read_csv('../input/pokemon.csv')
# As you can see there is index. However we want to set one or more column to be index
# Setting index : type 1 is outer type 2 is inner index
data1 = data.set_index(["Type 1","Type 2"])
# data1.loc["Fire","Flying"] # howw to use indexes
dic = {"treatment":["A","A","B","B"],"gender":["F","M","F","M"],"response":[10,45,5,9],"age":[15,4,72,65]}
df = pd.DataFrame(dic)
# pivoting
df.pivot(index="treatment",columns = "gender",values="response")
df1 = df.set_index(["treatment","gender"])
# lets unstack it
# level determines indexes
# change inner and outer level index position
df2 = df1.swaplevel(0,1)
# df.pivot(index="treatment",columns = "gender",values="response")
# We will use df
# according to treatment take means of other features
df.groupby("treatment").mean() # mean is aggregation / reduction method
# there are other methods like sum, std,max or min
# we can only choose one of the feature
# Or we can choose multiple features
# as you can see gender is object
# However if we use groupby, we can convert it categorical data.
# Because categorical data uses less memory, speed up operations like groupby
#df["gender"] = df["gender"].astype("category")
#df["treatment"] = df["treatment"].astype("category")
