Week 4 Pandas

Week 4: Pandas

What is pandas?¶

pandas is a Python library used for data manipulation and analysis that buil on top of the Python programming language.

Pandas work well with CSV file (e.g., excel, SQL)

Key features of Python Pandas¶

Import Pandas in Python¶

It is recommended to install and run pandas from a virtual environment

Create a virtual environment¶

Right click the Anaconda prompt and Run as Administrator (This step will ensure that your package is installed in Anaconda/env folder)

Create a virtual environment named as 'pandas_v' with python version 3.9

conda create -n pda python=3.9

List virtual environment¶

now if you list virgual envionment, you can see the virtual environment that you just create

conda env list

Activate virtual environment¶

if you virtual environment named as 'arcgis_python'

activate pda

Install pandas¶

conda install pandas

Download data in CSV format¶

Data Source: The city of Worcester

Search: Worcester Police Use of Force Incidents (July 2024)

Download CSV

Loading a Pandas DataFrame from CSV file¶

In [ ]:

            
                Copied!
                
import os
import os

In [ ]:

            
                Copied!
                
os.getcwd()
os.getcwd()

In [ ]:

            
                Copied!
                
import pandas as pd
import pandas as pd

In [ ]:

            
                Copied!
                
## use the dir() function to list all the attributes (include functions) in python
print(dir(pd))
## use the dir() function to list all the attributes (include functions) in python
print(dir(pd))

read csv¶

In [ ]:

            
                Copied!
                
incidents = pd.read_csv(r'Worcester_Police_Use_of_Force_Incidents_July_2024.csv')
incidents.tail(n=5)
incidents = pd.read_csv(r'Worcester_Police_Use_of_Force_Incidents_July_2024.csv')
incidents.tail(n=5)

A concise summary of a DataFrame¶

In [ ]:

            
                Copied!
                
incidents.info()
incidents.info()

In [ ]:

            
                Copied!
                
incidents.shape
incidents.shape

Series and DataFrame¶

Series: A one-dimensional labeled array capable of holding data of any type (integers, strings, floating-point numbers, etc.).

DataFrame: A two-dimensional tabular data structure with labeled axes (rows and columns).

Access a column¶

In [ ]:

            
                Copied!
                
# Access a column as series
col_series = incidents['Narr']
print(col_series)
type(col_series)
# Access a column as series
col_series = incidents['Narr']
print(col_series)
type(col_series)

In [ ]:

            
                Copied!
                
# Access a column as Series (2nd syntax)
col_series02 = incidents.Narr
print(type(col_series02))
col_series02
# Access a column as Series (2nd syntax)
col_series02 = incidents.Narr
print(type(col_series02))
col_series02

In [ ]:

            
                Copied!
                
## Access a column as DataFrame
col_df = incidents[['Narr']]
print(col_df)
type(col_df)
## Access a column as DataFrame
col_df = incidents[['Narr']]
print(col_df)
type(col_df)

Access a row¶

The iloc() function is an indexed-based selecting method which means that we have to pass an integer index in the method to select a specific row/column.

In [ ]:

            
                Copied!
                
# Access a row as Series
row_series = incidents.iloc[0,:] 
print(row_series)
type(row_series)
# Access a row as Series
row_series = incidents.iloc[0,:] 
print(row_series)
type(row_series)

In [ ]:

            
                Copied!
                
# Access a row as DataFrame
row_df = incidents.iloc[[0],:] 
print(type(row_df))
row_df
# Access a row as DataFrame
row_df = incidents.iloc[[0],:] 
print(type(row_df))
row_df

Access multiple rows and columns¶

In [ ]:

            
                Copied!
                
##### Access multiple row using iloc()
sub_inci = incidents.iloc[0:5,0:3]
print(type(sub_inci))
sub_inci
##### Access multiple row using iloc()
sub_inci = incidents.iloc[0:5,0:3]
print(type(sub_inci))
sub_inci

Access columns or rows using loc()¶

The loc() function is label based data selecting method which means that we have to pass the name of the row or column which we want to select.

In [ ]:

            
                Copied!
                
rst = incidents.loc[0:1,['Incident_No', 'Date___Time', 'Location']]
rst
rst = incidents.loc[0:1,['Incident_No', 'Date___Time', 'Location']]
rst

Exercise 1: Get record as series and Datafram from incidents¶

Get the record as series: 'Location' column
Get the record as DataFrame using index based method (.iloc): row 1 : 5 and column 2 : 3
Get the record as DataFrame using label based method (.loc): row 1:5 and column ['Supervisor', 'Notified']

In [ ]:

            
                Copied!
                
incidents.loc[1:6,['Supervisor', 'Notified']]
incidents.loc[1:6,['Supervisor', 'Notified']]

In [ ]:

            
                Copied!
                
incidents.iloc[1:5,2:3]
incidents.iloc[1:5,2:3]

In [ ]:

            
                Copied!
                
incidents.Location
incidents.Location

Data cleaning¶

Rename the column name¶

Check the columns names based on incidents.columns

In [ ]:

            
                Copied!
                
incidents.columns
incidents.columns

In [ ]:

            
                Copied!
                
# #format of columns can be the dictionary: old column name: new column name
# print('column name are {} '.format(incidents.columns))

# incidents.rename(columns={'Date___Time':'DateTime', 'Incident_No':"IncidentNo"}, inplace=True)

# print('updated column name are {} '.format(incidents.columns))
# #format of columns can be the dictionary: old column name: new column name
# print('column name are {} '.format(incidents.columns))

# incidents.rename(columns={'Date___Time':'DateTime', 'Incident_No':"IncidentNo"}, inplace=True)

# print('updated column name are {} '.format(incidents.columns))

In [ ]:

            
                Copied!
                
#format of columns can be the dictionary: old column name: new column name
print('column name are {} '.format(incidents.columns))

incidents.rename(columns={'Date___Time':'DateTime', 'Incident_No':"IncidentNo"}, inplace=True)

print('updated column name are {} '.format(incidents.columns))
#format of columns can be the dictionary: old column name: new column name
print('column name are {} '.format(incidents.columns))

incidents.rename(columns={'Date___Time':'DateTime', 'Incident_No':"IncidentNo"}, inplace=True)

print('updated column name are {} '.format(incidents.columns))

Check Null value¶

pandas.DataFrame.isnull check missing value

pandas.DataFrame.sum the sum of the values over the requested axis.

In [ ]:

            
                Copied!
                
# check null value in each cell
incidents.isnull().head(n = 5)
# check null value in each cell
incidents.isnull().head(n = 5)

In [ ]:

            
                Copied!
                
# calcualte the sum for each column
incidents.isnull().sum()
# calcualte the sum for each column
incidents.isnull().sum()

Get unique value of the columns¶

unique() function in pandas

pandas.unique

In [ ]:

            
                Copied!
                
multi_col = incidents.loc[:,['Injury', 'Narr']]
print(type(multi_col))
pd.unique(multi_col)
multi_col = incidents.loc[:,['Injury', 'Narr']]
print(type(multi_col))
pd.unique(multi_col)

Exericse 2: Loop throug the columns in incidents to calculate the unique value¶

In [ ]:

            
                Copied!
                
for col_names in multi_col.columns:
    print(col_names)
for col_names in multi_col.columns:
    print(col_names)

In [ ]:

            
                Copied!
                
multi_col['Injury']
multi_col['Injury']

In [ ]:

            
                Copied!
                
# use the column to get the series from dataframe
# calculate the unique vale for each series
for col_names in multi_col.columns:
    print(pd.unique(multi_col[col_names]))
# use the column to get the series from dataframe
# calculate the unique vale for each series
for col_names in multi_col.columns:
    print(pd.unique(multi_col[col_names]))

DataFrame Manipulation¶

Data sorting¶

pandas.DataFrame.sort_values

pandas.Series.sort_values

In [ ]:

            
                Copied!
                
inct_sort_narr = incidents.sort_values(by="Narr", ascending=False)
inct_sort_narr.head(n = 6)
inct_sort_narr = incidents.sort_values(by="Narr", ascending=False)
inct_sort_narr.head(n = 6)

Exercise 2: sort values based on Officer (String)¶

sort_values() function sort the number from smallest to largest (ascending) descending

how about sort the String?

Data query: single condition¶

In [ ]:

            
                Copied!
                
unique_injury = incidents.loc[:,"Injury"].unique()
print('unique value: {}'.format(unique_injury))
unique_injury = incidents.loc[:,"Injury"].unique()
print('unique value: {}'.format(unique_injury))

In [ ]:

            
                Copied!
                
none_incident = incidents.query('Injury == "_None"')
print(none_incident.head(n=5))
print(none_incident.shape)
none_incident = incidents.query('Injury == "_None"')
print(none_incident.head(n=5))
print(none_incident.shape)

Exercise 3: How many incidents' injury type are 'Visible Injury'¶

In [ ]:

            
                Copied!
                
none_incident = incidents.query('Injury == "_None"')
print(none_incident.head(n=5))
print(none_incident.shape)
none_incident = incidents.query('Injury == "_None"')
print(none_incident.head(n=5))
print(none_incident.shape)

In [ ]:

            
                Copied!
                
none_incident = incidents.query('Injury == "Visible Injury"')
print(none_incident.head(n=5))
print(none_incident.shape)
none_incident = incidents.query('Injury == "Visible Injury"')
print(none_incident.head(n=5))
print(none_incident.shape)

Data query: multiple condition¶

Query how many incidents's Narr are larger than 6 and Implement are 'I. Display of Firearm'

In [ ]:

            
                Copied!
                
none_firearm = incidents.query('Narr > 6 & Implement == "I. Display of Firearm"')
print(none_firearm.head(n=5))
none_firearm.shape
none_firearm = incidents.query('Narr > 6 & Implement == "I. Display of Firearm"')
print(none_firearm.head(n=5))
none_firearm.shape

DateTime¶

Pandas provides the pd.to_datetime() function to convert string data into a datetime object:

In [ ]:

            
                Copied!
                
incidents['DateTime']
incidents['DateTime']

Documentation about strftime

In [ ]:

            
                Copied!
                
# create a new column to store the DateTime
incidents['DT'] = pd.to_datetime(incidents['DateTime'], format = "%m/%d/%y %I:%M %p")
# create a new column to store the DateTime
incidents['DT'] = pd.to_datetime(incidents['DateTime'], format = "%m/%d/%y %I:%M %p")

In [ ]:

            
                Copied!
                
incidents['DT']
incidents['DT']

In [ ]:

            
                Copied!
                
# sort value based on 'DateTime' or 'DT' to see the difference
incidents.sort_values(by='DT', ascending=True).head(n=5)
# sort value based on 'DateTime' or 'DT' to see the difference
incidents.sort_values(by='DT', ascending=True).head(n=5)

In [ ]:

            
                Copied!
                
incidents.info()
incidents.info()

Get weekday from the DateTime¶

pandas.Series.dt

In [ ]:

            
                Copied!
                
incidents['wday'] = incidents.DT.dt.weekday
incidents['wday'] = incidents.DT.dt.weekday

In [ ]:

            
                Copied!
                
incidents['wday'].unique()
incidents['wday'].unique()

Exercise 4: Get year and hour from the DateTime and assign to a new column¶

incidents['yr']

incidents[hr]

In [ ]:

            
                Copied!
                
incidents['year'] = incidents.DT.dt.year
incidents['hr'] = incidents.DT.dt.hour
incidents['year'] = incidents.DT.dt.year
incidents['hr'] = incidents.DT.dt.hour

In [ ]:

            
                Copied!
                
incidents['year']
incidents['year']

Aggregating statistics¶

Aggregating Statistics of a single column¶

pandas.DataFrame.mean

In [ ]:

            
                Copied!
                
incidents['Narr'].mean()
incidents['Narr'].mean()

In [ ]:

            
                Copied!
                
incidents[["Narr"]].describe()
incidents[["Narr"]].describe()

Aggregating statistics grouped by category¶

In [ ]:

            
                Copied!
                
incidents[["Narr", "wday"]].groupby("wday").mean()
incidents[["Narr", "wday"]].groupby("wday").mean()

Exercise 5: Count the incidents based on day of week¶

using count() to count the incidents for each day of week

Group IncidentNo based on wday, then count the occurrence of incidentNo for each wday

Sort the wday based on the total number of incidents

In [ ]:

            
                Copied!
                
incidents.groupby('wday')['IncidentNo'].count()
incidents.groupby('wday')['IncidentNo'].count()

In [ ]:

            
                Copied!
                
incidents[["Narr", "wday"]].groupby("wday").count()
incidents[["Narr", "wday"]].groupby("wday").count()

In [ ]:

            
                Copied!
                
incidents[["Narr", "wday"]].groupby("wday").count()
incidents[["Narr", "wday"]].groupby("wday").count()

Spatially Enabled Data Frame¶

The Spatially Enabled DataFrame (SEDF) creates a simple, intutive object that can easily manipulate attribute and geometric data.

Create SEDF from a Shapefile

An SEDF still includes access to all the Pandas DataFrame functionality

Shapefile¶

A shapefile is made up of multiple files and all files must be found in the same folder with the same name

A shapefile must have the following:

.shp – this file stores the geometry of the feature
.shx – this file stores the index of the geometry
.dbf – this file stores the attribute information for the feature