Week 4 Pandas
Week 4: Pandas
What is pandas?¶
pandas is a Python library used for data manipulation and analysis that buil on top of the Python programming language.
Pandas work well with CSV file (e.g., excel, SQL)
Key features of Python Pandas¶
Import Pandas in Python¶
It is recommended to install and run pandas from a virtual environment
Create a virtual environment¶
Right click the Anaconda prompt and Run as Administrator (This step will ensure that your package is installed in Anaconda/env folder)
Create a virtual environment named as 'pandas_v' with python version 3.9
conda create -n pda python=3.9
List virtual environment¶
now if you list virgual envionment, you can see the virtual environment that you just create
conda env list
Activate virtual environment¶
if you virtual environment named as 'arcgis_python'
activate pda
Install pandas¶
conda install pandas
Download data in CSV format¶
Data Source: The city of Worcester
Search: Worcester Police Use of Force Incidents (July 2024)
Download CSV
Loading a Pandas DataFrame from CSV file¶
import pandas as pd
## use the dir() function to list all the attributes (include functions) in python
print(dir(pd))
['ArrowDtype', 'BooleanDtype', 'Categorical', 'CategoricalDtype', 'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex', 'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Flags', 'Float32Dtype', 'Float64Dtype', 'Grouper', 'HDFStore', 'Index', 'IndexSlice', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int8Dtype', 'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NA', 'NaT', 'NamedAgg', 'Period', 'PeriodDtype', 'PeriodIndex', 'RangeIndex', 'Series', 'SparseDtype', 'StringDtype', 'Timedelta', 'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype', 'UInt64Dtype', 'UInt8Dtype', '__all__', '__builtins__', '__cached__', '__doc__', '__docformat__', '__file__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_built_with_meson', '_config', '_is_numpy_dev', '_libs', '_pandas_datetime_CAPI', '_pandas_parser_CAPI', '_testing', '_typing', '_version_meson', 'annotations', 'api', 'array', 'arrays', 'bdate_range', 'compat', 'concat', 'core', 'crosstab', 'cut', 'date_range', 'describe_option', 'errors', 'eval', 'factorize', 'from_dummies', 'get_dummies', 'get_option', 'infer_freq', 'interval_range', 'io', 'isna', 'isnull', 'json_normalize', 'lreshape', 'melt', 'merge', 'merge_asof', 'merge_ordered', 'notna', 'notnull', 'offsets', 'option_context', 'options', 'pandas', 'period_range', 'pivot', 'pivot_table', 'plotting', 'qcut', 'read_clipboard', 'read_csv', 'read_excel', 'read_feather', 'read_fwf', 'read_gbq', 'read_hdf', 'read_html', 'read_json', 'read_orc', 'read_parquet', 'read_pickle', 'read_sas', 'read_spss', 'read_sql', 'read_sql_query', 'read_sql_table', 'read_stata', 'read_table', 'read_xml', 'reset_option', 'set_eng_float_format', 'set_option', 'show_versions', 'test', 'testing', 'timedelta_range', 'to_datetime', 'to_numeric', 'to_pickle', 'to_timedelta', 'tseries', 'unique', 'util', 'value_counts', 'wide_to_long']
read csv¶
incidents = pd.read_csv(r'D:\Teaching_Data\Worcester_Police_Use_of_Force_Incidents_July_2024.csv')
incidents.head(n=3)
Incident_No | Date___Time | Location | Officer | Narr | Implement | Injury | Supervisor | Notified | ObjectId | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2024000078279 | 7/29/24 1:09 PM | 9 LINCOLN SQ Worcester, MA | Daniel Adjei | 1 | R. Empty Hand Compliance Techniques / Level 3 | Visible Injury | Neil F O'Connor | 7/29/24 4:27 PM | 1 |
1 | 2024000078796 | 7/30/24 4:38 PM | 6 SHANNON ST Apt #: 2 Worcester, MA | Rachel E Frisch | 3 | R. Empty Hand Compliance Techniques / Level 3 | _None | Michael A Cappabianca Jr | 7/30/24 6:57 PM | 2 |
2 | 2024000078796 | 7/30/24 4:38 PM | 6 SHANNON ST Apt #: 2 Worcester, MA | Michael Genese | 2 | R. Empty Hand Compliance Techniques / Level 3 | _None | Michael A Cappabianca Jr | 7/30/24 6:55 PM | 3 |
3 | 2024000078953 | 7/30/24 9:50 PM | 79 MAYFIELD ST Worcester, MA | David Green | 3 | I. Display of Firearm | _None | Stephen L Roche | 7/30/24 10:26 PM | 4 |
4 | 2024000079115 | 7/31/24 11:05 AM | 70 JACKSON ST Worcester, MA | DUY CHAU | 6 | I. Display of Firearm | _None | Shawn M Barbale | 7/31/24 1:20 PM | 5 |
A concise summary of a DataFrame¶
incidents.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 74 entries, 0 to 73 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Incident_No 74 non-null int64 1 Date___Time 74 non-null object 2 Location 74 non-null object 3 Officer 74 non-null object 4 Narr 74 non-null int64 5 Implement 74 non-null object 6 Injury 74 non-null object 7 Supervisor 74 non-null object 8 Notified 74 non-null object 9 ObjectId 74 non-null int64 dtypes: int64(3), object(7) memory usage: 5.9+ KB
incidents.shape
(74, 10)
Series and DataFrame¶
Series: A one-dimensional labeled array capable of holding data of any type (integers, strings, floating-point numbers, etc.).
DataFrame: A two-dimensional tabular data structure with labeled axes (rows and columns).
Access a column¶
# Access a column as series
col_series = incidents['Narr']
print(col_series)
type(col_series)
0 1 1 3 2 2 3 3 4 6 .. 69 5 70 2 71 1 72 4 73 2 Name: Narr, Length: 74, dtype: int64
pandas.core.series.Series
# Access a column as Series (2nd syntax)
col_series02 = incidents.Narr
print(type(col_series02))
col_series02
<class 'pandas.core.series.Series'>
0 1 1 3 2 2 3 3 4 6 .. 69 5 70 2 71 1 72 4 73 2 Name: Narr, Length: 74, dtype: int64
## Access a column as DataFrame
col_df = incidents[['Narr']]
print(col_df)
type(col_df)
Narr 0 1 1 3 2 2 3 3 4 6 .. ... 69 5 70 2 71 1 72 4 73 2 [74 rows x 1 columns]
pandas.core.frame.DataFrame
Access a row¶
The iloc() function is an indexed-based selecting method which means that we have to pass an integer index in the method to select a specific row/column.
# Access a row as Series
row_series = incidents.iloc[0,:]
print(row_series)
type(row_series)
Incident_No 2024000078279 Date___Time 7/29/24 1:09 PM Location 9 LINCOLN SQ Worcester, MA Officer Daniel Adjei Narr 1 Implement R. Empty Hand Compliance Techniques / Level 3 Injury Visible Injury Supervisor Neil F O'Connor Notified 7/29/24 4:27 PM ObjectId 1 Name: 0, dtype: object
pandas.core.series.Series
# Access a row as DataFrame
row_df = incidents.iloc[[0],:]
print(type(row_df))
row_df
<class 'pandas.core.frame.DataFrame'>
Incident_No | Date___Time | Location | Officer | Narr | Implement | Injury | Supervisor | Notified | ObjectId | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2024000078279 | 7/29/24 1:09 PM | 9 LINCOLN SQ Worcester, MA | Daniel Adjei | 1 | R. Empty Hand Compliance Techniques / Level 3 | Visible Injury | Neil F O'Connor | 7/29/24 4:27 PM | 1 |
Access multiple rows and columns¶
##### Access multiple row using iloc()
sub_inci = incidents.iloc[0:5,0:3]
print(type(sub_inci))
sub_inci
<class 'pandas.core.frame.DataFrame'>
Incident_No | Date___Time | Location | |
---|---|---|---|
0 | 2024000078279 | 7/29/24 1:09 PM | 9 LINCOLN SQ Worcester, MA |
1 | 2024000078796 | 7/30/24 4:38 PM | 6 SHANNON ST Apt #: 2 Worcester, MA |
2 | 2024000078796 | 7/30/24 4:38 PM | 6 SHANNON ST Apt #: 2 Worcester, MA |
3 | 2024000078953 | 7/30/24 9:50 PM | 79 MAYFIELD ST Worcester, MA |
4 | 2024000079115 | 7/31/24 11:05 AM | 70 JACKSON ST Worcester, MA |
Access columns or rows using loc()¶
The loc() function is label based data selecting method which means that we have to pass the name of the row or column which we want to select.
rst = incidents.loc[0:1,['Incident_No', 'Date___Time', 'Location']]
rst
Incident_No | Date___Time | Location | |
---|---|---|---|
0 | 2024000078279 | 7/29/24 1:09 PM | 9 LINCOLN SQ Worcester, MA |
1 | 2024000078796 | 7/30/24 4:38 PM | 6 SHANNON ST Apt #: 2 Worcester, MA |
Exercise 1: Get record as series and Datafram from incidents¶
Get the record as series: 'Location' column
Get the record as DataFrame using index based method (.iloc): row 1 - 5 and column 2 - 3
Get the record as DataFrame using label based method (.loc): row 1-5 and column ['Supervisor', 'Notified']
#format of columns can be the dictionary: old column name: new column name
print('column name are {} '.format(incidents.columns))
incidents.rename(columns={'Date___Time':'DateTime', 'Incident_No':"IncidentNo"}, inplace=True)
print('updated column name are {} '.format(incidents.columns))
column name are Index(['IncidentNo', 'DateTime', 'Location', 'Officer', 'Narr', 'Implement', 'Injury', 'Supervisor', 'Notified', 'ObjectId'], dtype='object') updated column name are Index(['IncidentNo', 'DateTime', 'Location', 'Officer', 'Narr', 'Implement', 'Injury', 'Supervisor', 'Notified', 'ObjectId'], dtype='object')
Check Null value¶
pandas.DataFrame.isnull check missing value
pandas.DataFrame.sum the sum of the values over the requested axis.
# check null value in each cell
incidents.isnull().head(n = 5)
IncidentNo | DateTime | Location | Officer | Narr | Implement | Injury | Supervisor | Notified | ObjectId | |
---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False |
# calcualte the sum for each column
incidents.isnull().sum()
IncidentNo 0 DateTime 0 Location 0 Officer 0 Narr 0 Implement 0 Injury 0 Supervisor 0 Notified 0 ObjectId 0 dtype: int64
multi_col = incidents.loc[:,['Injury', 'Narr']]
pd.unique(multi_col)
C:\Users\yanawu\AppData\Local\Temp\ipykernel_8664\3584398706.py:2: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version. pd.unique(multi_col)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[15], line 2 1 multi_col = incidents.loc[:,['Injury', 'Narr']] ----> 2 pd.unique(multi_col) File d:\Anaconda\envs\pandas_vp\lib\site-packages\pandas\core\algorithms.py:401, in unique(values) 307 def unique(values): 308 """ 309 Return unique values based on a hash table. 310 (...) 399 array([('a', 'b'), ('b', 'a'), ('a', 'c')], dtype=object) 400 """ --> 401 return unique_with_mask(values) File d:\Anaconda\envs\pandas_vp\lib\site-packages\pandas\core\algorithms.py:429, in unique_with_mask(values, mask) 427 def unique_with_mask(values, mask: npt.NDArray[np.bool_] | None = None): 428 """See algorithms.unique for docs. Takes a mask for masked arrays.""" --> 429 values = _ensure_arraylike(values, func_name="unique") 431 if isinstance(values.dtype, ExtensionDtype): 432 # Dispatch to extension dtype's unique. 433 return values.unique() File d:\Anaconda\envs\pandas_vp\lib\site-packages\pandas\core\algorithms.py:238, in _ensure_arraylike(values, func_name) 236 if isinstance(values, tuple): 237 values = list(values) --> 238 values = construct_1d_object_array_from_listlike(values) 239 else: 240 values = np.asarray(values) File d:\Anaconda\envs\pandas_vp\lib\site-packages\pandas\core\dtypes\cast.py:1601, in construct_1d_object_array_from_listlike(values) 1598 # numpy will try to interpret nested lists as further dimensions, hence 1599 # making a 1D array that contains list-likes is a bit tricky: 1600 result = np.empty(len(values), dtype="object") -> 1601 result[:] = values 1602 return result ValueError: could not broadcast input array from shape (74,2) into shape (74,)
unique_col = {}
for col in multi_col.columns:
unique_col[col] = pd.unique(multi_col[col])
unique_col
{'Injury': array(['Visible Injury', '_None', 'Complaint of Injury'], dtype=object), 'Narr': array([ 1, 3, 2, 6, 4, 12, 7, 5, 8, 9, 10], dtype=int64)}
inct_sort_narr = incidents.sort_values(by="Narr", ascending=False)
inct_sort_narr.head(n = 6)
Incident_No | Date___Time | Location | Officer | Narr | Implement | Injury | Supervisor | Notified | ObjectId | |
---|---|---|---|---|---|---|---|---|---|---|
24 | 2024000067866 | 7/6/24 12:33 AM | 466 HARDING ST Worcester, MA | Santino Simone | 12 | G. Display of Taser or Sparks Display | _None | Terrence G Cahill | 8/18/24 12:39 AM | 25 |
55 | 2024000074652 | 7/20/24 11:52 PM | 3 JEFFERSON ST Worcester, MA | Tyler R Sterner | 10 | G. Display of Taser or Sparks Display | _None | Michael A Cappabianca Jr | 7/24/24 9:23 PM | 56 |
47 | 2024000072086 | 7/15/24 10:27 AM | 442 GRAFTON ST Worcester, MA | Dimitrios Gaitanidis | 9 | I. Display of Firearm | _None | Jeffrey P Carlson | 7/15/24 1:05 PM | 48 |
48 | 2024000072443 | 7/16/24 6:23 AM | 5 TOWNSEND ST Apt #: 2 Worcester, MA | Terrence O Gaffney | 9 | I. Display of Firearm | _None | Terrence G Cahill | 7/16/24 7:41 PM | 49 |
43 | 2024000072086 | 7/15/24 10:27 AM | 442 GRAFTON ST Worcester, MA | Jose M Lugo-Gardner | 8 | I. Display of Firearm | _None | Shawn M Barbale | 7/15/24 2:01 PM | 44 |
Exercise 2: sort values based on Officer (String)¶
sort_values() function sort the number from smallest to largest (ascending) descending
how about sort the String?
Data query: single condition¶
unique_injury = incidents.loc[:,"Injury"].unique()
print('unique value: {}'.format(unique_injury))
unique value: ['Visible Injury' '_None' 'Complaint of Injury']
none_incident = incidents.query('Injury == "_None"')
print(none_incident.shape)
(68, 10)
Exercise 3: How many incidents' injury type are 'Visible Injury'¶
Data query: multiple condition¶
Query how many incidents's Narr are larger than 6 and Implement are 'I. Display of Firearm'
none_firearm = incidents.query('Narr > 6 & Implement == "I. Display of Firearm"')
none_firearm.shape
(4, 10)
DateTime¶
Pandas provides the pd.to_datetime()
function to convert string data into a datetime object:
incidents['DateTime']
0 7/29/24 1:09 PM 1 7/30/24 4:38 PM 2 7/30/24 4:38 PM 3 7/30/24 9:50 PM 4 7/31/24 11:05 AM ... 69 7/27/24 12:43 AM 70 7/29/24 11:41 AM 71 7/29/24 11:41 AM 72 7/29/24 1:09 PM 73 7/29/24 1:09 PM Name: DateTime, Length: 74, dtype: object
# create a new column to store the DateTime
incidents['DT'] = pd.to_datetime(incidents['DateTime'], format = "%m/%d/%y %I:%M %p")
# sort value based on 'DateTime' or 'DT' to see the difference
incidents.sort_values(by='DT', ascending=True).head(n=5)
IncidentNo | DateTime | Location | Officer | Narr | Implement | Injury | Supervisor | Notified | ObjectId | DT | |
---|---|---|---|---|---|---|---|---|---|---|---|
7 | 2024000065816 | 7/1/24 9:46 AM | 25 QUEEN ST Worcester, MA | Joseph L Ford | 1 | R. Empty Hand Compliance Techniques / Level 3 | _None | Justin Bennes | 7/1/24 4:00 PM | 8 | 2024-07-01 09:46:00 |
9 | 2024000066207 | 7/2/24 10:11 AM | 275 PLEASANT ST Apt #: 713 Worcester, MA | John Biancaniello | 1 | G. Display of Taser or Sparks Display | _None | MIGUEL DIAZ | 7/2/24 11:45 AM | 10 | 2024-07-02 10:11:00 |
8 | 2024000066207 | 7/2/24 10:11 AM | 275 PLEASANT ST Apt #: 713 Worcester, MA | David C McAtee | 2 | I. Display of Firearm | _None | MIGUEL DIAZ | 7/2/24 12:11 PM | 9 | 2024-07-02 10:11:00 |
10 | 2024000066256 | 7/2/24 12:02 PM | 25 TOBIAS BOLAND WAY Worcester, MA | James P Ciru | 1 | R. Empty Hand Compliance Techniques / Level 3 | _None | Jeffrey P Carlson | 7/2/24 1:04 PM | 11 | 2024-07-02 12:02:00 |
11 | 2024000066380 | 7/2/24 5:16 PM | 810 MAIN ST Worcester, MA | Stephen J Mitchell | 2 | R. Empty Hand Compliance Techniques / Level 3 | _None | Christopher A Panarello | 7/2/24 5:11 PM | 12 | 2024-07-02 17:16:00 |
incidents.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 74 entries, 0 to 73 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 IncidentNo 74 non-null int64 1 DateTime 74 non-null object 2 Location 74 non-null object 3 Officer 74 non-null object 4 Narr 74 non-null int64 5 Implement 74 non-null object 6 Injury 74 non-null object 7 Supervisor 74 non-null object 8 Notified 74 non-null object 9 ObjectId 74 non-null int64 10 DT 74 non-null datetime64[ns] dtypes: datetime64[ns](1), int64(3), object(7) memory usage: 6.5+ KB
Get weekday from the DateTime¶
incidents['wday'] = incidents.DT.dt.weekday
incidents['wday'].unique()
array([0, 1, 2, 3, 4, 5, 6])
Exercise 4: Get year and hour from the DateTime and assign to a new column¶
incidents['yr']
incidents[hr]
incidents['yr'] = incidents.DT.dt.year
incidents['hr'] = incidents.DT.dt.hour
Aggregating statistics¶
Aggregating Statistics of a single column¶
incidents['Narr'].mean()
2.9324324324324325
incidents[["Narr"]].describe()
Narr | |
---|---|
count | 74.000000 |
mean | 2.932432 |
std | 2.377584 |
min | 1.000000 |
25% | 1.000000 |
50% | 2.000000 |
75% | 3.000000 |
max | 12.000000 |
Aggregating statistics grouped by category¶
incidents[["Narr", "wday"]].groupby("wday").mean()
Narr | |
---|---|
wday | |
0 | 3.142857 |
1 | 2.700000 |
2 | 2.307692 |
3 | 2.111111 |
4 | 2.833333 |
5 | 4.200000 |
6 | 2.428571 |
Exercise 5: Count the incidents based on day of week¶
using count() to count the incidents for each day of week
Group IncidentNo based on wday, then count the occurrence of incidentNo for each wday
Sort the wday based on the total number of incidents
incidents[["IncidentNo", "wday"]].groupby("wday").count()
IncidentNo | |
---|---|
wday | |
0 | 14 |
1 | 10 |
2 | 13 |
3 | 9 |
4 | 6 |
5 | 15 |
6 | 7 |
Spatially Enabled Data Frame¶
The Spatially Enabled DataFrame (SEDF) creates a simple, intutive object that can easily manipulate attribute and geometric data.
Create SEDF from a Shapefile
An SEDF still includes access to all the Pandas DataFrame functionality
Shapefile¶
A shapefile is made up of multiple files and all files must be found in the same folder with the same name
A shapefile must have the following:
- .shp – this file stores the geometry of the feature
- .shx – this file stores the index of the geometry
- .dbf – this file stores the attribute information for the feature