Week 4 Pandas

&nbsp &nbsp

Week 4: Pandas

What is pandas?¶

pandas is a Python library used for data manipulation and analysis that buil on top of the Python programming language.

Pandas work well with CSV file (e.g., excel, SQL)

Key features of Python Pandas¶

Import Pandas in Python¶

It is recommended to install and run pandas from a virtual environment

Create a virtual environment¶

Right click the Anaconda prompt and Run as Administrator (This step will ensure that your package is installed in Anaconda/env folder)

Create a virtual environment named as 'pandas_v' with python version 3.9

conda create -n pda python=3.9

List virtual environment¶

now if you list virgual envionment, you can see the virtual environment that you just create

conda env list

Activate virtual environment¶

if you virtual environment named as 'arcgis_python'

activate pda

Install pandas¶

conda install pandas

Download data in CSV format¶

Data Source: The city of Worcester

Search: Worcester Police Use of Force Incidents (July 2024)

Download CSV

Loading a Pandas DataFrame from CSV file¶

In [2]:

            
                Copied!
                
import pandas as pd
import pandas as pd

In [4]:

            
                Copied!
                
## use the dir() function to list all the attributes (include functions) in python
print(dir(pd))
## use the dir() function to list all the attributes (include functions) in python
print(dir(pd))

['ArrowDtype', 'BooleanDtype', 'Categorical', 'CategoricalDtype', 'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex', 'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Flags', 'Float32Dtype', 'Float64Dtype', 'Grouper', 'HDFStore', 'Index', 'IndexSlice', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int8Dtype', 'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NA', 'NaT', 'NamedAgg', 'Period', 'PeriodDtype', 'PeriodIndex', 'RangeIndex', 'Series', 'SparseDtype', 'StringDtype', 'Timedelta', 'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype', 'UInt64Dtype', 'UInt8Dtype', '__all__', '__builtins__', '__cached__', '__doc__', '__docformat__', '__file__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_built_with_meson', '_config', '_is_numpy_dev', '_libs', '_pandas_datetime_CAPI', '_pandas_parser_CAPI', '_testing', '_typing', '_version_meson', 'annotations', 'api', 'array', 'arrays', 'bdate_range', 'compat', 'concat', 'core', 'crosstab', 'cut', 'date_range', 'describe_option', 'errors', 'eval', 'factorize', 'from_dummies', 'get_dummies', 'get_option', 'infer_freq', 'interval_range', 'io', 'isna', 'isnull', 'json_normalize', 'lreshape', 'melt', 'merge', 'merge_asof', 'merge_ordered', 'notna', 'notnull', 'offsets', 'option_context', 'options', 'pandas', 'period_range', 'pivot', 'pivot_table', 'plotting', 'qcut', 'read_clipboard', 'read_csv', 'read_excel', 'read_feather', 'read_fwf', 'read_gbq', 'read_hdf', 'read_html', 'read_json', 'read_orc', 'read_parquet', 'read_pickle', 'read_sas', 'read_spss', 'read_sql', 'read_sql_query', 'read_sql_table', 'read_stata', 'read_table', 'read_xml', 'reset_option', 'set_eng_float_format', 'set_option', 'show_versions', 'test', 'testing', 'timedelta_range', 'to_datetime', 'to_numeric', 'to_pickle', 'to_timedelta', 'tseries', 'unique', 'util', 'value_counts', 'wide_to_long']

read csv¶

In [3]:

            
                Copied!
                
incidents = pd.read_csv(r'D:\Teaching_Data\Worcester_Police_Use_of_Force_Incidents_July_2024.csv')
incidents.head(n=3)
incidents = pd.read_csv(r'D:\Teaching_Data\Worcester_Police_Use_of_Force_Incidents_July_2024.csv')
incidents.head(n=3)

Out[3]:

	Incident_No	Date___Time	Location	Officer	Narr	Implement	Injury	Supervisor	Notified	ObjectId
0	2024000078279	7/29/24 1:09 PM	9 LINCOLN SQ Worcester, MA	Daniel Adjei	1	R. Empty Hand Compliance Techniques / Level 3	Visible Injury	Neil F O'Connor	7/29/24 4:27 PM	1
1	2024000078796	7/30/24 4:38 PM	6 SHANNON ST Apt #: 2 Worcester, MA	Rachel E Frisch	3	R. Empty Hand Compliance Techniques / Level 3	_None	Michael A Cappabianca Jr	7/30/24 6:57 PM	2
2	2024000078796	7/30/24 4:38 PM	6 SHANNON ST Apt #: 2 Worcester, MA	Michael Genese	2	R. Empty Hand Compliance Techniques / Level 3	_None	Michael A Cappabianca Jr	7/30/24 6:55 PM	3
3	2024000078953	7/30/24 9:50 PM	79 MAYFIELD ST Worcester, MA	David Green	3	I. Display of Firearm	_None	Stephen L Roche	7/30/24 10:26 PM	4
4	2024000079115	7/31/24 11:05 AM	70 JACKSON ST Worcester, MA	DUY CHAU	6	I. Display of Firearm	_None	Shawn M Barbale	7/31/24 1:20 PM	5

A concise summary of a DataFrame¶

In [4]:

            
                Copied!
                
incidents.info()
incidents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74 entries, 0 to 73
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Incident_No  74 non-null     int64 
 1   Date___Time  74 non-null     object
 2   Location     74 non-null     object
 3   Officer      74 non-null     object
 4   Narr         74 non-null     int64 
 5   Implement    74 non-null     object
 6   Injury       74 non-null     object
 7   Supervisor   74 non-null     object
 8   Notified     74 non-null     object
 9   ObjectId     74 non-null     int64 
dtypes: int64(3), object(7)
memory usage: 5.9+ KB

In [5]:

            
                Copied!
                
incidents.shape
incidents.shape

Out[5]:

(74, 10)

Series and DataFrame¶

Series: A one-dimensional labeled array capable of holding data of any type (integers, strings, floating-point numbers, etc.).

DataFrame: A two-dimensional tabular data structure with labeled axes (rows and columns).

Access a column¶

In [28]:

            
                Copied!
                
# Access a column as series
col_series = incidents['Narr']
print(col_series)
type(col_series)
# Access a column as series
col_series = incidents['Narr']
print(col_series)
type(col_series)

0     1
1     3
2     2
3     3
4     6
     ..
69    5
70    2
71    1
72    4
73    2
Name: Narr, Length: 74, dtype: int64

Out[28]:

pandas.core.series.Series

In [38]:

            
                Copied!
                
# Access a column as Series (2nd syntax)
col_series02 = incidents.Narr
print(type(col_series02))
col_series02
# Access a column as Series (2nd syntax)
col_series02 = incidents.Narr
print(type(col_series02))
col_series02

<class 'pandas.core.series.Series'>

Out[38]:

0     1
1     3
2     2
3     3
4     6
     ..
69    5
70    2
71    1
72    4
73    2
Name: Narr, Length: 74, dtype: int64

In [6]:

            
                Copied!
                
## Access a column as DataFrame
col_df = incidents[['Narr']]
print(col_df)
type(col_df)
## Access a column as DataFrame
col_df = incidents[['Narr']]
print(col_df)
type(col_df)

    Narr
0      1
1      3
2      2
3      3
4      6
..   ...
69     5
70     2
71     1
72     4
73     2

[74 rows x 1 columns]

Out[6]:

pandas.core.frame.DataFrame

Access a row¶

The iloc() function is an indexed-based selecting method which means that we have to pass an integer index in the method to select a specific row/column.

In [7]:

            
                Copied!
                
# Access a row as Series
row_series = incidents.iloc[0,:] 
print(row_series)
type(row_series)
# Access a row as Series
row_series = incidents.iloc[0,:] 
print(row_series)
type(row_series)

Incident_No                                    2024000078279
Date___Time                                  7/29/24 1:09 PM
Location                         9 LINCOLN SQ Worcester, MA 
Officer                                        Daniel Adjei 
Narr                                                       1
Implement      R. Empty Hand Compliance Techniques / Level 3
Injury                                        Visible Injury
Supervisor                                  Neil F O'Connor 
Notified                                     7/29/24 4:27 PM
ObjectId                                                   1
Name: 0, dtype: object

Out[7]:

pandas.core.series.Series

In [8]:

            
                Copied!
                
# Access a row as DataFrame
row_df = incidents.iloc[[0],:] 
print(type(row_df))
row_df
# Access a row as DataFrame
row_df = incidents.iloc[[0],:] 
print(type(row_df))
row_df

<class 'pandas.core.frame.DataFrame'>

Out[8]:

	Incident_No	Date___Time	Location	Officer	Narr	Implement	Injury	Supervisor	Notified	ObjectId
0	2024000078279	7/29/24 1:09 PM	9 LINCOLN SQ Worcester, MA	Daniel Adjei	1	R. Empty Hand Compliance Techniques / Level 3	Visible Injury	Neil F O'Connor	7/29/24 4:27 PM	1

Access multiple rows and columns¶

In [9]:

            
                Copied!
                
##### Access multiple row using iloc()
sub_inci = incidents.iloc[0:5,0:3]
print(type(sub_inci))
sub_inci
##### Access multiple row using iloc()
sub_inci = incidents.iloc[0:5,0:3]
print(type(sub_inci))
sub_inci

<class 'pandas.core.frame.DataFrame'>

Out[9]:

	Incident_No	Date___Time	Location
0	2024000078279	7/29/24 1:09 PM	9 LINCOLN SQ Worcester, MA
1	2024000078796	7/30/24 4:38 PM	6 SHANNON ST Apt #: 2 Worcester, MA
2	2024000078796	7/30/24 4:38 PM	6 SHANNON ST Apt #: 2 Worcester, MA
3	2024000078953	7/30/24 9:50 PM	79 MAYFIELD ST Worcester, MA
4	2024000079115	7/31/24 11:05 AM	70 JACKSON ST Worcester, MA

Access columns or rows using loc()¶

The loc() function is label based data selecting method which means that we have to pass the name of the row or column which we want to select.

In [10]:

            
                Copied!
                
rst = incidents.loc[0:1,['Incident_No', 'Date___Time', 'Location']]
rst
rst = incidents.loc[0:1,['Incident_No', 'Date___Time', 'Location']]
rst

Out[10]:

	Incident_No	Date___Time	Location
0	2024000078279	7/29/24 1:09 PM	9 LINCOLN SQ Worcester, MA
1	2024000078796	7/30/24 4:38 PM	6 SHANNON ST Apt #: 2 Worcester, MA

Exercise 1: Get record as series and Datafram from incidents¶

Get the record as series: 'Location' column
Get the record as DataFrame using index based method (.iloc): row 1 - 5 and column 2 - 3
Get the record as DataFrame using label based method (.loc): row 1-5 and column ['Supervisor', 'Notified']

Data cleaning¶

Rename the column name¶

Check the columns names based on incidents.columns

In [12]:

            
                Copied!
                
#format of columns can be the dictionary: old column name: new column name
print('column name are {} '.format(incidents.columns))

incidents.rename(columns={'Date___Time':'DateTime', 'Incident_No':"IncidentNo"}, inplace=True)

print('updated column name are {} '.format(incidents.columns))
#format of columns can be the dictionary: old column name: new column name
print('column name are {} '.format(incidents.columns))

incidents.rename(columns={'Date___Time':'DateTime', 'Incident_No':"IncidentNo"}, inplace=True)

print('updated column name are {} '.format(incidents.columns))

column name are Index(['IncidentNo', 'DateTime', 'Location', 'Officer', 'Narr', 'Implement',
       'Injury', 'Supervisor', 'Notified', 'ObjectId'],
      dtype='object') 
updated column name are Index(['IncidentNo', 'DateTime', 'Location', 'Officer', 'Narr', 'Implement',
       'Injury', 'Supervisor', 'Notified', 'ObjectId'],
      dtype='object')

Check Null value¶

pandas.DataFrame.isnull check missing value

pandas.DataFrame.sum the sum of the values over the requested axis.

In [13]:

            
                Copied!
                
# check null value in each cell
incidents.isnull().head(n = 5)
# check null value in each cell
incidents.isnull().head(n = 5)

Out[13]:

	IncidentNo	DateTime	Location	Officer	Narr	Implement	Injury	Supervisor	Notified	ObjectId
0	False	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False

In [14]:

            
                Copied!
                
# calcualte the sum for each column
incidents.isnull().sum()
# calcualte the sum for each column
incidents.isnull().sum()

Out[14]:

IncidentNo    0
DateTime      0
Location      0
Officer       0
Narr          0
Implement     0
Injury        0
Supervisor    0
Notified      0
ObjectId      0
dtype: int64

Get unique value of the columns¶

unique() function in pandas

pandas.unique

In [15]:

            
                Copied!
                
multi_col = incidents.loc[:,['Injury', 'Narr']]
pd.unique(multi_col)
multi_col = incidents.loc[:,['Injury', 'Narr']]
pd.unique(multi_col)

C:\Users\yanawu\AppData\Local\Temp\ipykernel_8664\3584398706.py:2: FutureWarning: unique with argument that is not not a Series, Index, ExtensionArray, or np.ndarray is deprecated and will raise in a future version.
  pd.unique(multi_col)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[15], line 2
      1 multi_col = incidents.loc[:,['Injury', 'Narr']]
----> 2 pd.unique(multi_col)

File d:\Anaconda\envs\pandas_vp\lib\site-packages\pandas\core\algorithms.py:401, in unique(values)
    307 def unique(values):
    308     """
    309     Return unique values based on a hash table.
    310 
   (...)
    399     array([('a', 'b'), ('b', 'a'), ('a', 'c')], dtype=object)
    400     """
--> 401     return unique_with_mask(values)

File d:\Anaconda\envs\pandas_vp\lib\site-packages\pandas\core\algorithms.py:429, in unique_with_mask(values, mask)
    427 def unique_with_mask(values, mask: npt.NDArray[np.bool_] | None = None):
    428     """See algorithms.unique for docs. Takes a mask for masked arrays."""
--> 429     values = _ensure_arraylike(values, func_name="unique")
    431     if isinstance(values.dtype, ExtensionDtype):
    432         # Dispatch to extension dtype's unique.
    433         return values.unique()

File d:\Anaconda\envs\pandas_vp\lib\site-packages\pandas\core\algorithms.py:238, in _ensure_arraylike(values, func_name)
    236     if isinstance(values, tuple):
    237         values = list(values)
--> 238     values = construct_1d_object_array_from_listlike(values)
    239 else:
    240     values = np.asarray(values)

File d:\Anaconda\envs\pandas_vp\lib\site-packages\pandas\core\dtypes\cast.py:1601, in construct_1d_object_array_from_listlike(values)
   1598 # numpy will try to interpret nested lists as further dimensions, hence
   1599 # making a 1D array that contains list-likes is a bit tricky:
   1600 result = np.empty(len(values), dtype="object")
-> 1601 result[:] = values
   1602 return result

ValueError: could not broadcast input array from shape (74,2) into shape (74,)

In [66]:

            
                Copied!
                
unique_col = {}
for col in multi_col.columns:
    unique_col[col] = pd.unique(multi_col[col])
unique_col
unique_col = {}
for col in multi_col.columns:
    unique_col[col] = pd.unique(multi_col[col])
unique_col

Out[66]:

{'Injury': array(['Visible Injury', '_None', 'Complaint of Injury'], dtype=object),
 'Narr': array([ 1,  3,  2,  6,  4, 12,  7,  5,  8,  9, 10], dtype=int64)}

DataFrame Manipulation¶

Data sorting¶

pandas.DataFrame.sort_values

pandas.Series.sort_values

In [11]:

            
                Copied!
                
inct_sort_narr = incidents.sort_values(by="Narr", ascending=False)
inct_sort_narr.head(n = 6)
inct_sort_narr = incidents.sort_values(by="Narr", ascending=False)
inct_sort_narr.head(n = 6)

Out[11]:

	Incident_No	Date___Time	Location	Officer	Narr	Implement	Injury	Supervisor	Notified	ObjectId
24	2024000067866	7/6/24 12:33 AM	466 HARDING ST Worcester, MA	Santino Simone	12	G. Display of Taser or Sparks Display	_None	Terrence G Cahill	8/18/24 12:39 AM	25
55	2024000074652	7/20/24 11:52 PM	3 JEFFERSON ST Worcester, MA	Tyler R Sterner	10	G. Display of Taser or Sparks Display	_None	Michael A Cappabianca Jr	7/24/24 9:23 PM	56
47	2024000072086	7/15/24 10:27 AM	442 GRAFTON ST Worcester, MA	Dimitrios Gaitanidis	9	I. Display of Firearm	_None	Jeffrey P Carlson	7/15/24 1:05 PM	48
48	2024000072443	7/16/24 6:23 AM	5 TOWNSEND ST Apt #: 2 Worcester, MA	Terrence O Gaffney	9	I. Display of Firearm	_None	Terrence G Cahill	7/16/24 7:41 PM	49
43	2024000072086	7/15/24 10:27 AM	442 GRAFTON ST Worcester, MA	Jose M Lugo-Gardner	8	I. Display of Firearm	_None	Shawn M Barbale	7/15/24 2:01 PM	44

Exercise 2: sort values based on Officer (String)¶

sort_values() function sort the number from smallest to largest (ascending) descending

how about sort the String?

Data query: single condition¶

In [18]:

            
                Copied!
                
unique_injury = incidents.loc[:,"Injury"].unique()
print('unique value: {}'.format(unique_injury))
unique_injury = incidents.loc[:,"Injury"].unique()
print('unique value: {}'.format(unique_injury))

unique value: ['Visible Injury' '_None' 'Complaint of Injury']

In [13]:

            
                Copied!
                
none_incident = incidents.query('Injury == "_None"')
print(none_incident.shape)
none_incident = incidents.query('Injury == "_None"')
print(none_incident.shape)

Out[13]:

(68, 10)

Exercise 3: How many incidents' injury type are 'Visible Injury'¶

Data query: multiple condition¶

Query how many incidents's Narr are larger than 6 and Implement are 'I. Display of Firearm'

In [16]:

            
                Copied!
                
none_firearm = incidents.query('Narr > 6 & Implement == "I. Display of Firearm"')
none_firearm.shape
none_firearm = incidents.query('Narr > 6 & Implement == "I. Display of Firearm"')
none_firearm.shape

Out[16]:

(4, 10)

DateTime¶

Pandas provides the pd.to_datetime() function to convert string data into a datetime object:

In [16]:

            
                Copied!
                
incidents['DateTime']
incidents['DateTime']

Out[16]:

0      7/29/24 1:09 PM
1      7/30/24 4:38 PM
2      7/30/24 4:38 PM
3      7/30/24 9:50 PM
4     7/31/24 11:05 AM
            ...       
69    7/27/24 12:43 AM
70    7/29/24 11:41 AM
71    7/29/24 11:41 AM
72     7/29/24 1:09 PM
73     7/29/24 1:09 PM
Name: DateTime, Length: 74, dtype: object

In [17]:

            
                Copied!
                
# create a new column to store the DateTime
incidents['DT'] = pd.to_datetime(incidents['DateTime'], format = "%m/%d/%y %I:%M %p")
# create a new column to store the DateTime
incidents['DT'] = pd.to_datetime(incidents['DateTime'], format = "%m/%d/%y %I:%M %p")

In [18]:

            
                Copied!
                
# sort value based on 'DateTime' or 'DT' to see the difference
incidents.sort_values(by='DT', ascending=True).head(n=5)
# sort value based on 'DateTime' or 'DT' to see the difference
incidents.sort_values(by='DT', ascending=True).head(n=5)

Out[18]:

	IncidentNo	DateTime	Location	Officer	Narr	Implement	Injury	Supervisor	Notified	ObjectId	DT
7	2024000065816	7/1/24 9:46 AM	25 QUEEN ST Worcester, MA	Joseph L Ford	1	R. Empty Hand Compliance Techniques / Level 3	_None	Justin Bennes	7/1/24 4:00 PM	8	2024-07-01 09:46:00
9	2024000066207	7/2/24 10:11 AM	275 PLEASANT ST Apt #: 713 Worcester, MA	John Biancaniello	1	G. Display of Taser or Sparks Display	_None	MIGUEL DIAZ	7/2/24 11:45 AM	10	2024-07-02 10:11:00
8	2024000066207	7/2/24 10:11 AM	275 PLEASANT ST Apt #: 713 Worcester, MA	David C McAtee	2	I. Display of Firearm	_None	MIGUEL DIAZ	7/2/24 12:11 PM	9	2024-07-02 10:11:00
10	2024000066256	7/2/24 12:02 PM	25 TOBIAS BOLAND WAY Worcester, MA	James P Ciru	1	R. Empty Hand Compliance Techniques / Level 3	_None	Jeffrey P Carlson	7/2/24 1:04 PM	11	2024-07-02 12:02:00
11	2024000066380	7/2/24 5:16 PM	810 MAIN ST Worcester, MA	Stephen J Mitchell	2	R. Empty Hand Compliance Techniques / Level 3	_None	Christopher A Panarello	7/2/24 5:11 PM	12	2024-07-02 17:16:00

In [19]:

            
                Copied!
                
incidents.info()
incidents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74 entries, 0 to 73
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   IncidentNo  74 non-null     int64         
 1   DateTime    74 non-null     object        
 2   Location    74 non-null     object        
 3   Officer     74 non-null     object        
 4   Narr        74 non-null     int64         
 5   Implement   74 non-null     object        
 6   Injury      74 non-null     object        
 7   Supervisor  74 non-null     object        
 8   Notified    74 non-null     object        
 9   ObjectId    74 non-null     int64         
 10  DT          74 non-null     datetime64[ns]
dtypes: datetime64[ns](1), int64(3), object(7)
memory usage: 6.5+ KB

Get weekday from the DateTime¶

pandas.Series.dt

In [22]:

            
                Copied!
                
incidents['wday'] = incidents.DT.dt.weekday
incidents['wday'] = incidents.DT.dt.weekday

In [21]:

            
                Copied!
                
incidents['wday'].unique()
incidents['wday'].unique()

Out[21]:

array([0, 1, 2, 3, 4, 5, 6])

Exercise 4: Get year and hour from the DateTime and assign to a new column¶

incidents['yr']

incidents[hr]

In [23]:

            
                Copied!
                
incidents['yr'] = incidents.DT.dt.year
incidents['hr'] = incidents.DT.dt.hour
incidents['yr'] = incidents.DT.dt.year
incidents['hr'] = incidents.DT.dt.hour

Aggregating statistics¶

Aggregating Statistics of a single column¶

pandas.DataFrame.mean

In [24]:

            
                Copied!
                
incidents['Narr'].mean()
incidents['Narr'].mean()

Out[24]:

2.9324324324324325

In [25]:

            
                Copied!
                
incidents[["Narr"]].describe()
incidents[["Narr"]].describe()

Out[25]:

	Narr
count	74.000000
mean	2.932432
std	2.377584
min	1.000000
25%	1.000000
50%	2.000000
75%	3.000000
max	12.000000

Aggregating statistics grouped by category¶

In [13]:

            
                Copied!
                
incidents[["Narr", "wday"]].groupby("wday").mean()
incidents[["Narr", "wday"]].groupby("wday").mean()

Out[13]:

	Narr
wday
0	3.142857
1	2.700000
2	2.307692
3	2.111111
4	2.833333
5	4.200000
6	2.428571

Exercise 5: Count the incidents based on day of week¶

using count() to count the incidents for each day of week

Group IncidentNo based on wday, then count the occurrence of incidentNo for each wday

Sort the wday based on the total number of incidents

In [19]:

            
                Copied!
                
incidents[["IncidentNo", "wday"]].groupby("wday").count()
incidents[["IncidentNo", "wday"]].groupby("wday").count()

Out[19]:

	IncidentNo
wday
0	14
1	10
2	13
3	9
4	6
5	15
6	7

Spatially Enabled Data Frame¶

The Spatially Enabled DataFrame (SEDF) creates a simple, intutive object that can easily manipulate attribute and geometric data.

Create SEDF from a Shapefile

An SEDF still includes access to all the Pandas DataFrame functionality

Shapefile¶

A shapefile is made up of multiple files and all files must be found in the same folder with the same name

A shapefile must have the following:

.shp – this file stores the geometry of the feature
.shx – this file stores the index of the geometry
.dbf – this file stores the attribute information for the feature

Week 4 Pandas

Week 4: Pandas

What is pandas?¶

Key features of Python Pandas¶

Import Pandas in Python¶

Create a virtual environment¶

List virtual environment¶

Activate virtual environment¶

Install pandas¶

Download data in CSV format¶

Loading a Pandas DataFrame from CSV file¶

read csv¶

A concise summary of a DataFrame¶

Series and DataFrame¶

Access a column¶

Access a row¶

Access multiple rows and columns¶

Access columns or rows using loc()¶

Exercise 1: Get record as series and Datafram from incidents¶

Data cleaning¶

Rename the column name¶

Check Null value¶

Get unique value of the columns¶

DataFrame Manipulation¶

Data sorting¶

Exercise 2: sort values based on Officer (String)¶

Data query: single condition¶

Exercise 3: How many incidents' injury type are 'Visible Injury'¶

Data query: multiple condition¶

DateTime¶

Get weekday from the DateTime¶

Exercise 4: Get year and hour from the DateTime and assign to a new column¶

Aggregating statistics¶

Aggregating Statistics of a single column¶

Aggregating statistics grouped by category¶

Exercise 5: Count the incidents based on day of week¶

Spatially Enabled Data Frame¶

Shapefile¶

Next, do the task about Process SEDF in the notebook in ArcGIS Pro¶