By William M — Aug 26, 2024

Python EDA - London Bus Safety Part 1/2

Tower Bridge in London

Transport for London provided their data on bus safety. This data will be explored using Python.

We will use a Jupyter notebook using python 3. We do our library imports, import the data, and preview:

import pandas as pd
import numpy as np

#For formatting output
BOLD = '\033[1m'
END = '\033[0m'

df = pd.read_csv("TFL Bus Safety.csv",encoding='cp1252')
df.head()

OUTPUT:

Year	Date Of Incident	Route	Operator	Group Name	Bus Garage	Borough	Injury Result Description	Incident Event Type	Victim Category	Victims Sex	Victims Age
0	2015	1/1/2015	1	London General	Go-Ahead	Garage Not Available	Southwark	Injuries treated on scene	Onboard Injuries	Passenger	Male	Child
1	2015	1/1/2015	4	Metroline	Metroline	Garage Not Available	Islington	Injuries treated on scene	Onboard Injuries	Passenger	Male	Unknown
2	2015	1/1/2015	5	East London	Stagecoach	Garage Not Available	Havering	Taken to Hospital – Reported Serious Injury or...	Onboard Injuries	Passenger	Male	Elderly
3	2015	1/1/2015	5	East London	Stagecoach	Garage Not Available	None London Borough	Taken to Hospital – Reported Serious Injury or...	Onboard Injuries	Passenger	Male	Elderly

We will be doing 6 tasks to explore this data:

How many unique fields are there for each variable?
How much nullity is there for each variable?
Create a new variable to describe if the person was taken to hospital.
Do any operators have a higher incidence of overall incidents?
Do any operators have a higher incident of hospitalizations?*
Compare the 'Slip Trip Fall' count with different operators. Order them by operators with most 'Slip Trip Fall's to least. Which operator has the most of these incidents?*

*Items 5 and 6 will be finished in the next post.

1a. How many unique fields are there for each variable?

This is very straightforward as pandas has the nunique() method.

df.nunique()

OUTPUT:

Year                           4
Date Of Incident              45
Route                        612
Operator                      25
Group Name                    14
Bus Garage                    84
Borough                       35
Injury Result Description      4
Incident Event Type           10
Victim Category               17
Victims Sex                    3
Victims Age                    5
dtype: int64

1b. London does not have 35 Boroughs, I know this because I googled it...

so I put a list of London Boroughs in a dataframe and compared this to the 'list' of Boroughs in this dataset.

#copy list of boroughs
dfboroughs = pd.read_csv("londonboroughlist.csv")

#remove whitespace (the copying of boroughs caused a trailing whitespace)
dfboroughs['borough'] = dfboroughs['borough'].str.strip()

#convert the dataframe to a list
dfboroughs = dfboroughs['borough'].to_list()

#List of valuses are in the dataframe['Borough'] but are not in the borough list.
nonboroughs = df['Borough'][~df['Borough'].isin(dfboroughs)].unique()


#Show what boroughs are in the dataframe that are not in the borough list.
print('Dataframe contains these non-borough fields as boroughs:\n')

for i in nonboroughs:
    print("'" + i + "''")

OUTPUT:

Dataframe contains these non-borough fields as boroughs:

'None London Borough''
'City of London''
'Not specified''

1c. Examine unique values on a subset of values, then on all values

With this we can understand what may be categorical. We will also need to examine how this dataset reports nulls, so the exhaustive list of values will be helpful.

examine_unique_list = ['Injury Result Description', 'Incident Event Type', 'Victim Category','Victims Sex', 'Victims Age']

print('Examining a few variables unique values:\n\n')

for i in examine_unique_list:
    print(BOLD + i.upper() + ':' + END)
    print(*df[i].unique(), sep = '\n')
    print('\n')

OUTPUT:

2a. How much null is there in each variable?

'''Using the output above, find out how null is reported on each column, we will
use a dictionary to map the column with how null is reported.'''

column_null_dictionary = {
    'Route': '(blank)', 
    'Bus Garage': 'Garage Not Available',
    'Borough': 'Not specified', 
    'Victim Category': 'Insufficient Data',
    'Victims Sex': 'Unknown',
    'Victims Age': 'Unknown'
}


print("NULL COUNTS\n")

#Calculate row count of dataframe this is used for percent calculations
row_count = df.shape[0]


'''Iterate through the dictonary and replace values with NaN
and report nulls as we go'''

for x, y in column_null_dictionary.items():
    #Replace the column values with null where appropriate
    df[x].replace(y, np.nan, inplace=True)
    
    #Set null_count
    null_count = df[x].isna().sum()
    null_count_percent = round(((null_count / row_count)*100),2)
    
    #Count and report the nulls
    print(BOLD + x + ':' + END + '\n' + str(null_count) + ' / ' + str(row_count))
    
    #Report percent
    print(str(null_count_percent)  + '%' + '\n')

OUTPUT:

Continuing this we will just run a report on nulls:

#Report on all variables as a 'sanity check'
df.isna().sum()

OUTPUT:

Year                            0
Date Of Incident                0
Route                          14
Operator                        0
Group Name                      0
Bus Garage                   8572
Borough                       553
Injury Result Description       0
Incident Event Type             0
Victim Category                 2
Victims Sex                  3602
Victims Age                  7135
dtype: int64

2b Visualize nullity

I will use the missingno library to output a nullity matrix.

import missingno as msno 

#only report missing numbers on data with missing values
msno.matrix(df[list(column_null_dictionary.keys())])

3 Create a new variable to describe if the person was taken to hospital.

#If the Injury Result Description has the string 'Hospital' create bool T/F
df['Hospitalized'] = df['Injury Result Description'].str.contains('Hospital')

#Convert bool T/F into yes/no responses
df['HospitalizedYN'] = df['Hospitalized'].replace({
    True: "Yes",
    False: "No"
})

'''A column with true and false values is kept for future aggregations,
this is with the hindsight of proofreading, originally I converted it back 
to boolean to do the aggregations'''

print('View of the new variables\n')
print(df['HospitalizedYN'].head())
print('\n')
print(df['Hospitalized'].head())

In the example above I created two new variables, one human-readable ('Yes'/'No' values) and one boolean. We can use the boolean for aggregate functions.

OUTPUT:

View of the new variables

0     No
1     No
2    Yes
3    Yes
4    Yes
Name: HospitalizedYN, dtype: object


0    False
1    False
2     True
3     True
4     True
Name: Hospitalized, dtype: bool

4. Do any operators have a higher incidence of overall incidents?

'''Create new Dataframe with all the operators.  Operator is needed, 
the other column is just to tranform into count column'''

#New DF
operator_incidents = df[['Operator']].copy()

#Create a new column in new dataframe
operator_incidents['Count'] = 1


#Value check for above code (maybe thorough, maybe noob stuff IDK)
print(operator_incidents)
print('\nCheck to make sure sum of count = the # of rows above')
print(operator_incidents['Count'].sum())

Outputs:

             Operator  Count
0      London General      1
1           Metroline      1
2         East London      1
3         East London      1
4           Metroline      1
...               ...    ...
23153     East London      1
23154   London United      1
23155   London United      1
23156   London United      1
23157       Metroline      1

[23158 rows x 2 columns]

Check to make sure sum of count = the # of rows above
23158

Then we use this for the aggregate function:

#Use an aggregate function and the sum of the new 'Count' variable

operator_incidents = operator_incidents.groupby(['Operator'], as_index=False).agg(incident_count = ('Count', 'sum'))

operator_incidents.sort_values(by=['incident_count'], ascending=False)

OUTPUT:

We will finish with tasks 5 and 6 in the next post.

The code for this can be found here on github.

Data Citation:

data.world [@vizwiz]. (2018). 2018/W51: London Bus Safety Performance [Dataset]. https://data.world/makeovermonday/2018w51

Python EDA - London Bus Safety Part 1/2

1a. How many unique fields are there for each variable?

1b. London does not have 35 Boroughs, I know this because I googled it...

1c. Examine unique values on a subset of values, then on all values

2a. How much null is there in each variable?

2b Visualize nullity

3 Create a new variable to describe if the person was taken to hospital.

4. Do any operators have a higher incidence of overall incidents?

SQL/Tableau: AFDB Market Trends with Viz. Part 2

Python EDA - London Bus Safety Part 2/2