Predictive Analytics on NYC Collision Data

14 min readDec 15, 2021

Is NYC a secure walking city? Are people frightened of getting hit by cars? Which borough has deadly collisions? Are motorists injured more than cyclists? How many of injuries result in deadly deaths?

Considering many situations above, will at the end result in injuries and deaths of people. We believe we can reduce injuries and deaths by utilizing the open data available and help users with effective insights beforehand and factors resulting in collisions.

Introduction

A New York Times article reported that streets in New York have grown more precarious because deaths and fatalities due to traffic collisions have risen 10.5 percent to 8,730 in the first 3 months of 2021 from 7,900 deaths in the same period in 2020. As New York city is home to more than 8 million people and approximately 45 million people visit every year the traffic will surge which can cause more collisions. The objective of this project is to gather insights, perform comprehensive exploratory data analysis to understand the relationship between various features of the dataset and use them to make better machine learning predictive models.

1. Data Acquisition

Motor Vehicle Collisions - Crashes | NYC Open Data

data.cityofnewyork.us

The dataset contains information about collisions reported in NYC by the police department. The dataset contains 1.84M rows and 29 columns latest updated as of 7th November with columns like crash date, time, location, borough, collision id, number of persons injured, number of pedestrians injured, number of pedestrians killed, number of cyclists killed, number of cyclists injured, contributing factors behind the collision, vehicle types and many other relevant features. Furthermore, we scraped weather data from this website which contains information on temperature, wind, humidity, windspeed, wind gust, pressure, precipitation, and condition.

Proposed Methodology

First, we downloaded NYC collision data and scraped weather data. Then we performed data wrangling and preprocessing on these 2 datasets separately to prepare the data for analysis and model building pipeline.

We joined the preprocessed and cleaned datasets to perform extensive EDA and visualization by exploring various libraries in python for interactive and appealing visualizations like Plotly, Folium and Geopandas.
For Modeling, we first used unsupervised techniques on data like PCA to reduce the high number of correlated features to a lower number of uncorrelated features, K-means clustering and t-SNE for Cluster Visualization. Then we used classification models like Logistic Regression, Random Forest and Stacked Model. Data Wrangling and Preprocessing.

2. Data Wrangling and Preprocessing

The data we collected is in raw format and it is not feasible to perform analysis on it so it is essential to clean and preprocess the dataset to obtain a clean dataset and perform analysis.

Data Wrangling and Preprocessing on NYC Collision Data:-

▹Explored structure , statistics and dtypes of columns.

▹ Checked for Nulls in all columns

▹Manually explored usefulness of each column

▹Remove redundant columns. For instance Contributing Factor1 contained all the required information and Contributing Factor2–5 were redundant so we dropped those columns and selected relevant columns ~ 1.2 Million

▹Checked for correlation among the columns of the dataset

▹The original dataset contained 2 separate columns of collision date and collision time where time was in 24 hour format and date in mm/dd/yyyy format so we converted this to python datetime format and floored it to nearest hour to make joins and visualization uncomplicated.

▹ Then we deduced Part of Days from the converted columns from Web reference: Merriam Webster Parts of Day

Morning (5AM- 12PM), Afternoon (12PM- 5PM), Evening (5PM-9PM) and Night(9PM-5AM)

▹There were too many contributing factors with similar meaning which made the factors redundant so we grouped similar ones together to create a general list of contributing factors.

Data Wrangling and Preprocessing on Scraped Weather Data:-

▹ Explored structure , statistics and dtypes of columns.

▹Checked for Nulls in all columns and except wind column there were no nulls. We dropped wind column as direction of wind is highly location specific and not useful when analyzing data recorded at singular location points.

▹Checked for correlation among the columns of the dataset

▹We converted the raw datetime of dataset to python datetime format and floored it to nearest hour and dropped unwanted columns from the dataset.

▹ We observed that some columns had different units of measurement so we converted data values from string to float data type.

▹There were too many weather conditions with similar meaning which made the data redundant so we grouped similar ones together to create a general list of weather conditions.

We then performed pandas join and SQL join on the datetime column of cleaned datasets.

Here ends the boring but important part and interesting part begins. Plus: We are already tired :(

3. EDA and Visualization

The primary goal of performing EDA and visualization is to maximize the end-user’s insight into a data set and into the underlying structure of a dataset. We started with brainstorming the questions we intend to analyze like-

Most important factors contributing to accidents
Proportion of deaths and injuries
Trends on holidays and special occasions
Relationship between time when the collision occur to factors contributing it
Types of vehicles involved
Daily, weekly, monthly trends? How do trends change over time? What is the most unsafe time to drive in a day?
Most collision-prone regions/streets/boroughs
Correlation between weather condition and chances of accident

3.1 Basic Visualizations

Which factors contributed in highest injuries/deaths and what is the frequency of contributing factor?

Projection:- The plot depicts here that the average of people who were injured in the collisions is approximately around 6500. The top 10 contributing factors are shown to understand what factors contribute the highest in collisions and serious action can be taken accordingly. From this, we came to know that Driver Inattention, failure to Yield Right-of-Way and Traffic Control Disregarded are factors where most people were injured.

Which Borough has maximum collisions happening and how many people were injured and killed?

Projection:- The two bar plots shows that maximum people were killed/injured in Brooklyn followed by Queens. We further read across various articles published and understood that Brooklyn has the most deadly intersections and major collisions happen in Brooklyn due to Driver Inattention. We also observed that Staten Island has the least number of collisions as it has 1/5th population density in comparison to Brooklyn.

What is Distribution of People Killed/Injured in Collisions?

Distribution of people killed and injured

Projection:-From the 2 bar plots we observed here that while considering frequency of injured people most of them were motorists but if freuency of people getting killed is considered then most of them were pedestrians. Further we noted that the cyclists were least injured/killed in comparison to Motorists and Pedestrians while collisions happened.

What is the distribution of Collisions by Part of Day?

Projection:- From the pie chart we deduced that maximum collisions happened in afternoon followed by morning. We tried to dig deep by going through existing studies and found that highest collisions happen during rush hour that is were most people are leaving for officework and for all the boroughs the maximum collisions happened in afternoon followed by Morning

What is Impact of Weather on Collisions?

Projection:- From the pie chart we observed that most of the collisions happen on a cloudy day followed by a Fair weather condition. We wanted to understand the reason behind this so we read across various articles published on this and came to know that Cloudy conditions are the No.1 cause of fatal collisions, according to data from the National Highway Traffic Safety Administration (NHTSA). Some Reasons behind this are- Cloudy skies can make driving more dangerous because it decreases visibility, make it harder to see potholes, black ice, and unplowed roads.

3.2 Advanced and Interactive Visualizations

Which streets are frequently facing collisions?

Projection:- We wanted to analyze the dangerous streets in NYC and we used Counter to create a dictionary of key:value pairs of street and the counts. We sorted it in descending order to just see the dangerous streets and then we plotted the Wordcloud using the result of Counter. The main benefit from this visual is the more bigger the word appears in the wordcloud the more often collisions happen there.

What is weekly analysis of collisions taking place in boroughs by hours?

We used plotly for this visual and created the plots to analyze all 5 boroughs.

Analysis:-1. Maximum collisions happened on weekdays( Monday- Friday) compared to weekends (Saturday and Sunday). We deduced that people will have busy schedules on weekdays as they would be occupied with businesswork and other mundane activities which would lead to heavy traffic and more collisions.

2. The number of collisions happening increased a lot during 9–11 AM because the time duration is considered to be rush hour and then the collisions dropped comparitively.

3. The number of collisions happening increased a lot again during evening and the reason we deduced from this is people might be returning back from their business and heading home so there might be heavy traffic and then it dropped again.

What is the number of collisions for each borough and how to view it on the NYC map?

We used Geopandas library along with shape files and necessary geoJSON files which we downloaded from the website.

Collisions by Borough on a Geopandas Map

Analysis :- For plot we experimented with various colormaps like viridis, jet, inferno, Purples, YlGnBu and various others. For jet colormap we were able to see the dark red highlighted portions in Brooklyn and Queens Borough were maximum collisions happened.

How to create Interactive map visualization to reach the location where the collision happened?

We used Folium as it makes the data easy to visualize data that’s been manipulated in Python, on an interactive Leaflet map. We created such maps for all the Boroughs.

Analysis :- 1. The markers in the plots are interactive and on clicking them it will keep on expanding and exact locations along with streets can be visualized where collision happened.

2. This is very effective as to reduce the collisions taking place the police department can easily access the high alert streets and the exact location on the folium plot where the accidents happened. Moreover, precaution and extra safety can be managed in the specific areas where needed.

What about other details where the collision happened along with other relevant information about it?

Analysis :-We further wanted to create something like Google Maps using Folium which can give the exact information of the collision spot along with important information like- Latitude, Longitude, Zip Code, Street, Borough and number of collisions.

Here in this visualization if user hovers on the Blue markers(less collisions than Red spots) then the required information regarding collision can be obtained easily and if user hovers on the Red markers which are specifically the high-alert spots as higher number of collisions happened there compared to Blue marker points

The interactive plots will be beneficial in real-time as even just taking a look at the visual many things can be interpreted easily 👀

4. Model Development

4.1 Unsupervised Modeling

After we’ve gotten our data into a format which is ingestable by our computers, we ask our machines to wave it’s wand and from the magic of mathematics bring out the inner structure and patterns within the data.

PCA for Dimensionality Reduction

We used this technique before building model to reduce the high number of correlated features to a lower number of uncorrelated features. It is generally implemented before building supervised classification models. In epitome, PCA helps us to identify patterns in data based on the correlation between features and find the directions of maximum variance in high-dimensional data and projects it onto fewer dimensions than the original one.

After PCA we got the following plot which shows that the explained variance ratio starts to flatten out after 12 components so we selected 12 as our n_components and then fit and transformed the data.

2. K-Means Clustering

Clustering being one of the most popular and oldest exploratory data analysis technique came as a natural choice.We used Sklearn library to run K-means clustering algorithm on our dataset. Initially, we computed the SSE errors when the number of clusters range from 1 to 50 to identify the optimal number of clusters for our data.

From the plot we see an elbow at N=21 clusters, indicating that it might be the optimal number of clusters. Furthermore as the rate of decrease of SSE errors significantly drops at N=21, we set it as our “K” for K-means clustering algorithm.

Plots in 2D show some extent of clustering. However the clusters appear to be highly spread out indicating the possible absence of natural clustering in our data. To perform another level of analysis and confirm our judgement we visualize the clusters with t-SNE components.

3. t-SNE (T-Distributed Stochastic Neighbor Embedding) for Cluster Visualization

t-SNE is a popular unsupervised algorithm used for reducing dimensionality while still preserving the neighbourhood relationships between the data point in the low dimensional space. Therefore it fits our usecase perfectly where we need to visualize the clusters in 3D.

The 3D plots agree with conclusions drawn from the 2D plots. There seems to be an absence of clearly demarcated clusters calculated by the K-mean algorithm as all the clusters are interspersed amongst each other.

Therefore we steer away from the visualization of clusters by K-means algorithm and instead check if the data clusters with respect to some other attributes.

Visualization of clustering with respect to collisions and different parts of day

Visualization of clustering with respect to collisions and whether injuries were caused

From the visualizations we don’t discern any noticeable clusters or structure in the underlying data with respect to different attributes. We end our journey with unsupervised learning techniques at this point and foray into the territory of Supervised Learning.

4.2 Supervised Modeling

We wanted to build models that can classify that the collision happening will result in injury or not as at the end life of human being is more valuable than a car getting crashed or broken in collision.

So, the end goal of decreasing the collisions and its effects is to make sure that least number of people are injured from it. We decided to classify that whether our model will be able to predict that the collision will lead to injury or not. Before directly building model we started with certain steps to make the data model ready like one-hot encoding and we also performed PCA before building the models for dimensionality reduction.

We splitted the dataset as train: test in 70:30 using sklearn train_test_split and then performed PCA on it. For models we implemented 3 models:-

Logistic Regression

We trained the Logistic Regression Model using ElasticNet Regularization to penalize the loss function. Then we evaluated the model using classification evaluation metrics like Precision, Recall, F Score , Accuracy and Confusion Matrix.

2. Random Forest using GridSearchCV

Random Forest is an ensemble of Decision Trees. The ‘forest’ generated by the random forest algorithm is trained through bagging or bootstrap aggregating. Bagging is an ensemble meta-algorithm that improves the accuracy of machine learning algorithms.

For hyperparamter tuning we used GridSearchCV, it loops through predefined hyperparameters and fit the estimator on your training set. So, in the end, we selected the best parameters from the listed hyperparameters.

3. Stacked Ensemble Model

The benefit of stacking is that it can harness the capabilities of a range of well-performing models on a classification or regression task and make predictions that have better performance than any single model in the ensemble. It uses a meta-learning algorithm to combine the predictions from two or more base machine learning algorithms used while stacking.

We stacked models like Random Forest, KNeighbour Classifier and Decision Trees with final estimator as Logistic Regression.

Analysis from Model Implementation:-
We observed that although the accuracy was coming moderate enough approximately ~ 80%, the training data was imbalanced and was heavily biased towards 0. To overcome this issue we explored ways to overcome this and performed Undersampling technique to balance the training data where it remove examples from the training dataset that belong to the majority class in order to better balance the class distribution.

Furthermore, we simply did the same steps to make the data model ready and built same 3 models as above. The accuracy dropped to ~56% for the models but we understood the importance of data analysis phase and instead of just focusing on the accuracy metrics for evaluating the classifier the whole data as well as its descriptive statistics needs to be taken into consideration to built bias-free and reliable models.

Conclusion

New York is sometimes called the biggest collection of villages in the world. This is the reason why this project and the datasets caught our attention. A city that is known for its hustle and bustle also inevitably leads to collisions. With this project we embarked on a journey to uncover the patterns behind these collisions and the factors that might contribute to them. Ultimately we wanted to propose a solution which could help in reducing the number of accidents by suggesting preventive measures.We believe that we were able to uncover many important facts and patterns about the collisions in NYC which will assist in putting preventive measures in place and lead to safer roads.

Future Scope

In this project, we did not have time series geospatial information regarding normal traffic activity. This data can prove to be very useful for charting out routes to avoid collision prone areas. Our future efforts can be directed to obtaining this information and using it to build a guidance system for the average New Yorker for safer road journeys. In addition, more information about the people involved in the collision can be used by the authorities to improve their process of issuing driver licenses and changing driver tests.

We: while waiting for grades