Coursera Capstone project, 2020.

6 октября 20206 окт 2020

5 мин

Оглавление

Introduction
A description of the problem and a discussion of the background.
Data

Introduction

A description of the problem and a discussion of the background.

We live in an environmental world. We can not imagine our life without machines but some of them can be quite dangerous, cause injuries or even human death when used improperly.

For example, according to WHO (World Health Organisation) road accidents cause up to 1.4 million deaths worldwide annually. Road accident is the 8th by frequency reason of death and the first one in 5-29 age period. People all over the world are working on the problem of minimisation of accident probability and severity because humans life is the most precious thing on Earth.

In our work we will try to analyse the collected data from car accidents statistics to find out

"How might we predict the probability of a road accident in actual conditions to warn the human about the upcoming danger before the road accident occurs?"

Data

A description of the data and how it will be used to solve the problem.

According to the business question we need to investigate Data, containing information about different types of road accidents, including information about severity and different conditions that may have caused the accident. In this work we will explore the dataset, provided by Coursera and containing various information about road accidents in Seattle& The dataset has 38 columns and 194 673 rows that provide multiple information concerning road accidents.We assume that we can pick relevant indicators to build a supervised model that will be able to predict the possible road accident with sufficient probability. So the general plan is:

Inspect the dataset
Fix missing data and type mismatch problems
Investigate relations between parameters and a target variable (Severity)
Pick the most illustrative parameters
Build a model based on picked parameters
Check model quality
Improve the model
Deliver the final model with recommendations

Dataset description

Initial dataset consists of 194673 rows and 38 columns. Let's check datatypes:

SEVERITYCODE int64

X float64

Y float64

OBJECTID int64

INCKEY int64

COLDETKEY int64

REPORTNO object

STATUS object

ADDRTYPE object

INTKEY float64

LOCATION object

EXCEPTRSNCODE object

EXCEPTRSNDESC object

SEVERITYCODE.1 int64

SEVERITYDESC object

COLLISIONTYPE object

PERSONCOUNT int64

PEDCOUNT int64

PEDCYLCOUNT int64

VEHCOUNT int64

INCDATE object

INCDTTM object

JUNCTIONTYPE object

SDOT_COLCODE int64

SDOT_COLDESC object

INATTENTIONIND object

UNDERINFL object

WEATHER object

ROADCOND object

LIGHTCOND object

PEDROWNOTGRNT object

SDOTCOLNUM float64

SPEEDING object

ST_COLCODE object

ST_COLDESC object

SEGLANEKEY int64

CROSSWALKKEY int64

HITPARKEDCAR object

dtype: object

We have selected these parameters to describe our model. The parameters have been chosen taking in mind the fact that they should be measurable during driving. They are:

Weather
Road conditions
Light conditions
Speeding

and the dependent variable SEVERITY. Of course other parameters may play a very important role in our investigation but they can't be measured during driving (for example "Drugs influence" and the parameters of predicted incident). Let's construct the reduced dataset excluding rows with Unknown or missing data.

After our operations dataset consists of 70510 rows and 5 columns. Let us perform one hot encoding as our variables mainly are categorical.

Methodology

Let us examine the correlation of our variables with the dependent variable to choose appropriate model. We will use the Pearson correlation for this operation.

Discussion section

Unfortunately, we have found out that none of the variables are in good correlation with Severity. That might mean that linear model approximation won't perform well. Let us examine top 5 and bottom 5 parameters that are in strongest correlation with SEVERITY.

SEVERITY

LightCond_Dark - Street Lights On

-0.029372

RoadCond_Snow/Slush

-0.021316

LightCond_Dark - No Street Lights

-0.019214

Weather_Snowing

-0.019056

RoadCond_Ice

-0.014229

RoadCond_Wet

0.010186

Weather_Raining

0.012265

SPEEDING

0.026915

LightCond_Daylight

0.031597

As we can see the strongest correlation with speeding is only 3.2%. That means that non-linear approaches might result in better prediction accuracy.

Statistical testing

Let us construct our models, defining Test and train datasets

Below the size of Train and Test arrays is provided:

Train arrays:(136408, 27) (136408,),

Test arrays (34102, 27) (34102,)

Let us build several models and examine their quality on Train and Test Data. Before we should Normalise Data.

Linear Ridge Regression

Accuracy Ridge Linear Regression train:0.0033

Accuracy Ridge Linear Regression test:0.0035

As we have predicted linear model doesn't give good results because of low correlation.

Logistic Regression

Accuracy Logistic Regression train:0.6722

Accuracy Logistic Regression test:0.6732

Logistic regression gives better results because of possible non-linear correlation between variables.

Decision Trees

Accuracy Tree train:0.6723

Accuracy Tree test:0.6732

Decision trees give almost the same result as logistic regression - 67% accuracy

K Nearest Neighbours

Accuracy Neigh train:0.6721

Accuracy Neigh test:0.6733

Results section

As we have noticed there is no linear relationship between selected variables and a target variable. That is why we have achieved poor results using Linear Regression method. This problem was brilliantly solved using non linear method of Logistic Regression, Decision Trees. Both methods led to 67% accuracy on train and test Datasets. K Nearest neighbours resulted in approximately the same accuracy.

Of course this result might be advanced but still this is already a good accuracy taking in mind that we deal with health and even the cost of a human life.

Conclusion section

We have finished examination of Accidents dataset and found out that predicting a car accident severity may be a complicated problem because the relationship between chosen variables and a target variable is not linear. A predictive model giving good results in 2/3 of cases was implemented using Logistics Regression and later proven by Decision Trees and K Nearest Neighbours.

We can use this output model in perspective automobiles to give signals to the driver after multiple variables analysis. And probably using this model will result in a rapid fall of deaths and injuries on the road (2/3*1.4 million = up to 1 million saved lives annually )

However a deeper investigation should take place because the initial dataset was a learning one and a lot of vital parameters such as Tire pressure, Drivers Experience, Car drive type (rear/front/4wd) were missing. Possibly using those missing arguments may result in better predictive models implementation.