Genesis Generators
Stroke Prediction
Click Here to Learn More!

About our Project

Welcome to our Project!

Our project uses a dataset about stroke prediction acquired from kaggle to predict the chances of getting a stroke. For this prediction, our program utilizes a person's gender, age, marital and smoking status, presence of heart disease or hypertension, work and residence type, bmi, and average glucose level..

Motivation

Our Inspiration!

In the United States, every 40 seconds, someone has a stroke, and every 3 ½ minutes someone dies of a stroke. 1 in 4 people over the age of 25 in the world will experience a stroke. For people who go to the emergency room within 3 hours of the stroke, they are much more likely to avoid serious disability. Our project is aimed at helping people, especially in the healthcare industry, identify if someone has a higher risk of a stroke, so they can spot warning signs/symptoms of a stroke and access treatment quicker.

MVP

Minimum Viable Product

In 3 weeks, our final product will have accomplised complete analysis of characteristics that results in stroke, and our product will be to make accurate predictions whether a person will encounter a stroke in the near future based on these chatacteristics.

Tech Stack

Tech we will need in order to create our project

This project uses HTML, CSS, Bootstrap for the front end development and website creation. Along, with these coding languages we have used Python as our backbone. We used Pandas for data management, Numpy for mathematical operations, Sklearn for machine learning and Plotly for generating our interactive plots..

Graphs

Our data simplified

...
Visualizing Count of Categorical Columns
...
Pie Chart
...
Dual Bar Graph
...
Multifaceted Bar Graphs
...
Scatter Plot
...
Box Plot
...
Correlation Heatmap
...
Clustered Bar Graph

The Process

How we turned data into information and information into insight.

Machine Learning

Our machine learning models

Regression vs. Classification

Since the problem we are trying to solve is to predict whether a person would encounter a stroke or not, it is classified as a classification problem..


Supervised vs. Unsupervised

Because we fed our model data that was already labeled, this problem is an example of supervised machine learning.


Machine Learning Models

We tested out eleven models (KNN, SVC, RFC, LR, GNB, DTC, SGDC, RC, NC, GPC, MLPC) and found these three to be the best for our data (KNN, SVC, LR). We also ran hyperparameters on these three models to find the best version of these models.

K-Nearest Neighbors (KNN)

For the KNN model we specified n_neighbors and leaf_size as equal to 10. After that, we fit the model to the x_train and y_train and tested it. This allowed us to acquire a minimum observed mean squared error hence validating its accuracy for our data.

Logistic Regression (LR)

For the LR model we specified solver as equal to 'liblinear' and random_state as equal to 0. After that, we fit the model to the x_train and y_train and tested it. This allowed us to acquire the same minimum observed mean squared error hence validating its accuracy for our data.

Support Vector Machine (SVM)

For the SVC model we specified kernel as equal to 'linear'. After that, we fit the model to the x_train and y_train and tested it. This again allowed us to acquire a minimum observed mean squared error hence validating its accuracy for our data.

Comparisons

Deciding on a machine learning model


Accuracy (metric 1)

Accuracy is calculated as (tp + tn)/ (tp + fp _ tn + fn)and you want accuracy to be as close to 1 as possible for the best model.


Mean Standard Error(metric 2)

MSE is calculated by taking the distances of points to the regression line and squaring them. It tells us the average of a set of errors, and we want it to be as close to 0 as possible.

These graphs make it clear that the KNN model is best.

Conclusion

Our findings

Our results show that the variables in our dataset that have a positive correlation with stroke includes females, people with no heart disease, people who work in the private sector, people with hypertension, people who have never smoked, people who are older in age, people who are married, and people with a slightly higher body mass index. These variables indicate that a person may be at an increased risk for getting a stroke. We evaluated that the most accurate machine learning model that resulted in the lowest mean squared error was K Nearest Neighbors (KNN). We hope that our findings will help in aiding the healthcare industry in the prediction and classification of strokes.

Our Amazing Team

The brilliant minds behind this project

Anas Ahmad

...

Greetings all! I am an Accounting & Finance major in my Sophomore year at the University of Kansas. A Motorhead at heart, K-Pop/Dramas fan and pursuing a future in Corporate Law.

Aria Hoesley

...

Hi guys! I'm a rising freshman at Northwestern University planning on majors in data science and biology. I love coding games, biking, and reading.

Owen Yeung

...

I am a rising junior in high school who enjoys video games, playing piano, reading, and learning new things.

Shreya Sareen

...

Hi, I am a college freshman majoring in Computer/Data science with a minor in Bioinformatics. I enjoy playing nintendo, piano, art, stargazing, and listening to lofi music.

Bryce Reid

...

William Lin

...

Hi! I am a freshman attending Chabot Community College and will be a transfer student to the University of California. After college I hope to attend Medical School to become a Doctor.