Genesis Generators Project

About our Project

Welcome to our Project!

Our project uses a dataset about stroke prediction acquired from kaggle to predict the chances of getting a stroke. For this prediction, our program utilizes a person's gender, age, marital and smoking status, presence of heart disease or hypertension, work and residence type, bmi, and average glucose level..

Motivation

Our Inspiration!

In the United States, every 40 seconds, someone has a stroke, and every 3 ½ minutes someone dies of a stroke. 1 in 4 people over the age of 25 in the world will experience a stroke. For people who go to the emergency room within 3 hours of the stroke, they are much more likely to avoid serious disability. Our project is aimed at helping people, especially in the healthcare industry, identify if someone has a higher risk of a stroke, so they can spot warning signs/symptoms of a stroke and access treatment quicker.

MVP

Minimum Viable Product

In 3 weeks, our final product will have accomplised complete analysis of characteristics that results in stroke, and our product will be to make accurate predictions whether a person will encounter a stroke in the near future based on these chatacteristics.

Tech Stack

Tech we will need in order to create our project

This project uses HTML, CSS, Bootstrap for the front end development and website creation. Along, with these coding languages we have used Python as our backbone. We used Pandas for data management, Numpy for mathematical operations, Sklearn for machine learning and Plotly for generating our interactive plots..

Graphs

Our data simplified

Visualizing Count of Categorical Columns

Pie Chart

Dual Bar Graph

Multifaceted Bar Graphs

Scatter Plot

Box Plot

Correlation Heatmap

Clustered Bar Graph

The Process

How we turned data into information and information into insight.

Week 1

Exploring and Gathering Data

For the first week of AI camp, our team was introduced to Data Science and Data Analytics. We learned about the basics of the Python programming language and analysis. At the end of week 1, we explored various datasets and our team chose to do Stroke Prediction.
Week 2

Data Analysis

After choosing our dataset, we utilized the skills we learned in week 1 and we started cleaning our data and we performed exploratory data analysis by making plots and graphs to analyze patterns and trends to determine which variables were correlated. We learned about the different types of machine learning models and implemented various classification machine learning models. We analyzed each machine learning model by checking for the mean squared error of each model and by creating a confusion matrix. After analyzing the models, we analyzed various metrics including precision, accuracy, F1 score, and recall.
Week 3

Front End Development

After interpreting our dataset and acquiring the knowledge we needed to understand our dataset, we moved on to front end development. In week 3, we learned about HTML (Hypertext Markup Language) and CSS (Cascading Style Sheets) which we used to start developing our website. We compiled all of our data and built our website into our final product.

Machine Learning

Our machine learning models

Regression vs. Classification

Since the problem we are trying to solve is to predict whether a person would encounter a stroke or not, it is classified as a classification problem..

Supervised vs. Unsupervised

Because we fed our model data that was already labeled, this problem is an example of supervised machine learning.

Machine Learning Models

We tested out eleven models (KNN, SVC, RFC, LR, GNB, DTC, SGDC, RC, NC, GPC, MLPC) and found these three to be the best for our data (KNN, SVC, LR). We also ran hyperparameters on these three models to find the best version of these models.

K-Nearest Neighbors (KNN)

For the KNN model we specified n_neighbors and leaf_size as equal to 10. After that, we fit the model to the x_train and y_train and tested it. This allowed us to acquire a minimum observed mean squared error hence validating its accuracy for our data.

Logistic Regression (LR)

For the LR model we specified solver as equal to 'liblinear' and random_state as equal to 0. After that, we fit the model to the x_train and y_train and tested it. This allowed us to acquire the same minimum observed mean squared error hence validating its accuracy for our data.

Support Vector Machine (SVM)

For the SVC model we specified kernel as equal to 'linear'. After that, we fit the model to the x_train and y_train and tested it. This again allowed us to acquire a minimum observed mean squared error hence validating its accuracy for our data.

Comparisons

Deciding on a machine learning model

Accuracy (metric 1)

Accuracy is calculated as (tp + tn)/ (tp + fp _ tn + fn)and you want accuracy to be as close to 1 as possible for the best model.

Mean Standard Error(metric 2)

MSE is calculated by taking the distances of points to the regression line and squaring them. It tells us the average of a set of errors, and we want it to be as close to 0 as possible.

These graphs make it clear that the KNN model is best.

Conclusion

Our findings

Our results show that the variables in our dataset that have a positive correlation with stroke includes females, people with no heart disease, people who work in the private sector, people with hypertension, people who have never smoked, people who are older in age, people who are married, and people with a slightly higher body mass index. These variables indicate that a person may be at an increased risk for getting a stroke. We evaluated that the most accurate machine learning model that resulted in the lowest mean squared error was K Nearest Neighbors (KNN). We hope that our findings will help in aiding the healthcare industry in the prediction and classification of strokes.

Our Amazing Team

The brilliant minds behind this project

Anas Ahmad

Greetings all! I am an Accounting & Finance major in my Sophomore year at the University of Kansas. A Motorhead at heart, K-Pop/Dramas fan and pursuing a future in Corporate Law.

Aria Hoesley

Hi guys! I'm a rising freshman at Northwestern University planning on majors in data science and biology. I love coding games, biking, and reading.

Owen Yeung

I am a rising junior in high school who enjoys video games, playing piano, reading, and learning new things.

Shreya Sareen

Hi, I am a college freshman majoring in Computer/Data science with a minor in Bioinformatics. I enjoy playing nintendo, piano, art, stargazing, and listening to lofi music.

Bryce Reid

William Lin

Hi! I am a freshman attending Chabot Community College and will be a transfer student to the University of California. After college I hope to attend Medical School to become a Doctor.