Find Drivers of Heart Disease or Attack and create a model that can accurately predict Heart Disease or Attack while performing better than baseline.
Using the dataset for heart disease prediction from Kaggle, look for any features that have statistical significance to be used in a model for predicting heart disease or attack.
I believe that a majority of the feature columns will be good predictors of heart disease or attack.
| bmi | Body Mass Index |
|---|---|
| smoker | If the respondent has smoked 100 cigarettes in their lives |
| menthlth | Number of days during the past 30 days where mental health was not good |
| physhlth | Number of days during the past 30 days where physical health was not good |
| sex | Sex of respondent |
| age | Age of respondent |
| heartdiseaseorattack | Respondents that have ever reported having coronary heart disease or attack |
| highbp | Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional |
| highchol | Adults who have been told they have high cholesterol by a doctor, nurse, or other health professional |
| diabetes | If respondent has diabetes or not |
| hvyalcoholconsump | Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) |
-
Data was acquired from: https://www.kaggle.com/datasets/alexteboul/heart-disease-health-indicators-dataset
-
Clone this repo and run through the final report.
- Acquire dataset from link
- Cache full df for future use
- View data .info, .describe, .shape
- There are 253680 rows and 11 columns
- View/correct datatypes
- changed age from integers to correct age bins
- There were no null values
- Visualize full dataset for univariate exploration (histograms and boxplots)
- Handle outliers
- I got rid of the outliers in BMI by getting rid of the top 1 percent
- verified datatypes
- made all of the column names lower case
- split the data on the target variable heart disease or attack
- Scaling data on train
- Encoding any necessary columns
- Document how I'm changing the data
- Use unscaled data for multivariate exploration
- Hypothesize
- Visualize
- Run stats tests
- Run chi-squared test on catagorical vs. target
- Run comparison of means test on continuous vs target
- Summarize
-
All of the features were statistically significant towards heard disease or attack
-
More people with high blood pressure had heart disease or a heart attack than those that did not
-
More people with high cholestreol had heart disease or a heart attack than those that did not
-
More people that smoked had heart disease or a heart attack than those that did not
-
People without diabeties had heart attacks 8.6% more than those that did have diabetes
-
You are 5% more likely to have a heart attack if you are a female
-
People who consumed heavy amounts of alcohol had heart disease or heart attacks 6.81% more than those that did not consume heavy amounts of alcohol
-
People in higher age brackets have a higher percentage of people who have had a heart disease or attack
-
People who have had heart attacks had more days of bad physical health
-
People who have had heart attacks had more days of bad mental health
-
People who have had heart attack and people who didn't have heart attacks have a similar bmi mean
- Will the every feature model beat out the next steps model of creating new feature columns
- Use scaled/encoded data
- Split into X_variables and y_variables
- Determine evaluation metrics
- Establishing baseline
- Run different models on train/validate
- Pick best model and evaluate on test
- My top model performed beat baseline by .02 %
- I would not recommend using the model because it did not beat baseline by a significant result