Features that are affected with Diabetics
Significant Results
After conducting an exploratory analysis, it was found that individuals with cholesterol, high blood pressure, stroke issues, heart attacks, and serious difficulty in walking have a higher likelihood of having diabetes compared to the general population. Additionally, the analysis revealed that people who experience more than 25 days of physical and mental distress per month have a higher probability of developing diabetes. Consistently consuming fruits and vegetables has been found to reduce the risk of developing diabetes.
Gender and age also play a significant role in the prevalence of diabetes. Males are more susceptible to diabetes than females, and individuals above the age of 60 are more likely to develop diabetes than younger age groups. Individuals with low-income levels, who cannot afford regular doctor visits, are also at high risk of developing diabetes. This could be due to the impact of low-income levels on mental health, which indirectly contributes to diabetes.
Body Mass Index (BMI) is another crucial factor in determining the likelihood of developing diabetes. Individuals with high BMI are at a higher risk of developing diabetes.
Selection of Machine Learning Models
After conducting cross-validation scores for each model, the Keras sequential model was found to have the highest accuracy for the diabetes use case. Additionally, the highest F1-score was obtained by the Keras sequential model after analyzing the classification reports. The Multiplayer Perceptron achieved the highest AUC score. However, comparing the AUC and F1 score was necessary to determine the best method.
Although the target feature in the initial dataset was balanced, it was discovered that the dataset became imbalanced after splitting it into training and testing data. In cases of imbalanced datasets or for communicating results to end-users, the F1-score is typically considered more significant than AUC. Therefore, the Keras sequential model was chosen as the best model to predict the presence of diabetes based on the results of cross-validation and classification reports