29 Sept Project work progress

Today I continued to preprocess the data and engineer relevant features. This included merging the tables to represent the combined effects of obesity and inactivity on diabetes risk. After analyzing I found only 354 FIPS are in common within obesity, inactivity, and diabetes and there are 1370 FIPS in common between inactivity and diabetes. So, I decided to analyze the data on inactivity and diabetes first to get a better picture of the dataset after merging.

Building on my initial EDA, I created visualizations to illustrate the relationships between inactivity and diabetes. These visualizations will play a crucial role in communicating our findings. I did a bivariate analysis, and heatmap analysis of inactivity and diabetes to get a better understanding of the relation between the two. Further, I plot the probability density function and Cumulative distribution function for both to understand the distribution of data points.

This week I’m planning to develop machine learning models to predict diabetes risk based on inactivity levels. My preliminary model will include logistic regression.

Wed 27th Sept Project work

Today marked the beginning of my project focused on analyzing the CDC 2018 diabetes data with a specific focus on parameters such as obesity, inactivity, and diabetes. Today I outlined the project’s scope and objectives

I performed preliminary data cleaning. This involved handling missing values, standardizing data formats, and ensuring data integrity. Using Jupyter Notebook I conducted an initial EDA to gain insights into the dataset. I created visualizations to understand the distribution of obesity, inactivity, and diabetes. In the initial EDA, I created visualizations to illustrate the relationships between obesity, inactivity, and diabetes. These visualizations will play a crucial role in communicating our findings. For this, I used Pandas and Seaborn Python libraries for the initial EDA.

After initial analysis, I found there are 354 FIPS for which there is 2018 CDC data on all three of diabetes, obesity, and inactivity. I will perform inner join on obesity, inactivity, and diabetes data to do further analysis.

Sept 25 Cross-Validation

Today we learned about cross-validation and bootstrap. Cross-validation and bootstrap are both resampling techniques commonly used to assess the performance of models and estimate their generalization errors. Cross-validation is used to measure the performance of the predictive model. The basic idea is to divide your dataset into two parts: training and testing sets.

We also learned about the K cross-validation.  In this, the dataset is divided into K equally sized subsets or folds. The model is trained on K-1 of these folds and tested on the remaining fold. This process is repeated K times, with each fold serving as the test set exactly once. The results are then averaged to get an overall performance measure.

Bootstrap is a statistical resampling technique used to estimate the sampling distribution of a statistic by repeatedly resampling the dataset with replacement.

Sept 20 Excellent Linear model

Today we learned an excellent linear model fit to data with non-normally distributed, skewed, high-variance, and high-kurtosis variables typically indicates that the model is robust and effectively captures the underlying relationships between the variables despite the challenging characteristics of the data.

We also learned about the t-test, a statistical hypothesis test used to determine if there is a significant difference between the means of two groups or populations. It is employed when you have two sets of data and want to assess whether the means of these two sets are statistically different from each other.  The t-test produces a test statistic and a p-value. The t-value quantifies the difference between the group means relative to the variability within the groups, while the p-value tells you the probability of observing such a difference if there were no true differences in the populations from which the samples were drawn.

sept 18 Interaction model Linear regression with two predictor variables, interaction terms and quadratic terms

Today we learned about the interaction model. Interaction is a unique characteristic observed when three or more variables are involved, wherein at least two of these variables combine in a way that influences a third variable in a manner that is not merely additive. In other words, the two variables interact in such a way that their combined effect exceeds the simple summation of their individual impacts. An interaction effect occurs when the effect of one variable depends on the value of another variable.

Interaction models are essential for understanding and accounting for complex relationships in data. They help uncover patterns and better interpret the relationships between variables in their data.

 

Wed Sept 13

In today’s class, we learned the concept of Null hypothesis and p-value.

A null hypothesis is a method that follows proof by contradiction. The null hypothesis is an assumption we are making for solving a problem and working on the problem towards it. As this is proof by contradiction if we believe that the null hypothesis is true, we have to prove that the alternative hypothesis is false and should accept the null hypothesis, or in case we are unable to prove the null hypothesis is true we have to accept the alternate hypothesis.

The p-value is the probability of observing a value if the null hypothesis is true. If the p-value is very low, then we say that the chances for the null hypothesis to be true are very low. In such cases, we have to reject the null hypothesis and accept the alternate hypothesis.

Monday 11 Sept Exploring the CDC 2018 diabetes data

Today we analyzed the CDC 2018 diabetes data with a specific focus on parameters such as obesity, inactivity, and diabetes.  As in the dataset, inactivity and diabetes have higher data point count so we can explore if there is any relation between these two parameters. Both the features Diabetes and Inactivity have 1370 data points in common.

We analyzed the smoothed histograms for Diabetes and Inactivity data. We also examined the pair plot for inactivity and diabetes data. And we have learned the concept of Heteroscedasticity.