Mon 23 Oct

In this lecture, I learned about unsupervised learning. I’ve learned that it’s a machine learning method where algorithms explore data without predefined labels or outcomes. I understood that the main objective is to uncover patterns and structures within the data, primarily through clustering and dimensionality reduction techniques. I’ve gained the knowledge that unsupervised learning has numerous real-world applications, and that it plays a crucial role in discovering insights from data where the inherent structure isn’t immediately apparent.

I also learned about clustering in unsupervised learning. Clustering in unsupervised learning is the process of grouping similar data points together without predefined categories. The goal of clustering is to identify natural groupings or clusters within the dataset, allowing for the discovery of underlying structures and relationships. Once the clusters are formed, the results are interpreted by examining the characteristics of data points within each cluster. This can provide insights into the natural groupings or patterns present in the data. Various clustering algorithms, such as K-means and hierarchical clustering, are used to achieve this.

Fri 20 Oct

After attending the lecture, I learned multiple logistic regression. I’ve learned that this statistical method allows me to analyze the influence of multiple predictor variables on a binary outcome. I can estimate coefficients for each predictor, which reveal how they impact the likelihood of the event occurring. I understand that positive coefficients increase the likelihood, while negative coefficients decrease it. This knowledge equips me to apply multiple logistic regression for a deeper understanding of complex relationships.

I also learned about multinomial logistic regression. I’ve learned that this statistical technique is used when the outcome variable has more than two categories, and it allows me to model the probability of each category based on predictor variables. I can estimate coefficients for each predictor and category, helping me understand how these predictors influence the likelihood of being in a specific category relative to a reference category. This knowledge equips me to apply multinomial logistic regression for analyzing and predicting categorical outcomes with multiple categories

 

18 Oct Wed

In this lecture, I have learned about logistic regression coefficients. I can now interpret these coefficients as values that reveal how predictor variables affect the log odds of a binary outcome. I understand that positive coefficients indicate an increased likelihood of the outcome, while negative coefficients suggest a decreased likelihood. I also recognize that the magnitude and significance of coefficients provide insight into the strength and importance of these relationships. This knowledge equips me to effectively analyze and make informed decisions in various fields where logistic regression is applied.

I also learned to estimate logistic regression coefficients. I have learned that this process involves finding values that depict the relationship between predictor variables and the log odds of a binary outcome in logistic regression. I now know how each coefficient signifies the change in log odds for a one-unit change in the associated predictor. Moreover, I understand the importance of assessing the magnitude and significance of coefficients for meaningful interpretation. This knowledge equips me to use logistic regression coefficients to predict binary outcomes.

16 Oct Mon

After attending the lecture, I now have an understanding of permutation tests. I’ve learned that these tests are a valuable non-parametric statistical approach used when we can’t make assumptions about a specific data distribution. I’ve grasped the process of creating a null distribution by shuffling data, calculating a test statistic, and using it to compute p-values by comparing it to the distribution of test statistics from shuffled data. I can now apply this method to make valid statistical inferences based on observed data.

I also learned about Logistic regression, a statistical technique for predicting binary outcomes using predictor variables. I grasp how it employs the logistic function to model probabilities and how it estimates coefficients for these predictor variables to determine their impact on the outcome. I can see its applications in various fields, such as binary classification, and how it can be adapted for different types of categorical outcomes, making it a valuable tool for data analysis and prediction

Fri 13 Oct

After attending this lecture, I now understand the process of estimating a p-value from a simulation as a statistical method for assessing the significance of observed data. I have learned that it involves the formulation of a null hypothesis and an alternative hypothesis, and then simulating numerous datasets assuming the null hypothesis is true. These simulations help create a distribution of test statistics, from which I can calculate a p-value. I have gained the knowledge that a small p-value (typically less than 0.05) indicates statistical significance, allowing me to confidently reject the null hypothesis in favor of the alternative hypothesis.

I also learned the concept of hypothesis testing with permutation tests. I have learned that this statistical approach is particularly useful when dealing with data for which the underlying distribution is uncertain or doesn’t conform to standard assumptions. I understand the process of creating a null distribution by shuffling or reordering the data, calculating a test statistic, and determining the p-value by comparing the observed test statistic to the distribution from permuted data.

Wed 11 Oct

Today I learned about the Monte Carlo testing. Monte Carlo testing is a statistical method that uses random sampling techniques to approximate complex mathematical results or solve problems that might be deterministic in principle. It derives its name from the Monte Carlo Casino in Monaco, known for its games of chance, reflecting the random nature of the method.

Monte Carlo testing is particularly useful when dealing with problems that are analytically intractable or too complex to solve directly. It aims to provide numerical approximations for problems involving uncertainty, variability, and random processes. Instead of solving a problem analytically, Monte Carlo testing relies on generating a large number of random samples or simulations to estimate a solution. Monte Carlo methods are effective for analyzing complex systems with many interacting components and uncertainties. It is well-suited for parallel computing, making it computationally scalable.

6 Oct

Today I conducted a more in-depth EDA to uncover patterns, correlations, and potential multicollinearity among variables. Visualization techniques were employed to better understand how obesity and inactivity relate to diabetes prevalence. Later on, I started developing our linear regression model, using obesity and inactivity as independent variables and diabetes as the dependent variable. This involved testing the assumptions of linear regression, such as linearity, independence, homoscedasticity, and normality of residuals. Next, I assessed the preliminary performance of our linear regression model, examining metrics like R-squared, adjusted R-squared, and residual plots. This step allowed us to identify areas for improvement and fine-tune our model for better predictive accuracy.

Recognizing the potential interactions between obesity and inactivity, I explored interaction terms in our model. This step allows me to capture the combined effect of these variables on diabetes prevalence, providing a more nuanced understanding. I also implemented cross-validation techniques to assess the generalizability of my model. Additionally, I used various validation metrics to evaluate model performance.

4 Oct

I continued to preprocess the data and engineer relevant features. This included creating new variables to represent the combined effects of obesity and inactivity on diabetes risk. Feature engineering is essential for building more accurate predictive models.

Later I started developing machine learning models to predict diabetes risk based on inactivity levels. My preliminary models include linear regression. I am working on fine-tuning the model and plan to explore more complex algorithms in the coming weeks. Next, building on my initial EDA, I created more advanced visualizations to illustrate the relationships between obesity, inactivity, and diabetes. These visualizations will play a crucial role in communicating our findings.

2 Oct Residuals and Heteroskedasticity

Today I plotted the residuals against the predicted values from the linear model, I noticed a distinctive fanning-out pattern in the residuals as the fitted values increased. This observation led me to conclude that, in this situation, the heteroscedasticity of the residuals—specifically, their increasing variability with larger predicted values—serves as a major warning sign that the linear model may not be reliable.

When there’s heteroskedasticity, OLS (Ordinary Least Squares) isn’t the best because it treats all data points equally, even though some points are more reliable. If some data points are more uncertain (have a larger disturbance variance), they should be given less importance. But OLS doesn’t do that, which can mess up our estimates and make them biased. So I decided to test for heteroscedasticity analytically using the Breusch-Pagan test.