Project Report
After the analysis of the findings, I found that the employee earnings analysis has vital implications for organizational management. It highlights the need to address income inequality, prompting a review of compensation structures. Insights into departmental earnings inform strategic financial management and resource allocation. Overtime policies should be evaluated for fairness, and the identification of alternative income sources suggests opportunities for diversification. Implementing these measures can enhance employee satisfaction, engagement, and overall strategic human resource planning.
Later, I started compiling all the issues, findings, their implications, and the charts and plots used to analyze the data together for the report.
Project report
Today I started working on the Project report. Here I started working on the issues for the EMPLOYEE EARNINGS REPORT 2022 dataset. The issues I want to work on are 1. distribution of income from the Salaries of the employees, 2. distribution of total gross earnings of the employees, 3. Departments with the employees having the highest and lowest earnings 4. Job titles with the employees having the highest and lowest earnings through salary, 5. Departments with the employees having the highest and lowest total earnings, 6. Departments with the highest overtime earnings, 7. Departments with the employees having the highest income sources other than salary 8. Departments with the highest overtime earnings.
Later on, I started working on compiling the findings together. These findings are as follows:
The analysis of employee earnings data reveals several key findings. Regular and total earnings distributions are right-skewed, indicating a concentration of employees with lower earnings and a smaller group earning significantly more. The Boston Police Department has the highest total earnings, with top earners predominantly from the Police and Fire Departments. Job titles show considerable variation in average earnings etc.
Project report work
Today I started working on a predictive model to find out if the income of the employee influences where the employee lives. To build a logistic regression first I removed rows with NaN values in the ‘REGULAR’ and ‘TOTAL GROSS’ and ‘POSTAL’ columns. After that I trained a logistic regression model to predict postal codes based on regular earnings, resulting in an accuracy of approximately 8%. Later I attempted training of another logistic regression model to predict postal codes based on total gross earnings, resulting in an accuracy of approximately 7%. The confusion matrix is extensive and indicates that the model’s predictions are mostly zeros, which suggests that the model may not be performing well. This could be due to a variety of factors, including a possible imbalance in the dataset or the complexity of predicting postal codes from a single feature.
1 Dec
Now I will proceed with visualizing the distribution of earnings within each department. This will provide a clear visual representation of the spread of earnings. This may help in identifying the patterns and outliers in the data.
Next today I analyzed Employee Earnings reports from 2011 to 2022, To understand the increase in total payroll between 2011 to 2022, the number of employees decreased during the same period, and the growth in average earnings per employee.
Next, I’m planning to plot the distribution of earnings on the map using the zip code and will try to find out if the income of the employee influences where the employee lives.
29 Nov
Today I analyzed the distribution of earnings within the departments to understand the variance and identify any outliers. This involved calculating the standard deviation of earnings within each department and visualizing this information to provide a clear understanding of the earnings spread.
Next, I calculated the standard deviation of earnings by the department. The standard deviation provided an idea of the spread of earnings within each department. A higher standard deviation indicates a wider range of earnings among employees in the respective department. The Superintendent’s office shows the highest variability in earnings, followed by the Boston Police Department and BPS High School Renewal. This can be useful in understanding the pay scale and the distribution within the department.
Nov 27
The Employee Earning report dataset has various columns such as NAME, DEPARTMENT_NAME, TITLE, and different categories of earnings and the POSTAL column indicates postal codes.
Today I will proceed with analyzing the distribution of TOTAL_GROSS earnings across different departments. Identifying the top earners in the dataset. And
Summarizing the average earnings by department or title.
In the top 10 departments by total gross earnings, The Boston Police Department tops with the highest total gross earnings, followed by the Boston Fire Department and BPS Special Education.
Later I identified the top 10 earners in the dataset, sorted by their total gross earnings. The individuals listed are from various departments, with the Boston Police Department being prominent among the top earners.
Subsequently, I summarized the average earnings by department/title. The average total gross earnings by department, with the Superintendent’s office having the highest average earnings, followed by the Boston Fire Department and School Support & Transformation. This summary provides insight into the average compensation levels across different departments.
Nov 24
On this day I furthered my analysis of the Employee earning report dataset. I plotted the distribution of regular earnings across the top 10 departments by median earnings. This visualization helps to compare the spread and central tendency of earnings within these departments.
From my analysis, it was found that in the dataset, there are around 7 types of employee earnings for which data is collected. Every type has some rows with no data present, which indicates that the respective employee does not have an inflow of money through this type of earnings.
Here is the list of the number of rows per type of earnings for which the data is not present. It would be better to fill these empty rows with 0 to further the analysis of this dataset.
REGULAR 600
RETRO 20112
OTHER 7378
OVERTIME 16392
INJURED 21983
DETAIL 21088
QUINN_EDUCATION 21835
22 Nov
Today, I focused on analyzing the Employee Earnings Report dataset. This dataset contains employee names, job details, and earnings information including base salary, overtime, and total compensation for employees of the City.
Here, I plotted the top 20 departments by total gross earnings, highlighting where the most earnings are concentrated within the organization.
Later, I plotted a histogram of the distribution of total gross earnings, focusing on the range from the 1st to the 99th percentile to exclude outliers for a clearer view of the data’s spread.
Next, I plotted a scatter plot that visualizes the relationship between regular earnings and overtime earnings, indicating how these two factors correlate across the dataset.
After the scatter plot, I found out the correlation coefficient between regular earnings and overtime earnings is approximately 0.505, indicating a moderate positive relationship. This suggests that, on average, higher regular earnings are associated with higher overtime earnings, but not strongly so.
Nov 15th Basic analysis of datasets from Analyze Boston
Today I marked the beginning of my data analysis project focusing on economic indicators provided by Analyze Boston. On this day I collected the dataset and did the basic data exploration for the economic indicators dataset from the Analyze Boston.
The dataset contains various economic indicators such as passenger numbers at Logan Airport, international flights, hotel occupancy rates, average daily rates for hotels, total jobs, unemployment rate, labor rate, housing and development metrics, etc.
Here, I performed a descriptive analysis of the dataset to get an overview of the data. The primary objectives for today included setting up the project infrastructure, gathering dataset, and familiarizing myself with the available economic indicators.
Nov 20
On this day I did a basic analysis of the Crime Incident Reports dataset. Crime incident reports are provided by the Boston Police Department (BPD) to document the initial details surrounding an incident to which BPD officers respond. It contains incident reports with various details such as incident number, offense code, description, date, and location. To further understand the data, I will create a visualization that shows the frequency of incidents by the day of the week.
Here I plot the frequency of incidents reported by day of the week. This helped me to identify patterns or trends in incident occurrence related to specific days. Later, I plotted the top 20 offense descriptions by frequency, this gave an overview of the most common types of incidents reported. Moreover, I plotted the number of incidents reported by districts, which helped me identify areas with higher or lower frequencies of reported incidents.
Nov 17
On this day, to choose a dataset from various datasets from the Analyze Boston website I started doing a basic analysis of various datasets.
First of all, I analyzed the property assessment data. It contains various columns related to property assessment, such as property identification, owner information, property value, and characteristics of the buildings. Next, I visualized some distributions from this dataset to get a better understanding of the data. Here I plot the distribution of total property values on a logarithmic scale due to the wide range of values. This helps to visualize better the spread and concentration of property values across different magnitudes. I also plot the distribution of the year properties were built. It provided insight into the age of the properties within the dataset. These visualizations offered a glimpse into the property values and the age distribution of the buildings in the dataset.
Nov 13th
Today I learned about the Time series analysis. It is a way of analyzing data points collected over a time period. In this method, the data points are collected over a consistent period of time.
Time series analysis can help us understand how variables in data can change over time. It helps us understand the underlying cause of trends and patterns over time. TSA has two types of methods frequency-domain methods and time-domain methods.
TSA can be further divided into parametric and non-parametric methods. 1) Parametric approaches work with the assumption that the foundational stationary stochastic process has a specific structure that can be described through a limited set of parameters. In these methodologies, the objective is to estimate and evaluate the parameters in the model characterizing the stochastic process. 2) On the other hand Non-parametric approaches directly determine the covariance or spectrum of the process without presuming any specific structure for the process.
Project Report on the Analysis of the Washington Post data repository on fatal police shootings in the United States
Nov 10 Project Report work
On this day I started work on a project report on the analysis of the Washington Post data repository on fatal police shootings in the United States. In the project report, I included the issues and the findings from the data analysis and the implications of the findings in the discussion section. Later I mentioned the methods for data collection, variable creation, and data analysis methods.
In the next part of the report, I included the results of the data analysis of the Washington Post data repository on fatal police shootings. I included the code part that I implemented to get those results.
Nov 8 Project work
After performing the demographic analysis, Let’s move forward with Temporal data analysis. After plotting the number of shootings from 2015 till the present we can analyze that the number of shootings over the years fluctuates between 60 to 100.
Moving to armed vs unarmed fatalities, guns and Knife contribute the most number of fatalities which is more than 76% of the total number of deaths from 2015 to now.
From the analysis of the number of fatalities with signs of mental illness, I can conclude that around 20% of the fatalities had signs of mental illness whereas the rest of the fatalities were not.
6 Nov Project work
Further analyzing the data on Shootings by police departments, the highest number of police shootings are happening due to the Los Angeles police department, followed by the Phoenix police department. similarly, the highest number of shootings are happening in Los Angeles, Phoenix followed by Houston.
Moreover, after analyzing the age Distribution of individuals involved in shootings I can conclude that the majority of individuals dying in a shooting incident are within the age range of 20 to 40. Also, the distribution is positively skewed or right-skewed.
Nov 3 Project work
Today I loaded the data into Python Notebook and later checked for the missing values and outliers in the data. Later calculated statistics summary for variables like age, and race. subsequently plotted a Histogram plot for the distribution of fatal encounters by gender, distribution of fatal encounters by race, and distribution of fatal encounters by state. From the fatal encounters by gender histogram, I can conclude that men have the highest number of fatalities followed by women. From the plot of fatal encounters by race, I can say that most fatalities by race happen to people of white ethnicity, followed by people of African American race, followed by Unknown. Finally, from the fatal encounters by State histogram plot, it is clear that California has the highest number of fatalities followed by Texas and Florida.
Nov 1 Project Plan
Today I started working on the project. In this project, I’m planning to perform an analysis of each of the parameters in the Washington Post data repository on fatal police shootings in the United States. This will involve: 1. Overall Data analysis: This will involve the key statistics, such as the total number of incidents, average age, gender distribution, etc. 2. Demographic Analysis: this will include analysis of data by demographics like age, gender, race, and ethnicity 3. Geospatial Analysis: Here I will plot the geographical distribution of incidents on the maps, 4. Temporal Analysis: here I will analyze if there are any noticeable changes in incidents over time. 5. Use of Force Analysis: this will involve the types of force used in each incident. 6. Incident description analysis: Here I will analyze the incident description to find the commonalities in the incidents.
30 Oct
On this day I learned about Hierarchical clustering which is a method of cluster analysis that seeks to build a hierarchy of clusters. This hierarchy is represented as a tree. Hierarchical clustering can be performed in two ways:
Agglomerative Hierarchical Clustering: This method starts with individual data points as separate clusters. So if we have N data points we have N clusters. Then merge the two closest clusters. Keep on repeating the merging process until we have one cluster. In this technique, we have a bottom-up approach.
Divisive Hierarchical Clustering: Whereas for this method is totally the opposite of Agglomerative hierarchy clustering. We begin with a cluster and keep on dividing it until every data point is a separate cluster.
27 Oct
On this day I learned about the DBSCAN which is density based spatial clustering of applications with noise. DBSCAN algorithm can create clusters based on the density of the data points. Here number of clusters are created based on the density of data points. this method can be useful in identifying outliers, as data points away from the dense clusters can be easily marked as outliers.
In my opinion, the clustering techniques can be very useful in identifying patterns in the data points, such as the prevalence of shooting fatalities based on location i.e. regions with a high number of shootings, weapons used, demographic analysis, the intended use of force, temporal data for analysis the fluctuation of shooting over a period of time etc.
25 Oct
On this day I learned about the Kmeans and Kmedoids clustering methods. 1)Kmeans basically is a clustering or a grouping method of data points, in this method data points are divided into k number of clusters. The criteria for grouping data points together is based on the mean value of their features. The mean of the data points acts as a centroid in the k-means method. this method could be affected by the outliers as the mean can shift with the presence of outliers data points.
2) Kmedoids is another clustering method, in this method, the most centrally located point in a cluster is used as a representative for the cluster. Unlike Kmeans, this method is not affected by outliers. As the medians or medoids are not affected significantly by the presence of outlier data points.
Mon 23 Oct
In this lecture, I learned about unsupervised learning. I’ve learned that it’s a machine learning method where algorithms explore data without predefined labels or outcomes. I understood that the main objective is to uncover patterns and structures within the data, primarily through clustering and dimensionality reduction techniques. I’ve gained the knowledge that unsupervised learning has numerous real-world applications, and that it plays a crucial role in discovering insights from data where the inherent structure isn’t immediately apparent.
I also learned about clustering in unsupervised learning. Clustering in unsupervised learning is the process of grouping similar data points together without predefined categories. The goal of clustering is to identify natural groupings or clusters within the dataset, allowing for the discovery of underlying structures and relationships. Once the clusters are formed, the results are interpreted by examining the characteristics of data points within each cluster. This can provide insights into the natural groupings or patterns present in the data. Various clustering algorithms, such as K-means and hierarchical clustering, are used to achieve this.
Fri 20 Oct
After attending the lecture, I learned multiple logistic regression. I’ve learned that this statistical method allows me to analyze the influence of multiple predictor variables on a binary outcome. I can estimate coefficients for each predictor, which reveal how they impact the likelihood of the event occurring. I understand that positive coefficients increase the likelihood, while negative coefficients decrease it. This knowledge equips me to apply multiple logistic regression for a deeper understanding of complex relationships.
I also learned about multinomial logistic regression. I’ve learned that this statistical technique is used when the outcome variable has more than two categories, and it allows me to model the probability of each category based on predictor variables. I can estimate coefficients for each predictor and category, helping me understand how these predictors influence the likelihood of being in a specific category relative to a reference category. This knowledge equips me to apply multinomial logistic regression for analyzing and predicting categorical outcomes with multiple categories
18 Oct Wed
In this lecture, I have learned about logistic regression coefficients. I can now interpret these coefficients as values that reveal how predictor variables affect the log odds of a binary outcome. I understand that positive coefficients indicate an increased likelihood of the outcome, while negative coefficients suggest a decreased likelihood. I also recognize that the magnitude and significance of coefficients provide insight into the strength and importance of these relationships. This knowledge equips me to effectively analyze and make informed decisions in various fields where logistic regression is applied.
I also learned to estimate logistic regression coefficients. I have learned that this process involves finding values that depict the relationship between predictor variables and the log odds of a binary outcome in logistic regression. I now know how each coefficient signifies the change in log odds for a one-unit change in the associated predictor. Moreover, I understand the importance of assessing the magnitude and significance of coefficients for meaningful interpretation. This knowledge equips me to use logistic regression coefficients to predict binary outcomes.
16 Oct Mon
After attending the lecture, I now have an understanding of permutation tests. I’ve learned that these tests are a valuable non-parametric statistical approach used when we can’t make assumptions about a specific data distribution. I’ve grasped the process of creating a null distribution by shuffling data, calculating a test statistic, and using it to compute p-values by comparing it to the distribution of test statistics from shuffled data. I can now apply this method to make valid statistical inferences based on observed data.
I also learned about Logistic regression, a statistical technique for predicting binary outcomes using predictor variables. I grasp how it employs the logistic function to model probabilities and how it estimates coefficients for these predictor variables to determine their impact on the outcome. I can see its applications in various fields, such as binary classification, and how it can be adapted for different types of categorical outcomes, making it a valuable tool for data analysis and prediction
Fri 13 Oct
After attending this lecture, I now understand the process of estimating a p-value from a simulation as a statistical method for assessing the significance of observed data. I have learned that it involves the formulation of a null hypothesis and an alternative hypothesis, and then simulating numerous datasets assuming the null hypothesis is true. These simulations help create a distribution of test statistics, from which I can calculate a p-value. I have gained the knowledge that a small p-value (typically less than 0.05) indicates statistical significance, allowing me to confidently reject the null hypothesis in favor of the alternative hypothesis.
I also learned the concept of hypothesis testing with permutation tests. I have learned that this statistical approach is particularly useful when dealing with data for which the underlying distribution is uncertain or doesn’t conform to standard assumptions. I understand the process of creating a null distribution by shuffling or reordering the data, calculating a test statistic, and determining the p-value by comparing the observed test statistic to the distribution from permuted data.
Wed 11 Oct
Today I learned about the Monte Carlo testing. Monte Carlo testing is a statistical method that uses random sampling techniques to approximate complex mathematical results or solve problems that might be deterministic in principle. It derives its name from the Monte Carlo Casino in Monaco, known for its games of chance, reflecting the random nature of the method.
Monte Carlo testing is particularly useful when dealing with problems that are analytically intractable or too complex to solve directly. It aims to provide numerical approximations for problems involving uncertainty, variability, and random processes. Instead of solving a problem analytically, Monte Carlo testing relies on generating a large number of random samples or simulations to estimate a solution. Monte Carlo methods are effective for analyzing complex systems with many interacting components and uncertainties. It is well-suited for parallel computing, making it computationally scalable.
Report on Analysis of the Centers for Disease Control and Prevention Data -2018
6 Oct
Today I conducted a more in-depth EDA to uncover patterns, correlations, and potential multicollinearity among variables. Visualization techniques were employed to better understand how obesity and inactivity relate to diabetes prevalence. Later on, I started developing our linear regression model, using obesity and inactivity as independent variables and diabetes as the dependent variable. This involved testing the assumptions of linear regression, such as linearity, independence, homoscedasticity, and normality of residuals. Next, I assessed the preliminary performance of our linear regression model, examining metrics like R-squared, adjusted R-squared, and residual plots. This step allowed us to identify areas for improvement and fine-tune our model for better predictive accuracy.
Recognizing the potential interactions between obesity and inactivity, I explored interaction terms in our model. This step allows me to capture the combined effect of these variables on diabetes prevalence, providing a more nuanced understanding. I also implemented cross-validation techniques to assess the generalizability of my model. Additionally, I used various validation metrics to evaluate model performance.
4 Oct
I continued to preprocess the data and engineer relevant features. This included creating new variables to represent the combined effects of obesity and inactivity on diabetes risk. Feature engineering is essential for building more accurate predictive models.
Later I started developing machine learning models to predict diabetes risk based on inactivity levels. My preliminary models include linear regression. I am working on fine-tuning the model and plan to explore more complex algorithms in the coming weeks. Next, building on my initial EDA, I created more advanced visualizations to illustrate the relationships between obesity, inactivity, and diabetes. These visualizations will play a crucial role in communicating our findings.
2 Oct Residuals and Heteroskedasticity
Today I plotted the residuals against the predicted values from the linear model, I noticed a distinctive fanning-out pattern in the residuals as the fitted values increased. This observation led me to conclude that, in this situation, the heteroscedasticity of the residuals—specifically, their increasing variability with larger predicted values—serves as a major warning sign that the linear model may not be reliable.
When there’s heteroskedasticity, OLS (Ordinary Least Squares) isn’t the best because it treats all data points equally, even though some points are more reliable. If some data points are more uncertain (have a larger disturbance variance), they should be given less importance. But OLS doesn’t do that, which can mess up our estimates and make them biased. So I decided to test for heteroscedasticity analytically using the Breusch-Pagan test.
29 Sept Project work progress
Today I continued to preprocess the data and engineer relevant features. This included merging the tables to represent the combined effects of obesity and inactivity on diabetes risk. After analyzing I found only 354 FIPS are in common within obesity, inactivity, and diabetes and there are 1370 FIPS in common between inactivity and diabetes. So, I decided to analyze the data on inactivity and diabetes first to get a better picture of the dataset after merging.
Building on my initial EDA, I created visualizations to illustrate the relationships between inactivity and diabetes. These visualizations will play a crucial role in communicating our findings. I did a bivariate analysis, and heatmap analysis of inactivity and diabetes to get a better understanding of the relation between the two. Further, I plot the probability density function and Cumulative distribution function for both to understand the distribution of data points.
This week I’m planning to develop machine learning models to predict diabetes risk based on inactivity levels. My preliminary model will include logistic regression.
Wed 27th Sept Project work
Today marked the beginning of my project focused on analyzing the CDC 2018 diabetes data with a specific focus on parameters such as obesity, inactivity, and diabetes. Today I outlined the project’s scope and objectives
I performed preliminary data cleaning. This involved handling missing values, standardizing data formats, and ensuring data integrity. Using Jupyter Notebook I conducted an initial EDA to gain insights into the dataset. I created visualizations to understand the distribution of obesity, inactivity, and diabetes. In the initial EDA, I created visualizations to illustrate the relationships between obesity, inactivity, and diabetes. These visualizations will play a crucial role in communicating our findings. For this, I used Pandas and Seaborn Python libraries for the initial EDA.
After initial analysis, I found there are 354 FIPS for which there is 2018 CDC data on all three of diabetes, obesity, and inactivity. I will perform inner join on obesity, inactivity, and diabetes data to do further analysis.
Sept 25 Cross-Validation
Today we learned about cross-validation and bootstrap. Cross-validation and bootstrap are both resampling techniques commonly used to assess the performance of models and estimate their generalization errors. Cross-validation is used to measure the performance of the predictive model. The basic idea is to divide your dataset into two parts: training and testing sets.
We also learned about the K cross-validation. In this, the dataset is divided into K equally sized subsets or folds. The model is trained on K-1 of these folds and tested on the remaining fold. This process is repeated K times, with each fold serving as the test set exactly once. The results are then averaged to get an overall performance measure.
Bootstrap is a statistical resampling technique used to estimate the sampling distribution of a statistic by repeatedly resampling the dataset with replacement.
Sept 20 Excellent Linear model
Today we learned an excellent linear model fit to data with non-normally distributed, skewed, high-variance, and high-kurtosis variables typically indicates that the model is robust and effectively captures the underlying relationships between the variables despite the challenging characteristics of the data.
We also learned about the t-test, a statistical hypothesis test used to determine if there is a significant difference between the means of two groups or populations. It is employed when you have two sets of data and want to assess whether the means of these two sets are statistically different from each other. The t-test produces a test statistic and a p-value. The t-value quantifies the difference between the group means relative to the variability within the groups, while the p-value tells you the probability of observing such a difference if there were no true differences in the populations from which the samples were drawn.
sept 18 Interaction model Linear regression with two predictor variables, interaction terms and quadratic terms
Today we learned about the interaction model. Interaction is a unique characteristic observed when three or more variables are involved, wherein at least two of these variables combine in a way that influences a third variable in a manner that is not merely additive. In other words, the two variables interact in such a way that their combined effect exceeds the simple summation of their individual impacts. An interaction effect occurs when the effect of one variable depends on the value of another variable.
Interaction models are essential for understanding and accounting for complex relationships in data. They help uncover patterns and better interpret the relationships between variables in their data.
Wed Sept 13
In today’s class, we learned the concept of Null hypothesis and p-value.
A null hypothesis is a method that follows proof by contradiction. The null hypothesis is an assumption we are making for solving a problem and working on the problem towards it. As this is proof by contradiction if we believe that the null hypothesis is true, we have to prove that the alternative hypothesis is false and should accept the null hypothesis, or in case we are unable to prove the null hypothesis is true we have to accept the alternate hypothesis.
The p-value is the probability of observing a value if the null hypothesis is true. If the p-value is very low, then we say that the chances for the null hypothesis to be true are very low. In such cases, we have to reject the null hypothesis and accept the alternate hypothesis.
Monday 11 Sept Exploring the CDC 2018 diabetes data
Today we analyzed the CDC 2018 diabetes data with a specific focus on parameters such as obesity, inactivity, and diabetes. As in the dataset, inactivity and diabetes have higher data point count so we can explore if there is any relation between these two parameters. Both the features Diabetes and Inactivity have 1370 data points in common.
We analyzed the smoothed histograms for Diabetes and Inactivity data. We also examined the pair plot for inactivity and diabetes data. And we have learned the concept of Heteroscedasticity.