29 Nov

Today I analyzed the distribution of earnings within the departments to understand the variance and identify any outliers. This involved calculating the standard deviation of earnings within each department and visualizing this information to provide a clear understanding of the earnings spread.

Next, I calculated the standard deviation of earnings by the department. The standard deviation provided an idea of the spread of earnings within each department. A higher standard deviation indicates a wider range of earnings among employees in the respective department. The Superintendent’s office shows the highest variability in earnings, followed by the Boston Police Department and BPS High School Renewal. This can be useful in understanding the pay scale and the distribution within the department.

Nov 27

The Employee Earning report dataset has various columns such as NAME, DEPARTMENT_NAME, TITLE, and different categories of earnings and the POSTAL column indicates postal codes.

Today I will proceed with analyzing the distribution of TOTAL_GROSS earnings across different departments. Identifying the top earners in the dataset. And
Summarizing the average earnings by department or title.

In the top 10 departments by total gross earnings, The Boston Police Department tops with the highest total gross earnings, followed by the Boston Fire Department and BPS Special Education.

Later I identified the top 10 earners in the dataset, sorted by their total gross earnings. The individuals listed are from various departments, with the Boston Police Department being prominent among the top earners.

Subsequently, I summarized the average earnings by department/title. The average total gross earnings by department, with the Superintendent’s office having the highest average earnings, followed by the Boston Fire Department and School Support & Transformation. This summary provides insight into the average compensation levels across different departments.

Nov 24

On this day I furthered my analysis of the Employee earning report dataset.  I plotted the distribution of regular earnings across the top 10 departments by median earnings. This visualization helps to compare the spread and central tendency of earnings within these departments.

From my analysis, it was found that in the dataset, there are around 7 types of employee earnings for which data is collected. Every type has some rows with no data present, which indicates that the respective employee does not have an inflow of money through this type of earnings.

Here is the list of the number of rows per type of earnings for which the data is not present. It would be better to fill these empty rows with 0 to further the analysis of this dataset.

REGULAR   600
RETRO         20112
OTHER         7378
OVERTIME 16392
INJURED      21983
DETAIL         21088
QUINN_EDUCATION   21835

 

22 Nov

Today, I focused on analyzing the Employee Earnings Report dataset. This dataset contains employee names, job details, and earnings information including base salary, overtime, and total compensation for employees of the City.

Here, I plotted the top 20 departments by total gross earnings, highlighting where the most earnings are concentrated within the organization.

Later, I plotted a histogram of the distribution of total gross earnings, focusing on the range from the 1st to the 99th percentile to exclude outliers for a clearer view of the data’s spread.

Next, I plotted a scatter plot that visualizes the relationship between regular earnings and overtime earnings, indicating how these two factors correlate across the dataset.

After the scatter plot, I found out the correlation coefficient between regular earnings and overtime earnings is approximately 0.505, indicating a moderate positive relationship. This suggests that, on average, higher regular earnings are associated with higher overtime earnings, but not strongly so.

Nov 15th Basic analysis of datasets from Analyze Boston

Today I marked the beginning of my data analysis project focusing on economic indicators provided by Analyze Boston. On this day I collected the dataset and did the basic data exploration for the economic indicators dataset from the Analyze Boston.

The dataset contains various economic indicators such as passenger numbers at Logan Airport, international flights, hotel occupancy rates, average daily rates for hotels, total jobs, unemployment rate, labor rate, housing and development metrics, etc.

Here, I performed a descriptive analysis of the dataset to get an overview of the data. The primary objectives for today included setting up the project infrastructure, gathering dataset, and familiarizing myself with the available economic indicators.

Nov 20

On this day I did a basic analysis of the Crime Incident Reports dataset. Crime incident reports are provided by the Boston Police Department (BPD) to document the initial details surrounding an incident to which BPD officers respond.  It contains incident reports with various details such as incident number, offense code, description, date, and location. To further understand the data, I will create a visualization that shows the frequency of incidents by the day of the week.

Here I plot the frequency of incidents reported by day of the week. This helped me to identify patterns or trends in incident occurrence related to specific days. Later, I plotted the top 20 offense descriptions by frequency, this gave an overview of the most common types of incidents reported.  Moreover, I plotted the number of incidents reported by districts, which helped me identify areas with higher or lower frequencies of reported incidents.

Nov 17

On this day, to choose a dataset from various datasets from the Analyze Boston website I started doing a basic analysis of various datasets.

First of all, I analyzed the property assessment data. It contains various columns related to property assessment, such as property identification, owner information, property value, and characteristics of the buildings. Next, I  visualized some distributions from this dataset to get a better understanding of the data. Here I plot the distribution of total property values on a logarithmic scale due to the wide range of values. This helps to visualize better the spread and concentration of property values across different magnitudes. I also plot the distribution of the year properties were built. It provided insight into the age of the properties within the dataset. These visualizations offered a glimpse into the property values and the age distribution of the buildings in the dataset.

Nov 13th

Today I learned about the Time series analysis. It is a way of analyzing data points collected over a time period. In this method, the data points are collected over a consistent period of time.

Time series analysis can help us understand how variables in data can change over time. It helps us understand the underlying cause of trends and patterns over time. TSA has two types of methods frequency-domain methods and time-domain methods.

TSA can be further divided into parametric and non-parametric methods. 1) Parametric approaches work with the assumption that the foundational stationary stochastic process has a specific structure that can be described through a limited set of parameters. In these methodologies, the objective is to estimate and evaluate the parameters in the model characterizing the stochastic process. 2) On the other hand Non-parametric approaches directly determine the covariance or spectrum of the process without presuming any specific structure for the process.

Nov 10 Project Report work

On this day I started work on a project report on the analysis of the Washington Post data repository on fatal police shootings in the United States. In the project report, I included the issues and the findings from the data analysis and the implications of the findings in the discussion section.  Later I mentioned the methods for data collection, variable creation, and data analysis methods.

In the next part of the report, I included the results of the data analysis of the Washington Post data repository on fatal police shootings. I included the code part that I implemented to get those results.

Nov 8 Project work

After performing the demographic analysis, Let’s move forward with Temporal data analysis.  After plotting the number of shootings from 2015 till the present we can analyze that the number of shootings over the years fluctuates between 60 to 100.

Moving to armed vs unarmed fatalities, guns and Knife contribute the most number of fatalities which is more than 76% of the total number of deaths from 2015 to now.

From the analysis of the number of fatalities with signs of mental illness, I can conclude that around 20% of the fatalities had signs of mental illness whereas the rest of the fatalities were not.

6 Nov Project work

Further analyzing the data on Shootings by police departments, the highest number of police shootings are happening due to the Los Angeles police department, followed by the Phoenix police department. similarly, the highest number of shootings are happening in Los Angeles, Phoenix followed by Houston.

Moreover,  after analyzing the age Distribution of individuals involved in shootings I can conclude that the majority of individuals dying in a shooting incident are within the age range of 20 to 40. Also, the distribution is positively skewed or right-skewed.

Nov 3 Project work

Today I loaded the data into Python Notebook and later checked for the missing values and outliers in the data. Later calculated statistics summary for variables like age, and race. subsequently plotted a Histogram plot for the distribution of fatal encounters by gender, distribution of fatal encounters by race,  and distribution of fatal encounters by state. From the fatal encounters by gender histogram, I can conclude that men have the highest number of fatalities followed by women. From the plot of fatal encounters by race, I can say that most fatalities by race happen to people of white ethnicity, followed by people of African American race, followed by Unknown. Finally, from the fatal encounters by State histogram plot, it is clear that California has the highest number of fatalities followed by Texas and Florida.

Nov 1 Project Plan

Today I started working on the project. In this project, I’m planning to perform an analysis of each of the parameters in the Washington Post data repository on fatal police shootings in the United States. This will involve: 1. Overall Data analysis: This will involve the key statistics, such as the total number of incidents, average age, gender distribution, etc.  2. Demographic Analysis: this will include analysis of data by demographics like age, gender, race, and ethnicity   3. Geospatial Analysis: Here I will plot the geographical distribution of incidents on the maps, 4. Temporal Analysis: here I will analyze if there are any noticeable changes in incidents over time. 5. Use of Force Analysis: this will involve the types of force used in each incident. 6. Incident description analysis: Here I will analyze the incident description to find the commonalities in the incidents.

30 Oct

On this day I learned about Hierarchical clustering which is a method of cluster analysis that seeks to build a hierarchy of clusters. This hierarchy is represented as a tree. Hierarchical clustering can be performed in two ways:

Agglomerative Hierarchical Clustering: This method starts with individual data points as separate clusters. So if we have N data points we have N clusters. Then merge the two closest clusters. Keep on repeating the merging process until we have one cluster. In this technique, we have a bottom-up approach.

Divisive Hierarchical Clustering: Whereas for this method is totally the opposite of Agglomerative hierarchy clustering. We begin with a cluster and keep on dividing it until every data point is a separate cluster.

27 Oct

On this day I learned about the DBSCAN which is density based spatial clustering of applications with noise. DBSCAN algorithm can create clusters based on the density of the data points. Here number of clusters are created based on the density of data points. this method can be useful in identifying outliers, as data points away from the dense clusters can be easily marked as outliers.

In my opinion, the clustering techniques can be very useful in identifying patterns in the data points, such as the prevalence of shooting fatalities based on location i.e. regions with a high number of shootings, weapons used, demographic analysis, the intended use of force, temporal data for analysis the fluctuation of shooting over a period of time etc.

25 Oct

On this day I learned about the Kmeans and Kmedoids clustering methods.  1)Kmeans basically is a  clustering or a grouping method of data points, in this method data points are divided into k number of clusters. The criteria for grouping data points together is based on the mean value of their features.  The mean of the data points acts as a centroid in the k-means method. this method could be affected by the outliers as the mean can shift with the presence of outliers data points.

2) Kmedoids is another clustering method, in this method, the most centrally located point in a cluster is used as a representative for the cluster. Unlike Kmeans, this method is not affected by outliers. As the medians or medoids are not affected significantly by the presence of outlier data points.