Will's Data Portfolio

Data Analytics Capstone - Full Explanation

William M — Fri, 04 Jul 2025 22:26:14 GMT

Research Question

“We are never going to reach equality in America … until we achieve equality in education.” (The Aspen Institute, 2017). Justice Sotomayor’s words in this interview offer an ugly lesson on the current state of affairs in America, inherently stating that the US does not pass the test to be called equal, and it cannot truly pass that test until education is equal. After growing up in a minority family in a subsidized housing unit, this Supreme Court justice graduated from Princeton summa cum laude, and now is a lifetime member of the nation’s highest court (Wikipedia contributors, 2023). This is one anecdotal example of an exceptional scholar seemingly contradicting her quote. The reason for pursuing the research question is to use data analysis to see past anecdotal examples such as this one. To examine education inequality in the United States, this study asks, “Does the proportion of students who are eligible for free lunch in a school have a statistically significant effect on standardized test scores?”

The underlying premise of this question is “are we educating children with differing incomes equally?” The context of this question deals with Virginia Public Schools, specifically in the 2015-2016 school year. The data contains information compiled by University Libraries, Virginia Tech, and Bradburn (2021), and it includes standardized test scores for each school. The study’s focus is on a school’s average performance in standardized tests compared to the school’s proportion of low-income students. The data does not contain the income of student families, but it does contain the proportion of students that are eligible for free lunches. With free lunch eligibility, we can understand family incomes. In the 2015-2016 school year, students that were eligible for free lunches came from households with incomes lower than 130% of the Federal income poverty guidelines as illustrated in the following chart (Department of Agriculture, Food and Nutrition Service, 2015).

Data found at: https://www.govinfo.gov/content/pkg/FR-2015-03-31/pdf/2015-07358.pdf

It is hypothesized that income levels will have a statistically significant impact on a school’s average test scores. Running a two-sample t-test the null hypothesis is that “mean test scores for schools with more students eligible for free lunch are not statistically different from schools with fewer students eligible for free lunch.” The alternate hypothesis for this comparison test is that “mean test scores for schools with more students eligible for free lunch are statistically different from schools with fewer students eligible for free lunch.”

Data Collection

The data will be provided from an existing data set titled “Characterizing Virginia Public Schools (public dataset)” (University Libraries, Virginia Tech & Bradburn, 2021). The data included a series of four Microsoft Excel files with multiple sheets and a text file with a brief description of the data. The data included information from 2008 until 2017. Of the four Excel files, “AYP VA schools 2012 - 2016_final.xlsx” provides relevant data on the “2015-16” sheet. “Free Reduced Lunch by Schools and Grade Structures 2008-2017_final.xlsx” contains relevant information on the “Data” sheet. The following table highlights the variables that are of importance to this study.

Utilizing this data set derived from publicly available information was an advantage because the work of compiling the scores from disparate counties and cities had already been completed. A disadvantage of using this data is that the different Excel files contained different ways of reporting nulls.

The issue of reporting nulls differently in different Excel files was solved by examining the data and the data descriptions. After importing into a Pandas data frame, the nulls would be standardized overcoming this issue. Another issue was making sure time was compared appropriately. To overcome this challenge, the most recent school year that was included in both files was used. This meant including only the “2015-16” sheet from the “Free Reduced Lunch by Schools and Grade Structures 2008-2017_final.xlsx” file and including the columns from the “AYP VA schools 2012 - 2016_final.xlsx” file with the suffix “_2015_16”.

Data Extraction and Preparation

The data preparation steps are outlined in the following list below:

All code is included ~~in a separate file “02_D214_Task_2_Code.ipynb”~~ at this post.

Save Excel Sheets as CSV using Excel

The data was received in a series of excel files with multiple sheets. The file “AYP VA schools 2012 - 2016_final.xlsx” provides relevant data on the “2015-16” sheet. This sheet was saved as “scores.csv”. “Free Reduced Lunch by Schools and Grade Structures 2008-2017_final.xlsx” contains relevant information on the “Data” sheet. This sheet was saved as “free_lunch_percentages.csv”.

Import data into a Pandas data frame

The Pandas “read_csv” method was used to import both files into a Jupyter notebook using Python version 3.

#Import data from data set files.

scores = pd.read_csv ('C:\\Users\\will\\Desktop\\Courses\\11 - Capstone - D214\\Task 2\\Data\\scores.csv',
                 index_col=0)

free_lunch_percentages = pd.read_csv \
('C:\\Users\\will\\Desktop\\Courses\\11 - Capstone - D214\\Task 2\\Data\\free_lunch_percentages.csv', index_col=0)

Convert null values to standard ‘nan’

“#NULL!” and “-“ values were changed to “nan”.

#Converting all null value representations to 'nan'
#"free_lunch_percentages" has nulls reported as '#NULL!'.  Changing '#NULL!' to 'nan' with replace() method (Inada, n.d.).
free_lunch_percentages.replace("#NULL!", np.nan, inplace=True)
#"scores" has nulls reported as '#NULL!'.  Changing '-' to 'nan' with replace() method (citation).
scores.replace("-", np.nan, inplace=True)

Create new data frames with only relevant information, then join them on the school ID.

New data frames were created with only the variables that will be used in the study. The “.join()” method was used to match the data on the school ID.

#Only include relevant variables
dependent = scores[["Division Name","School Name","English_2015_16","Mathematics_2015_16","History_2015_16","Science_2015_16"]]
independent = free_lunch_percentages[["free_per_1516"]]

#Join data on both dataframes using join() method (Bobbit, 2021).
df = dependent.join(independent)

Report on nulls, and treat (null removal was used), and repeat the report to confirm the removal.

A function was created to report the numbers and percentages of null information on the data frame. After examining the data it was decided to remove null values with the “dropna()” method.

#Define function to count nulls and output counts/percentage
def nullcounter(dataframename):
    print("\n Your null count for \"{dname}\" is: \n".format(dname=dataframename))
    run = '''nullcounter = pd.DataFrame(({name}.isna().sum()), columns=["Null Count"])
nullcounterpercent = pd.DataFrame(({name}.isna().mean() *100), columns=["Null Count"])
nullcounter['Null %'] = nullcounterpercent
print(nullcounter.loc[(nullcounter['Null Count'] > 0)])'''.format(name=dataframename)
    exec(run)

    
#Output basic statistics on combined data frame.
print(df.describe())

#Output null count on combined data frame.
nullcounter("df")

#Output null visualization on combined data frame.
msno.matrix(df)

#Drop rows with nulls dependent
df = df.dropna()

Check for duplicates. None found.

The “duplicated()” method was used to examine if there was any duplicate data in the data set. There were no duplicates found.

#Cell 4
#Check for duplicates on all rows

#Check for duplicates
duplicates = df.duplicated()
print("Duplicate data on all rows combined?")
print(duplicates.unique())
print('\n')

#Check for duplicates on index using duplicated() method, sodee snipped from stackoverflow (Matthew, 2013).
df[df.index.duplicated(keep=False)]

print("Data Types")
df.dtypes

Convert variables to appropriate types, remove the “%” symbol, and drop any data that is out of bounds.

The test scores were converted to integers. The percentage of students on free lunch was converted to a float after removing the "%" symbol. Percentage data that was negative or above 1 was deemed out of bounds, and removed using the “drop()” method.

#Convert scores to numeric then integers
df['English_2015_16'] = pd.to_numeric(df['English_2015_16']).astype(int)
df['Mathematics_2015_16'] = pd.to_numeric(df['Mathematics_2015_16']).astype(int)
df['History_2015_16'] = pd.to_numeric(df['History_2015_16']).astype(int)
df['Science_2015_16'] = pd.to_numeric(df['Science_2015_16']).astype(int)


#remove percentage sign and convert to decimal notation using rstrip and astype.  Code snipped from stackoverflow (Bloom, 2014).
df['free_per_1516'] = df['free_per_1516'].str.rstrip('%').astype('float') / 100.0

Create aggregate variable “total_score”.

The scores for each test were added together to create the variable “total_score” using the “sum()” method.

#Create a sum of all test scores as 'total_score'.
df['total_score'] = df[["English_2015_16","Mathematics_2015_16", "History_2015_16", "Science_2015_16"]].sum(axis=1)
df.head()

Cap outliers in "total_score".

Outliers in “total_score” were capped to remain within 1.5 times the interquartile range.

#Capping the data in the skewed distribution "total_score" with 1.5 iqr
#Article mentions treating outliers differently depending on the distribution (Goyal, 2022).

skewed_int_outliers =["total_score"]

#Skewed Distribution Outlier treatment with 1.5iqr
for i in skewed_int_outliers:
    percentile25 = df[i].quantile(0.25)
    percentile75 = df[i].quantile(0.75)
    iqr = percentile75 - percentile25
    upper_limit = percentile75 + 1.5 * iqr
    lower_limit = percentile25 - 1.5 * iqr
    #capping the outliers
    df[i] = np.where(
        df[i]>upper_limit,
        np.rint(math.floor(upper_limit)).astype(int),
        np.where(
            df[i]

Calculate the median of “free_per_1516” and put schools into two groups based on the median.

The median was used to create two groups of schools. Schools with a percentage of students eligible to receive free lunch greater than or equal to the median were included in group “1”. The remaining schools were placed in group “0”. Two data frames were created based on these different populations.

#Find median and print
free_lunch_percent_median = df['free_per_1516'].describe().loc[['50%']]
print("Median Percent of 'free_per_1516' is: ", free_lunch_percent_median[0], "\n")

#Create distinct gropus of student population based on proportion of students who are eligible for free lunch.
#Values below and above can be replaced based on condition (Komali, 2021).

#Split data into 0.372850 (median of cleaned data set)
df.loc[df['free_per_1516'] >= free_lunch_percent_median[0], 'groups'] = 1
df.loc[df['free_per_1516'] < free_lunch_percent_median[0], 'groups'] = 0 

df['groups'] = pd.to_numeric(df['groups']).astype(int)
df.groups.describe()

low_percentage_df = df.loc[df['groups'] == 0]
high_percentage_df = df.loc[df['groups'] == 1]

Python was utilized because it is a good language for data science and provides the functionality to deal with data science applications (GeeksforGeeks, 2022). Utilizing a Pandas data frame allowed for all the necessary data transformations on the data. An advantage of using a Jupyter notebook for the data extraction and preparation is the division of tasks into individual cells, this allowed for larger tasks to be decomposed and tested in individual cells. A disadvantage of using Python is that visualization is not a strength, especially when compared to the R programming language (Python Vs. R: What’s the Difference?, 2021). This limitation was overcome by utilizing additional libraries for data visualization in Python.

Analysis

At this point, the data has been placed into two distinct groups. One group in the data frame “low_percentage_df” represents schools that have lower than the median number of students who are eligible for free lunch, and the other group “high_percentage_df” represents schools that have higher than the median number of students who are eligible for free lunch.

To test if these populations have similar test scores, we will compare them using a two-sample t-test. A two-sample t-test “is used when you want to compare two independent groups to see if their means are different” (Glen, 2022). This study will compare the mean test scores of the schools with low proportions of students who are eligible for free lunch and the mean test scores of the schools with high proportions of students who are eligible for free lunch. This use case aligns with the purpose of a t-test.

The analysis steps are described below:

All code is included in a separate file “02_D214_Task_2_Code.ipynb”:

Visually examine the data:

#Output histogram on free_per_1516
plt.hist(df['free_per_1516'])
plt.title("Histogram {}".format("free_per_1516"))
plt.xlabel("free_per_1516")
plt.ylabel("Count")
plt.show()


#Output histogram on 'total_score'
plt.hist(df['total_score'])
plt.title("Histogram {}".format("total_score"))
plt.xlabel("total_score")
plt.ylabel("Count")
plt.show()


#Output regplot to examine possible relationships
sns.regplot(x = df['free_per_1516'], y = df['total_score'])
plt.title("Scatter plot free_percentage_1516 compared to total score")
plt.xlabel("I")
plt.ylabel("Y")
plt.show()


#Output boxplot to examine data characteristics of 'free_per_1516'
sns.boxplot(df['free_per_1516'])
plt.title("Box plot free_percentage_1516")
plt.show()


#Output boxplot to examine data characteristics of 'total_score'
sns.boxplot(df['total_score'])
plt.title("Box plot total_score")
plt.show()


#Output heatmap to examine correlation
plt.figure(figsize=(25, 11))
plt.title("Heatmap free_per_1516 correlation with total_score")
sns.heatmap(df[["free_per_1516","total_score"]].corr(),vmin=-1, vmax=1, annot=True);


#Find median and print
free_lunch_percent_median = df['free_per_1516'].describe().loc[['50%']]
print("Median Percent of 'free_per_1516' is: ", free_lunch_percent_median[0], "\n")

Test for Equality of variance:

A Levene’s test was used as a preliminary test to check if the two populations had the same variance. The test resulted in a p-value less than .05, meaning we have sufficient evidence to say the variance in test scores between the populations is likely different (Bobbit, 2020).

#use levene test to check for equality of variance (Bobbit, 2020).
stats.levene(low_percentage_df.total_score, high_percentage_df.total_score, center='median')

Perform the two-sample t-test on both populations:

Though the Levene’s test showed that it is unlikely that the populations have the same variance, the t-test was still performed on the data sets. “The t-test is robust to violations of that assumption so long as the sample size isn’t tiny and the sample sizes aren’t far apart” (How to Compare Two Means When the Groups Have Different Standard Deviations. - FAQ 1349 - GraphPad, n.d.).

#Code snippet for t-test derived from website (GeeksforGeeks, 2022).
print(stats.ttest_ind(a=low_percentage_df.total_score, b=high_percentage_df.total_score, equal_var=False))

Output:

Ttest_indResult(statistic=25.734340600381046, pvalue=2.5360699249948364e-122)

Visualize population distributions with a Kernel Density Plot.

A density plot was used to further analyze the distribution of test scores in each of the populations.

#Run the t-test
#Code snippet for t-test derived from website (GeeksforGeeks, 2022).
print(stats.ttest_ind(a=low_percentage_df.total_score, b=high_percentage_df.total_score, equal_var=False))

#Plot both population's metrics
low_percentage_df['total_score'].plot(kind='kde', c='red', linewidth=3, figsize=[13,6])
high_percentage_df['total_score'].plot(kind='kde', c='blue', linewidth=3, figsize=[13,6])
# Labels
labels = ['Low Proportion of Free Lunch Eligible Students', 'High Proportion of Free Lunch Eligible Students']
plt.legend(labels)
plt.xlabel('Reported Standars of Learning Score')
plt.ylabel('Score Probability Density')

plt.show()

Output:

An advantage of using the two-sample t-test is its simplicity. After the data is cleaned the test can be performed with one line of code. One disadvantage is that the output is not as rigorous as other tests. A regression model would estimate how much of an increase in the proportion of students receiving free school lunches affected the test scores. The two-sample t-test just reports if the populations likely have different means.

Data Summary and Implications

The two-sample t-test (with a t-value of 25.734340600381046, and a p-value of 2.5360699249948364e-122) shows that there is likely a statistically significant difference between schools with high proportions of students receiving free lunch and schools with low proportions of these students. The probability density graph highlights this difference. Schools with a high proportion of free lunch eligible students are receiving lower standardized test scores.

A limitation of this analysis is that performing tests on the entire state of Virginia’s schools does not consider unique situations. Some of these schools could specifically cater to the needs of those with developmental issues, possibly skewing the results. Perhaps there exist in Virginia schools whose goal is to specifically help those who are economically disadvantaged, possibly leading to schools that have both high achievement on standardized tests, and above median levels of students eligible for free lunch. These are only a few examples of unique situations that may need further consideration.

An immediate action would be to further examine cities with both types of schools. Cities that have both, schools with higher than median amounts of students eligible for free lunch, and schools with lower than median amounts of students eligible for free lunch can be compared. Some of these cities may show a statistically significant difference in test scores in these differing populations. Some of these cities may show that they can effectively teach students from lower-income families equally. If the cities that have more equality in education offer a different approach, different programs, or even have a different pedagogy, these things can be tested in cities that have larger gaps in education equality immediately.

Further examination as to why some schools have large proportions of students eligible for free lunch should also be performed. For example, if school districts are drawn in such a way that unfairly concentrates economically disadvantaged students in some schools while concentrating students with more economic advantages in other schools, the districts should be redrawn to attempt to even out these numbers.

Another aspect to look at when it comes to evening out the incomes in these schools should be housing. The Housing Choice Voucher is a federal program that is meant to assist “very low-income families, the elderly, and the disabled to afford decent, safe, and sanitary housing in the private market.” (Housing Choice Voucher Program Section 8, 2022). Are these cities fully taking advantage of the program? Does the application of the housing program in the city provide low-income families opportunities to live in districts that have lower levels of poverty, or does the voucher program need to make improvements in this area? Asking these questions, and course correcting when schools have uneven levels of student income is a necessary step.

Further study of the data set would be to look closely at individual counties and cities to see where income inequality exists, specifically where it is influencing student outcomes. This can be done by performing the same two-sample t-test with a city/county scope as opposed to a statewide scope. Another approach would be to attempt a regression to examine how much of an influence the proportion of students that are eligible for free lunches affects test scores.

Income inequality is just one measure of inequality in schools. It is important to continue to pursue studies about inequality to quantify the effects, understand the influences, and importantly to try to fix the issues. No form of separate and unequal education in schools should be accepted. When these discrepancies are discovered, it is important that everything is done to understand the issue and as quickly as possible correct the problem.

References:

Bobbit, Z. (2020, July 10). How to Perform Levene’s Test in Python. Statology. https://www.statology.org/levenes-test-python/

GeeksforGeeks. (2022, November 17).Python for Data Science.https://www.geeksforgeeks.org/python-for-data-science/

Glen, S. (2022, January 12). Two-Sample T-Test: When to Use it. Statistics How To.https://www.statisticshowto.com/two-sample-t-test-difference-means/

Housing Choice Voucher Program Section 8. (2022, January 11). HUD.gov / U.S. Department of Housing And Urban Development (HUD). https://www.hud.gov/topics/housing_choice_voucher_program_section_8

How to compare two means when the groups have different standard deviations. - FAQ 1349 - GraphPad. (n.d.). GraphPad by Dotmatics.https://www.graphpad.com/support/faq/how-to-compare-two-means-when-the-groups-have-different-standard-deviations/

Python vs. R: What’s the Difference? (2021, March 23). IBM. https://www.ibm.com/cloud/blog/python-vs-r

The Aspen Institute. (2017, March 30). In Conversation: Justice Sonia Sotomayor and Abigail Golden-Vazquez [Video]. YouTube.https://www.youtube.com/watch?v=EaJuyXqGF2E&feature=youtu.be

University Libraries, Virginia Tech, & Bradburn, I. (2021, May 18).Characterizing Virginia Public Schools (public dataset). Figshare.https://figshare.com/articles/dataset/Characterizing_Virginia_Public_Schools_public_dataset_/14097092CC license:CC BY 3.0 US

Wikipedia contributors. (2023, February 3). Sonia Sotomayor. Wikipedia.https://en.wikipedia.org/wiki/Sonia_Sotomayor

Department of Agriculture, Food and Nutrition Service. (2015, March 26).Federal Register: Vol. 80, No. 61. The U.S. Department of Agriculture.https://www.govinfo.gov/content/pkg/FR-2015-03-31/pdf/2015-07358.pdf

Code References:

Bloom, G., [GaryMBloom]. (2014, September 4). Convert percent string to float in pandas read_csv. Stack Overflow. https://stackoverflow.com/questions/25669588/convert-percent-string-to-float-in-pandas-read-csv

Bobbit, Z. (2020, July 10). How to Perform Levene’s Test in Python. Statology. https://www.statology.org/levenes-test-python/

Bobbit, Z. (2021, November 6). How to Merge Two Pandas DataFrames on Index. Statology. https://www.statology.org/pandas-merge-on-index/

Chattar, P. (2021, September 2). Find common elements in two lists in python. Java2Blog. https://java2blog.com/find-common-elements-in-two-lists-python/

GeeksforGeeks. (2022, October 17). How to Conduct a Two Sample T Test in Python. https://www.geeksforgeeks.org/how-to-conduct-a-two-sample-t-test-in-python/

Goyal, C. (2022, August 25). Feature Engineering – How to Detect and Remove Outliers (with Python Code). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/05/feature-engineering-how-to-detect-and-remove-outliers-with-python-code/

Hadzhiev, B. (n.d.). Remove common elements from two Lists in Python | bobbyhadz. Blog - Bobby Hadz. https://bobbyhadz.com/blog/python-remove-common-elements-from-two-lists

Harikrishnan, R., [Harikrishnan R]. (2020, January 26). Dropping rows with values outside of boundaries. Stack Overflow. https://stackoverflow.com/questions/59914605/dropping-rows-with-values-outside-of-boundaries

How to compare two means when the groups have different standard deviations. - FAQ 1349 - GraphPad. (n.d.). GraphPad by Dotmatics. https://www.graphpad.com/support/faq/how-to-compare-two-means-when-the-groups-have-different-standard-deviations/

Inada, I. (Ed.). (n.d.). Replacing values with NaNs in Pandas DataFrame. https://www.skytowner.com/explore/replacing_values_with_nans_in_pandas_dataframe

Komali. (2021, December 27). Pandas Replace Values based on Condition. Spark by {Examples}. https://sparkbyexamples.com/pandas/pandas-replace-values-based-on-condition/

Matthew. (2013, November 25). Pandas: Get duplicated indexes. Stack Overflow. https://stackoverflow.com/questions/20199129/pandas-get-duplicated-indexes



Data Analytics Capstone - Code Reference
William M — Fri, 04 Jul 2025 21:46:34 GMT
#Cell 1
#Import the packages required to complete the test.

#Pandas is imported to deal with tabular data.
import pandas as pd
#Numpy is imported to deal with mathematical functions.
import numpy as np
#Missingno is imported to deal with nulls.
import missingno as msno
#Pyplot and Seaborn are imported to deal with visualizations.
import matplotlib.pyplot as plt
import seaborn as sns
#Math is imported as it will help with capping outliers
import math
#Stats is imported for running the t-test.
import scipy.stats as stats
#For formatting list output in a more readable format
from pprint import pprint

#Warnings is used to supress messages about future version deprecation.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


#Import data from data set files.

scores = pd.read_csv ('C:\\Users\\will\\Desktop\\Courses\\11 - Capstone - D214\\Task 2\\Data\\scores.csv',
                 index_col=0)

free_lunch_percentages = pd.read_csv \
('C:\\Users\\will\\Desktop\\Courses\\11 - Capstone - D214\\Task 2\\Data\\free_lunch_percentages.csv', index_col=0)


#Converting all null value representations to 'nan'
#"free_lunch_percentages" has nulls reported as '#NULL!'.  Changing '#NULL!' to 'nan' with replace() method (Inada, n.d.).
free_lunch_percentages.replace("#NULL!", np.nan, inplace=True)
#"scores" has nulls reported as '#NULL!'.  Changing '-' to 'nan' with replace() method (citation).
scores.replace("-", np.nan, inplace=True)


#Only include relevant variables
dependent = scores[["Division Name","School Name","English_2015_16","Mathematics_2015_16","History_2015_16","Science_2015_16"]]
independent = free_lunch_percentages[["free_per_1516"]]    


#Output basic statistics on each dataframe.
print("\n")
print("Dependent data frame statistics")
print(dependent.describe())
print("\n")
print("Independent data frame statistics")
print(independent.describe())

Dependent data frame statistics
          Division Name               School Name English_2015_16  \
count              1822                      1831            1813   
unique              132                      1758              64   
top     Fairfax County   Mountain View Elementary              85   
freq                192                         6              89   

       Mathematics_2015_16 History_2015_16 Science_2015_16  
count                 1814            1768            1750  
unique                  62              49              71  
top                     90              94              83  
freq                    95              99              79  


Independent data frame statistics
       free_per_1516
count           1910
unique          1610
top          100.00%
freq              60
#Cell 2
#Join data on both dataframes using join() method (Bobbit, 2021).
df = dependent.join(independent)


#Define function to count nulls and output counts/percentage
def nullcounter(dataframename):
    print("\n Your null count for \"{dname}\" is: \n".format(dname=dataframename))
    run = '''nullcounter = pd.DataFrame(({name}.isna().sum()), columns=["Null Count"])
nullcounterpercent = pd.DataFrame(({name}.isna().mean() *100), columns=["Null Count"])
nullcounter['Null %'] = nullcounterpercent
print(nullcounter.loc[(nullcounter['Null Count'] > 0)])'''.format(name=dataframename)
    exec(run)

    
#Output basic statistics on combined data frame.
print(df.describe())

#Output null count on combined data frame.
nullcounter("df")

#Output null visualization on combined data frame.
msno.matrix(df)

          Division Name               School Name English_2015_16  \
count              1822                      1831            1813   
unique              132                      1758              64   
top     Fairfax County   Mountain View Elementary              85   
freq                192                         6              89   

       Mathematics_2015_16 History_2015_16 Science_2015_16 free_per_1516  
count                 1814            1768            1750          1819  
unique                  62              49              71          1555  
top                     90              94              83       100.00%  
freq                    95              99              79            48  

 Your null count for "df" is: 

                     Null Count    Null %
Division Name                55  2.930208
School Name                  46  2.450719
English_2015_16              64  3.409696
Mathematics_2015_16          63  3.356420
History_2015_16             109  5.807139
Science_2015_16             127  6.766116
free_per_1516                58  3.090037
#Cell 3
#Drop rows with nulls dependent
df = df.dropna()

#Count nulls again
nullcounter("df")

#Desciribe data frame again.
print(df.describe())

#Visualize nulls again
msno.matrix(df)
 Your null count for "df" is: 

Empty DataFrame
Columns: [Null Count, Null %]
Index: []
          Division Name               School Name English_2015_16  \
count              1724                      1724            1724   
unique              132                      1662              64   
top     Fairfax County   Mountain View Elementary              85   
freq                190                         5              84   

       Mathematics_2015_16 History_2015_16 Science_2015_16 free_per_1516  
count                 1724            1724            1724          1724  
unique                  61              49              71          1476  
top                     90              94              83       100.00%  
freq                    90              97              79            47  
#Cell 4
#Check for duplicates on all rows

#Check for duplicates
duplicates = df.duplicated()
print("Duplicate data on all rows combined?")
print(duplicates.unique())
print('\n')

#Check for duplicates on index using duplicated() method, sodee snipped from stackoverflow (Matthew, 2013).
df[df.index.duplicated(keep=False)]

print("Data Types")
df.dtypes
Duplicate data on all rows combined?
[False]


Data Types
Division Name          object
School Name            object
English_2015_16        object
Mathematics_2015_16    object
History_2015_16        object
Science_2015_16        object
free_per_1516          object
dtype: object

#Cell 5
#Convert scores to numeric then integers
df['English_2015_16'] = pd.to_numeric(df['English_2015_16']).astype(int)
df['Mathematics_2015_16'] = pd.to_numeric(df['Mathematics_2015_16']).astype(int)
df['History_2015_16'] = pd.to_numeric(df['History_2015_16']).astype(int)
df['Science_2015_16'] = pd.to_numeric(df['Science_2015_16']).astype(int)


#remove percentage sign and convert to decimal notation using rstrip and astype.  Code snipped from stackoverflow (Bloom, 2014).
df['free_per_1516'] = df['free_per_1516'].str.rstrip('%').astype('float') / 100.0


#Remove data that is out of bounds in 'free_per_1516'.  Code snipped from stackoverflow (Harikrishnan, 2020).
df.drop(df[df.free_per_1516>1.0].index, inplace=True)
df.drop(df[df.free_per_1516<0.0].index, inplace=True)


#Create a sum of all test scores as 'total_score'.
df['total_score'] = df[["English_2015_16","Mathematics_2015_16", "History_2015_16", "Science_2015_16"]].sum(axis=1)
df.head()


df.free_per_1516.describe()
count    1718.000000
mean        0.402871
std         0.260002
min         0.002800
25%         0.204550
50%         0.372850
75%         0.540700
max         1.000000
Name: free_per_1516, dtype: float64
#Cell 6
#Data nearly cleaned, output visualizations

#Output histogram on free_per_1516
plt.hist(df['free_per_1516'])
plt.title("Histogram {}".format("free_per_1516"))
plt.xlabel("free_per_1516")
plt.ylabel("Count")
plt.show()


#Output histogram on 'total_score'
plt.hist(df['total_score'])
plt.title("Histogram {}".format("total_score"))
plt.xlabel("total_score")
plt.ylabel("Count")
plt.show()


#Output regplot to examine possible relationships
sns.regplot(x = df['free_per_1516'], y = df['total_score'])
plt.title("Scatter plot free_percentage_1516 compared to total score")
plt.xlabel("I")
plt.ylabel("Y")
plt.show()


#Output boxplot to examine data characteristics of 'free_per_1516'
sns.boxplot(df['free_per_1516'])
plt.title("Box plot free_percentage_1516")
plt.show()


#Output boxplot to examine data characteristics of 'total_score'
sns.boxplot(df['total_score'])
plt.title("Box plot total_score")
plt.show()


#Output heatmap to examine correlation
plt.figure(figsize=(25, 11))
plt.title("Heatmap free_per_1516 correlation with total_score")
sns.heatmap(df[["free_per_1516","total_score"]].corr(),vmin=-1, vmax=1, annot=True);


#Find median and print
free_lunch_percent_median = df['free_per_1516'].describe().loc[['50%']]
print("Median Percent of 'free_per_1516' is: ", free_lunch_percent_median[0], "\n")
Median Percent of 'free_per_1516' is:  0.37285 
#Cell 7
#Capping the data in the skewed distribution "total_score" with 1.5 iqr
#Article mentions treating outliers differently depending on the distribution (Goyal, 2022).

skewed_int_outliers =["total_score"]

#Skewed Distribution Outlier treatment with 1.5iqr
for i in skewed_int_outliers:
    percentile25 = df[i].quantile(0.25)
    percentile75 = df[i].quantile(0.75)
    iqr = percentile75 - percentile25
    upper_limit = percentile75 + 1.5 * iqr
    lower_limit = percentile25 - 1.5 * iqr
    #capping the outliers
    df[i] = np.where(
        df[i]>upper_limit,
        np.rint(math.floor(upper_limit)).astype(int),
        np.where(
            df[i]
Median Percent of 'free_per_1516' is:  0.37285 

#Cell 8

#Create distinct gropus of student population based on proportion of students who are eligible for free lunch.
#Values below and above can be replaced based on condition (Komali, 2021).

#Split data into 0.372850 (median of cleaned data set)
df.loc[df['free_per_1516'] >= free_lunch_percent_median[0], 'groups'] = 1
df.loc[df['free_per_1516'] < free_lunch_percent_median[0], 'groups'] = 0 

df['groups'] = pd.to_numeric(df['groups']).astype(int)
df.groups.describe()

low_percentage_df = df.loc[df['groups'] == 0]
high_percentage_df = df.loc[df['groups'] == 1]


#Export a copy of the cleaned data
df.to_csv("C:\\Users\\will\\Desktop\\Courses\\11 - Capstone - D214\\Task 2\\Data\\cleaned_data.csv", index = False)


#use levene test to check for equality of variance (Bobbit, 2020).
stats.levene(low_percentage_df.total_score, high_percentage_df.total_score, center='median')
LeveneResult(statistic=54.81184644296979, pvalue=2.0681326644527468e-13)
Interpreting the LeveneResult - " Ignore the result. With equal, or nearly equal, sample size (and moderately large samples), the assumption of equal standard deviations is not a crucial assumption. The t test work pretty well even with unequal standard deviations. In other words, the t test is robust to violations of that assumption so long as the sample size isn’t tiny and the sample sizes aren’t far apart. If you want to use ordinary t tests, " (How to Compare Two Means When the Groups Have Different Standard Deviations. - FAQ 1349 - GraphPad, n.d.).
#Cell 9
#Run the t-test
#Code snippet for t-test derived from website (GeeksforGeeks, 2022).
print(stats.ttest_ind(a=low_percentage_df.total_score, b=high_percentage_df.total_score, equal_var=False))

#Plot both population's metrics
low_percentage_df['total_score'].plot(kind='kde', c='red', linewidth=3, figsize=[13,6])
high_percentage_df['total_score'].plot(kind='kde', c='blue', linewidth=3, figsize=[13,6])
# Labels
labels = ['Low Proportion of Free Lunch Eligible Students', 'High Proportion of Free Lunch Eligible Students']
plt.legend(labels)
plt.xlabel('Reported Standars of Learning Score')
plt.ylabel('Score Probability Density')

plt.show() 
Ttest_indResult(statistic=25.734340600381046, pvalue=2.5360699249948364e-122)
Original Outcome Complete


#Cell 10
#Make a list of unique divisions
division_list_low = low_percentage_df['Division Name'].unique().tolist()
division_list_high = high_percentage_df['Division Name'].unique().tolist()

print("Count of Divisions with low proportions of students who are eligible for free lunch")
print(len(division_list_low))
print("Count of Divisions with high proportions of students who are eligible for free lunch")
print(len(division_list_high))


#Create a list of divisions that contain schools with high and low proportions of students eligible for free lunch.
#code snippet found online (Chattar, 2021).
divisions_high_and_low_temp = set(division_list_low).intersection(division_list_high)


print("Count of Divisions that have schools with high and low proportions of students eligible for free lunch")
print(len(divisions_high_and_low_temp))

divisions_high_and_low = []
for i in divisions_high_and_low_temp:
    X =low_percentage_df.loc[low_percentage_df['Division Name'] == (i)].total_score
    Y = high_percentage_df.loc[high_percentage_df['Division Name'] == (i)].total_score
    if len(X) > 2 and len(Y) > 2:
        divisions_high_and_low.append(i)

print("Count of Divisions that have more than 2 of each, schools with high proportion of students eligible\
for free lunch, and schools with low proportions.")
print(len(divisions_high_and_low))

Count of Divisions with low proportions of students who are eligible for free lunch
85
Count of Divisions with high proportions of students who are eligible for free lunch
117
Count of Divisions that have schools with high and low proportions of students eligible for free lunch
70
Count of Divisions that have more than 2 of each, schools with high proportion of students eligiblefor free lunch, and schools with low proportions.
32
Ttest_indResult(statistic=6.409940604162604, pvalue=6.681972140013619e-07) 


#Cell 11
#Create a selection process with divisions that have a difference in mean test outcomes

divisions_with_low_p_values = []

#To count how many divisions have a < .05 t-score
for i in divisions_high_and_low:
    #Geeksforgeeks code snippet(GeeksforGeeks, 2022).
    X = low_percentage_df.loc[low_percentage_df['Division Name'] == (i)].total_score
    Y = high_percentage_df.loc[high_percentage_df['Division Name'] == (i)].total_score
    t,p = stats.ttest_ind(a=X, b=Y, equal_var=False)
    if p < .05:
        divisions_with_low_p_values.append(i)


print("These divisions have multiple schools with high proportions of students eligible for free lunch and multiple\
 schools with low proportions of students eligible for free lunch.")
print("They also show a statistically significant difference in mean test scores.\n")

pprint(divisions_with_low_p_values)  

#Visualize density plot on divisions with statistical differences in populations

for i in divisions_with_low_p_values:
    X =low_percentage_df.loc[low_percentage_df['Division Name'] == (i)].total_score
    Y = high_percentage_df.loc[high_percentage_df['Division Name'] == (i)].total_score
    #Run the t-test and output
    #Plot both metrics
    low_percentage_df.loc[low_percentage_df['Division Name'] == (i)].total_score.plot\
    (kind='kde', c='red', linewidth=3, figsize=[13,6])
    high_percentage_df.loc[high_percentage_df['Division Name'] == (i)].total_score.plot\
    (kind='kde', c='blue', linewidth=3, figsize=[13,6])
    # Labels
    labels = ['Low Proportion of Free Lunch Eligible Students', 'High Proportion of Free Lunch Eligible Students']
    plt.title(i)
    plt.legend(labels)
    plt.xlabel('Reported Standars of Learning Score')
    plt.ylabel('Score Probability Density')
    plt.show()
    #Code snippet for t-test derived from website (GeeksforGeeks, 2022).
    print(stats.ttest_ind(a=X, b=Y, equal_var=False), "\n\n")
These divisions have multiple schools with high proportions of students eligible for free lunch and multiple schools with low proportions of students eligible for free lunch.
They also show a statistically significant difference in mean test scores.

['Arlington County ',
 'Hampton City ',
 'Chesapeake City ',
 'Albemarle County ',
 'Newport News City ',
 'Alexandria City ',
 'Montgomery County ',
 'Spotsylvania County ',
 'Henrico County ',
 'Suffolk City ',
 'Chesterfield County ',
 'Loudoun County ',
 'Norfolk City ',
 'Virginia Beach City ',
 'Prince William County ',
 'Williamsburg-James City County ',
 'Fairfax County ']
Ttest_indResult(statistic=8.461103427033763, pvalue=4.2564535128111695e-10) 
Ttest_indResult(statistic=3.7223282464849152, pvalue=0.0022359663924919103) 
Ttest_indResult(statistic=4.806338810130206, pvalue=0.0014819298232501942) 
Ttest_indResult(statistic=2.5438908135728764, pvalue=0.027946988309494193) 
Ttest_indResult(statistic=2.625164815841848, pvalue=0.023337646472250697) 
Ttest_indResult(statistic=3.373751982093548, pvalue=0.0024706357289852967) 
Ttest_indResult(statistic=8.196062297626334, pvalue=1.3567694425275851e-11) 
Ttest_indResult(statistic=2.4614922901402174, pvalue=0.029589168699507278) 
Ttest_indResult(statistic=4.370459142644502, pvalue=0.00011089458080339332) 
Ttest_indResult(statistic=4.0471544101064145, pvalue=0.00411407734704421) 
Ttest_indResult(statistic=7.659633097156011, pvalue=6.205830981719048e-08) 
Ttest_indResult(statistic=6.415834167155425, pvalue=5.255180694823957e-07) 
Ttest_indResult(statistic=6.162598621901206, pvalue=1.0118260722466614e-07)
Ttest_indResult(statistic=2.5713849399256987, pvalue=0.02332371134812441) 
Ttest_indResult(statistic=9.026177677291102, pvalue=3.656434012894371e-14) 
#Cell 12
#Generate list of remaining divisions that didn't have low p-values, but had both types of schools.
#Subtract "divisions_with_low_p_values" from "divisions_high_and_low" using symmetric_difference() method (Hadzhiev, n.d.).
other_divisions = list(set(divisions_high_and_low).symmetric_difference(divisions_with_low_p_values))


#Visualize density plot on divisions that did not show a statistically significant difference

for i in other_divisions:
    X =low_percentage_df.loc[low_percentage_df['Division Name'] == (i)].total_score
    Y = high_percentage_df.loc[high_percentage_df['Division Name'] == (i)].total_score
    #Plot both metrics
    low_percentage_df.loc[low_percentage_df['Division Name'] == (i)].total_score.plot\
    (kind='kde', c='red', linewidth=3, figsize=[13,6])
    high_percentage_df.loc[high_percentage_df['Division Name'] == (i)].total_score.plot\
    (kind='kde', c='blue', linewidth=3, figsize=[13,6])
    # Labels
    labels = ['Low Proportion of Free Lunch Students', 'High Proportion of Free Student Lunches']
    plt.title(i)
    plt.legend(labels)
    plt.xlabel('Reported Standars of Learning Score')
    plt.ylabel('Score Probability Density')
    plt.show() 
    #Code snippet for t-test derived from website (GeeksforGeeks, 2022).
    print(stats.ttest_ind(a=X, b=Y, equal_var=False), "\n\n")
Ttest_indResult(statistic=2.4205691334017856, pvalue=0.08475385838978879) 
Ttest_indResult(statistic=0.2570834615381291, pvalue=0.8082157582528549) 
Ttest_indResult(statistic=0.6151089286608511, pvalue=0.5499385169889386) 
Ttest_indResult(statistic=-0.43228762054052156, pvalue=0.694690937976563) 
Ttest_indResult(statistic=0.006819058031850852, pvalue=0.9946954512431896) 
Ttest_indResult(statistic=-0.42914107542287455, pvalue=0.6827212468291588) 
Ttest_indResult(statistic=1.496028353478178, pvalue=0.23104670043725686) 
Ttest_indResult(statistic=0.9439871944160111, pvalue=0.35882922068937606) 
Ttest_indResult(statistic=2.4589999935373332, pvalue=0.05995075594533983) 
Ttest_indResult(statistic=0.27940572225992494, pvalue=0.794558661937718) 
Ttest_indResult(statistic=-0.5999889539606617, pvalue=0.5672948520140612) 
Ttest_indResult(statistic=0.5857229902989585, pvalue=0.577275098608178) 
Ttest_indResult(statistic=1.2074039981491638, pvalue=0.3369371944307884) 
Ttest_indResult(statistic=1.3789234479354537, pvalue=0.2061614549792278) 
Ttest_indResult(statistic=-1.0811022683078786, pvalue=0.3315247080127098) 
References:
Bloom, G., [GaryMBloom]. (2014, September 4). Convert percent string to float in pandas read_csv. Stack Overflow. https://stackoverflow.com/questions/25669588/convert-percent-string-to-float-in-pandas-read-csv
Bobbit, Z. (2020, July 10). How to Perform Levene’s Test in Python. Statology. https://www.statology.org/levenes-test-python/
Bobbit, Z. (2021, November 6). How to Merge Two Pandas DataFrames on Index. Statology. https://www.statology.org/pandas-merge-on-index/
Chattar, P. (2021, September 2). Find common elements in two lists in python. Java2Blog. https://java2blog.com/find-common-elements-in-two-lists-python/
GeeksforGeeks. (2022, October 17). How to Conduct a Two Sample T Test in Python. https://www.geeksforgeeks.org/how-to-conduct-a-two-sample-t-test-in-python/
Goyal, C. (2022, August 25). Feature Engineering – How to Detect and Remove Outliers (with Python Code). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/05/feature-engineering-how-to-detect-and-remove-outliers-with-python-code/
Hadzhiev, B. (n.d.). Remove common elements from two Lists in Python | bobbyhadz. Blog - Bobby Hadz. https://bobbyhadz.com/blog/python-remove-common-elements-from-two-lists
Harikrishnan, R., [Harikrishnan R]. (2020, January 26). Dropping rows with values outside of boundaries. Stack Overflow. https://stackoverflow.com/questions/59914605/dropping-rows-with-values-outside-of-boundaries
How to compare two means when the groups have different standard deviations. - FAQ 1349 - GraphPad. (n.d.). GraphPad by Dotmatics. https://www.graphpad.com/support/faq/how-to-compare-two-means-when-the-groups-have-different-standard-deviations/
Inada, I. (Ed.). (n.d.). Replacing values with NaNs in Pandas DataFrame. https://www.skytowner.com/explore/replacing_values_with_nans_in_pandas_dataframe
Komali. (2021, December 27). Pandas Replace Values based on Condition. Spark by {Examples}. https://sparkbyexamples.com/pandas/pandas-replace-values-based-on-condition/
Matthew. (2013, November 25). Pandas: Get duplicated indexes. Stack Overflow. https://stackoverflow.com/questions/20199129/pandas-get-duplicated-indexes



Data Analytics Capstone - Summary
William M — Fri, 04 Jul 2025 21:09:20 GMT
Statement of the problem and the hypothesis 
Measuring inequality in public schools is not a trivial undertaking. Equality can be simply defined in theory, but in practice, there are many factors to consider. A reasonable goal is to always work to provide equal education to students of all races, genders, sexual orientations, financial backgrounds, and any other societal stratifications where one may find examples of inequality. This study focuses on financial backgrounds. There also exists a multitude of measures to assess equality. One could measure the severity of punishments for one group compared to another, how teachers interact with different groups, the implicit bias found in history lessons for different groups, student educational outcomes of different groups, and numerous other measures for equality. This study will focus on standardized test performance. The research question is “Does the proportion of students who are eligible for free lunch in a school have a statistically significant effect on standardized test scores?”. Free lunch eligibility is used as a measure of household income. The population of students that are eligible for free lunch represents households that are at the federal poverty level multiplied by 1.3 or lower (Department of Agriculture, Food and Nutrition Service, 2015). The hypothesis is that the means of these two groups’ standardized test scores have a statistically significant difference.
This hypothesis will be tested using a two-sample t-test. Regarding this two-sample t-test, we have a null hypothesis that mean test scores for schools with more students eligible for free lunch are not statistically different from schools with fewer students eligible for free lunch with the alternate hypothesis being that mean test scores for schools with more students eligible for free lunch are statistically different from schools with fewer students eligible for free lunch.
Summary of the data-analysis process
The data comes from a data set containing information about Virginia public schools compiled by University Libraries, Virginia Tech, and Bradburn (2021). The data included the variables of interest, shown in the table below, among many other variables. The most recent year that contained test scores and free lunch information was the 2015 and 2016 school year, so this time frame would be used for the analysis.
This extracted data was the focus in this study. Of note is the free_per_1516. This field was calculated from the data to act as the independent variable.
Before beginning the analysis, the data was extracted and prepared. The data of interest was put into a Pandas data frame, null values were standardized, data was matched where school IDs were equal, nulls were removed, variables were converted to the proper data types, the test scores were added together into an aggregate field “total_score”, and outliers in “total score” were capped at 1.5 times the interquartile range.
The ”free_per_1516” variable was analyzed to find its median. Schools that had numbers higher than or equal to this median value of 0.37285 were placed into a separate data frame and will be group “1”, the other schools were placed in another data frame and will be group “0”.
With two populations, each with their respective standardized test scores available, we can start looking into whether these groups have statistically different mean test scores. Before beginning the test, a Levine’s test was used to test for equal variance. The result of this is shown below:
#use levene test to check for equality of variance (Bobbit, 2020).
stats.levine(low_percentage_df.total_score, high_percentage_df.total_score, center ='median')
LeveneResult(statistic=54.81184644296979, pvalue=2.0681326644527468e-13)
With this test, we can conclude that each group likely has a different variance in their test respective test scores. A t-test operates with the assumption that both groups have equal variances, yet this is not a critical assumption for a t-test, so the comparison using a two-sample t-test will still be carried out (How to Compare Two Means When the Groups Have Different Standard Deviations. - FAQ 1349 - GraphPad, n.d.). The “scipy.stats” library was used to carry out the t-test and compare the two groups.
Outline of the findings
The output of the t-test and a visualization of the probability density of test scores is shown below:
The very low p-value indicates that we can reject the null hypothesis that mean test scores for schools with more students eligible for free lunch are not statistically different from schools with fewer students eligible for free lunch. The probability density plot highlights the performance gap. Schools with a low proportion of free lunch eligible students perform better on standardized tests.
Explanation of the limitations of the techniques and tools used
A two-sample t-test, by definition, compares the means of two groups. This limits our mean comparison to two groups. In this case, above the median, and below the median were chosen to define the school groups, but other groups could be created, perhaps dividing the data into quartiles if needed, and compared using an ANOVA test.
Another limitation is that this t-test does not explore the relationship between increases in a school’s percentage of students who are eligible for free lunch, and the subsequent decrease in standardized test scores. In the data analysis phase of the study, the following heatmap and scatterplot were generated from the data. It appears that there is a correlation, but this is not explored in the t-test.
Notice the pattern. Generally, as the percentage of the student body that receive free lunch increases, the test score decreases. With obvious, and interesting outliers.
Summary of proposed actions
Further study on counties that have both types of schools, high median poverty levels, and low median poverty levels is an effective next step. In this subset of counties, one could determine which schools have unequal test scores and which counties have statistically similar test scores for each median poverty grouping. Exploring the differences in these counties may lead to new ideas as to why this inequality exists. It may bring about ideas or policies that can be carried out in counties that have students that are suffering from educational inequalities.
Example of a school district with both above-median and below-median schools with statistically different mean test scores.
Example of a school district with both above-median and below-median schools with statistically similar mean test scores.
Further research should also be carried out in counties with high levels of inequality to see if there is a systemic housing issue or other mitigating factors that are causing income segregation across schools. The focus of this approach would be analyzing school districts, quantifying the income levels in those districts, and trying to find effective ways to limit the extremes. The extremes here would be schools with very low diversity in student incomes compared to the county’s income diversity levels.
Expected benefits of the study
This study shows that Virginia Public Schools have statistically different test scores based on the proportion of low-income students in a particular school. Knowing that there is a quantifiable problem to solve is a benefit. In 2022, Virginia public schools had a total of 1,296,817 students (Public Education in Virginia, n.d.). Of these students, “Virginia has over 512,000 economically disadvantaged students in its public schools” (Weighing Support for Virginia’s Students, 2021). Increasing fairness and equality in the school system would be a benefit to all students, but certainly to those who are economically disadvantaged. Examining and applying this study at the state level can help decision-makers allocate funds appropriately in support of promoting income equality in schools.
References
Department of Agriculture, Food and Nutrition Service. (2015, March 26).Federal Register: Vol. 80, No. 61. The U.S. Department of Agriculture.https://www.govinfo.gov/content/pkg/FR-2015-03-31/pdf/2015-07358.pdf
How to compare two means when the groups have different standard deviations. - FAQ 1349 - GraphPad. (n.d.). GraphPad by Dotmatics.https://www.graphpad.com/support/faq/how-to-compare-two-means-when-the-groups-have-different-standard-deviations/
Public education in Virginia. (n.d.). Ballotpedia. https://ballotpedia.org/Public_education_in_Virginia
University Libraries, Virginia Tech, & Bradburn, I. (2021, May 18).Characterizing Virginia Public Schools (public dataset). Figshare.https://figshare.com/articles/dataset/Characterizing_Virginia_Public_Schools_public_dataset_/14097092CC license:CC BY 3.0 US
Weighing Support for Virginia’s Students. (2021, April 13). The Commonwealth Institute. https://thecommonwealthinstitute.org/research/weighing-support-for-virginias-students/



Reinforcement Learning with ML Agents - Part 1/? - Unity, C#, Prompt Engineering, Git
William M — Mon, 24 Mar 2025 20:59:06 GMT
This project will be to simulate a warehouse environment with the aims of training a series of agents.  One agent in charge of navigating the warehouse effectively, another agent in charge of stacking boxes on the pallet optimally, and another agent in charge of utilizing the trained agents to manage the location of items in the warehouse.  
As this is a larger project for an individual with limited time, I will be breaking the project into manageable chunks.
The first part of the project was to create the simulated environment to train the navigation agent.
Full Project as I understand it now:

 
  Preliminary Game Environment
  Train Navigation with draft environment.
  Reconfigure Environment for Box Stacking
  Train Stacking Agent
   Train Warehouse Profiler agent to optimally place the items in the correct slots for effecient movement.







Superset - Agriculture Crop Production Visualization
William M — Wed, 11 Sep 2024 01:03:34 GMT
I'll be using superset to visualize the agricultural crop production dataset that I used in my previous R post.  I'll make a dashboard that can compare land use efficiency between 1961 and 2018.
I will do an overview of the work done to create the dashboard, If you would prefer you can  Jump to the dashboard link here.  **Update my self hosted superset container is no longer running.  I have included videos of the dashboard in the post instead of embedding directly to the server.
Superset dashboards are made of charts and layout elements.  I started by making the charts for each element.  I would need a chart for 'Tonnes Produced in 1961', 'Acres used in 1961', 'Tonnes Produced in 1961', and 'Acres used in 2018'.  In addition to these charts, I want to include a visualization that would compare the individual crop's land use productivity.  Also, a large metric that shows the percentage increase of all crops selected.




The Tonnes Per Acre chart would be a bit more involved as it would require a bit of custom SQL, and a few filters.
SQL for the  metrics, filters, and sorting are below (full SQL is not required for these settings, as Superset takes care of much of it):
--Metrics
--1961 Tonnes Per Acre
SUM(production1961) / SUM(areaharvested1961)

--2018 Tonnes Per Acre
SUM(production2018) / SUM(areaharvested2018)

--Increase in productivity
ABS(
  (SUM(production1961)/SUM(areaharvested1961)) -
  (SUM(production2018)/SUM(areaharvested2018)))
  /(SUM(production1961)/SUM(areaharvested1961))

 /*Filters
For the filters I wanted to do two things, I needed to include items that had values in 1961 and 2018.  By filtering where the production and area harvested are greater than 0 helped with this.  It also had the effect of making sure there was no division by zero errors */

--Sorting
ABS(
 (SUM(production1961)/SUM(areaharvested1961)) -
 (SUM(production2018)/SUM(areaharvested2018)))
 /(SUM(production1961)/SUM(areaharvested1961))
--Making the sort order descending is done through the User Interface.




This sorts out the tables and charts, the last thing I want to include on my dashboard is a large number that indicates the total efficiency increase.





This is all of the elements of the dashboard save for the filters.  So let's take a look at the filters we will use.  A filter for the Continent, by Country, by tonnes produced, and to look at specific crops.
With the filters and dashboard's elements complete the dashboard is ready to explore whatever of detail the user needs.
Superset dashboard
I strongly recommend viewing the dashboard on a  larger screen than a mobile device.  You can view the dashboard from the Supererset instance by clicking this link.






R - Agriculture Crop Production EDA - Part 2
William M — Tue, 03 Sep 2024 07:38:13 GMT
We will be answering 5 questions on the Agriculture Crop Production data.
Which country in Africa produced the most coffee in 2018?
Which country in Africa produced the most coffee per acre in 2018?
Which country in Africa has the largest increase in the amount of coffee produced since 1961?
Which country (in the world, not specifically Africa) saw the highest increase in the amount of coffee grown per acre since 1961?*
Which continent (or rather, major grouping as presented in the 5 data sets) has the largest increase in the amount of coffee grown per acre since 1961?*
*These final two questions will be answered, for the previous questions, and an explanation of the data, please review the previous post.
4 Which country (in the world, not specifically Africa) saw the highest increase in the amount of coffee grown per acre since 1961?* 
Let's jump into the solution
library(dplyr)

#Import all of the datasets
t1 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv")
t2 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Americas.csv")
t3 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Asia.csv")
t4 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Europe.csv")
t5 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Oceania.csv")

#Combine the datasets
q4 <- rbind(t1,t2,t3,t4,t5) %>%
  #Keep necessary variables
  select('Area', 'Item', 'Unit', 'Element','Y1961','Y2018') %>%
  #Filter "Item" and "Element" on appropriate values
  filter( Item == 'Coffee, green', Element == 'Yield') %>%
  #Create a new variable and apply the percent change formula
  mutate(PercentChange = ((Y2018 - Y1961) / Y1961 * 100)) %>%
  #Sort with the highest percent change on top
  arrange(desc(PercentChange))

#round for better formatting
q4$PercentChange = round(q4$'PercentChange', 2)

OUTPUT:


	
	
		
			

			Area
			Item
			Unit
			Element
			Y1961
			Y2018
			PercentChange
		
		
			1
			Viet Nam
			Coffee, green
			hg/ha
			Yield
			1934
			26117
			1250.41
		
		
			2
			Thailand
			Coffee, green
			hg/ha
			Yield
			600
			5725
			854.17
		
		
			3
			Nigeria
			Coffee, green
			hg/ha
			Yield
			1833
			12886
			603
		
		
			4
			Malaysia
			Coffee, green
			hg/ha
			Yield
			3391
			22810
			572.66
		
		
			5
			China, mainland
			Coffee, green
			hg/ha
			Yield
			5000
			29405
			488.1
		
		
			6
			Malawi
			Coffee, green
			hg/ha
			Yield
			4175
			23345
			459.16
		
		
			7
			Honduras
			Coffee, green
			hg/ha
			Yield
			2767
			11195
			304.59
		
		
			8
			Nicaragua
			Coffee, green
			hg/ha
			Yield
			2762
			10672
			286.39
		
		
			9
			Brazil
			Coffee, green
			hg/ha
			Yield
			5084
			19060
			274.9
		
		
			10
			Lao People's Democratic Republic
			Coffee, green
			hg/ha
			Yield
			5000
			18611
			272.22
		
		
			11
			Sierra Leone
			Coffee, green
			hg/ha
			Yield
			6379
			20706
			224.6
		
		
			12
			Ghana
			Coffee, green
			hg/ha
			Yield
			5667
			16959
			199.26
		
		
			13
			Rwanda
			Coffee, green
			hg/ha
			Yield
			5833
			16506
			182.98
		
		
			14
			French Polynesia
			Coffee, green
			hg/ha
			Yield
			869
			2268
			160.99
		
	


5 Which continent (or rather, major grouping as presented in the 5 data sets) has the largest increase in the amount of coffee grown per acre since 1961?*
This required a bit of working to do.  The data needed to be lagged, aggregated, and aggregated some more to get the information on the continental level and 'subtractable' and divisible at a row level.
library(dplyr)

#PART 1
#Import all of the datasets
#Before the merge each individual dataset was given a 'Continent' column


t1 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv")
t1$Continent = "Africa"
t2 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Americas.csv")
t2$Continent = "Americas"
t3 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Asia.csv")
t3$Continent = "Asia"
t4 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Europe.csv")
t4$Continent = "Europe"
t5 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Oceania.csv")
t5$Continent = "Oceania"




#PART 2
This part of the code is where I filtered data, and used the lag function to essentially pivot the data.

#Combine the datasets
q5step1 <- rbind(t1,t2,t3,t4,t5) %>%
  #Select just coffee, and Element 'Production' OR 'Area harvested' 
  filter(Item == 'Coffee, green', Element == 'Production' | Element =='Area harvested') %>%
  #Select a few variables for now, more will be removed later
  select('Continent', 'Area', 'Unit', 'Element', 'Y1961', 'Y2018') %>%
  #We will lag the production of both years partitioned by the 'Area'
  group_by(Area) %>%
  mutate('1961AreaHarvested' = lag(Y1961, n=1, order_by=Area)) %>%
  mutate('2018AreaHarvested' = lag(Y2018, n=1, order_by=Area)) %>%
  #Renaming Y1961, and Y2018 Columns to denote their new role as Production values
  rename( "1961Production" = "Y1961") %>%
  rename( "2018Production" = "Y2018") %>%
  #We will get filter by Element = Production... this will help with the data pivot
  #It will remove the 'Area Harvested' elements from the 1961/2018_Production variables
  filter(Element == 'Production') %>%
  #Clean data up a bit and continue in another pipe, because this has become cumbersome
  #We will do maths and such in another pipe
  group_by(Continent) %>%
  select('Continent', '1961Production','1961AreaHarvested','2018Production','2018AreaHarvested')



#PART 3
The data was further aggregated by continent, and yield calculated as production-units/area harvested for each year.  From there the percent change calculation was applied.

q5step2 <- q5step1 %>%
  group_by(Continent) %>%
summarize_at(vars('1961Production','1961AreaHarvested','2018Production','2018AreaHarvested'), list(name = sum), na.rm=T )
 
#Create the yield for each continent
q5step3 <- q5step2 %>%
  #Calculate yield, the original document uses hectares and hectograms... but this will be a percent change
  #the unit is not important so long as its the same in 1961 and 2018
  mutate(Yield2018 = .$'2018Production_name' / .$'2018AreaHarvested_name') %>%
  mutate(Yield1961 = .$'1961Production_name' / .$'1961AreaHarvested_name') %>%
  select(Continent,Yield1961,Yield2018)

#PART 4 FINAL OUTPUT
#Output the final table with the yield-percent-change per continent in descending order

q5 <- q5step3 %>%
  #Apply the percent change calculation
  mutate(Yield_Change_Percent = (Yield2018 - Yield1961)/ Yield1961 * 100) %>%
  #Nice round Numbers
  mutate(Yield_Change_Percent = round(Yield_Change_Percent,2)) %>%
  #sorted by highest yield change to lowest
  arrange(desc(Yield_Change_Percent)) %>%
  #Choose only the variables needed
  select(Continent,Yield_Change_Percent)
OUTPUT:


	
	
		
			Continent
			Yield_Change_Percent
		
		
			Oceania
			163.29
		
		
			Americas
			137.01
		
		
			Asia
			111.31
		
		
			Africa
			14.63
		
		
			Europe
			NA
		
	


So with all that we have our final (very small) output.  
This was my first time using RStudio, or the R language in general.  So I am early on in the learning curve.  I will definitely revisit R at another time.


R - Agriculture Crop Production EDA - Part 1
William M — Tue, 03 Sep 2024 04:00:34 GMT
I found a dataset on data.world that had a great deal of information on country crop yields.  It contained 5 data tables divided into different groupings including Africa, The Americas, Asia, Europe, and Oceania.  It also contained 3 more tables that helped explain the dataset and offer context about the data collection, measurements, etc.
Lets preview the data concerning Africa.


	
	
		
			Area Code
			Area
			Item Code
			Item
			Element Code
			Element
			Unit
			Y1961
		
		
			4
			Algeria
			221
			Almonds, with shell
			5312
			Area harvested
			ha
			13300
		
		
			4
			Algeria
			221
			Almonds, with shell
			5419
			Yield
			hg/ha
			4511
		
		
			4
			Algeria
			221
			Almonds, with shell
			5510
			Production
			tonnes
			6000
		
		
			4
			Algeria
			515
			Apples
			5312
			Area harvested
			ha
			3400
		
		
			4
			Algeria
			515
			Apples
			5419
			Yield
			hg/ha
			45294
		
		
			4
			Algeria
			515
			Apples
			5510
			Production
			tonnes
			15400
		
	


Each agricultural product in each country has three entries.  The 'Element' column describes the measure.  Three elements are included 'Area harvested', 'Yield', and 'Production'.  'Yield' is an aggregate of the 'Area harvested', and 'Production'.
The data for the years is ordered by column as well.


	
	
		
			Y1961F
			Y1962
			Y1962F
			Y1963
			Y1963F
			Y1964
			Y1964F
			Y1965
		
		
			F
			13300
			F
			13300
			F
			14200
			F
			13800
		
		
			Fc
			4511
			Fc
			4511
			Fc
			4507
			Fc
			4493
		
		
			

			6000
			

			6000
			

			6400
			

			6200
		
		
			F
			3100
			F
			2800
			F
			2700
			F
			2900
		
		
			Fc
			45161
			Fc
			46429
			Fc
			46078
			Fc
			45348
		
		
			

			14000
			

			13000
			

			12441
			

			13151
		
	


This continues until 2018.  Notice how there is a date with an 'F' at the end of it?  This allows us to know how the data was collected.  This data can be found in the 'Flags.csv' file included in the dataset.  I include it for those are curious, but it was not used in this data exploration.  I'll use an ellipse to denote where I snipped the dataset.


	
	
		
			Flag
			Flags
		
		
			

			Official data
		
		
			*
			Unofficial figure
		
		
			A
			Aggregate, may include official, semi-official, estimated or calculated data
		
		
			…
			… 
		
		
			F
			FAO estimate
		
		
			Fb
			Data obtained as a balance
		
		
			Fc
			Calculated data
		
		
			…
			…
		
	


TO THE ANALYSIS
We will be answering 5 questions on this data.


  Which country in Africa produced the most coffee in 2018?
  Which country in Africa produced the most coffee per acre in 2018?
  Which country in Africa has the largest increase in the amount of coffee produced since 1961?
  Which country (in the world, not specifically Africa) saw the highest increase in the amount of coffee grown per acre since 1961?*
  Which continent (or rather, major grouping as presented in the 5 data sets) has the largest increase in the amount of coffee grown per acre since 1961?*


*This post contains solutions for 1-3, the next post details 4 and 5
1 Which country in Africa produced the most coffee in 2018?
We are going to start by importing the data.  Then we distill the columns that we to display by subsetting the dataset and including only 'Area', 'Item', 'Element', 'Unit', and the year in question 'Y2018'.
The idea here is simple, we are going to filter where the 'Item Code' is equal to '656', and filter on the 'Element' column where it is equal to 'Production'.  After this it is a matter of sorting data on the 'Y2018' column, as this column contains the numerical 'Production' values.
#Import dplyr
library(dplyr)

#Import data
africa_table = read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv")


q1 <- africa_table %>% 
  select(Area, Item.Code, Item, Element, Unit, Y2018) %>%
  filter(Element == "Production") %>%
  filter(Item.Code == 656) %>%
  arrange(desc(Y2018))

#Output as a csv vile
write.csv(q1africa_table, "C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\q1.csv")
OUTPUT:


	
	
		
			Area
			Item.Code
			Item
			Element
			Unit
			Y2018
		
		
			Ethiopia
			656
			Coffee, green
			Production
			tonnes
			494574
		
		
			Uganda
			656
			Coffee, green
			Production
			tonnes
			284225
		
		
			Madagascar
			656
			Coffee, green
			Production
			tonnes
			60114
		
		
			United Republic of Tanzania
			656
			Coffee, green
			Production
			tonnes
			43193
		
		
			Guinea
			656
			Coffee, green
			Production
			tonnes
			42900
		
		
			Kenya
			656
			Coffee, green
			Production
			tonnes
			41375
		
		
			C�te d'Ivoire
			656
			Coffee, green
			Production
			tonnes
			39092
		
		
			Rwanda
			656
			Coffee, green
			Production
			tonnes
			38643
		
		
			Democratic Republic of the Congo
			656
			Coffee, green
			Production
			tonnes
			31145
		
		
			Cameroon
			656
			Coffee, green
			Production
			tonnes
			30590
		
		
			Sierra Leone
			656
			Coffee, green
			Production
			tonnes
			20480
		
		
			Togo
			656
			Coffee, green
			Production
			tonnes
			19076
		
		
			Angola
			656
			Coffee, green
			Production
			tonnes
			16308
		
		
			Burundi
			656
			Coffee, green
			Production
			tonnes
			14216
		
		
			Malawi
			656
			Coffee, green
			Production
			tonnes
			11082
		
	


2 Which country in Africa produced the most coffee per acre in 2018?
This is a similar problem, there is no need to aggregate the data because the data is already aggregated, it is a matter of not selecting "Production" as the "Element", but filtering by "Yield".
library(dplyr)
africa_table = read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv")


#Only keep Element values equal to 'Yield'
q2 <- filter(africa_table, Element == 'Yield') %>%
  #Our focus is on coffee
  filter(Item =='Coffee, green') %>%
  #Select output columns
  select('Area', 'Item', 'Unit', 'Y2018') %>%
  #Sort highest yield to lowest
  arrange(desc(Y2018))
OUTPUT:


	
	
		
			

			Area
			Item
			Unit
			Y2018
		
		
			1
			Malawi
			Coffee, green
			hg/ha
			23345
		
		
			2
			Sierra Leone
			Coffee, green
			hg/ha
			20706
		
		
			3
			Ghana
			Coffee, green
			hg/ha
			16959
		
		
			4
			Rwanda
			Coffee, green
			hg/ha
			16506
		
		
			5
			Nigeria
			Coffee, green
			hg/ha
			12886
		
		
			6
			Burundi
			Coffee, green
			hg/ha
			9658
		
	


3 Which country in Africa has the largest increase in the amount of coffee produced since 1961?
A different type of question.  I need to compare values, and use the percent change formula.
#consider dplyr and africa_table already imported

#Select required columns
q3 <- select(africa_table, 'Area', 'Item', 'Unit', 'Element','Y1961','Y2018') %>%
  #filter by coffee and production
  filter(Item == 'Coffee, green', Element == 'Production') %>%
  #create a new column for percent change rounded to 2 decimal places
  mutate(PercentChange = round(((Y2018 - Y1961) / Y1961)*100,2)) %>%
  #arrange in descending order using PercentChange
  arrange(desc(PercentChange))  
OUTPUT:


	
	
		
			

			Area
			Item
			Unit
			Element
			Y1961
			Y2018
			PercentChange
		
		
			1
			Malawi
			Coffee, green
			tonnes
			Production
			167
			11082
			6535.93
		
		
			2
			Sierra Leone
			Coffee, green
			tonnes
			Production
			5103
			20480
			301.33
		
		
			3
			Rwanda
			Coffee, green
			tonnes
			Production
			10500
			38643
			268.03
		
		
			4
			Congo
			Coffee, green
			tonnes
			Production
			900
			3049
			238.78
		
		
			5
			Uganda
			Coffee, green
			tonnes
			Production
			94100
			284225
			202.05
		
		
			6
			Guinea
			Coffee, green
			tonnes
			Production
			15000
			42900
			186
		
		
			7
			Togo
			Coffee, green
			tonnes
			Production
			10300
			19076
			85.2
		
		
			8
			Kenya
			Coffee, green
			tonnes
			Production
			28100
			41375
			47.24
		
		
			9
			Comoros
			Coffee, green
			tonnes
			Production
			100
			140
			40
		
		
			10
			United Republic of Tanzania
			Coffee, green
			tonnes
			Production
			33000
			43193
			30.89
		
		
			11
			Central African Republic
			Coffee, green
			tonnes
			Production
			8200
			9391
			14.52
		
		
			12
			Madagascar
			Coffee, green
			tonnes
			Production
			54000
			60114
			11.32
		
		
			13
			Nigeria
			Coffee, green
			tonnes
			Production
			1100
			1161
			5.55
		
		
			14
			Burundi
			Coffee, green
			tonnes
			Production
			14000
			14216
			1.54
		
		
			15
			Mozambique
			Coffee, green
			tonnes
			Production
			1000
			825
			-17.5
		
		
			16
			Cameroon
			Coffee, green
			tonnes
			Production
			44700
			30590
			-31.57
		
	


We will continue with questions 4 and 5 in the next post.


Power Bi - TFL Bus Safety Viz
William M — Thu, 29 Aug 2024 22:31:34 GMT




View the interactive visualization here (I think you may need a Power Bi account to view)



Python EDA - London Bus Safety Part 2/2
William M — Mon, 26 Aug 2024 23:12:05 GMT


 
  How many unique fields are there for each variable?
  How much nullity is there for each variable?
  Create a new variable to describe if the person was taken to hospital.
  Do any operators have a higher incidence of overall incidents?
  Do any operators have a higher incident of hospitalizations?*
  Compare the 'Slip Trip Fall' count with different operators.  
    Order them by operators with most 'Slip Trip Fall's to least.  
    Which operator has the most of these incidents?*
 

*We have already explored questions 1 through 4 in the previous post, we will continue with the final 2 tasks of this analysis.
Data preview:

Year
      Date Of Incident
      Route
      Operator
      Group Name
      Bus Garage
      Borough
      Injury Result Description
      Incident Event Type
      Victim Category
      Victims Sex
      Victims Age
    
  
  
    
      0
      2015
      1/1/2015
      1
      London General
      Go-Ahead
      Garage Not Available
      Southwark
      Injuries treated on scene
      Onboard Injuries
      Passenger
      Male
      Child
    
    
      1
      2015
      1/1/2015
      4
      Metroline
      Metroline
      Garage Not Available
      Islington
      Injuries treated on scene
      Onboard Injuries
      Passenger
      Male
      Unknown
    
    
      2
      2015
      1/1/2015
      5
      East London
      Stagecoach
      Garage Not Available
      Havering
      Taken to Hospital – Reported Serious Injury or...
      Onboard Injuries
      Passenger
      Male
      Elderly
    
    
      3
      2015
      1/1/2015
      5
      East London
      Stagecoach
      Garage Not Available
      None London Borough
      Taken to Hospital – Reported Serious Injury or...
      Onboard Injuries
      Passenger
      Male
      Elderly

5. Do any operators have a higher incident of hospitalizations?
Fairly straightforward because we have already created a field for hospitalized, let's aggregate that, and use Seaborn to plot it.
#Aggregate the count of hospitilizations 
operator_hospitalizations = df.groupby(['Operator'], as_index=False).agg(hospitalized_count = ('Hospitalized', 'sum'))

#Sort values by hospitalizations descending
operator_hospitalizations = operator_hospitalizations.sort_values(by=['hospitalized_count'], ascending=False)

#Display
operator_hospitalizations
OUTPUT:
This continues for all operators, only the top of the output is viewed above.
Then we will use matplotlib and seaborn to visualize the data.
#Add a visualization

import seaborn as sns
import matplotlib.pyplot as plt

operator_hospitalizations_chart = sns.barplot(
    x= operator_hospitalizations['hospitalized_count'], 
    y = operator_hospitalizations['Operator']
    )

operator_hospitalizations_chart.set_title('Hospitalizations Per Operator')
operator_hospitalizations_chart.set_xlabel('People Hospitalized')

plt.show()
OUTPUT:
And now for the final task
6. Compare the slip trip fall count (in incident event type) with different operators. Order them by operators with most slip trip falls to least.
#search for Slip Trip Fall in 'Incident Event Type'

df['SlipTripFall'] = df['Incident Event Type'].str.contains('Slip Trip Fall')
#Confirm that the string is found (if so there will be two unique values,

operator_tripslipfall = df.groupby(['Operator'], as_index=False).agg(SlipTripFallCount = ('SlipTripFall', 'sum'))

operator_tripslipfall = operator_tripslipfall.sort_values(by=['SlipTripFallCount'], ascending=False)

operator_tripslipfall
OUTPUT:
This continues for all operators, only the top of the output is viewed above.
Then we will make a visualization of the data using matplotlib and seaborn
import seaborn as sns

sns.barplot(
    x= operator_tripslipfall['SlipTripFallCount'], 
    y = operator_tripslipfall['Operator']
    )


#Add a visualization

import seaborn as sns
import matplotlib.pyplot as plt

operator_tripslipfall_chart = sns.barplot(
    x= operator_tripslipfall['SlipTripFallCount'], 
    y = operator_tripslipfall['Operator']
    )

operator_tripslipfall_chart.set_title('Slip/Trip/Fall Count Per Operator')
operator_tripslipfall_chart.set_xlabel('People who Slipped/Tripped/Fell')

plt.show()



###ANSWER THE QUESTION

print('The Operator with the most Slip/Trip/Fall incidents is:\n\n' 
      + BOLD + operator_tripslipfall['Operator'].iloc[0] + END)
And that is the end of our Python Exploratory Data Analysis.


Python EDA - London Bus Safety Part 1/2
William M — Mon, 26 Aug 2024 21:27:34 GMT
Transport for London provided their data on bus safety.  This data will be explored using Python.
We will use a Jupyter notebook using python 3.  We do our library imports, import the data, and preview:
import pandas as pd
import numpy as np

#For formatting output
BOLD = '\033[1m'
END = '\033[0m'

df = pd.read_csv("TFL Bus Safety.csv",encoding='cp1252')
df.head()
OUTPUT:

Year
      Date Of Incident
      Route
      Operator
      Group Name
      Bus Garage
      Borough
      Injury Result Description
      Incident Event Type
      Victim Category
      Victims Sex
      Victims Age
    
  
  
    
      0
      2015
      1/1/2015
      1
      London General
      Go-Ahead
      Garage Not Available
      Southwark
      Injuries treated on scene
      Onboard Injuries
      Passenger
      Male
      Child
    
    
      1
      2015
      1/1/2015
      4
      Metroline
      Metroline
      Garage Not Available
      Islington
      Injuries treated on scene
      Onboard Injuries
      Passenger
      Male
      Unknown
    
    
      2
      2015
      1/1/2015
      5
      East London
      Stagecoach
      Garage Not Available
      Havering
      Taken to Hospital – Reported Serious Injury or...
      Onboard Injuries
      Passenger
      Male
      Elderly
    
    
      3
      2015
      1/1/2015
      5
      East London
      Stagecoach
      Garage Not Available
      None London Borough
      Taken to Hospital – Reported Serious Injury or...
      Onboard Injuries
      Passenger
      Male
      Elderly

We will be doing 6 tasks to explore this data:

 
  How many unique fields are there for each variable?
  How much nullity is there for each variable?
  Create a new variable to describe if the person was taken to hospital.
  Do any operators have a higher incidence of overall incidents?
  Do any operators have a higher incident of hospitalizations?*
  Compare the 'Slip Trip Fall' count with different operators.  
    Order them by operators with most 'Slip Trip Fall's to least.  
    Which operator has the most of these incidents?*
 

*Items 5 and 6 will be finished in the next post.
1a. How many unique fields are there for each variable?
This is very straightforward as pandas has the nunique() method.  
df.nunique()
OUTPUT:
Year                           4
Date Of Incident              45
Route                        612
Operator                      25
Group Name                    14
Bus Garage                    84
Borough                       35
Injury Result Description      4
Incident Event Type           10
Victim Category               17
Victims Sex                    3
Victims Age                    5
dtype: int64
1b. London does not have 35 Boroughs, I know this because I googled it...
so I put a list of London Boroughs in a dataframe and compared this to the 'list' of Boroughs in this dataset.
#copy list of boroughs
dfboroughs = pd.read_csv("londonboroughlist.csv")

#remove whitespace (the copying of boroughs caused a trailing whitespace)
dfboroughs['borough'] = dfboroughs['borough'].str.strip()

#convert the dataframe to a list
dfboroughs = dfboroughs['borough'].to_list()

#List of valuses are in the dataframe['Borough'] but are not in the borough list.
nonboroughs = df['Borough'][~df['Borough'].isin(dfboroughs)].unique()


#Show what boroughs are in the dataframe that are not in the borough list.
print('Dataframe contains these non-borough fields as boroughs:\n')

for i in nonboroughs:
    print("'" + i + "''")
OUTPUT:
Dataframe contains these non-borough fields as boroughs:

'None London Borough''
'City of London''
'Not specified''
1c. Examine unique values on a subset of values, then on all values
With this we can understand what may be categorical.  We will also need to examine how this dataset reports nulls, so the exhaustive list of values will be helpful.
examine_unique_list = ['Injury Result Description', 'Incident Event Type', 'Victim Category','Victims Sex', 'Victims Age']

print('Examining a few variables unique values:\n\n')

for i in examine_unique_list:
    print(BOLD + i.upper() + ':' + END)
    print(*df[i].unique(), sep = '\n')
    print('\n')
OUTPUT:
... continues, but omitted for the blog
2a. How much null is there in each variable?
'''Using the output above, find out how null is reported on each column, we will
use a dictionary to map the column with how null is reported.'''

column_null_dictionary = {
    'Route': '(blank)', 
    'Bus Garage': 'Garage Not Available',
    'Borough': 'Not specified', 
    'Victim Category': 'Insufficient Data',
    'Victims Sex': 'Unknown',
    'Victims Age': 'Unknown'
}


print("NULL COUNTS\n")

#Calculate row count of dataframe this is used for percent calculations
row_count = df.shape[0]


'''Iterate through the dictonary and replace values with NaN
and report nulls as we go'''

for x, y in column_null_dictionary.items():
    #Replace the column values with null where appropriate
    df[x].replace(y, np.nan, inplace=True)
    
    #Set null_count
    null_count = df[x].isna().sum()
    null_count_percent = round(((null_count / row_count)*100),2)
    
    #Count and report the nulls
    print(BOLD + x + ':' + END + '\n' + str(null_count) + ' / ' + str(row_count))
    
    #Report percent
    print(str(null_count_percent)  + '%' + '\n')
OUTPUT:
Continuing this we will just run a report on nulls:
#Report on all variables as a 'sanity check'
df.isna().sum()
OUTPUT:
Year                            0
Date Of Incident                0
Route                          14
Operator                        0
Group Name                      0
Bus Garage                   8572
Borough                       553
Injury Result Description       0
Incident Event Type             0
Victim Category                 2
Victims Sex                  3602
Victims Age                  7135
dtype: int64
2b Visualize nullity
I will use the missingno library to output a nullity matrix.
import missingno as msno 

#only report missing numbers on data with missing values
msno.matrix(df[list(column_null_dictionary.keys())]) 
3 Create a new variable to describe if the person was taken to hospital.
#If the Injury Result Description has the string 'Hospital' create bool T/F
df['Hospitalized'] = df['Injury Result Description'].str.contains('Hospital')

#Convert bool T/F into yes/no responses
df['HospitalizedYN'] = df['Hospitalized'].replace({
    True: "Yes",
    False: "No"
})

'''A column with true and false values is kept for future aggregations,
this is with the hindsight of proofreading, originally I converted it back 
to boolean to do the aggregations'''

print('View of the new variables\n')
print(df['HospitalizedYN'].head())
print('\n')
print(df['Hospitalized'].head())
In the example above I created two new variables, one human-readable ('Yes'/'No' values) and one boolean.  We can use the boolean for aggregate functions.
OUTPUT:
View of the new variables

0     No
1     No
2    Yes
3    Yes
4    Yes
Name: HospitalizedYN, dtype: object


0    False
1    False
2     True
3     True
4     True
Name: Hospitalized, dtype: bool
4. Do any operators have a higher incidence of overall incidents?
'''Create new Dataframe with all the operators.  Operator is needed, 
the other column is just to tranform into count column'''

#New DF
operator_incidents = df[['Operator']].copy()

#Create a new column in new dataframe
operator_incidents['Count'] = 1


#Value check for above code (maybe thorough, maybe noob stuff IDK)
print(operator_incidents)
print('\nCheck to make sure sum of count = the # of rows above')
print(operator_incidents['Count'].sum())
Outputs:
             Operator  Count
0      London General      1
1           Metroline      1
2         East London      1
3         East London      1
4           Metroline      1
...               ...    ...
23153     East London      1
23154   London United      1
23155   London United      1
23156   London United      1
23157       Metroline      1

[23158 rows x 2 columns]

Check to make sure sum of count = the # of rows above
23158
Then we use this for the aggregate function:
#Use an aggregate function and the sum of the new 'Count' variable

operator_incidents = operator_incidents.groupby(['Operator'], as_index=False).agg(incident_count = ('Count', 'sum'))

operator_incidents.sort_values(by=['incident_count'], ascending=False)
OUTPUT:
We will finish with tasks 5 and 6 in the next post.
The code for this can be found here on github.
Data Citation:
data.world [@vizwiz]. (2018). 2018/W51: London Bus Safety Performance [Dataset]. https://data.world/makeovermonday/2018w51


SQL/Tableau: AFDB Market Trends with Viz.  Part 2
William M — Thu, 22 Aug 2024 22:26:40 GMT
The next visualization request is to be able to view the differences in the initial value of each indicator, and the final value of each indicator.  This will be completed in SQL.
Requirements:


  Extract month and year from date
  Average values based on indicator, year, and month
  Filter by the first month in the data provided
  Follow the above steps, except filter by last month
  Join those tables together


SELECT 
	--3 final output
	-- indicator | initial date | initial average | ending date | ending average
	initial_values.indicatorname,
	make_date(CAST(initial_year AS INT), CAST(initial_month AS INT), 1) AS initital_date,
	initial_average,
	make_date(CAST(ending_year AS INT), CAST(ending_month AS INT), 1) AS ending_date,
	ending_average
FROM (
	/*1  Create a month/year format and aggregate
	Averagre value per indicator per month/year
	Only include first month using HAVING clause*/
	SELECT
		indicatorname,
		EXTRACT('Year' FROM date) AS initial_year,
		EXTRACT('Month' FROM date) AS initial_month,
		AVG(value) AS initial_average
	FROM
		afdbmarkettrends2015
	GROUP BY
		indicatorname, initial_year, initial_month
	HAVING
		EXTRACT('Year' FROM date) = 2011
		AND
		EXTRACT('Month' FROM date) = 1
	ORDER BY
		indicatorname, initial_year, initial_month)
	AS initial_values
JOIN(
	/*2  Same as first except 
	Only include last month using HAVING clause*/
	SELECT
		indicatorname,
		EXTRACT('Year' FROM date) AS ending_year,
		EXTRACT('Month' FROM date) AS ending_month,
		AVG(value) AS ending_average
	FROM
		afdbmarkettrends2015
	GROUP BY
		indicatorname, ending_year, ending_month
	HAVING
		EXTRACT('Year' FROM date) = 2015
		AND
		EXTRACT('Month' FROM date) = 7
	ORDER BY
		indicatorname, ending_year, ending_month)
	AS ending_values
ON
	initial_values.indicatorname = ending_values.indicatorname
OUTPUT:


	
	
		
			indicatorname
			initital_date
			initial_average
			ending_date
			ending_average
		
		
			Baltic Dry Index
			2011-01-01
			1401.4
			2011-01-01
			954.9
		
		
			CFA zone Countries CFA Franc
			2011-01-01
			492.8661
			2011-01-01
			596.559305
		
		
			Cocoa (USD/tonne)
			2011-01-01
			3164.863
			2011-01-01
			3332.6425
		
		
			Coffee Brazilian Naturals (US cents/tonne)
			2011-01-01
			209.2631578947
			2011-01-01
			111.9342105263
		
		
			…
			…
			…
			…
			…
		
	


Then we will use this data in Tableau for the visualization.
This image shows the desktop view of the dashboard. Below is the embedded dashboard. Depending on your browser the visualization may change.
The following dashboard allows the user to select which indicator(s) they would like to view, and shows the initial and final averages in an easy to understand chart.

   
                




Tableau: AFDB Market Trends with Viz.  Part 1
William M — Sun, 18 Aug 2024 18:01:46 GMT
I have embedded the visualization of the AFDB Market Trend dataset below.  Tableau dashboards seem to work flawlessly on desktop, but often have trouble on mobile displays.  I will include screenshots, and the direct link to the dashboard on Tableau Public can be found here.
This is an example of the dashboard, if it looks very different (or doesn't load below this image at all) it may be better to try another browser.

   
                


I started my dashboard with a draft. It is a single line graph that allows the end user to choose which data and time-frame they want to examine... it was a draft that was never meant to share on the internet... so don't judge.
In order to turn this draft into a working dashboard I considered a few steps then began to create the table.
I imported data and made sure the data types were correct.
I placed the 'date' variable in the columns, and the 'value' as the rows.
On the Graph itself I placed the 'Indicator Name' as I wanted to be able to display each indicator's value's separately as opposed to aggregating the value for all indicators.
I created a filter for indicator name, so the user can choose which indicator(s) they would like to see.
The SUM(value) was changed to average (for future aggregations, by month or quarter, etc)
A calculated field was created to filter out commodities (
IF ISNULL([Unit]) THEN 'Main Indicators'
ELSE 'Commodity Indicators'
END
Make sure the indictor filter is affected by the 'commody indicator', else it will show all the indicators even when just 'commodies' or just 'Main indicators' is selected.
Before:


SQL EDA - AFDB Market Trends - Part 2
William M — Sat, 17 Aug 2024 19:14:43 GMT
I will continue exploring a dataset found on data.world that was originally found on The Humanitarian Data Exchange.  This dataset contains numerous market indicators of importance to the African Development Bank, along with dates and values for different indicators.  Below is a snippet of the first few rows of data.


	
	
		
			Indicator
			IndicatorName
			Unit
			Frequency
			Date
			Value
		
		
			91242809
			Egypt CASE 30 Index
			

			D
			2011-01-02
			7082.4
		
		
			91242109
			Tunisia Dinar
			

			D
			2011-01-03
			1.4416
		
		
			91242209
			Platinum (USD/Troy Ounce)
			USD/Troy Ounce
			D
			2011-01-03
			1781.5
		
		
			91242309
			Kenya Nairobi SE Index- NSE 20
			

			D
			2011-01-03
			4495.41
		
	


From this data I became curious to explore 7 different things:
How many unique indicators are there?
Is the number of unique indicators equal to the unique indicatorname field? If not explain the discrepancy.
Remove all commodities from this table and put them in a separate table.
From the remaining indicators, are there any gaps in reporting?
Calculate the  'PercentChangeDaily' of indicators.
Calculate the  'AverageMonthly' value of different indicators.
What were the indicators that increased the most percentage OVERALL from the inception of this data.
I have explored the first four tasks in another post.  I will finish the final three tasks of this project now.
Our fist task will be to create a new field for our table.  This will be a comparison to previous dates, and will be on a per indicator basis, meaning the data must be partitioned, sorted, and lagged.
 Create a new field for "PercentChangeDaily"
SELECT 
	indicatorname, 
    date, 
    value, 
    yesterday,
	ROUND(((yesterday - value) / value * 100), 5) AS percentchange
FROM(
	SELECT 
		*,
		LAG(value) OVER (PARTITION BY 
						 indicatorname 
    						 ORDER BY 
    						 	indicatorname, 
    						 	date) AS yesterday
	FROM 
		afdbmarkettrends2015
ORDER BY 
	indicatorname, 
	date) AS yesterdaystable
OUTPUT:


	
	
		
			indicatorname
			date
			value
			yesterday
			percentchange
		
		
			Baltic Dry Index
			2011-01-04
			1693
			NULL
			NULL
		
		
			Baltic Dry Index
			2011-01-05
			1621
			1693
			4.4417
		
		
			Baltic Dry Index
			2011-01-06
			1544
			1621
			4.98705
		
		
			Baltic Dry Index
			2011-01-07
			1519
			1544
			1.64582
		
		
			Baltic Dry Index
			2011-01-10
			1495
			1519
			1.60535
		
		
			Baltic Dry Index
			2011-01-11
			1480
			1495
			1.01351
		
		
			Baltic Dry Index
			2011-01-12
			1453
			1480
			1.85822
		
		
			Baltic Dry Index
			2011-01-13
			1438
			1453
			1.04312
		
		
			Baltic Dry Index
			2011-01-14
			1439
			1438
			-0.06949
		
		
			Baltic Dry Index
			2011-01-17
			1439
			1439
			0
		
		
			Baltic Dry Index
			2011-01-18
			1432
			1439
			0.48883
		
	


This output creates a new column with yesterdays values, and uses that new column to calculate the percent change from the previous date on a per indicatorname basis.
The next task is to create an average monthly value field.  To do this I will extract the month and year from the date column.  And i will then aggregate the values on the basis of the newly created year and month columns.
 Calculate the  'AverageMonthly' value of different indicators.
SELECT 
	indicatorname, 
	year, 
	month, 
	AVG(value)
FROM
	(SELECT 
		*, 
		EXTRACT(month FROM date) AS month, 
		EXTRACT(year FROM date) AS year
	FROM afdbmarkettrends2015) AS trendswithmonthyear
GROUP BY 
	indicatorname, 
	year, 
	month
ORDER BY 
	indicatorname, 
	year, 
	month
OUTPUT:


	
	
		
			indicatorname
			year
			month
			monthlyaveragevalue
		
		
			Baltic Dry Index
			2011
			1
			1401.4
		
		
			Baltic Dry Index
			2011
			2
			1181.1
		
		
			Baltic Dry Index
			2011
			3
			1492.6957
		
		
			Baltic Dry Index
			2011
			4
			1342.5556
		
		
			Baltic Dry Index
			2011
			5
			1352.4
		
		
			Baltic Dry Index
			2011
			6
			1433.2273
		
		
			Baltic Dry Index
			2011
			7
			1365.5238
		
		
			Baltic Dry Index
			2011
			8
			1386.9545
		
		
			Baltic Dry Index
			2011
			9
			1840.4091
		
		
			Baltic Dry Index
			2011
			10
			2072.4762
		
		
			Baltic Dry Index
			2011
			11
			1835.3182
		
	


The final problem is not as straightforward as the other examples.  We will be finding which indicators increased the most percentage overall since the inception of the data.  It has to be noted that not all indicators have the same start date.  
On a per indicator basis we must calculate


  Earliest date record of indicator as beginningvalue
  Latest date record of indicator as endingvalue
  Percent change of the new beginningvalue and endingvalue as percentchange
  

This will be done with the lag value and filtering where the beginningvalue and endingvalue match the dates of the indicator on a per indicator basis.
7 .  What were the indicators that increased the most percentage OVERALL from the inception of the data?
SELECT 
	indicatorname, 
	date, 
	beginningvalue, 
	endingvalue, 
	ROUND(percentchange, 4) AS percentchange
FROM
	/* 2 At this point we will have two records for each indicator
	one with the beginningvalue matched to the date, and one with
	the ending value matched to the date.  We will use lag to
	create a record for each indicator with the beginningvalue, and 
	the endingvalue, and then we use this to calculate percentage*/
	
	(SELECT 
	 	indicatorname, 
	 	date, 
	 	LAG (value) OVER (PARTITION BY 
						  	(indicatorname) 
						ORDER BY 
						  	date) AS beginningvalue, 
	 	value AS endingvalue,
		(value - LAG (value) OVER (PARTITION BY 
							(indicatorname) 
						ORDER BY 
    						date)) 
	 		/ LAG (value) OVER (PARTITION BY 
							(indicatorname) 
						ORDER BY 
							date) * 100 AS percentchange 
		FROM
			--1  Create a mindate and maxdate for each indicator
	 		(SELECT *, 
			 	MIN(date) OVER (PARTITION BY 
							indicatorname) AS mindate, 
			 	MAX(date) OVER (PARTITION BY 
    						indicatorname) AS maxdate
			FROM 
			 	afdbmarkettrends2015
		ORDER BY 
			 date) 
	 	AS trendswithminandmax
        
		/*3 Filter for each row using the where clause,
	 	keeps only records that are the beginning and 
	 	endind dates for each record*/
	 	WHERE 
	 		date = mindate OR 
	 		date = maxdate
	ORDER BY 
	 	indicatorname, 
	 	date) 
	AS trendswithchangepercent
WHERE
	/*4 The lag creates a null value  that helps
	to remove the unnecessary record, we will filter
	to remove these records*/
	beginningvalue IS NOT NULL
ORDER BY 
	percentchange DESC
LIMIT 
	5
OUTPUT:


	
	
		
			indicatorname
			date
			beginningvalue
			endingvalue
			percentchange
		
		
			South Africa Rand
			2015-07-28
			6.6275
			12.5681
			89.6356
		
		
			Cote d'Ivoire BRVM Composite Index
			2015-07-28
			159.32
			301.22
			89.066
		
		
			Uganda SE All Share index USE
			2015-07-28
			1192.57
			1893.63
			58.7856
		
		
			South Africa JALSH All Share Index
			2015-07-28
			32308.11
			50758.42
			57.1074
		
		
			Uganda Shilling
			2015-07-28
			2310
			3420
			48.0519
		
	


There we have our final SQL analysis of this dataset.
Key takeaways:
With SQL it is easy to get the data to report what you want it to.  Some of the positives are that it is quick to query, and the language is quite easy to understand for smaller queries.  Conversely, while trying to accomplish more robust aggregations, the language becomes harder to follow.  Also it is not a 'one stop shop' for analyzing and reporting visually.  Tools like Python or R would be useful if visualizations and the same aggregations could be done.
I will be exploring this data visually with Tableau in this post.



SQL EDA - AFDB Market Trends - Part 1
William M — Mon, 12 Aug 2024 21:15:24 GMT
I will be exploring a dataset found on data.world that was originally found on The Humanitarian Data Exchange.  This dataset contains numerous market indicators of importance to the African Development Bank, along with dates and values for different indicators.  Below is a snippet of the first few rows of data.


	
	
		
			Indicator
			IndicatorName
			Unit
			Frequency
			Date
			Value
		
		
			91242809
			Egypt CASE 30 Index
			

			D
			2011-01-02
			7082.4
		
		
			91242109
			Tunisia Dinar
			

			D
			2011-01-03
			1.4416
		
		
			91242209
			Platinum (USD/Troy Ounce)
			USD/Troy Ounce
			D
			2011-01-03
			1781.5
		
		
			91242309
			Kenya Nairobi SE Index- NSE 20
			

			D
			2011-01-03
			4495.41
		
	


From this data I became curious to explore 7 different things:
How many unique indicators are there?
Is the number of unique indicators equal to the unique indicatorname field? If not explain the discrepancy.
Remove all commodities from this table and put them in a separate table.
From the remaining indicators, are there any gaps in reporting?
Calculate the  'PercentChangeDaily' of indicators.*
Calculate the  'AverageMonthly' value of different indicators.*
What were the indicators that increased the most percentage OVERALL from the inception of this data.*
*The last 3 items will be covered in the next post.
For a more condensed version, view on github.
Let's start with the first question.
 How many unique indicators are there?
SELECT 
	COUNT(DISTINCT(indicator))
FROM 
	afdbmarkettrends2015
OUTPUT:


	
	
		
			indicator_count
		
		
			31
		
	


Problem 1 is complete. I've answered the question, but let's answer some questions that may be implied.  I will output the 31 indicators along with the indicator names.  We will sort the list alphabetically by the indicatorname.
SELECT 
	DISTINCT(indicator), 
	indicatorname
FROM 
	afdbmarkettrends2015
ORDER BY 
	indicatorname


	
	
		
			indicator
			indicatorname
		
		
			91244509
			Baltic Dry Index
		
		
			91242409
			CFA zone Countries CFA Franc
		
		
			91241909
			Cocoa (USD/tonne)
		
		
			91244109
			Coffee Brazilian Naturals (US cents/tonne)
		
		
			91242709
			Coffee Robusta (US cents/tonne)
		
		
			91243909
			Copper (USD/lb)
		
		
			91244709
			Copper (USD/MT)
		
		
			91243409
			Cote d'Ivoire BRVM Composite Index
		
		
			91243709
			Cotton (USD/lb)
		
		
			91244809
			Crude Oil, Brent (USD/bbl)
		
		
			91242809
			Egypt CASE 30 Index
		
		
			91243809
			Egypt Pound
		
		
			91243609
			Europe EURO
		
		
			91244909
			Gold (USD/Troy Ounce)
		
		
			91244609
			Iron Ore (USD/Dry MT)
		
		
			91242609
			Kenya Kenyan Shilling
		
		
			91242309
			Kenya Nairobi SE Index- NSE 20
		
		
			91242909
			Mauritius Mauritius AllShares SEMDEX
		
		
			91244009
			Mauritius Rupee
		
		
			91244309
			Morocco Casa All Share Index
		
		
			91243109
			Morocco Dirham
		
		
			91243009
			Nigeria Naira
		
		
			91242009
			Nigeria NGSE All Share Index
		
		
			91242209
			Platinum (USD/Troy Ounce)
		
		
			91245009
			Silver (USD/Troy Ounce)
		
		
			91242509
			South Africa JALSH All Share Index
		
		
			91243209
			South Africa Rand
		
		
			91242109
			Tunisia Dinar
		
		
			91243309
			Tunisia Tunis se Tnse Index TUNINDEX
		
		
			91243509
			Uganda SE All Share index USE
		
		
			91244209
			Uganda Shilling
		
	


The second question outlines a 'sanity check' for the data.  We will be summing the count of unique values in the indicator field, and the count of unique values in the indicatorname field, to make sure they are the same.
 Is the number of unique indicators equal to the unique inidcatorname field? If not explain the discrepancy.
SELECT 
	COUNT(DISTINCT(indicator)) AS indicator_count,
	COUNT(DISTINCT(indicatorname)) AS indicator_name_count
FROM 
	afdbmarkettrends2015


	
	
		
			indicator_count
			indicator_name_count
		
		
			31
			31
		
	


The fields are equal, this aspect of the data is 'sane'.
Then we want to separate the commodities from this data.  The unit column has a currency if and only if the item is a commodity.  We will use this information to extract commodities, then we will use this data to delete the commodities after the data is transported to a new table.
 Remove all commodities from this table and put them in a separate table.  Then we will count the number of items in the new table.
CREATE TABLE afdbmarketcommodities2015 AS
	SELECT 
		* 
	FROM 
		afdbmarkettrends2015
	WHERE 
		unit IS NOT NULL;

SELECT 
	COUNT (*) AS newcommoditiescount 
FROM 
	afdbmarketcommodities2015;
OUTPUT:


	
	
		
			newcommoditiescount
		
		
			11473
		
	


This makes the new table from the filtered data.  The data was checked for accuracy, and we now need to remove the data from the afdbmarkettrends2015 table.
DELETE FROM 
	afdbmarkettrends2015 
WHERE 
	unit IS NOT NULL;
OUTPUT:
DELETE 11473 Query returned successfully in 11 msec.
Good.
When I saw the dates listed, I was genuinely curious if the days had any gaps.  This will require some creative sorting by indicator and date, the creation of a new column that has the previous date that was on record.  This sorting should be done separately for each indicator.
From the remaining indicators, are there any gaps in reporting?
SELECT 
	indicatorname,
	date,
	previousday,
	CAST(date AS timestamp) - CAST(previousday AS timestamp)  AS timegap
FROM (
	SELECT 
		*,
		LAG(date) OVER(PARTITION BY (indicator) ORDER BY indicator, date) AS previousday
	FROM afdbmarkettrends2015
	ORDER BY 
		indicator, 
		date) 
	AS laggedtable
OUTPUT (small summary):


	
	
		
			indicatorname
			date
			previousday
			timegap
		
		
			Nigeria NGSE All Share Index
			2011-01-04
			NULL
			NULL
		
		
			Nigeria NGSE All Share Index
			2011-01-05
			2011-01-04
			1 day
		
		
			Nigeria NGSE All Share Index
			2011-01-06
			2011-01-05
			1 day
		
		
			Nigeria NGSE All Share Index
			2011-01-07
			2011-01-06
			1 day
		
		
			Nigeria NGSE All Share Index
			2011-01-10
			2011-01-07
			3 days
		
	


Of interest in the same output is that the counting starts over when a new indicator is listed.


	
	
		
			indicatorname
			date
			previousday
			timegap
		
		
			Nigeria NGSE All Share Index
			2015-07-27
			2015-07-24
			3 days
		
		
			Nigeria NGSE All Share Index
			2015-07-28
			2015-07-27
			1 day
		
		
			Tunisia Dinar
			2011-01-03
			NULL
			NULL
		
		
			Tunisia Dinar
			2011-01-04
			2011-01-03
			1 day
		
		
			Tunisia Dinar
			2011-01-05
			2011-01-04
			1 day
		
		
			Tunisia Dinar
			2011-01-06
			2011-01-05
			1 day
		
		
			Tunisia Dinar
			2011-01-07
			2011-01-06
			1 day
		
		
			Tunisia Dinar
			2011-01-10
			2011-01-07
			3 days
		
	


Good
Let's look at another


	
	
		
			indicatorname
			date
			previousday
			timegap
		
		
			Tunisia Dinar
			2015-07-27
			2015-07-24
			3 days
		
		
			Tunisia Dinar
			2015-07-28
			2015-07-27
			1 day
		
		
			Kenya Nairobi SE Index- NSE 20
			2011-01-03
			NULL
			NULL
		
		
			Kenya Nairobi SE Index- NSE 20
			2011-01-04
			2011-01-03
			1 day
		
		
			Kenya Nairobi SE Index- NSE 20
			2011-01-05
			2011-01-04
			1 day
		
		
			Kenya Nairobi SE Index- NSE 20
			2011-01-06
			2011-01-05
			1 day
		
		
			Kenya Nairobi SE Index- NSE 20
			2011-01-07
			2011-01-06
			1 day
		
		
			Kenya Nairobi SE Index- NSE 20
			2011-01-10
			2011-01-07
			3 days
		
	


Still Good
We will continue with tasks 5-7 in another post.
I will also be exploring this data visually with Tableau in this post if you wanted to skip ahead.

	Area	Item	Unit	Element	Y1961	Y2018	PercentChange
1	Viet Nam	Coffee, green	hg/ha	Yield	1934	26117	1250.41
2	Thailand	Coffee, green	hg/ha	Yield	600	5725	854.17
3	Nigeria	Coffee, green	hg/ha	Yield	1833	12886	603
4	Malaysia	Coffee, green	hg/ha	Yield	3391	22810	572.66
5	China, mainland	Coffee, green	hg/ha	Yield	5000	29405	488.1
6	Malawi	Coffee, green	hg/ha	Yield	4175	23345	459.16
7	Honduras	Coffee, green	hg/ha	Yield	2767	11195	304.59
8	Nicaragua	Coffee, green	hg/ha	Yield	2762	10672	286.39
9	Brazil	Coffee, green	hg/ha	Yield	5084	19060	274.9
10	Lao People's Democratic Republic	Coffee, green	hg/ha	Yield	5000	18611	272.22
11	Sierra Leone	Coffee, green	hg/ha	Yield	6379	20706	224.6
12	Ghana	Coffee, green	hg/ha	Yield	5667	16959	199.26
13	Rwanda	Coffee, green	hg/ha	Yield	5833	16506	182.98
14	French Polynesia	Coffee, green	hg/ha	Yield	869	2268	160.99

Continent	Yield_Change_Percent
Oceania	163.29
Americas	137.01
Asia	111.31
Africa	14.63
Europe	NA

Area Code	Area	Item Code	Item	Element Code	Element	Unit	Y1961
4	Algeria	221	Almonds, with shell	5312	Area harvested	ha	13300
4	Algeria	221	Almonds, with shell	5419	Yield	hg/ha	4511
4	Algeria	221	Almonds, with shell	5510	Production	tonnes	6000
4	Algeria	515	Apples	5312	Area harvested	ha	3400
4	Algeria	515	Apples	5419	Yield	hg/ha	45294
4	Algeria	515	Apples	5510	Production	tonnes	15400

Y1961F	Y1962	Y1962F	Y1963	Y1963F	Y1964	Y1964F	Y1965
F	13300	F	13300	F	14200	F	13800
Fc	4511	Fc	4511	Fc	4507	Fc	4493
	6000		6000		6400		6200
F	3100	F	2800	F	2700	F	2900
Fc	45161	Fc	46429	Fc	46078	Fc	45348
	14000		13000		12441		13151

Flag	Flags
	Official data
*	Unofficial figure
A	Aggregate, may include official, semi-official, estimated or calculated data
…	…
F	FAO estimate
Fb	Data obtained as a balance
Fc	Calculated data
…	…

Area	Item.Code	Item	Element	Unit	Y2018
Ethiopia	656	Coffee, green	Production	tonnes	494574
Uganda	656	Coffee, green	Production	tonnes	284225
Madagascar	656	Coffee, green	Production	tonnes	60114
United Republic of Tanzania	656	Coffee, green	Production	tonnes	43193
Guinea	656	Coffee, green	Production	tonnes	42900
Kenya	656	Coffee, green	Production	tonnes	41375
C�te d'Ivoire	656	Coffee, green	Production	tonnes	39092
Rwanda	656	Coffee, green	Production	tonnes	38643
Democratic Republic of the Congo	656	Coffee, green	Production	tonnes	31145
Cameroon	656	Coffee, green	Production	tonnes	30590
Sierra Leone	656	Coffee, green	Production	tonnes	20480
Togo	656	Coffee, green	Production	tonnes	19076
Angola	656	Coffee, green	Production	tonnes	16308
Burundi	656	Coffee, green	Production	tonnes	14216
Malawi	656	Coffee, green	Production	tonnes	11082

Year	Date Of Incident	Route	Operator	Group Name	Bus Garage	Borough	Injury Result Description	Incident Event Type	Victim Category	Victims Sex	Victims Age
0	2015	1/1/2015	1	London General	Go-Ahead	Garage Not Available	Southwark	Injuries treated on scene	Onboard Injuries	Passenger	Male	Child
1	2015	1/1/2015	4	Metroline	Metroline	Garage Not Available	Islington	Injuries treated on scene	Onboard Injuries	Passenger	Male	Unknown
2	2015	1/1/2015	5	East London	Stagecoach	Garage Not Available	Havering	Taken to Hospital – Reported Serious Injury or...	Onboard Injuries	Passenger	Male	Elderly
3	2015	1/1/2015	5	East London	Stagecoach	Garage Not Available	None London Borough	Taken to Hospital – Reported Serious Injury or...	Onboard Injuries	Passenger	Male	Elderly

indicatorname	initital_date	initial_average	ending_date	ending_average
Baltic Dry Index	2011-01-01	1401.4	2011-01-01	954.9
CFA zone Countries CFA Franc	2011-01-01	492.8661	2011-01-01	596.559305
Cocoa (USD/tonne)	2011-01-01	3164.863	2011-01-01	3332.6425
Coffee Brazilian Naturals (US cents/tonne)	2011-01-01	209.2631578947	2011-01-01	111.9342105263
…	…	…	…	…

Indicator	IndicatorName	Unit	Frequency	Date	Value
91242809	Egypt CASE 30 Index		D	2011-01-02	7082.4
91242109	Tunisia Dinar		D	2011-01-03	1.4416
91242209	Platinum (USD/Troy Ounce)	USD/Troy Ounce	D	2011-01-03	1781.5
91242309	Kenya Nairobi SE Index- NSE 20		D	2011-01-03	4495.41

indicatorname	year	month	monthlyaveragevalue
Baltic Dry Index	2011	1	1401.4
Baltic Dry Index	2011	2	1181.1
Baltic Dry Index	2011	3	1492.6957
Baltic Dry Index	2011	4	1342.5556
Baltic Dry Index	2011	5	1352.4
Baltic Dry Index	2011	6	1433.2273
Baltic Dry Index	2011	7	1365.5238
Baltic Dry Index	2011	8	1386.9545
Baltic Dry Index	2011	9	1840.4091
Baltic Dry Index	2011	10	2072.4762
Baltic Dry Index	2011	11	1835.3182

indicatorname	date	beginningvalue	endingvalue	percentchange
South Africa Rand	2015-07-28	6.6275	12.5681	89.6356
Cote d'Ivoire BRVM Composite Index	2015-07-28	159.32	301.22	89.066
Uganda SE All Share index USE	2015-07-28	1192.57	1893.63	58.7856
South Africa JALSH All Share Index	2015-07-28	32308.11	50758.42	57.1074
Uganda Shilling	2015-07-28	2310	3420	48.0519

indicator	indicatorname
91244509	Baltic Dry Index
91242409	CFA zone Countries CFA Franc
91241909	Cocoa (USD/tonne)
91244109	Coffee Brazilian Naturals (US cents/tonne)
91242709	Coffee Robusta (US cents/tonne)
91243909	Copper (USD/lb)
91244709	Copper (USD/MT)
91243409	Cote d'Ivoire BRVM Composite Index
91243709	Cotton (USD/lb)
91244809	Crude Oil, Brent (USD/bbl)
91242809	Egypt CASE 30 Index
91243809	Egypt Pound
91243609	Europe EURO
91244909	Gold (USD/Troy Ounce)
91244609	Iron Ore (USD/Dry MT)
91242609	Kenya Kenyan Shilling
91242309	Kenya Nairobi SE Index- NSE 20
91242909	Mauritius Mauritius AllShares SEMDEX
91244009	Mauritius Rupee
91244309	Morocco Casa All Share Index
91243109	Morocco Dirham
91243009	Nigeria Naira
91242009	Nigeria NGSE All Share Index
91242209	Platinum (USD/Troy Ounce)
91245009	Silver (USD/Troy Ounce)
91242509	South Africa JALSH All Share Index
91243209	South Africa Rand
91242109	Tunisia Dinar
91243309	Tunisia Tunis se Tnse Index TUNINDEX
91243509	Uganda SE All Share index USE
91244209	Uganda Shilling

indicatorname	date	previousday	timegap
Nigeria NGSE All Share Index	2011-01-04	NULL	NULL
Nigeria NGSE All Share Index	2011-01-05	2011-01-04	1 day
Nigeria NGSE All Share Index	2011-01-06	2011-01-05	1 day
Nigeria NGSE All Share Index	2011-01-07	2011-01-06	1 day
Nigeria NGSE All Share Index	2011-01-10	2011-01-07	3 days