<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Will's Data Portfolio]]></title><description><![CDATA[My Data Analytics Portfolio]]></description><link>https://www.williamameyer.com/</link><image><url>https://www.williamameyer.com/favicon.png</url><title>Will&apos;s Data Portfolio</title><link>https://www.williamameyer.com/</link></image><generator>Ghost 5.89</generator><lastBuildDate>Wed, 13 May 2026 10:22:37 GMT</lastBuildDate><atom:link href="https://www.williamameyer.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Data Analytics Capstone - Full Explanation]]></title><description><![CDATA[<p><strong>Research Question</strong></p><p>&#x201C;We are never going to reach equality in America &#x2026; until we achieve equality in education.&#x201D; (The Aspen Institute, 2017). Justice Sotomayor&#x2019;s words in this interview offer an ugly lesson on the current state of affairs in America, inherently stating that the US does</p>]]></description><link>https://www.williamameyer.com/data-analytics-capstone-full-explanation/</link><guid isPermaLink="false">6868506246755058329ed839</guid><dc:creator><![CDATA[William M]]></dc:creator><pubDate>Fri, 04 Jul 2025 22:26:14 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1581726707445-75cbe4efc586?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDh8fGNsYXNzcm9vbXxlbnwwfHx8fDE3NTE1ODY2NDh8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1581726707445-75cbe4efc586?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDh8fGNsYXNzcm9vbXxlbnwwfHx8fDE3NTE1ODY2NDh8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" alt="Data Analytics Capstone - Full Explanation"><p><strong>Research Question</strong></p><p>&#x201C;We are never going to reach equality in America &#x2026; until we achieve equality in education.&#x201D; (The Aspen Institute, 2017). Justice Sotomayor&#x2019;s words in this interview offer an ugly lesson on the current state of affairs in America, inherently stating that the US does not pass the test to be called equal, and it cannot truly pass that test until education is equal. After growing up in a minority family in a subsidized housing unit, this Supreme Court justice graduated from Princeton summa cum laude, and now is a lifetime member of the nation&#x2019;s highest court (Wikipedia contributors, 2023). This is one anecdotal example of an exceptional scholar seemingly contradicting her quote. The reason for pursuing the research question is to use data analysis to see past anecdotal examples such as this one. To examine education inequality in the United States, this study asks, &#x201C;Does the proportion of students who are eligible for free lunch in a school have a statistically significant effect on standardized test scores?&#x201D;</p><p>The underlying premise of this question is &#x201C;are we educating children with differing incomes equally?&#x201D; The context of this question deals with Virginia Public Schools, specifically in the 2015-2016 school year. The data contains information compiled by University Libraries, Virginia Tech, and Bradburn (2021), and it includes standardized test scores for each school. The study&#x2019;s focus is on a school&#x2019;s average performance in standardized tests compared to the school&#x2019;s proportion of low-income students. The data does not contain the income of student families, but it does contain the proportion of students that are eligible for free lunches. With free lunch eligibility, we can understand family incomes. In the 2015-2016 school year, students that were eligible for free lunches came from households with incomes lower than 130% of the Federal income poverty guidelines as illustrated in the following chart (Department of Agriculture, Food and Nutrition Service, 2015).</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.williamameyer.com/content/images/2025/07/image-6.png" class="kg-image" alt="Data Analytics Capstone - Full Explanation" loading="lazy" width="793" height="257" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/image-6.png 600w, https://www.williamameyer.com/content/images/2025/07/image-6.png 793w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Data found at: https://www.govinfo.gov/content/pkg/FR-2015-03-31/pdf/2015-07358.pdf</em></i></figcaption></figure><p>It is hypothesized that income levels will have a statistically significant impact on a school&#x2019;s average test scores. Running a two-sample t-test the null hypothesis is that &#x201C;mean test scores for schools with more students eligible for free lunch are not statistically different from schools with fewer students eligible for free lunch.&#x201D; The alternate hypothesis for this comparison test is that &#x201C;mean test scores for schools with more students eligible for free lunch are statistically different from schools with fewer students eligible for free lunch.&#x201D;</p><p><strong>Data Collection</strong></p><p>The data will be provided from an existing data set titled &#x201C;Characterizing Virginia Public Schools (public dataset)&#x201D; (University Libraries, Virginia Tech &amp; Bradburn, 2021). The data included a series of four Microsoft Excel files with multiple sheets and a text file with a brief description of the data. The data included information from 2008 until 2017. Of the four Excel files, &#x201C;AYP VA schools 2012 - 2016_final.xlsx&#x201D; provides relevant data on the &#x201C;2015-16&#x201D; sheet. &#x201C;Free Reduced Lunch by Schools and Grade Structures 2008-2017_final.xlsx&#x201D; contains relevant information on the &#x201C;Data&#x201D; sheet. The following table highlights the variables that are of importance to this study.</p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/image-7.png" class="kg-image" alt="Data Analytics Capstone - Full Explanation" loading="lazy" width="793" height="317" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/image-7.png 600w, https://www.williamameyer.com/content/images/2025/07/image-7.png 793w" sizes="(min-width: 720px) 720px"></figure><p><br>Utilizing this data set derived from publicly available information was an advantage because the work of compiling the scores from disparate counties and cities had already been completed. A disadvantage of using this data is that the different Excel files contained different ways of reporting nulls.</p><p>The issue of reporting nulls differently in different Excel files was solved by examining the data and the data descriptions. After importing into a Pandas data frame, the nulls would be standardized overcoming this issue. Another issue was making sure time was compared appropriately. To overcome this challenge, the most recent school year that was included in both files was used. This meant including only the &#x201C;2015-16&#x201D; sheet from the &#x201C;Free Reduced Lunch by Schools and Grade Structures 2008-2017_final.xlsx&#x201D; file and including the columns from the &#x201C;AYP VA schools 2012 - 2016_final.xlsx&#x201D; file with the suffix &#x201C;_2015_16&#x201D;.</p><p><br><strong>Data Extraction and Preparation</strong></p><p>The data preparation steps are outlined in the following list below:</p><p>All code is included <s>in a separate file &#x201C;02_D214_Task_2_Code.ipynb&#x201D; </s> <a href="https://www.williamameyer.com/data-analytics-capstone-code-reference/" rel="noreferrer">at this post.</a></p><ol><li>Save Excel Sheets as CSV using Excel</li></ol><p>The data was received in a series of excel files with multiple sheets. The file &#x201C;AYP VA schools 2012 - 2016_final.xlsx&#x201D; provides relevant data on the &#x201C;2015-16&#x201D; sheet. This sheet was saved as &#x201C;scores.csv&#x201D;. &#x201C;Free Reduced Lunch by Schools and Grade Structures 2008-2017_final.xlsx&#x201D; contains relevant information on the &#x201C;Data&#x201D; sheet. This sheet was saved as &#x201C;free_lunch_percentages.csv&#x201D;.</p><ol start="2"><li>Import data into a Pandas data frame</li></ol><p>The Pandas &#x201C;read_csv&#x201D; method was used to import both files into a Jupyter notebook using Python version 3.</p><pre><code>#Import data from data set files.

scores = pd.read_csv (&apos;C:\\Users\\will\\Desktop\\Courses\\11 - Capstone - D214\\Task 2\\Data\\scores.csv&apos;,
                 index_col=0)

free_lunch_percentages = pd.read_csv \
(&apos;C:\\Users\\will\\Desktop\\Courses\\11 - Capstone - D214\\Task 2\\Data\\free_lunch_percentages.csv&apos;, index_col=0)</code></pre><p></p><ol start="3"><li>Convert null values to standard &#x2018;nan&#x2019;</li></ol><p>&#x201C;#NULL!&#x201D; and &#x201C;-&#x201C; values were changed to &#x201C;nan&#x201D;.</p><pre><code>#Converting all null value representations to &apos;nan&apos;
#&quot;free_lunch_percentages&quot; has nulls reported as &apos;#NULL!&apos;.  Changing &apos;#NULL!&apos; to &apos;nan&apos; with replace() method (Inada, n.d.).
free_lunch_percentages.replace(&quot;#NULL!&quot;, np.nan, inplace=True)
#&quot;scores&quot; has nulls reported as &apos;#NULL!&apos;.  Changing &apos;-&apos; to &apos;nan&apos; with replace() method (citation).
scores.replace(&quot;-&quot;, np.nan, inplace=True)</code></pre><ol start="4"><li>Create new data frames with only relevant information, then join them on the school ID.</li></ol><p>New data frames were created with only the variables that will be used in the study. The &#x201C;.join()&#x201D; method was used to match the data on the school ID.</p><pre><code>#Only include relevant variables
dependent = scores[[&quot;Division Name&quot;,&quot;School Name&quot;,&quot;English_2015_16&quot;,&quot;Mathematics_2015_16&quot;,&quot;History_2015_16&quot;,&quot;Science_2015_16&quot;]]
independent = free_lunch_percentages[[&quot;free_per_1516&quot;]]</code></pre><pre><code>#Join data on both dataframes using join() method (Bobbit, 2021).
df = dependent.join(independent)</code></pre><ol start="5"><li>Report on nulls, and treat (null removal was used), and repeat the report to confirm the removal.</li></ol><p>A function was created to report the numbers and percentages of null information on the data frame. After examining the data it was decided to remove null values with the &#x201C;dropna()&#x201D; method.</p><pre><code>#Define function to count nulls and output counts/percentage
def nullcounter(dataframename):
    print(&quot;\n Your null count for \&quot;{dname}\&quot; is: \n&quot;.format(dname=dataframename))
    run = &apos;&apos;&apos;nullcounter = pd.DataFrame(({name}.isna().sum()), columns=[&quot;Null Count&quot;])
nullcounterpercent = pd.DataFrame(({name}.isna().mean() *100), columns=[&quot;Null Count&quot;])
nullcounter[&apos;Null %&apos;] = nullcounterpercent
print(nullcounter.loc[(nullcounter[&apos;Null Count&apos;] &gt; 0)])&apos;&apos;&apos;.format(name=dataframename)
    exec(run)

    
#Output basic statistics on combined data frame.
print(df.describe())

#Output null count on combined data frame.
nullcounter(&quot;df&quot;)

#Output null visualization on combined data frame.
msno.matrix(df)</code></pre><pre><code>#Drop rows with nulls dependent
df = df.dropna()</code></pre><ol start="6"><li>Check for duplicates. None found.</li></ol><p>The &#x201C;duplicated()&#x201D; method was used to examine if there was any duplicate data in the data set. There were no duplicates found.</p><pre><code>#Cell 4
#Check for duplicates on all rows

#Check for duplicates
duplicates = df.duplicated()
print(&quot;Duplicate data on all rows combined?&quot;)
print(duplicates.unique())
print(&apos;\n&apos;)

#Check for duplicates on index using duplicated() method, sodee snipped from stackoverflow (Matthew, 2013).
df[df.index.duplicated(keep=False)]

print(&quot;Data Types&quot;)
df.dtypes</code></pre><ol start="7"><li>Convert variables to appropriate types, remove the &#x201C;%&#x201D; symbol, and drop any data that is out of bounds.</li></ol><p>The test scores were converted to integers. The percentage of students on free lunch was converted to a float after removing the &quot;%&quot; symbol. Percentage data that was negative or above 1 was deemed out of bounds, and removed using the &#x201C;drop()&#x201D; method.</p><pre><code>#Convert scores to numeric then integers
df[&apos;English_2015_16&apos;] = pd.to_numeric(df[&apos;English_2015_16&apos;]).astype(int)
df[&apos;Mathematics_2015_16&apos;] = pd.to_numeric(df[&apos;Mathematics_2015_16&apos;]).astype(int)
df[&apos;History_2015_16&apos;] = pd.to_numeric(df[&apos;History_2015_16&apos;]).astype(int)
df[&apos;Science_2015_16&apos;] = pd.to_numeric(df[&apos;Science_2015_16&apos;]).astype(int)


#remove percentage sign and convert to decimal notation using rstrip and astype.  Code snipped from stackoverflow (Bloom, 2014).
df[&apos;free_per_1516&apos;] = df[&apos;free_per_1516&apos;].str.rstrip(&apos;%&apos;).astype(&apos;float&apos;) / 100.0</code></pre><ol start="8"><li>Create aggregate variable &#x201C;total_score&#x201D;.</li></ol><p>The scores for each test were added together to create the variable &#x201C;total_score&#x201D; using the &#x201C;sum()&#x201D; method.</p><pre><code>#Create a sum of all test scores as &apos;total_score&apos;.
df[&apos;total_score&apos;] = df[[&quot;English_2015_16&quot;,&quot;Mathematics_2015_16&quot;, &quot;History_2015_16&quot;, &quot;Science_2015_16&quot;]].sum(axis=1)
df.head()</code></pre><p></p><ol start="9"><li>Cap outliers in &quot;total_score&quot;.</li></ol><p>Outliers in &#x201C;total_score&#x201D; were capped to remain within 1.5 times the interquartile range.</p><pre><code>#Capping the data in the skewed distribution &quot;total_score&quot; with 1.5 iqr
#Article mentions treating outliers differently depending on the distribution (Goyal, 2022).

skewed_int_outliers =[&quot;total_score&quot;]

#Skewed Distribution Outlier treatment with 1.5iqr
for i in skewed_int_outliers:
    percentile25 = df[i].quantile(0.25)
    percentile75 = df[i].quantile(0.75)
    iqr = percentile75 - percentile25
    upper_limit = percentile75 + 1.5 * iqr
    lower_limit = percentile25 - 1.5 * iqr
    #capping the outliers
    df[i] = np.where(
        df[i]&gt;upper_limit,
        np.rint(math.floor(upper_limit)).astype(int),
        np.where(
            df[i]&lt;lower_limit,
            np.rint(math.ceil(lower_limit)).astype(int),
            df[i]
        )
    )</code></pre><ol start="10"><li>Calculate the median of &#x201C;free_per_1516&#x201D; and put schools into two groups based on the median.</li></ol><p>The median was used to create two groups of schools. Schools with a percentage of students eligible to receive free lunch greater than or equal to the median were included in group &#x201C;1&#x201D;. The remaining schools were placed in group &#x201C;0&#x201D;. Two data frames were created based on these different populations.</p><pre><code>#Find median and print
free_lunch_percent_median = df[&apos;free_per_1516&apos;].describe().loc[[&apos;50%&apos;]]
print(&quot;Median Percent of &apos;free_per_1516&apos; is: &quot;, free_lunch_percent_median[0], &quot;\n&quot;)</code></pre><pre><code>#Create distinct gropus of student population based on proportion of students who are eligible for free lunch.
#Values below and above can be replaced based on condition (Komali, 2021).

#Split data into 0.372850 (median of cleaned data set)
df.loc[df[&apos;free_per_1516&apos;] &gt;= free_lunch_percent_median[0], &apos;groups&apos;] = 1
df.loc[df[&apos;free_per_1516&apos;] &lt; free_lunch_percent_median[0], &apos;groups&apos;] = 0 

df[&apos;groups&apos;] = pd.to_numeric(df[&apos;groups&apos;]).astype(int)
df.groups.describe()

low_percentage_df = df.loc[df[&apos;groups&apos;] == 0]
high_percentage_df = df.loc[df[&apos;groups&apos;] == 1]</code></pre><p>Python was utilized because it is a good language for data science and provides the functionality to deal with data science applications (GeeksforGeeks, 2022). Utilizing a Pandas data frame allowed for all the necessary data transformations on the data. An advantage of using a Jupyter notebook for the data extraction and preparation is the division of tasks into individual cells, this allowed for larger tasks to be decomposed and tested in individual cells. A disadvantage of using Python is that visualization is not a strength, especially when compared to the R programming language (Python Vs. R: What&#x2019;s the Difference?, 2021). This limitation was overcome by utilizing additional libraries for data visualization in Python.</p><p><strong>Analysis</strong></p><p>At this point, the data has been placed into two distinct groups. One group in the data frame &#x201C;low_percentage_df&#x201D; represents schools that have lower than the median number of students who are eligible for free lunch, and the other group &#x201C;high_percentage_df&#x201D; represents schools that have higher than the median number of students who are eligible for free lunch.</p><p>To test if these populations have similar test scores, we will compare them using a two-sample t-test. A two-sample t-test &#x201C;is used when you want to compare two independent groups to see if their means are different&#x201D; (Glen, 2022). This study will compare the mean test scores of the schools with low proportions of students who are eligible for free lunch and the mean test scores of the schools with high proportions of students who are eligible for free lunch. This use case aligns with the purpose of a t-test.</p><p>The analysis steps are described below:</p><p>All code is included in a separate file &#x201C;02_D214_Task_2_Code.ipynb&#x201D;:</p><ol><li>Visually examine the data:</li></ol><pre><code>#Output histogram on free_per_1516
plt.hist(df[&apos;free_per_1516&apos;])
plt.title(&quot;Histogram {}&quot;.format(&quot;free_per_1516&quot;))
plt.xlabel(&quot;free_per_1516&quot;)
plt.ylabel(&quot;Count&quot;)
plt.show()


#Output histogram on &apos;total_score&apos;
plt.hist(df[&apos;total_score&apos;])
plt.title(&quot;Histogram {}&quot;.format(&quot;total_score&quot;))
plt.xlabel(&quot;total_score&quot;)
plt.ylabel(&quot;Count&quot;)
plt.show()


#Output regplot to examine possible relationships
sns.regplot(x = df[&apos;free_per_1516&apos;], y = df[&apos;total_score&apos;])
plt.title(&quot;Scatter plot free_percentage_1516 compared to total score&quot;)
plt.xlabel(&quot;I&quot;)
plt.ylabel(&quot;Y&quot;)
plt.show()


#Output boxplot to examine data characteristics of &apos;free_per_1516&apos;
sns.boxplot(df[&apos;free_per_1516&apos;])
plt.title(&quot;Box plot free_percentage_1516&quot;)
plt.show()


#Output boxplot to examine data characteristics of &apos;total_score&apos;
sns.boxplot(df[&apos;total_score&apos;])
plt.title(&quot;Box plot total_score&quot;)
plt.show()


#Output heatmap to examine correlation
plt.figure(figsize=(25, 11))
plt.title(&quot;Heatmap free_per_1516 correlation with total_score&quot;)
sns.heatmap(df[[&quot;free_per_1516&quot;,&quot;total_score&quot;]].corr(),vmin=-1, vmax=1, annot=True);


#Find median and print
free_lunch_percent_median = df[&apos;free_per_1516&apos;].describe().loc[[&apos;50%&apos;]]
print(&quot;Median Percent of &apos;free_per_1516&apos; is: &quot;, free_lunch_percent_median[0], &quot;\n&quot;)</code></pre><p></p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/image-8.png" class="kg-image" alt="Data Analytics Capstone - Full Explanation" loading="lazy" width="279" height="406"></figure><p></p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/image-9.png" class="kg-image" alt="Data Analytics Capstone - Full Explanation" loading="lazy" width="297" height="625"></figure><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/image-10.png" class="kg-image" alt="Data Analytics Capstone - Full Explanation" loading="lazy" width="793" height="405" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/image-10.png 600w, https://www.williamameyer.com/content/images/2025/07/image-10.png 793w" sizes="(min-width: 720px) 720px"></figure><ol start="2"><li>Test for Equality of variance:</li></ol><p>A Levene&#x2019;s test was used as a preliminary test to check if the two populations had the same variance. The test resulted in a p-value less than .05, meaning we have sufficient evidence to say the variance in test scores between the populations is likely different (Bobbit, 2020).</p><pre><code>#use levene test to check for equality of variance (Bobbit, 2020).
stats.levene(low_percentage_df.total_score, high_percentage_df.total_score, center=&apos;median&apos;)</code></pre><ol start="3"><li>Perform the two-sample t-test on both populations:</li></ol><p>Though the Levene&#x2019;s test showed that it is unlikely that the populations have the same variance, the t-test was still performed on the data sets. &#x201C;The t-test is robust to violations of that assumption so long as the sample size isn&#x2019;t tiny and the sample sizes aren&#x2019;t far apart&#x201D; (How to Compare Two Means When the Groups Have Different Standard Deviations. - FAQ 1349 - GraphPad, n.d.).</p><pre><code>#Code snippet for t-test derived from website (GeeksforGeeks, 2022).
print(stats.ttest_ind(a=low_percentage_df.total_score, b=high_percentage_df.total_score, equal_var=False))</code></pre><p>Output:</p><blockquote>Ttest_indResult(statistic=25.734340600381046, pvalue=2.5360699249948364e-122)</blockquote><ol start="4"><li>Visualize population distributions with a Kernel Density Plot.</li></ol><p>A density plot was used to further analyze the distribution of test scores in each of the populations.</p><pre><code>#Run the t-test
#Code snippet for t-test derived from website (GeeksforGeeks, 2022).
print(stats.ttest_ind(a=low_percentage_df.total_score, b=high_percentage_df.total_score, equal_var=False))

#Plot both population&apos;s metrics
low_percentage_df[&apos;total_score&apos;].plot(kind=&apos;kde&apos;, c=&apos;red&apos;, linewidth=3, figsize=[13,6])
high_percentage_df[&apos;total_score&apos;].plot(kind=&apos;kde&apos;, c=&apos;blue&apos;, linewidth=3, figsize=[13,6])
# Labels
labels = [&apos;Low Proportion of Free Lunch Eligible Students&apos;, &apos;High Proportion of Free Lunch Eligible Students&apos;]
plt.legend(labels)
plt.xlabel(&apos;Reported Standars of Learning Score&apos;)
plt.ylabel(&apos;Score Probability Density&apos;)

plt.show() </code></pre><p>Output:</p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/image-11.png" class="kg-image" alt="Data Analytics Capstone - Full Explanation" loading="lazy" width="793" height="377" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/image-11.png 600w, https://www.williamameyer.com/content/images/2025/07/image-11.png 793w" sizes="(min-width: 720px) 720px"></figure><p>An advantage of using the two-sample t-test is its simplicity. After the data is cleaned the test can be performed with one line of code. One disadvantage is that the output is not as rigorous as other tests. A regression model would estimate how much of an increase in the proportion of students receiving free school lunches affected the test scores. The two-sample t-test just reports if the populations likely have different means.</p><p><strong>Data Summary and Implications</strong></p><p>The two-sample t-test (with a t-value of 25.734340600381046, and a p-value of 2.5360699249948364e-122) shows that there is likely a statistically significant difference between schools with high proportions of students receiving free lunch and schools with low proportions of these students. The probability density graph highlights this difference. Schools with a high proportion of free lunch eligible students are receiving lower standardized test scores.</p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/image-12.png" class="kg-image" alt="Data Analytics Capstone - Full Explanation" loading="lazy" width="1171" height="522" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/image-12.png 600w, https://www.williamameyer.com/content/images/size/w1000/2025/07/image-12.png 1000w, https://www.williamameyer.com/content/images/2025/07/image-12.png 1171w" sizes="(min-width: 720px) 720px"></figure><p>A limitation of this analysis is that performing tests on the entire state of Virginia&#x2019;s schools does not consider unique situations. Some of these schools could specifically cater to the needs of those with developmental issues, possibly skewing the results. Perhaps there exist in Virginia schools whose goal is to specifically help those who are economically disadvantaged, possibly leading to schools that have both high achievement on standardized tests, and above median levels of students eligible for free lunch. These are only a few examples of unique situations that may need further consideration.</p><p>An immediate action would be to further examine cities with both types of schools. Cities that have both, schools with higher than median amounts of students eligible for free lunch, and schools with lower than median amounts of students eligible for free lunch can be compared. Some of these cities may show a statistically significant difference in test scores in these differing populations. Some of these cities may show that they can effectively teach students from lower-income families equally. If the cities that have more equality in education offer a different approach, different programs, or even have a different pedagogy, these things can be tested in cities that have larger gaps in education equality immediately.</p><p>Further examination as to why some schools have large proportions of students eligible for free lunch should also be performed. For example, if school districts are drawn in such a way that unfairly concentrates economically disadvantaged students in some schools while concentrating students with more economic advantages in other schools, the districts should be redrawn to attempt to even out these numbers.</p><p>Another aspect to look at when it comes to evening out the incomes in these schools should be housing. The Housing Choice Voucher is a federal program that is meant to assist &#x201C;very low-income families, the elderly, and the disabled to afford decent, safe, and sanitary housing in the private market.&#x201D; (<em>Housing Choice Voucher Program Section 8</em>, 2022). Are these cities fully taking advantage of the program? Does the application of the housing program in the city provide low-income families opportunities to live in districts that have lower levels of poverty, or does the voucher program need to make improvements in this area? Asking these questions, and course correcting when schools have uneven levels of student income is a necessary step.</p><p>Further study of the data set would be to look closely at individual counties and cities to see where income inequality exists, specifically where it is influencing student outcomes. This can be done by performing the same two-sample t-test with a city/county scope as opposed to a statewide scope. Another approach would be to attempt a regression to examine how much of an influence the proportion of students that are eligible for free lunches affects test scores.</p><p>Income inequality is just one measure of inequality in schools. It is important to continue to pursue studies about inequality to quantify the effects, understand the influences, and importantly to try to fix the issues. No form of separate and unequal education in schools should be accepted. When these discrepancies are discovered, it is important that everything is done to understand the issue and as quickly as possible correct the problem.</p><p><strong>References:</strong></p><p>Bobbit, Z. (2020, July 10). How to Perform Levene&#x2019;s Test in Python. Statology. https://www.statology.org/levenes-test-python/</p><p>GeeksforGeeks. (2022, November 17).<em>Python for Data Science</em>.<a href="https://www.geeksforgeeks.org/python-for-data-science/"><u>https://www.geeksforgeeks.org/python-for-data-science/</u></a></p><p>Glen, S. (2022, January 12). <em>Two-Sample T-Test: When to Use it</em>. Statistics How To.<a href="https://www.statisticshowto.com/two-sample-t-test-difference-means/"><u>https://www.statisticshowto.com/two-sample-t-test-difference-means/</u></a></p><p><em>Housing Choice Voucher Program Section 8</em>. (2022, January 11). HUD.gov / U.S. Department of Housing And Urban Development (HUD). https://www.hud.gov/topics/housing_choice_voucher_program_section_8</p><p>How to compare two means when the groups have different standard deviations. - FAQ 1349 - GraphPad. (n.d.). GraphPad by Dotmatics.<a href="https://www.graphpad.com/support/faq/how-to-compare-two-means-when-the-groups-have-different-standard-deviations/"><u>https://www.graphpad.com/support/faq/how-to-compare-two-means-when-the-groups-have-different-standard-deviations/</u></a></p><p><em>Python vs. R: What&#x2019;s the Difference?</em> (2021, March 23). IBM. https://www.ibm.com/cloud/blog/python-vs-r</p><p>The Aspen Institute. (2017, March 30). <em>In Conversation: Justice Sonia Sotomayor and Abigail Golden-Vazquez</em> [Video]. YouTube.<a href="https://www.youtube.com/watch?v=EaJuyXqGF2E&amp;feature=youtu.be"><u>https://www.youtube.com/watch?v=EaJuyXqGF2E&amp;feature=youtu.be</u></a></p><p>University Libraries, Virginia Tech, &amp; Bradburn, I. (2021, May 18).<em>Characterizing Virginia Public Schools (public dataset)</em>. Figshare.<a href="https://figshare.com/articles/dataset/Characterizing_Virginia_Public_Schools_public_dataset_/14097092"><u>https://figshare.com/articles/dataset/Characterizing_Virginia_Public_Schools_public_dataset_/14097092</u></a>CC license:<a href="https://creativecommons.org/licenses/by/3.0/us/"><u>CC BY 3.0 US</u></a></p><p>Wikipedia contributors. (2023, February 3). <em>Sonia Sotomayor</em>. Wikipedia.<a href="https://en.wikipedia.org/wiki/Sonia_Sotomayor"><u>https://en.wikipedia.org/wiki/Sonia_Sotomayor</u></a></p><p>Department of Agriculture, Food and Nutrition Service. (2015, March 26).<em>Federal Register: Vol. 80, No. 61</em>. The U.S. Department of Agriculture.<a href="https://www.govinfo.gov/content/pkg/FR-2015-03-31/pdf/2015-07358.pdf"><u>https://www.govinfo.gov/content/pkg/FR-2015-03-31/pdf/2015-07358.pdf</u></a></p><p><strong>Code References:</strong></p><p>Bloom, G., [GaryMBloom]. (2014, September 4). Convert percent string to float in pandas read_csv. Stack Overflow. https://stackoverflow.com/questions/25669588/convert-percent-string-to-float-in-pandas-read-csv</p><p>Bobbit, Z. (2020, July 10). How to Perform Levene&#x2019;s Test in Python. Statology. https://www.statology.org/levenes-test-python/</p><p>Bobbit, Z. (2021, November 6). How to Merge Two Pandas DataFrames on Index. Statology. https://www.statology.org/pandas-merge-on-index/</p><p>Chattar, P. (2021, September 2). Find common elements in two lists in python. Java2Blog. https://java2blog.com/find-common-elements-in-two-lists-python/</p><p>GeeksforGeeks. (2022, October 17). How to Conduct a Two Sample T Test in Python. https://www.geeksforgeeks.org/how-to-conduct-a-two-sample-t-test-in-python/</p><p>Goyal, C. (2022, August 25). Feature Engineering &#x2013; How to Detect and Remove Outliers (with Python Code). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/05/feature-engineering-how-to-detect-and-remove-outliers-with-python-code/</p><p>Hadzhiev, B. (n.d.). Remove common elements from two Lists in Python | bobbyhadz. Blog - Bobby Hadz. https://bobbyhadz.com/blog/python-remove-common-elements-from-two-lists</p><p>Harikrishnan, R., [Harikrishnan R]. (2020, January 26). Dropping rows with values outside of boundaries. Stack Overflow. https://stackoverflow.com/questions/59914605/dropping-rows-with-values-outside-of-boundaries</p><p>How to compare two means when the groups have different standard deviations. - FAQ 1349 - GraphPad. (n.d.). GraphPad by Dotmatics. https://www.graphpad.com/support/faq/how-to-compare-two-means-when-the-groups-have-different-standard-deviations/</p><p>Inada, I. (Ed.). (n.d.). Replacing values with NaNs in Pandas DataFrame. https://www.skytowner.com/explore/replacing_values_with_nans_in_pandas_dataframe</p><p>Komali. (2021, December 27). Pandas Replace Values based on Condition. Spark by {Examples}. https://sparkbyexamples.com/pandas/pandas-replace-values-based-on-condition/</p><p>Matthew. (2013, November 25). Pandas: Get duplicated indexes. Stack Overflow. https://stackoverflow.com/questions/20199129/pandas-get-duplicated-indexes</p>]]></content:encoded></item><item><title><![CDATA[Data Analytics Capstone - Code Reference]]></title><description><![CDATA[<pre><code>#Cell 1
#Import the packages required to complete the test.

#Pandas is imported to deal with tabular data.
import pandas as pd
#Numpy is imported to deal with mathematical functions.
import numpy as np
#Missingno is imported to deal with nulls.
import missingno as msno
#Pyplot and Seaborn are imported</code></pre>]]></description><link>https://www.williamameyer.com/data-analytics-capstone-code-reference/</link><guid isPermaLink="false">6868474c46755058329ed7cf</guid><dc:creator><![CDATA[William M]]></dc:creator><pubDate>Fri, 04 Jul 2025 21:46:34 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1581726707445-75cbe4efc586?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDh8fGNsYXNzcm9vbXxlbnwwfHx8fDE3NTE1ODY2NDh8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<pre><code>#Cell 1
#Import the packages required to complete the test.

#Pandas is imported to deal with tabular data.
import pandas as pd
#Numpy is imported to deal with mathematical functions.
import numpy as np
#Missingno is imported to deal with nulls.
import missingno as msno
#Pyplot and Seaborn are imported to deal with visualizations.
import matplotlib.pyplot as plt
import seaborn as sns
#Math is imported as it will help with capping outliers
import math
#Stats is imported for running the t-test.
import scipy.stats as stats
#For formatting list output in a more readable format
from pprint import pprint

#Warnings is used to supress messages about future version deprecation.
import warnings
warnings.simplefilter(action=&apos;ignore&apos;, category=FutureWarning)


#Import data from data set files.

scores = pd.read_csv (&apos;C:\\Users\\will\\Desktop\\Courses\\11 - Capstone - D214\\Task 2\\Data\\scores.csv&apos;,
                 index_col=0)

free_lunch_percentages = pd.read_csv \
(&apos;C:\\Users\\will\\Desktop\\Courses\\11 - Capstone - D214\\Task 2\\Data\\free_lunch_percentages.csv&apos;, index_col=0)


#Converting all null value representations to &apos;nan&apos;
#&quot;free_lunch_percentages&quot; has nulls reported as &apos;#NULL!&apos;.  Changing &apos;#NULL!&apos; to &apos;nan&apos; with replace() method (Inada, n.d.).
free_lunch_percentages.replace(&quot;#NULL!&quot;, np.nan, inplace=True)
#&quot;scores&quot; has nulls reported as &apos;#NULL!&apos;.  Changing &apos;-&apos; to &apos;nan&apos; with replace() method (citation).
scores.replace(&quot;-&quot;, np.nan, inplace=True)


#Only include relevant variables
dependent = scores[[&quot;Division Name&quot;,&quot;School Name&quot;,&quot;English_2015_16&quot;,&quot;Mathematics_2015_16&quot;,&quot;History_2015_16&quot;,&quot;Science_2015_16&quot;]]
independent = free_lunch_percentages[[&quot;free_per_1516&quot;]]    


#Output basic statistics on each dataframe.
print(&quot;\n&quot;)
print(&quot;Dependent data frame statistics&quot;)
print(dependent.describe())
print(&quot;\n&quot;)
print(&quot;Independent data frame statistics&quot;)
print(independent.describe())

</code></pre><blockquote>Dependent data frame statistics<br>          Division Name               School Name English_2015_16  \<br>count              1822                      1831            1813   <br>unique              132                      1758              64   <br>top     Fairfax County   Mountain View Elementary              85   <br>freq                192                         6              89   <br><br>       Mathematics_2015_16 History_2015_16 Science_2015_16  <br>count                 1814            1768            1750  <br>unique                  62              49              71  <br>top                     90              94              83  <br>freq                    95              99              79  <br><br><br>Independent data frame statistics<br>       free_per_1516<br>count           1910<br>unique          1610<br>top          100.00%<br>freq              60<br></blockquote><pre><code>#Cell 2
#Join data on both dataframes using join() method (Bobbit, 2021).
df = dependent.join(independent)


#Define function to count nulls and output counts/percentage
def nullcounter(dataframename):
    print(&quot;\n Your null count for \&quot;{dname}\&quot; is: \n&quot;.format(dname=dataframename))
    run = &apos;&apos;&apos;nullcounter = pd.DataFrame(({name}.isna().sum()), columns=[&quot;Null Count&quot;])
nullcounterpercent = pd.DataFrame(({name}.isna().mean() *100), columns=[&quot;Null Count&quot;])
nullcounter[&apos;Null %&apos;] = nullcounterpercent
print(nullcounter.loc[(nullcounter[&apos;Null Count&apos;] &gt; 0)])&apos;&apos;&apos;.format(name=dataframename)
    exec(run)

    
#Output basic statistics on combined data frame.
print(df.describe())

#Output null count on combined data frame.
nullcounter(&quot;df&quot;)

#Output null visualization on combined data frame.
msno.matrix(df)

</code></pre><blockquote>          Division Name               School Name English_2015_16  \<br>count              1822                      1831            1813   <br>unique              132                      1758              64   <br>top     Fairfax County   Mountain View Elementary              85   <br>freq                192                         6              89   <br><br>       Mathematics_2015_16 History_2015_16 Science_2015_16 free_per_1516  <br>count                 1814            1768            1750          1819  <br>unique                  62              49              71          1555  <br>top                     90              94              83       100.00%  <br>freq                    95              99              79            48  <br><br> Your null count for &quot;df&quot; is: <br><br>                     Null Count    Null %<br>Division Name                55  2.930208<br>School Name                  46  2.450719<br>English_2015_16              64  3.409696<br>Mathematics_2015_16          63  3.356420<br>History_2015_16             109  5.807139<br>Science_2015_16             127  6.766116<br>free_per_1516                58  3.090037<br>&lt;AxesSubplot:&gt;</blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-37a74f7a-3700-4997-b6cd-f55b64678e00.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="1476" height="715" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-37a74f7a-3700-4997-b6cd-f55b64678e00.png 600w, https://www.williamameyer.com/content/images/size/w1000/2025/07/data-src-image-37a74f7a-3700-4997-b6cd-f55b64678e00.png 1000w, https://www.williamameyer.com/content/images/2025/07/data-src-image-37a74f7a-3700-4997-b6cd-f55b64678e00.png 1476w" sizes="(min-width: 720px) 720px"></figure><pre><code>#Cell 3
#Drop rows with nulls dependent
df = df.dropna()

#Count nulls again
nullcounter(&quot;df&quot;)

#Desciribe data frame again.
print(df.describe())

#Visualize nulls again
msno.matrix(df)</code></pre><blockquote> Your null count for &quot;df&quot; is: <br><br>Empty DataFrame<br>Columns: [Null Count, Null %]<br>Index: []<br>          Division Name               School Name English_2015_16  \<br>count              1724                      1724            1724   <br>unique              132                      1662              64   <br>top     Fairfax County   Mountain View Elementary              85   <br>freq                190                         5              84   <br><br>       Mathematics_2015_16 History_2015_16 Science_2015_16 free_per_1516  <br>count                 1724            1724            1724          1724  <br>unique                  61              49              71          1476  <br>top                     90              94              83       100.00%  <br>freq                    90              97              79            47  <br>&lt;AxesSubplot:&gt;</blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-5342669b-af65-4e96-a9d5-b14e8601f9c4.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="1476" height="715" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-5342669b-af65-4e96-a9d5-b14e8601f9c4.png 600w, https://www.williamameyer.com/content/images/size/w1000/2025/07/data-src-image-5342669b-af65-4e96-a9d5-b14e8601f9c4.png 1000w, https://www.williamameyer.com/content/images/2025/07/data-src-image-5342669b-af65-4e96-a9d5-b14e8601f9c4.png 1476w" sizes="(min-width: 720px) 720px"></figure><pre><code>#Cell 4
#Check for duplicates on all rows

#Check for duplicates
duplicates = df.duplicated()
print(&quot;Duplicate data on all rows combined?&quot;)
print(duplicates.unique())
print(&apos;\n&apos;)

#Check for duplicates on index using duplicated() method, sodee snipped from stackoverflow (Matthew, 2013).
df[df.index.duplicated(keep=False)]

print(&quot;Data Types&quot;)
df.dtypes</code></pre><blockquote>Duplicate data on all rows combined?<br>[False]<br><br><br>Data Types<br>Division Name          object<br>School Name            object<br>English_2015_16        object<br>Mathematics_2015_16    object<br>History_2015_16        object<br>Science_2015_16        object<br>free_per_1516          object<br>dtype: object</blockquote><pre><code>
#Cell 5
#Convert scores to numeric then integers
df[&apos;English_2015_16&apos;] = pd.to_numeric(df[&apos;English_2015_16&apos;]).astype(int)
df[&apos;Mathematics_2015_16&apos;] = pd.to_numeric(df[&apos;Mathematics_2015_16&apos;]).astype(int)
df[&apos;History_2015_16&apos;] = pd.to_numeric(df[&apos;History_2015_16&apos;]).astype(int)
df[&apos;Science_2015_16&apos;] = pd.to_numeric(df[&apos;Science_2015_16&apos;]).astype(int)


#remove percentage sign and convert to decimal notation using rstrip and astype.  Code snipped from stackoverflow (Bloom, 2014).
df[&apos;free_per_1516&apos;] = df[&apos;free_per_1516&apos;].str.rstrip(&apos;%&apos;).astype(&apos;float&apos;) / 100.0


#Remove data that is out of bounds in &apos;free_per_1516&apos;.  Code snipped from stackoverflow (Harikrishnan, 2020).
df.drop(df[df.free_per_1516&gt;1.0].index, inplace=True)
df.drop(df[df.free_per_1516&lt;0.0].index, inplace=True)


#Create a sum of all test scores as &apos;total_score&apos;.
df[&apos;total_score&apos;] = df[[&quot;English_2015_16&quot;,&quot;Mathematics_2015_16&quot;, &quot;History_2015_16&quot;, &quot;Science_2015_16&quot;]].sum(axis=1)
df.head()


df.free_per_1516.describe()</code></pre><blockquote>count    1718.000000<br>mean        0.402871<br>std         0.260002<br>min         0.002800<br>25%         0.204550<br>50%         0.372850<br>75%         0.540700<br>max         1.000000<br>Name: free_per_1516, dtype: float64</blockquote><pre><code>#Cell 6
#Data nearly cleaned, output visualizations

#Output histogram on free_per_1516
plt.hist(df[&apos;free_per_1516&apos;])
plt.title(&quot;Histogram {}&quot;.format(&quot;free_per_1516&quot;))
plt.xlabel(&quot;free_per_1516&quot;)
plt.ylabel(&quot;Count&quot;)
plt.show()


#Output histogram on &apos;total_score&apos;
plt.hist(df[&apos;total_score&apos;])
plt.title(&quot;Histogram {}&quot;.format(&quot;total_score&quot;))
plt.xlabel(&quot;total_score&quot;)
plt.ylabel(&quot;Count&quot;)
plt.show()


#Output regplot to examine possible relationships
sns.regplot(x = df[&apos;free_per_1516&apos;], y = df[&apos;total_score&apos;])
plt.title(&quot;Scatter plot free_percentage_1516 compared to total score&quot;)
plt.xlabel(&quot;I&quot;)
plt.ylabel(&quot;Y&quot;)
plt.show()


#Output boxplot to examine data characteristics of &apos;free_per_1516&apos;
sns.boxplot(df[&apos;free_per_1516&apos;])
plt.title(&quot;Box plot free_percentage_1516&quot;)
plt.show()


#Output boxplot to examine data characteristics of &apos;total_score&apos;
sns.boxplot(df[&apos;total_score&apos;])
plt.title(&quot;Box plot total_score&quot;)
plt.show()


#Output heatmap to examine correlation
plt.figure(figsize=(25, 11))
plt.title(&quot;Heatmap free_per_1516 correlation with total_score&quot;)
sns.heatmap(df[[&quot;free_per_1516&quot;,&quot;total_score&quot;]].corr(),vmin=-1, vmax=1, annot=True);


#Find median and print
free_lunch_percent_median = df[&apos;free_per_1516&apos;].describe().loc[[&apos;50%&apos;]]
print(&quot;Median Percent of &apos;free_per_1516&apos; is: &quot;, free_lunch_percent_median[0], &quot;\n&quot;)</code></pre><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-8d49dfc4-e6e0-408f-8ee9-4d94f38f5cac.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="389" height="279"></figure><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-3d2b68bf-9b71-46f8-8272-9bea7e7dcdf1.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="389" height="279"></figure><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-73bef012-037e-41f9-a9bb-d3b02823c68c.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="400" height="278"></figure><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-67d90e43-e3c7-49ad-882d-e80b3684bce7.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="352" height="279"></figure><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-7b22f32b-f5e5-4d24-b281-deff53dea240.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="353" height="279"></figure><img src="https://images.unsplash.com/photo-1581726707445-75cbe4efc586?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDh8fGNsYXNzcm9vbXxlbnwwfHx8fDE3NTE1ODY2NDh8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" alt="Data Analytics Capstone - Code Reference"><p>Median Percent of &apos;free_per_1516&apos; is:  0.37285 </p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-e00bb011-c335-414b-a3b9-5ff9e768791b.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="1285" height="646" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-e00bb011-c335-414b-a3b9-5ff9e768791b.png 600w, https://www.williamameyer.com/content/images/size/w1000/2025/07/data-src-image-e00bb011-c335-414b-a3b9-5ff9e768791b.png 1000w, https://www.williamameyer.com/content/images/2025/07/data-src-image-e00bb011-c335-414b-a3b9-5ff9e768791b.png 1285w" sizes="(min-width: 720px) 720px"></figure><pre><code>#Cell 7
#Capping the data in the skewed distribution &quot;total_score&quot; with 1.5 iqr
#Article mentions treating outliers differently depending on the distribution (Goyal, 2022).

skewed_int_outliers =[&quot;total_score&quot;]

#Skewed Distribution Outlier treatment with 1.5iqr
for i in skewed_int_outliers:
    percentile25 = df[i].quantile(0.25)
    percentile75 = df[i].quantile(0.75)
    iqr = percentile75 - percentile25
    upper_limit = percentile75 + 1.5 * iqr
    lower_limit = percentile25 - 1.5 * iqr
    #capping the outliers
    df[i] = np.where(
        df[i]&gt;upper_limit,
        np.rint(math.floor(upper_limit)).astype(int),
        np.where(
            df[i]&lt;lower_limit,
            np.rint(math.ceil(lower_limit)).astype(int),
            df[i]
        )
    )


#Data cleaned, start with visualizations
#Output histogram on free_per_1516
plt.hist(df[&apos;free_per_1516&apos;])
plt.title(&quot;Histogram {}&quot;.format(&quot;free_per_1516&quot;))
plt.xlabel(&quot;free_per_1516&quot;)
plt.ylabel(&quot;Count&quot;)
plt.show()


#Output histogram on &apos;total_score&apos;
plt.hist(df[&apos;total_score&apos;])
plt.title(&quot;Histogram {}&quot;.format(&quot;total_score&quot;))
plt.xlabel(&quot;total_score&quot;)
plt.ylabel(&quot;Count&quot;)
plt.show()


#Output regplot to examine possible relationships
sns.regplot(x = df[&apos;free_per_1516&apos;], y = df[&apos;total_score&apos;])
plt.title(&quot;Scatter plot free_percentage_1516 compared to total score&quot;)
plt.xlabel(&quot;I&quot;)
plt.ylabel(&quot;Y&quot;)
plt.show()


#Output boxplot to examine data characteristics of &apos;free_per_1516&apos;
sns.boxplot(df[&apos;free_per_1516&apos;])
plt.title(&quot;Box plot free_percentage_1516&quot;)
plt.show()


#Output boxplot to examine data characteristics of &apos;total_score&apos;
sns.boxplot(df[&apos;total_score&apos;])
plt.title(&quot;Box plot total_score&quot;)
plt.show()


#Output heatmap to examine correlation
plt.figure(figsize=(25, 11))
plt.title(&quot;Heatmap free_per_1516 correlation with total_score&quot;)
sns.heatmap(df[[&quot;free_per_1516&quot;,&quot;total_score&quot;]].corr(),vmin=-1, vmax=1, annot=True);


#Find median and print
free_lunch_percent_median = df[&apos;free_per_1516&apos;].describe().loc[[&apos;50%&apos;]]
print(&quot;Median Percent of &apos;free_per_1516&apos; is: &quot;, free_lunch_percent_median[0], &quot;\n&quot;)</code></pre><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-f5d4c6bd-2d07-4f34-ac22-07796e61416f.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="389" height="279"></figure><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-7f2fa0fb-24d6-41b0-835d-9b8c273bb4bd.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="389" height="279"></figure><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-20f37c53-c872-43a7-bc25-1bd769547272.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="400" height="278"></figure><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-1925b874-3565-4a9b-82aa-7bb8870569f7.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="352" height="279"></figure><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-798406ff-1cbe-4efd-ac2c-babf46a5caec.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="352" height="279"></figure><blockquote>Median Percent of &apos;free_per_1516&apos; is:  0.37285 <br><br></blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-b0210f98-6a15-4a43-b872-3de908e29f17.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="1285" height="646" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-b0210f98-6a15-4a43-b872-3de908e29f17.png 600w, https://www.williamameyer.com/content/images/size/w1000/2025/07/data-src-image-b0210f98-6a15-4a43-b872-3de908e29f17.png 1000w, https://www.williamameyer.com/content/images/2025/07/data-src-image-b0210f98-6a15-4a43-b872-3de908e29f17.png 1285w" sizes="(min-width: 720px) 720px"></figure><pre><code>#Cell 8

#Create distinct gropus of student population based on proportion of students who are eligible for free lunch.
#Values below and above can be replaced based on condition (Komali, 2021).

#Split data into 0.372850 (median of cleaned data set)
df.loc[df[&apos;free_per_1516&apos;] &gt;= free_lunch_percent_median[0], &apos;groups&apos;] = 1
df.loc[df[&apos;free_per_1516&apos;] &lt; free_lunch_percent_median[0], &apos;groups&apos;] = 0 

df[&apos;groups&apos;] = pd.to_numeric(df[&apos;groups&apos;]).astype(int)
df.groups.describe()

low_percentage_df = df.loc[df[&apos;groups&apos;] == 0]
high_percentage_df = df.loc[df[&apos;groups&apos;] == 1]


#Export a copy of the cleaned data
df.to_csv(&quot;C:\\Users\\will\\Desktop\\Courses\\11 - Capstone - D214\\Task 2\\Data\\cleaned_data.csv&quot;, index = False)


#use levene test to check for equality of variance (Bobbit, 2020).
stats.levene(low_percentage_df.total_score, high_percentage_df.total_score, center=&apos;median&apos;)</code></pre><blockquote>LeveneResult(statistic=54.81184644296979, pvalue=2.0681326644527468e-13)</blockquote><p>Interpreting the LeveneResult - &quot; Ignore the result. With equal, or nearly equal, sample size (and moderately large samples), the assumption of equal standard deviations is not a crucial assumption. The t test work pretty well even with unequal standard deviations. In other words, the t test is robust to violations of that assumption so long as the sample size isn&#x2019;t tiny and the sample sizes aren&#x2019;t far apart. If you want to use ordinary t tests, &quot; (How to Compare Two Means When the Groups Have Different Standard Deviations. - FAQ 1349 - GraphPad, n.d.).</p><pre><code>#Cell 9
#Run the t-test
#Code snippet for t-test derived from website (GeeksforGeeks, 2022).
print(stats.ttest_ind(a=low_percentage_df.total_score, b=high_percentage_df.total_score, equal_var=False))

#Plot both population&apos;s metrics
low_percentage_df[&apos;total_score&apos;].plot(kind=&apos;kde&apos;, c=&apos;red&apos;, linewidth=3, figsize=[13,6])
high_percentage_df[&apos;total_score&apos;].plot(kind=&apos;kde&apos;, c=&apos;blue&apos;, linewidth=3, figsize=[13,6])
# Labels
labels = [&apos;Low Proportion of Free Lunch Eligible Students&apos;, &apos;High Proportion of Free Lunch Eligible Students&apos;]
plt.legend(labels)
plt.xlabel(&apos;Reported Standars of Learning Score&apos;)
plt.ylabel(&apos;Score Probability Density&apos;)

plt.show() </code></pre><blockquote>Ttest_indResult(statistic=25.734340600381046, pvalue=2.5360699249948364e-122)</blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-b896eda4-b1d0-4bcd-9894-22b58bc72986.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="795" height="371" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-b896eda4-b1d0-4bcd-9894-22b58bc72986.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-b896eda4-b1d0-4bcd-9894-22b58bc72986.png 795w" sizes="(min-width: 720px) 720px"></figure><p>Original Outcome Complete</p><pre><code>

#Cell 10
#Make a list of unique divisions
division_list_low = low_percentage_df[&apos;Division Name&apos;].unique().tolist()
division_list_high = high_percentage_df[&apos;Division Name&apos;].unique().tolist()

print(&quot;Count of Divisions with low proportions of students who are eligible for free lunch&quot;)
print(len(division_list_low))
print(&quot;Count of Divisions with high proportions of students who are eligible for free lunch&quot;)
print(len(division_list_high))


#Create a list of divisions that contain schools with high and low proportions of students eligible for free lunch.
#code snippet found online (Chattar, 2021).
divisions_high_and_low_temp = set(division_list_low).intersection(division_list_high)


print(&quot;Count of Divisions that have schools with high and low proportions of students eligible for free lunch&quot;)
print(len(divisions_high_and_low_temp))

divisions_high_and_low = []
for i in divisions_high_and_low_temp:
    X =low_percentage_df.loc[low_percentage_df[&apos;Division Name&apos;] == (i)].total_score
    Y = high_percentage_df.loc[high_percentage_df[&apos;Division Name&apos;] == (i)].total_score
    if len(X) &gt; 2 and len(Y) &gt; 2:
        divisions_high_and_low.append(i)

print(&quot;Count of Divisions that have more than 2 of each, schools with high proportion of students eligible\
for free lunch, and schools with low proportions.&quot;)
print(len(divisions_high_and_low))

</code></pre><blockquote>Count of Divisions with low proportions of students who are eligible for free lunch<br>85<br>Count of Divisions with high proportions of students who are eligible for free lunch<br>117<br>Count of Divisions that have schools with high and low proportions of students eligible for free lunch<br>70<br>Count of Divisions that have more than 2 of each, schools with high proportion of students eligiblefor free lunch, and schools with low proportions.<br>32<br></blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-3d407961-1106-40f0-b186-3a8177725932.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-3d407961-1106-40f0-b186-3a8177725932.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-3d407961-1106-40f0-b186-3a8177725932.png 789w" sizes="(min-width: 720px) 720px"></figure><p>Ttest_indResult(statistic=6.409940604162604, pvalue=6.681972140013619e-07) <br><br><br></p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-f080ab7a-2dc4-43f6-a65f-c1cf9ff92dd3.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="795" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-f080ab7a-2dc4-43f6-a65f-c1cf9ff92dd3.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-f080ab7a-2dc4-43f6-a65f-c1cf9ff92dd3.png 795w" sizes="(min-width: 720px) 720px"></figure><pre><code>#Cell 11
#Create a selection process with divisions that have a difference in mean test outcomes

divisions_with_low_p_values = []

#To count how many divisions have a &lt; .05 t-score
for i in divisions_high_and_low:
    #Geeksforgeeks code snippet(GeeksforGeeks, 2022).
    X = low_percentage_df.loc[low_percentage_df[&apos;Division Name&apos;] == (i)].total_score
    Y = high_percentage_df.loc[high_percentage_df[&apos;Division Name&apos;] == (i)].total_score
    t,p = stats.ttest_ind(a=X, b=Y, equal_var=False)
    if p &lt; .05:
        divisions_with_low_p_values.append(i)


print(&quot;These divisions have multiple schools with high proportions of students eligible for free lunch and multiple\
 schools with low proportions of students eligible for free lunch.&quot;)
print(&quot;They also show a statistically significant difference in mean test scores.\n&quot;)

pprint(divisions_with_low_p_values)  

#Visualize density plot on divisions with statistical differences in populations

for i in divisions_with_low_p_values:
    X =low_percentage_df.loc[low_percentage_df[&apos;Division Name&apos;] == (i)].total_score
    Y = high_percentage_df.loc[high_percentage_df[&apos;Division Name&apos;] == (i)].total_score
    #Run the t-test and output
    #Plot both metrics
    low_percentage_df.loc[low_percentage_df[&apos;Division Name&apos;] == (i)].total_score.plot\
    (kind=&apos;kde&apos;, c=&apos;red&apos;, linewidth=3, figsize=[13,6])
    high_percentage_df.loc[high_percentage_df[&apos;Division Name&apos;] == (i)].total_score.plot\
    (kind=&apos;kde&apos;, c=&apos;blue&apos;, linewidth=3, figsize=[13,6])
    # Labels
    labels = [&apos;Low Proportion of Free Lunch Eligible Students&apos;, &apos;High Proportion of Free Lunch Eligible Students&apos;]
    plt.title(i)
    plt.legend(labels)
    plt.xlabel(&apos;Reported Standars of Learning Score&apos;)
    plt.ylabel(&apos;Score Probability Density&apos;)
    plt.show()
    #Code snippet for t-test derived from website (GeeksforGeeks, 2022).
    print(stats.ttest_ind(a=X, b=Y, equal_var=False), &quot;\n\n&quot;)</code></pre><blockquote>These divisions have multiple schools with high proportions of students eligible for free lunch and multiple schools with low proportions of students eligible for free lunch.<br>They also show a statistically significant difference in mean test scores.<br><br>[&apos;Arlington County &apos;,<br> &apos;Hampton City &apos;,<br> &apos;Chesapeake City &apos;,<br> &apos;Albemarle County &apos;,<br> &apos;Newport News City &apos;,<br> &apos;Alexandria City &apos;,<br> &apos;Montgomery County &apos;,<br> &apos;Spotsylvania County &apos;,<br> &apos;Henrico County &apos;,<br> &apos;Suffolk City &apos;,<br> &apos;Chesterfield County &apos;,<br> &apos;Loudoun County &apos;,<br> &apos;Norfolk City &apos;,<br> &apos;Virginia Beach City &apos;,<br> &apos;Prince William County &apos;,<br> &apos;Williamsburg-James City County &apos;,<br> &apos;Fairfax County &apos;]</blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-9d7bf0d2-a1b1-4c41-94f2-5c5969c6cdc4.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-9d7bf0d2-a1b1-4c41-94f2-5c5969c6cdc4.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-9d7bf0d2-a1b1-4c41-94f2-5c5969c6cdc4.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=8.461103427033763, pvalue=4.2564535128111695e-10) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-de49ab3c-1f35-4c3c-aa7e-df5d48e4b3d8.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-de49ab3c-1f35-4c3c-aa7e-df5d48e4b3d8.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-de49ab3c-1f35-4c3c-aa7e-df5d48e4b3d8.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=3.7223282464849152, pvalue=0.0022359663924919103) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-e9ec00ca-4966-4bce-a803-b0a4acc65372.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-e9ec00ca-4966-4bce-a803-b0a4acc65372.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-e9ec00ca-4966-4bce-a803-b0a4acc65372.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=4.806338810130206, pvalue=0.0014819298232501942) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-970cf9cd-069b-4e9a-b053-b29ffa7300a6.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-970cf9cd-069b-4e9a-b053-b29ffa7300a6.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-970cf9cd-069b-4e9a-b053-b29ffa7300a6.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=2.5438908135728764, pvalue=0.027946988309494193) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-c3e8ab5c-11da-4ec0-bc75-13f8eb1a1f2a.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="795" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-c3e8ab5c-11da-4ec0-bc75-13f8eb1a1f2a.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-c3e8ab5c-11da-4ec0-bc75-13f8eb1a1f2a.png 795w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=2.625164815841848, pvalue=0.023337646472250697) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-653b9d9b-c7fa-4fa7-8426-9dfd3818a4d8.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-653b9d9b-c7fa-4fa7-8426-9dfd3818a4d8.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-653b9d9b-c7fa-4fa7-8426-9dfd3818a4d8.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=3.373751982093548, pvalue=0.0024706357289852967) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-97283bce-2759-46a2-a89d-07d713f2367f.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="795" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-97283bce-2759-46a2-a89d-07d713f2367f.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-97283bce-2759-46a2-a89d-07d713f2367f.png 795w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=8.196062297626334, pvalue=1.3567694425275851e-11) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-4f1ec042-7d1b-4f92-9dad-f2894efce352.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="792" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-4f1ec042-7d1b-4f92-9dad-f2894efce352.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-4f1ec042-7d1b-4f92-9dad-f2894efce352.png 792w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=2.4614922901402174, pvalue=0.029589168699507278) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-d6ec8632-c0ce-4ff4-b4fb-b325b5a504c8.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-d6ec8632-c0ce-4ff4-b4fb-b325b5a504c8.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-d6ec8632-c0ce-4ff4-b4fb-b325b5a504c8.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=4.370459142644502, pvalue=0.00011089458080339332) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-7b3ee220-9bb7-41b7-bcd6-f9bb28b7ad10.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="791" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-7b3ee220-9bb7-41b7-bcd6-f9bb28b7ad10.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-7b3ee220-9bb7-41b7-bcd6-f9bb28b7ad10.png 791w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=4.0471544101064145, pvalue=0.00411407734704421) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-0a00391f-e935-4e6f-beee-2165a1fb4152.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-0a00391f-e935-4e6f-beee-2165a1fb4152.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-0a00391f-e935-4e6f-beee-2165a1fb4152.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=7.659633097156011, pvalue=6.205830981719048e-08) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-697425e5-9b9d-4db7-814f-b6dcd47dcf8e.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-697425e5-9b9d-4db7-814f-b6dcd47dcf8e.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-697425e5-9b9d-4db7-814f-b6dcd47dcf8e.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=6.415834167155425, pvalue=5.255180694823957e-07) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-18892236-e1f4-4773-8de2-4eb2cf731577.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-18892236-e1f4-4773-8de2-4eb2cf731577.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-18892236-e1f4-4773-8de2-4eb2cf731577.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=6.162598621901206, pvalue=1.0118260722466614e-07)</blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-45e43359-72e1-496d-9ec1-e1f25060afa2.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="783" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-45e43359-72e1-496d-9ec1-e1f25060afa2.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-45e43359-72e1-496d-9ec1-e1f25060afa2.png 783w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=2.5713849399256987, pvalue=0.02332371134812441) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-1b7e3ed1-0eef-4628-95c5-d12773e884f4.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="795" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-1b7e3ed1-0eef-4628-95c5-d12773e884f4.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-1b7e3ed1-0eef-4628-95c5-d12773e884f4.png 795w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=9.026177677291102, pvalue=3.656434012894371e-14) </blockquote><pre><code>#Cell 12
#Generate list of remaining divisions that didn&apos;t have low p-values, but had both types of schools.
#Subtract &quot;divisions_with_low_p_values&quot; from &quot;divisions_high_and_low&quot; using symmetric_difference() method (Hadzhiev, n.d.).
other_divisions = list(set(divisions_high_and_low).symmetric_difference(divisions_with_low_p_values))


#Visualize density plot on divisions that did not show a statistically significant difference

for i in other_divisions:
    X =low_percentage_df.loc[low_percentage_df[&apos;Division Name&apos;] == (i)].total_score
    Y = high_percentage_df.loc[high_percentage_df[&apos;Division Name&apos;] == (i)].total_score
    #Plot both metrics
    low_percentage_df.loc[low_percentage_df[&apos;Division Name&apos;] == (i)].total_score.plot\
    (kind=&apos;kde&apos;, c=&apos;red&apos;, linewidth=3, figsize=[13,6])
    high_percentage_df.loc[high_percentage_df[&apos;Division Name&apos;] == (i)].total_score.plot\
    (kind=&apos;kde&apos;, c=&apos;blue&apos;, linewidth=3, figsize=[13,6])
    # Labels
    labels = [&apos;Low Proportion of Free Lunch Students&apos;, &apos;High Proportion of Free Student Lunches&apos;]
    plt.title(i)
    plt.legend(labels)
    plt.xlabel(&apos;Reported Standars of Learning Score&apos;)
    plt.ylabel(&apos;Score Probability Density&apos;)
    plt.show() 
    #Code snippet for t-test derived from website (GeeksforGeeks, 2022).
    print(stats.ttest_ind(a=X, b=Y, equal_var=False), &quot;\n\n&quot;)</code></pre><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-82f40d99-eae9-4299-9c4e-6e5513ec2e40.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-82f40d99-eae9-4299-9c4e-6e5513ec2e40.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-82f40d99-eae9-4299-9c4e-6e5513ec2e40.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=2.4205691334017856, pvalue=0.08475385838978879) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-56e85326-ae83-4fb8-a1b2-09d19ee72a4a.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-56e85326-ae83-4fb8-a1b2-09d19ee72a4a.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-56e85326-ae83-4fb8-a1b2-09d19ee72a4a.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=0.2570834615381291, pvalue=0.8082157582528549) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-edeaf422-8083-4765-871d-b7f235246a56.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-edeaf422-8083-4765-871d-b7f235246a56.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-edeaf422-8083-4765-871d-b7f235246a56.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=0.6151089286608511, pvalue=0.5499385169889386) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-ff7ba839-dafc-4f45-92be-808c88b6e4c0.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-ff7ba839-dafc-4f45-92be-808c88b6e4c0.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-ff7ba839-dafc-4f45-92be-808c88b6e4c0.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=-0.43228762054052156, pvalue=0.694690937976563) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-c09c3c9e-5760-4894-9ac8-87ff7537fab2.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-c09c3c9e-5760-4894-9ac8-87ff7537fab2.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-c09c3c9e-5760-4894-9ac8-87ff7537fab2.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=0.006819058031850852, pvalue=0.9946954512431896) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-43076e66-1d64-4545-9d92-f9f74cf48351.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-43076e66-1d64-4545-9d92-f9f74cf48351.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-43076e66-1d64-4545-9d92-f9f74cf48351.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=-0.42914107542287455, pvalue=0.6827212468291588) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-9507aae5-d546-45e5-94fb-4133440b93d5.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-9507aae5-d546-45e5-94fb-4133440b93d5.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-9507aae5-d546-45e5-94fb-4133440b93d5.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=1.496028353478178, pvalue=0.23104670043725686) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-cb3b4a47-58dc-4562-a472-cd621d3cc751.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-cb3b4a47-58dc-4562-a472-cd621d3cc751.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-cb3b4a47-58dc-4562-a472-cd621d3cc751.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=0.9439871944160111, pvalue=0.35882922068937606) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-12de8e2e-611b-43ae-aafb-35ebdbeb0b72.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="783" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-12de8e2e-611b-43ae-aafb-35ebdbeb0b72.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-12de8e2e-611b-43ae-aafb-35ebdbeb0b72.png 783w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=2.4589999935373332, pvalue=0.05995075594533983) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-57966431-feaa-47d9-b652-3737b38cf78d.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-57966431-feaa-47d9-b652-3737b38cf78d.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-57966431-feaa-47d9-b652-3737b38cf78d.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=0.27940572225992494, pvalue=0.794558661937718) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-ecc2fbac-1138-4b33-b480-ddaf14a55b0e.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-ecc2fbac-1138-4b33-b480-ddaf14a55b0e.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-ecc2fbac-1138-4b33-b480-ddaf14a55b0e.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=-0.5999889539606617, pvalue=0.5672948520140612) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-a2044116-7ed8-416d-95a2-c04a9af1bc41.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-a2044116-7ed8-416d-95a2-c04a9af1bc41.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-a2044116-7ed8-416d-95a2-c04a9af1bc41.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=0.5857229902989585, pvalue=0.577275098608178) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-09bdd8d8-b530-44cb-a219-794af110e1c8.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-09bdd8d8-b530-44cb-a219-794af110e1c8.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-09bdd8d8-b530-44cb-a219-794af110e1c8.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=1.2074039981491638, pvalue=0.3369371944307884) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-58066ed8-0b2a-457c-ba39-358803f03020.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-58066ed8-0b2a-457c-ba39-358803f03020.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-58066ed8-0b2a-457c-ba39-358803f03020.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=1.3789234479354537, pvalue=0.2061614549792278) </blockquote><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/data-src-image-bda63aee-044d-41dc-9e24-473ee3a80ba5.png" class="kg-image" alt="Data Analytics Capstone - Code Reference" loading="lazy" width="789" height="387" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/data-src-image-bda63aee-044d-41dc-9e24-473ee3a80ba5.png 600w, https://www.williamameyer.com/content/images/2025/07/data-src-image-bda63aee-044d-41dc-9e24-473ee3a80ba5.png 789w" sizes="(min-width: 720px) 720px"></figure><blockquote>Ttest_indResult(statistic=-1.0811022683078786, pvalue=0.3315247080127098) </blockquote><p></p><p></p><p>References:</p><p>Bloom, G., [GaryMBloom]. (2014, September 4). Convert percent string to float in pandas read_csv. Stack Overflow. <a href="https://stackoverflow.com/questions/25669588/convert-percent-string-to-float-in-pandas-read-csv">https://stackoverflow.com/questions/25669588/convert-percent-string-to-float-in-pandas-read-csv</a></p><p>Bobbit, Z. (2020, July 10). How to Perform Levene&#x2019;s Test in Python. Statology. <a href="https://www.statology.org/levenes-test-python/">https://www.statology.org/levenes-test-python/</a></p><p>Bobbit, Z. (2021, November 6). How to Merge Two Pandas DataFrames on Index. Statology. <a href="https://www.statology.org/pandas-merge-on-index/">https://www.statology.org/pandas-merge-on-index/</a></p><p>Chattar, P. (2021, September 2). Find common elements in two lists in python. Java2Blog. <a href="https://java2blog.com/find-common-elements-in-two-lists-python/">https://java2blog.com/find-common-elements-in-two-lists-python/</a></p><p>GeeksforGeeks. (2022, October 17). How to Conduct a Two Sample T Test in Python. <a href="https://www.geeksforgeeks.org/how-to-conduct-a-two-sample-t-test-in-python/">https://www.geeksforgeeks.org/how-to-conduct-a-two-sample-t-test-in-python/</a></p><p>Goyal, C. (2022, August 25). Feature Engineering &#x2013; How to Detect and Remove Outliers (with Python Code). Analytics Vidhya. <a href="https://www.analyticsvidhya.com/blog/2021/05/feature-engineering-how-to-detect-and-remove-outliers-with-python-code/">https://www.analyticsvidhya.com/blog/2021/05/feature-engineering-how-to-detect-and-remove-outliers-with-python-code/</a></p><p>Hadzhiev, B. (n.d.). Remove common elements from two Lists in Python | bobbyhadz. Blog - Bobby Hadz. <a href="https://bobbyhadz.com/blog/python-remove-common-elements-from-two-lists">https://bobbyhadz.com/blog/python-remove-common-elements-from-two-lists</a></p><p>Harikrishnan, R., [Harikrishnan R]. (2020, January 26). Dropping rows with values outside of boundaries. Stack Overflow. <a href="https://stackoverflow.com/questions/59914605/dropping-rows-with-values-outside-of-boundaries">https://stackoverflow.com/questions/59914605/dropping-rows-with-values-outside-of-boundaries</a></p><p>How to compare two means when the groups have different standard deviations. - FAQ 1349 - GraphPad. (n.d.). GraphPad by Dotmatics. <a href="https://www.graphpad.com/support/faq/how-to-compare-two-means-when-the-groups-have-different-standard-deviations/">https://www.graphpad.com/support/faq/how-to-compare-two-means-when-the-groups-have-different-standard-deviations/</a></p><p>Inada, I. (Ed.). (n.d.). Replacing values with NaNs in Pandas DataFrame. <a href="https://www.skytowner.com/explore/replacing_values_with_nans_in_pandas_dataframe">https://www.skytowner.com/explore/replacing_values_with_nans_in_pandas_dataframe</a></p><p>Komali. (2021, December 27). Pandas Replace Values based on Condition. Spark by {Examples}. <a href="https://sparkbyexamples.com/pandas/pandas-replace-values-based-on-condition/">https://sparkbyexamples.com/pandas/pandas-replace-values-based-on-condition/</a></p><p>Matthew. (2013, November 25). Pandas: Get duplicated indexes. Stack Overflow. <a href="https://stackoverflow.com/questions/20199129/pandas-get-duplicated-indexes">https://stackoverflow.com/questions/20199129/pandas-get-duplicated-indexes</a></p>]]></content:encoded></item><item><title><![CDATA[Data Analytics Capstone - Summary]]></title><description><![CDATA[<p><strong>Statement of the problem and the hypothesis&#xA0;</strong></p><p>Measuring inequality in public schools is not a trivial undertaking. Equality can be simply defined in theory, but in practice, there are many factors to consider. A reasonable goal is to always work to provide equal education to students of all races,</p>]]></description><link>https://www.williamameyer.com/data-analytics-capstone-summary/</link><guid isPermaLink="false">68683f5246755058329ed798</guid><dc:creator><![CDATA[William M]]></dc:creator><pubDate>Fri, 04 Jul 2025 21:09:20 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1581726707445-75cbe4efc586?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDh8fGNsYXNzcm9vbXxlbnwwfHx8fDE3NTE1ODY2NDh8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1581726707445-75cbe4efc586?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDh8fGNsYXNzcm9vbXxlbnwwfHx8fDE3NTE1ODY2NDh8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=2000" alt="Data Analytics Capstone - Summary"><p><strong>Statement of the problem and the hypothesis&#xA0;</strong></p><p>Measuring inequality in public schools is not a trivial undertaking. Equality can be simply defined in theory, but in practice, there are many factors to consider. A reasonable goal is to always work to provide equal education to students of all races, genders, sexual orientations, financial backgrounds, and any other societal stratifications where one may find examples of inequality. This study focuses on financial backgrounds. There also exists a multitude of measures to assess equality. One could measure the severity of punishments for one group compared to another, how teachers interact with different groups, the implicit bias found in history lessons for different groups, student educational outcomes of different groups, and numerous other measures for equality. This study will focus on standardized test performance. The research question is &#x201C;Does the proportion of students who are eligible for free lunch in a school have a statistically significant effect on standardized test scores?&#x201D;. Free lunch eligibility is used as a measure of household income. The population of students that are eligible for free lunch represents households that are at the federal poverty level multiplied by 1.3 or lower (Department of Agriculture, Food and Nutrition Service, 2015). The hypothesis is that the means of these two groups&#x2019; standardized test scores have a statistically significant difference.</p><p>This hypothesis will be tested using a two-sample t-test. Regarding this two-sample t-test, we have a null hypothesis that mean test scores for schools with more students eligible for free lunch are not statistically different from schools with fewer students eligible for free lunch with the alternate hypothesis being that mean test scores for schools with more students eligible for free lunch are statistically different from schools with fewer students eligible for free lunch.</p><p><strong>Summary of the data-analysis process</strong></p><p>The data comes from a data set containing information about Virginia public schools compiled by University Libraries, Virginia Tech, and Bradburn (2021). The data included the variables of interest, shown in the table below, among many other variables. The most recent year that contained test scores and free lunch information was the 2015 and 2016 school year, so this time frame would be used for the analysis.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.williamameyer.com/content/images/2025/07/image.png" class="kg-image" alt="Data Analytics Capstone - Summary" loading="lazy" width="793" height="317" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/image.png 600w, https://www.williamameyer.com/content/images/2025/07/image.png 793w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">This extracted data was the focus in this study. Of note is the free_per_1516. This field was calculated from the data to act as the independent variable.</span></figcaption></figure><p>Before beginning the analysis, the data was extracted and prepared. The data of interest was put into a Pandas data frame, null values were standardized, data was matched where school IDs were equal, nulls were removed, variables were converted to the proper data types, the test scores were added together into an aggregate field &#x201C;total_score&#x201D;, and outliers in &#x201C;total score&#x201D; were capped at 1.5 times the interquartile range.</p><p>The &#x201D;free_per_1516&#x201D; variable was analyzed to find its median. Schools that had numbers higher than or equal to this median value of 0.37285 were placed into a separate data frame and will be group &#x201C;1&#x201D;, the other schools were placed in another data frame and will be group &#x201C;0&#x201D;.</p><p>With two populations, each with their respective standardized test scores available, we can start looking into whether these groups have statistically different mean test scores. Before beginning the test, a Levine&#x2019;s test was used to test for equal variance. The result of this is shown below:</p><pre><code>#use levene test to check for equality of variance (Bobbit, 2020).
stats.levine(low_percentage_df.total_score, high_percentage_df.total_score, center =&apos;median&apos;)
</code></pre><blockquote>LeveneResult(statistic=54.81184644296979, pvalue=2.0681326644527468e-13)</blockquote><p>With this test, we can conclude that each group likely has a different variance in their test respective test scores. A t-test operates with the assumption that both groups have equal variances, yet this is not a critical assumption for a t-test, so the comparison using a two-sample t-test will still be carried out (How to Compare Two Means When the Groups Have Different Standard Deviations. - FAQ 1349 - GraphPad, n.d.). The &#x201C;scipy.stats&#x201D; library was used to carry out the t-test and compare the two groups.</p><p><strong>Outline of the findings</strong></p><p>The output of the t-test and a visualization of the probability density of test scores is shown below:</p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/image-1.png" class="kg-image" alt="Data Analytics Capstone - Summary" loading="lazy" width="793" height="395" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/image-1.png 600w, https://www.williamameyer.com/content/images/2025/07/image-1.png 793w" sizes="(min-width: 720px) 720px"></figure><p>The very low p-value indicates that we can reject the null hypothesis that mean test scores for schools with more students eligible for free lunch are not statistically different from schools with fewer students eligible for free lunch. The probability density plot highlights the performance gap. Schools with a low proportion of free lunch eligible students perform better on standardized tests.</p><p><strong>Explanation of the limitations of the techniques and tools used</strong></p><p>A two-sample t-test, by definition, compares the means of two groups. This limits our mean comparison to two groups. In this case, above the median, and below the median were chosen to define the school groups, but other groups could be created, perhaps dividing the data into quartiles if needed, and compared using an ANOVA test.</p><p>Another limitation is that this t-test does not explore the relationship between increases in a school&#x2019;s percentage of students who are eligible for free lunch, and the subsequent decrease in standardized test scores. In the data analysis phase of the study, the following heatmap and scatterplot were generated from the data. It appears that there is a correlation, but this is not explored in the t-test.</p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2025/07/image-2.png" class="kg-image" alt="Data Analytics Capstone - Summary" loading="lazy" width="643" height="371" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/image-2.png 600w, https://www.williamameyer.com/content/images/2025/07/image-2.png 643w"></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.williamameyer.com/content/images/2025/07/image-3.png" class="kg-image" alt="Data Analytics Capstone - Summary" loading="lazy" width="583" height="410"><figcaption><span style="white-space: pre-wrap;">Notice the pattern. Generally, as the percentage of the student body that receive free lunch increases, the test score decreases. With obvious, and interesting outliers.</span></figcaption></figure><p><strong>Summary of proposed actions</strong></p><p>Further study on counties that have both types of schools, high median poverty levels, and low median poverty levels is an effective next step. In this subset of counties, one could determine which schools have unequal test scores and which counties have statistically similar test scores for each median poverty grouping. Exploring the differences in these counties may lead to new ideas as to why this inequality exists. It may bring about ideas or policies that can be carried out in counties that have students that are suffering from educational inequalities.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.williamameyer.com/content/images/2025/07/image-4.png" class="kg-image" alt="Data Analytics Capstone - Summary" loading="lazy" width="793" height="441" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/image-4.png 600w, https://www.williamameyer.com/content/images/2025/07/image-4.png 793w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Example of a school district with both above-median and below-median schools with statistically different mean test scores.</em></i></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.williamameyer.com/content/images/2025/07/image-5.png" class="kg-image" alt="Data Analytics Capstone - Summary" loading="lazy" width="793" height="414" srcset="https://www.williamameyer.com/content/images/size/w600/2025/07/image-5.png 600w, https://www.williamameyer.com/content/images/2025/07/image-5.png 793w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Example of a school district with both above-median and below-median schools with statistically similar mean test scores.</em></i></figcaption></figure><p>Further research should also be carried out in counties with high levels of inequality to see if there is a systemic housing issue or other mitigating factors that are causing income segregation across schools. The focus of this approach would be analyzing school districts, quantifying the income levels in those districts, and trying to find effective ways to limit the extremes. The extremes here would be schools with very low diversity in student incomes compared to the county&#x2019;s income diversity levels.</p><p><strong>Expected benefits of the study</strong></p><p>This study shows that Virginia Public Schools have statistically different test scores based on the proportion of low-income students in a particular school. Knowing that there is a quantifiable problem to solve is a benefit. In 2022, Virginia public schools had a total of 1,296,817 students (<em>Public Education in Virginia</em>, n.d.). Of these students, &#x201C;Virginia has over 512,000 economically disadvantaged students in its public schools&#x201D; (<em>Weighing Support for Virginia&#x2019;s Students</em>, 2021). Increasing fairness and equality in the school system would be a benefit to all students, but certainly to those who are economically disadvantaged. Examining and applying this study at the state level can help decision-makers allocate funds appropriately in support of promoting income equality in schools.<br></p><p><strong>References</strong></p><p>Department of Agriculture, Food and Nutrition Service. (2015, March 26).<em>Federal Register: Vol. 80, No. 61</em>. The U.S. Department of Agriculture.<a href="https://www.govinfo.gov/content/pkg/FR-2015-03-31/pdf/2015-07358.pdf"><u>https://www.govinfo.gov/content/pkg/FR-2015-03-31/pdf/2015-07358.pdf</u></a></p><p>How to compare two means when the groups have different standard deviations. - FAQ 1349 - GraphPad. (n.d.). GraphPad by Dotmatics.<a href="https://www.graphpad.com/support/faq/how-to-compare-two-means-when-the-groups-have-different-standard-deviations/"><u>https://www.graphpad.com/support/faq/how-to-compare-two-means-when-the-groups-have-different-standard-deviations/</u></a></p><p><em>Public education in Virginia</em>. (n.d.). Ballotpedia. https://ballotpedia.org/Public_education_in_Virginia</p><p>University Libraries, Virginia Tech, &amp; Bradburn, I. (2021, May 18).<em>Characterizing Virginia Public Schools (public dataset)</em>. Figshare.<a href="https://figshare.com/articles/dataset/Characterizing_Virginia_Public_Schools_public_dataset_/14097092"><u>https://figshare.com/articles/dataset/Characterizing_Virginia_Public_Schools_public_dataset_/14097092</u></a>CC license:<a href="https://creativecommons.org/licenses/by/3.0/us/"><u>CC BY 3.0 US</u></a></p><p><em>Weighing Support for Virginia&#x2019;s Students</em>. (2021, April 13). The Commonwealth Institute. https://thecommonwealthinstitute.org/research/weighing-support-for-virginias-students/</p><p></p>]]></content:encoded></item><item><title><![CDATA[Reinforcement Learning with ML Agents - Part 1/? - Unity, C#, Prompt Engineering, Git]]></title><description><![CDATA[<p>This project will be to simulate a warehouse environment with the aims of training a series of agents.  One agent in charge of navigating the warehouse effectively, another agent in charge of stacking boxes on the pallet optimally, and another agent in charge of utilizing the trained agents to manage</p>]]></description><link>https://www.williamameyer.com/reinforcement-learning-with-ml-agents-part-1-unity-c-prompt-engineering-git/</link><guid isPermaLink="false">67e1c33946755058329ed763</guid><dc:creator><![CDATA[William M]]></dc:creator><pubDate>Mon, 24 Mar 2025 20:59:06 GMT</pubDate><media:content url="https://www.williamameyer.com/content/images/2025/03/thumbnailSmaller.png" medium="image"/><content:encoded><![CDATA[<img src="https://www.williamameyer.com/content/images/2025/03/thumbnailSmaller.png" alt="Reinforcement Learning with ML Agents - Part 1/? - Unity, C#, Prompt Engineering, Git"><p>This project will be to simulate a warehouse environment with the aims of training a series of agents.  One agent in charge of navigating the warehouse effectively, another agent in charge of stacking boxes on the pallet optimally, and another agent in charge of utilizing the trained agents to manage the location of items in the warehouse.  </p><figure class="kg-card kg-embed-card"><iframe src="https://player.vimeo.com/video/1068988824?app_id=122963" width="426" height="240" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media" title="Simulated Warehouse Environment"></iframe></figure><p>As this is a larger project for an individual with limited time, I will be breaking the project into manageable chunks.</p><p>The first part of the project was to create the simulated environment to train the navigation agent.</p><p>Full Project as I understand it now:</p>
<!--kg-card-begin: html-->
 <ul>
  <li>Preliminary Game Environment</li>
  <li>Train Navigation with draft environment.</li>
  <li>Reconfigure Environment for Box Stacking</li>
  <li>Train Stacking Agent</li>
   <li>Train Warehouse Profiler agent to optimally place the items in the correct slots for effecient movement.</li>
</ul>


<!--kg-card-end: html-->
]]></content:encoded></item><item><title><![CDATA[Superset - Agriculture Crop Production Visualization]]></title><description><![CDATA[<p>I&apos;ll be using superset to visualize the agricultural crop production dataset that I used in my <a href="https://www.williamameyer.com/r-agriculture-crop-production-eda-specifically-on-coffee/" rel="noreferrer">previous R post</a>.  I&apos;ll make a dashboard that can compare land use efficiency between 1961 and 2018.</p><p>I will do an overview of the work done to create the dashboard,</p>]]></description><link>https://www.williamameyer.com/superset-agriculture-crop-production-visualization/</link><guid isPermaLink="false">66e0ebfb44502a60aa65af52</guid><dc:creator><![CDATA[William M]]></dc:creator><pubDate>Wed, 11 Sep 2024 01:03:34 GMT</pubDate><media:content url="https://www.williamameyer.com/content/images/2024/09/superset-1.webp" medium="image"/><content:encoded><![CDATA[<img src="https://www.williamameyer.com/content/images/2024/09/superset-1.webp" alt="Superset - Agriculture Crop Production Visualization"><p>I&apos;ll be using superset to visualize the agricultural crop production dataset that I used in my <a href="https://www.williamameyer.com/r-agriculture-crop-production-eda-specifically-on-coffee/" rel="noreferrer">previous R post</a>.  I&apos;ll make a dashboard that can compare land use efficiency between 1961 and 2018.</p><p>I will do an overview of the work done to create the dashboard, If you would prefer you can  <s>Jump to the dashboard link here</s>.  <em>**Update my self hosted superset container is no longer running.  I have included videos of the dashboard in the post instead of embedding directly to the server.</em></p><p>Superset dashboards are made of charts and layout elements.  I started by making the charts for each element.  I would need a chart for &apos;Tonnes Produced in 1961&apos;, &apos;Acres used in 1961&apos;, &apos;Tonnes Produced in 1961&apos;, and &apos;Acres used in 2018&apos;.  In addition to these charts, I want to include a visualization that would compare the individual crop&apos;s land use productivity.  Also, a large metric that shows the percentage increase of all crops selected.</p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2024/09/Pie-Chart-Settings-1.png" class="kg-image" alt="Superset - Agriculture Crop Production Visualization" loading="lazy" width="383" height="437"></figure>
<!--kg-card-begin: html-->
<iframe src="https://player.vimeo.com/video/1097224631?title=0&amp;byline=0&amp;portrait=0&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479" width="1920" height="1080" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" title="Superset Pie Chart Example"></iframe>
<!--kg-card-end: html-->
<p>The Tonnes Per Acre chart would be a bit more involved as it would require a bit of custom SQL, and a few filters.</p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2024/09/Numerical-Chart-for-increase-in-productivity-2.png" class="kg-image" alt="Superset - Agriculture Crop Production Visualization" loading="lazy" width="368" height="608"></figure><p>SQL for the  metrics, filters, and sorting are below (full SQL is not required for these settings, as Superset takes care of much of it):</p><pre><code class="language-SQLish">--Metrics
--1961 Tonnes Per Acre
SUM(production1961) / SUM(areaharvested1961)

--2018 Tonnes Per Acre
SUM(production2018) / SUM(areaharvested2018)

--Increase in productivity
ABS(
  (SUM(production1961)/SUM(areaharvested1961)) -
  (SUM(production2018)/SUM(areaharvested2018)))
  /(SUM(production1961)/SUM(areaharvested1961))

 /*Filters
For the filters I wanted to do two things, I needed to include items that had values in 1961 and 2018.  By filtering where the production and area harvested are greater than 0 helped with this.  It also had the effect of making sure there was no division by zero errors */

--Sorting
ABS(
 (SUM(production1961)/SUM(areaharvested1961)) -
 (SUM(production2018)/SUM(areaharvested2018)))
 /(SUM(production1961)/SUM(areaharvested1961))
--Making the sort order descending is done through the User Interface.
</code></pre>
<!--kg-card-begin: html-->
<iframe src="https://player.vimeo.com/video/1097224914?title=0&amp;byline=0&amp;portrait=0&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479" width="1920" height="1080" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" title="Superset Crop Production Increase Table View"></iframe>
<!--kg-card-end: html-->
<p>This sorts out the tables and charts, the last thing I want to include on my dashboard is a large number that indicates the total efficiency increase.</p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2024/09/big-number-settings-1.png" class="kg-image" alt="Superset - Agriculture Crop Production Visualization" loading="lazy" width="377" height="422"></figure>
<!--kg-card-begin: html-->
<iframe width="600" height="400" seamless frameborder="0" scrolling="no" src="https://viz.williamameyer.com/superset/explore/p/1r8J0DkJvBQ/?standalone=1&amp;height=400">
</iframe>
<!--kg-card-end: html-->
<p>This is all of the elements of the dashboard save for the filters.  So let&apos;s take a look at the filters we will use.  A filter for the Continent, by Country, by tonnes produced, and to look at specific crops.</p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2024/09/filters.png" class="kg-image" alt="Superset - Agriculture Crop Production Visualization" loading="lazy" width="297" height="707"></figure><p>With the filters and dashboard&apos;s elements complete the dashboard is ready to explore whatever of detail the user needs.</p><h3 id="superset-dashboard">Superset dashboard</h3><p>I strongly recommend viewing the dashboard on a  larger screen than a mobile device.  You can view the dashboard from the Supererset instance by clicking <u>this link</u>.</p>
<!--kg-card-begin: html-->
<iframe src="https://player.vimeo.com/video/1097225257?title=0&amp;byline=0&amp;portrait=0&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479" width="1920" height="1080" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" title="Crop Comparison over Time Superset Dashboard"></iframe>
<!--kg-card-end: html-->
]]></content:encoded></item><item><title><![CDATA[R - Agriculture Crop Production EDA - Part 2]]></title><description><![CDATA[<p>We will be answering 5 questions on the <a href="https://data.world/agriculture/crop-production" rel="noreferrer">Agriculture Crop Production data</a>.</p><ol><li><em>Which country in Africa produced the most coffee in 2018?</em></li><li><em>Which country in Africa produced the most coffee per acre in 2018?</em></li><li><em>Which country in Africa has the largest increase in the amount of coffee produced since 1961?</em></li></ol>]]></description><link>https://www.williamameyer.com/r-post-2/</link><guid isPermaLink="false">66d6895a44502a60aa65aea6</guid><dc:creator><![CDATA[William M]]></dc:creator><pubDate>Tue, 03 Sep 2024 07:38:13 GMT</pubDate><media:content url="https://www.williamameyer.com/content/images/2024/09/coffee-7561288_1280-1.jpg" medium="image"/><content:encoded><![CDATA[<img src="https://www.williamameyer.com/content/images/2024/09/coffee-7561288_1280-1.jpg" alt="R - Agriculture Crop Production EDA - Part 2"><p>We will be answering 5 questions on the <a href="https://data.world/agriculture/crop-production" rel="noreferrer">Agriculture Crop Production data</a>.</p><ol><li><em>Which country in Africa produced the most coffee in 2018?</em></li><li><em>Which country in Africa produced the most coffee per acre in 2018?</em></li><li><em>Which country in Africa has the largest increase in the amount of coffee produced since 1961?</em></li><li><strong>Which country (in the world, not specifically Africa) saw the highest increase in the amount of coffee grown per acre since 1961?*</strong></li><li><strong>Which continent (or rather, major grouping as presented in the 5 data sets) has the largest increase in the amount of coffee grown per acre since 1961?*</strong></li></ol><p><strong>*</strong>These final two questions will be answered, for the previous questions, and an explanation of the data, please review <a href="https://www.williamameyer.com/r-agriculture-crop-production-eda-specifically-on-coffee/" rel="noreferrer">the previous post</a>.</p><hr><h3 id="4-which-country-in-the-world-not-specifically-africa-saw-the-highest-increase-in-the-amount-of-coffee-grown-per-acre-since-1961"><strong>4 Which country (in the world, not specifically Africa) saw the highest increase in the amount of coffee grown per acre since 1961?*</strong> </h3><p>Let&apos;s jump into the solution</p><pre><code class="language-R">library(dplyr)

#Import all of the datasets
t1 &lt;- read.csv(&quot;C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv&quot;)
t2 &lt;- read.csv(&quot;C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Americas.csv&quot;)
t3 &lt;- read.csv(&quot;C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Asia.csv&quot;)
t4 &lt;- read.csv(&quot;C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Europe.csv&quot;)
t5 &lt;- read.csv(&quot;C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Oceania.csv&quot;)

#Combine the datasets
q4 &lt;- rbind(t1,t2,t3,t4,t5) %&gt;%
  #Keep necessary variables
  select(&apos;Area&apos;, &apos;Item&apos;, &apos;Unit&apos;, &apos;Element&apos;,&apos;Y1961&apos;,&apos;Y2018&apos;) %&gt;%
  #Filter &quot;Item&quot; and &quot;Element&quot; on appropriate values
  filter( Item == &apos;Coffee, green&apos;, Element == &apos;Yield&apos;) %&gt;%
  #Create a new variable and apply the percent change formula
  mutate(PercentChange = ((Y2018 - Y1961) / Y1961 * 100)) %&gt;%
  #Sort with the highest percent change on top
  arrange(desc(PercentChange))

#round for better formatting
q4$PercentChange = round(q4$&apos;PercentChange&apos;, 2)

</code></pre><p>OUTPUT:</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="8" rules="NONE" border="0">
	<colgroup><col width="26"><col width="219"><col width="91"><col width="45"><col width="62"><col width="51"><col width="51"><col width="105"></colgroup>
	<tbody>
		<tr>
			<td width="26" height="17" align="LEFT"><br></td>
			<td width="219" align="LEFT">Area</td>
			<td width="91" align="LEFT">Item</td>
			<td width="45" align="LEFT">Unit</td>
			<td width="62" align="LEFT">Element</td>
			<td width="51" align="LEFT">Y1961</td>
			<td width="51" align="LEFT">Y2018</td>
			<td width="105" align="LEFT">PercentChange</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="1" sdnum="1033;">1</td>
			<td align="LEFT">Viet Nam</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="LEFT">Yield</td>
			<td align="RIGHT" sdval="1934" sdnum="1033;">1934</td>
			<td align="RIGHT" sdval="26117" sdnum="1033;">26117</td>
			<td align="RIGHT" sdval="1250.41" sdnum="1033;">1250.41</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="2" sdnum="1033;">2</td>
			<td align="LEFT">Thailand</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="LEFT">Yield</td>
			<td align="RIGHT" sdval="600" sdnum="1033;">600</td>
			<td align="RIGHT" sdval="5725" sdnum="1033;">5725</td>
			<td align="RIGHT" sdval="854.17" sdnum="1033;">854.17</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="3" sdnum="1033;">3</td>
			<td align="LEFT">Nigeria</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="LEFT">Yield</td>
			<td align="RIGHT" sdval="1833" sdnum="1033;">1833</td>
			<td align="RIGHT" sdval="12886" sdnum="1033;">12886</td>
			<td align="RIGHT" sdval="603" sdnum="1033;">603</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="4" sdnum="1033;">4</td>
			<td align="LEFT">Malaysia</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="LEFT">Yield</td>
			<td align="RIGHT" sdval="3391" sdnum="1033;">3391</td>
			<td align="RIGHT" sdval="22810" sdnum="1033;">22810</td>
			<td align="RIGHT" sdval="572.66" sdnum="1033;">572.66</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="5" sdnum="1033;">5</td>
			<td align="LEFT">China, mainland</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="LEFT">Yield</td>
			<td align="RIGHT" sdval="5000" sdnum="1033;">5000</td>
			<td align="RIGHT" sdval="29405" sdnum="1033;">29405</td>
			<td align="RIGHT" sdval="488.1" sdnum="1033;">488.1</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="6" sdnum="1033;">6</td>
			<td align="LEFT">Malawi</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="LEFT">Yield</td>
			<td align="RIGHT" sdval="4175" sdnum="1033;">4175</td>
			<td align="RIGHT" sdval="23345" sdnum="1033;">23345</td>
			<td align="RIGHT" sdval="459.16" sdnum="1033;">459.16</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="7" sdnum="1033;">7</td>
			<td align="LEFT">Honduras</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="LEFT">Yield</td>
			<td align="RIGHT" sdval="2767" sdnum="1033;">2767</td>
			<td align="RIGHT" sdval="11195" sdnum="1033;">11195</td>
			<td align="RIGHT" sdval="304.59" sdnum="1033;">304.59</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="8" sdnum="1033;">8</td>
			<td align="LEFT">Nicaragua</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="LEFT">Yield</td>
			<td align="RIGHT" sdval="2762" sdnum="1033;">2762</td>
			<td align="RIGHT" sdval="10672" sdnum="1033;">10672</td>
			<td align="RIGHT" sdval="286.39" sdnum="1033;">286.39</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="9" sdnum="1033;">9</td>
			<td align="LEFT">Brazil</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="LEFT">Yield</td>
			<td align="RIGHT" sdval="5084" sdnum="1033;">5084</td>
			<td align="RIGHT" sdval="19060" sdnum="1033;">19060</td>
			<td align="RIGHT" sdval="274.9" sdnum="1033;">274.9</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="10" sdnum="1033;">10</td>
			<td align="LEFT">Lao People&apos;s Democratic Republic</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="LEFT">Yield</td>
			<td align="RIGHT" sdval="5000" sdnum="1033;">5000</td>
			<td align="RIGHT" sdval="18611" sdnum="1033;">18611</td>
			<td align="RIGHT" sdval="272.22" sdnum="1033;">272.22</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="11" sdnum="1033;">11</td>
			<td align="LEFT">Sierra Leone</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="LEFT">Yield</td>
			<td align="RIGHT" sdval="6379" sdnum="1033;">6379</td>
			<td align="RIGHT" sdval="20706" sdnum="1033;">20706</td>
			<td align="RIGHT" sdval="224.6" sdnum="1033;">224.6</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="12" sdnum="1033;">12</td>
			<td align="LEFT">Ghana</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="LEFT">Yield</td>
			<td align="RIGHT" sdval="5667" sdnum="1033;">5667</td>
			<td align="RIGHT" sdval="16959" sdnum="1033;">16959</td>
			<td align="RIGHT" sdval="199.26" sdnum="1033;">199.26</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="13" sdnum="1033;">13</td>
			<td align="LEFT">Rwanda</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="LEFT">Yield</td>
			<td align="RIGHT" sdval="5833" sdnum="1033;">5833</td>
			<td align="RIGHT" sdval="16506" sdnum="1033;">16506</td>
			<td align="RIGHT" sdval="182.98" sdnum="1033;">182.98</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="14" sdnum="1033;">14</td>
			<td align="LEFT">French Polynesia</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="LEFT">Yield</td>
			<td align="RIGHT" sdval="869" sdnum="1033;">869</td>
			<td align="RIGHT" sdval="2268" sdnum="1033;">2268</td>
			<td align="RIGHT" sdval="160.99" sdnum="1033;">160.99</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<h3 id="5-which-continent-or-rather-major-grouping-as-presented-in-the-5-data-sets-has-the-largest-increase-in-the-amount-of-coffee-grown-per-acre-since-1961"><strong>5 Which continent (or rather, major grouping as presented in the 5 data sets) has the largest increase in the amount of coffee grown per acre since 1961?*</strong></h3><p>This required a bit of working to do.  The data needed to be lagged, aggregated, and aggregated some more to get the information on the continental level and &apos;subtractable&apos; and divisible at a row level.</p><pre><code class="language-R">library(dplyr)

#PART 1
#Import all of the datasets
#Before the merge each individual dataset was given a &apos;Continent&apos; column


t1 &lt;- read.csv(&quot;C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv&quot;)
t1$Continent = &quot;Africa&quot;
t2 &lt;- read.csv(&quot;C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Americas.csv&quot;)
t2$Continent = &quot;Americas&quot;
t3 &lt;- read.csv(&quot;C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Asia.csv&quot;)
t3$Continent = &quot;Asia&quot;
t4 &lt;- read.csv(&quot;C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Europe.csv&quot;)
t4$Continent = &quot;Europe&quot;
t5 &lt;- read.csv(&quot;C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Oceania.csv&quot;)
t5$Continent = &quot;Oceania&quot;




#PART 2
This part of the code is where I filtered data, and used the lag function to essentially pivot the data.

#Combine the datasets
q5step1 &lt;- rbind(t1,t2,t3,t4,t5) %&gt;%
  #Select just coffee, and Element &apos;Production&apos; OR &apos;Area harvested&apos; 
  filter(Item == &apos;Coffee, green&apos;, Element == &apos;Production&apos; | Element ==&apos;Area harvested&apos;) %&gt;%
  #Select a few variables for now, more will be removed later
  select(&apos;Continent&apos;, &apos;Area&apos;, &apos;Unit&apos;, &apos;Element&apos;, &apos;Y1961&apos;, &apos;Y2018&apos;) %&gt;%
  #We will lag the production of both years partitioned by the &apos;Area&apos;
  group_by(Area) %&gt;%
  mutate(&apos;1961AreaHarvested&apos; = lag(Y1961, n=1, order_by=Area)) %&gt;%
  mutate(&apos;2018AreaHarvested&apos; = lag(Y2018, n=1, order_by=Area)) %&gt;%
  #Renaming Y1961, and Y2018 Columns to denote their new role as Production values
  rename( &quot;1961Production&quot; = &quot;Y1961&quot;) %&gt;%
  rename( &quot;2018Production&quot; = &quot;Y2018&quot;) %&gt;%
  #We will get filter by Element = Production... this will help with the data pivot
  #It will remove the &apos;Area Harvested&apos; elements from the 1961/2018_Production variables
  filter(Element == &apos;Production&apos;) %&gt;%
  #Clean data up a bit and continue in another pipe, because this has become cumbersome
  #We will do maths and such in another pipe
  group_by(Continent) %&gt;%
  select(&apos;Continent&apos;, &apos;1961Production&apos;,&apos;1961AreaHarvested&apos;,&apos;2018Production&apos;,&apos;2018AreaHarvested&apos;)



#PART 3
The data was further aggregated by continent, and yield calculated as production-units/area harvested for each year.  From there the percent change calculation was applied.

q5step2 &lt;- q5step1 %&gt;%
  group_by(Continent) %&gt;%
summarize_at(vars(&apos;1961Production&apos;,&apos;1961AreaHarvested&apos;,&apos;2018Production&apos;,&apos;2018AreaHarvested&apos;), list(name = sum), na.rm=T )
 
#Create the yield for each continent
q5step3 &lt;- q5step2 %&gt;%
  #Calculate yield, the original document uses hectares and hectograms... but this will be a percent change
  #the unit is not important so long as its the same in 1961 and 2018
  mutate(Yield2018 = .$&apos;2018Production_name&apos; / .$&apos;2018AreaHarvested_name&apos;) %&gt;%
  mutate(Yield1961 = .$&apos;1961Production_name&apos; / .$&apos;1961AreaHarvested_name&apos;) %&gt;%
  select(Continent,Yield1961,Yield2018)

#PART 4 FINAL OUTPUT
#Output the final table with the yield-percent-change per continent in descending order

q5 &lt;- q5step3 %&gt;%
  #Apply the percent change calculation
  mutate(Yield_Change_Percent = (Yield2018 - Yield1961)/ Yield1961 * 100) %&gt;%
  #Nice round Numbers
  mutate(Yield_Change_Percent = round(Yield_Change_Percent,2)) %&gt;%
  #sorted by highest yield change to lowest
  arrange(desc(Yield_Change_Percent)) %&gt;%
  #Choose only the variables needed
  select(Continent,Yield_Change_Percent)</code></pre><p>OUTPUT:</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="2" rules="NONE" border="0">
	<colgroup><col width="69"><col width="150"></colgroup>
	<tbody>
		<tr>
			<td width="69" height="17" align="LEFT">Continent</td>
			<td width="150" align="LEFT">Yield_Change_Percent</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Oceania</td>
			<td align="RIGHT" sdval="163.29" sdnum="1033;">163.29</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Americas</td>
			<td align="RIGHT" sdval="137.01" sdnum="1033;">137.01</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Asia</td>
			<td align="RIGHT" sdval="111.31" sdnum="1033;">111.31</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Africa</td>
			<td align="RIGHT" sdval="14.63" sdnum="1033;">14.63</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Europe</td>
			<td align="LEFT">NA</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>So with all that we have our final (very small) output.  </p><p>This was my first time using RStudio, or the R language in general.  So I am early on in the learning curve.  I will definitely revisit R at another time.</p>]]></content:encoded></item><item><title><![CDATA[R - Agriculture Crop Production EDA - Part 1]]></title><description><![CDATA[<p>I found a dataset on <a href="https://data.world/agriculture/crop-production" rel="noreferrer">data.world</a> that had a great deal of information on country crop yields.  It contained 5 data tables divided into different groupings including Africa, The Americas, Asia, Europe, and Oceania.  It also contained 3 more tables that helped explain the dataset and offer context about</p>]]></description><link>https://www.williamameyer.com/r-agriculture-crop-production-eda-specifically-on-coffee/</link><guid isPermaLink="false">66d66aff44502a60aa65ae1d</guid><dc:creator><![CDATA[William M]]></dc:creator><pubDate>Tue, 03 Sep 2024 04:00:34 GMT</pubDate><media:content url="https://www.williamameyer.com/content/images/2024/09/coffee-7561288_1280.jpg" medium="image"/><content:encoded><![CDATA[<img src="https://www.williamameyer.com/content/images/2024/09/coffee-7561288_1280.jpg" alt="R - Agriculture Crop Production EDA - Part 1"><p>I found a dataset on <a href="https://data.world/agriculture/crop-production" rel="noreferrer">data.world</a> that had a great deal of information on country crop yields.  It contained 5 data tables divided into different groupings including Africa, The Americas, Asia, Europe, and Oceania.  It also contained 3 more tables that helped explain the dataset and offer context about the data collection, measurements, etc.</p><p>Lets preview the data concerning Africa.</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="8" rules="NONE" border="0">
	<colgroup><col width="76"><col width="218"><col width="74"><col width="299"><col width="98"><col width="102"><col width="53"><col width="70"></colgroup>
	<tbody>
		<tr>
			<td width="76" height="17" align="LEFT">Area Code</td>
			<td width="218" align="LEFT">Area</td>
			<td width="74" align="LEFT">Item Code</td>
			<td width="299" align="LEFT">Item</td>
			<td width="98" align="LEFT">Element Code</td>
			<td width="102" align="LEFT">Element</td>
			<td width="53" align="LEFT">Unit</td>
			<td width="70" align="LEFT">Y1961</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="4" sdnum="1033;">4</td>
			<td align="LEFT">Algeria</td>
			<td align="RIGHT" sdval="221" sdnum="1033;">221</td>
			<td align="LEFT">Almonds, with shell</td>
			<td align="RIGHT" sdval="5312" sdnum="1033;">5312</td>
			<td align="LEFT">Area harvested</td>
			<td align="LEFT">ha</td>
			<td align="RIGHT" sdval="13300" sdnum="1033;">13300</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="4" sdnum="1033;">4</td>
			<td align="LEFT">Algeria</td>
			<td align="RIGHT" sdval="221" sdnum="1033;">221</td>
			<td align="LEFT">Almonds, with shell</td>
			<td align="RIGHT" sdval="5419" sdnum="1033;">5419</td>
			<td align="LEFT">Yield</td>
			<td align="LEFT">hg/ha</td>
			<td align="RIGHT" sdval="4511" sdnum="1033;">4511</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="4" sdnum="1033;">4</td>
			<td align="LEFT">Algeria</td>
			<td align="RIGHT" sdval="221" sdnum="1033;">221</td>
			<td align="LEFT">Almonds, with shell</td>
			<td align="RIGHT" sdval="5510" sdnum="1033;">5510</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="6000" sdnum="1033;">6000</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="4" sdnum="1033;">4</td>
			<td align="LEFT">Algeria</td>
			<td align="RIGHT" sdval="515" sdnum="1033;">515</td>
			<td align="LEFT">Apples</td>
			<td align="RIGHT" sdval="5312" sdnum="1033;">5312</td>
			<td align="LEFT">Area harvested</td>
			<td align="LEFT">ha</td>
			<td align="RIGHT" sdval="3400" sdnum="1033;">3400</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="4" sdnum="1033;">4</td>
			<td align="LEFT">Algeria</td>
			<td align="RIGHT" sdval="515" sdnum="1033;">515</td>
			<td align="LEFT">Apples</td>
			<td align="RIGHT" sdval="5419" sdnum="1033;">5419</td>
			<td align="LEFT">Yield</td>
			<td align="LEFT">hg/ha</td>
			<td align="RIGHT" sdval="45294" sdnum="1033;">45294</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="4" sdnum="1033;">4</td>
			<td align="LEFT">Algeria</td>
			<td align="RIGHT" sdval="515" sdnum="1033;">515</td>
			<td align="LEFT">Apples</td>
			<td align="RIGHT" sdval="5510" sdnum="1033;">5510</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="15400" sdnum="1033;">15400</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>Each agricultural product in each country has three entries.  The &apos;Element&apos; column describes the measure.  Three elements are included &apos;Area harvested&apos;, &apos;Yield&apos;, and &apos;Production&apos;.  &apos;Yield&apos; is an aggregate of the &apos;Area harvested&apos;, and &apos;Production&apos;.</p><p>The data for the years is ordered by column as well.</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="8" rules="NONE" border="0">
	<colgroup><col width="59"><col width="70"><col width="59"><col width="70"><col width="59"><col width="70"><col width="59"><col width="70"></colgroup>
	<tbody>
		<tr>
			<td width="59" height="17" align="LEFT">Y1961F</td>
			<td width="70" align="LEFT">Y1962</td>
			<td width="59" align="LEFT">Y1962F</td>
			<td width="70" align="LEFT">Y1963</td>
			<td width="59" align="LEFT">Y1963F</td>
			<td width="70" align="LEFT">Y1964</td>
			<td width="59" align="LEFT">Y1964F</td>
			<td width="70" align="LEFT">Y1965</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">F</td>
			<td align="RIGHT" sdval="13300" sdnum="1033;">13300</td>
			<td align="LEFT">F</td>
			<td align="RIGHT" sdval="13300" sdnum="1033;">13300</td>
			<td align="LEFT">F</td>
			<td align="RIGHT" sdval="14200" sdnum="1033;">14200</td>
			<td align="LEFT">F</td>
			<td align="RIGHT" sdval="13800" sdnum="1033;">13800</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Fc</td>
			<td align="RIGHT" sdval="4511" sdnum="1033;">4511</td>
			<td align="LEFT">Fc</td>
			<td align="RIGHT" sdval="4511" sdnum="1033;">4511</td>
			<td align="LEFT">Fc</td>
			<td align="RIGHT" sdval="4507" sdnum="1033;">4507</td>
			<td align="LEFT">Fc</td>
			<td align="RIGHT" sdval="4493" sdnum="1033;">4493</td>
		</tr>
		<tr>
			<td height="17" align="LEFT"><br></td>
			<td align="RIGHT" sdval="6000" sdnum="1033;">6000</td>
			<td align="LEFT"><br></td>
			<td align="RIGHT" sdval="6000" sdnum="1033;">6000</td>
			<td align="LEFT"><br></td>
			<td align="RIGHT" sdval="6400" sdnum="1033;">6400</td>
			<td align="LEFT"><br></td>
			<td align="RIGHT" sdval="6200" sdnum="1033;">6200</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">F</td>
			<td align="RIGHT" sdval="3100" sdnum="1033;">3100</td>
			<td align="LEFT">F</td>
			<td align="RIGHT" sdval="2800" sdnum="1033;">2800</td>
			<td align="LEFT">F</td>
			<td align="RIGHT" sdval="2700" sdnum="1033;">2700</td>
			<td align="LEFT">F</td>
			<td align="RIGHT" sdval="2900" sdnum="1033;">2900</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Fc</td>
			<td align="RIGHT" sdval="45161" sdnum="1033;">45161</td>
			<td align="LEFT">Fc</td>
			<td align="RIGHT" sdval="46429" sdnum="1033;">46429</td>
			<td align="LEFT">Fc</td>
			<td align="RIGHT" sdval="46078" sdnum="1033;">46078</td>
			<td align="LEFT">Fc</td>
			<td align="RIGHT" sdval="45348" sdnum="1033;">45348</td>
		</tr>
		<tr>
			<td height="17" align="LEFT"><br></td>
			<td align="RIGHT" sdval="14000" sdnum="1033;">14000</td>
			<td align="LEFT"><br></td>
			<td align="RIGHT" sdval="13000" sdnum="1033;">13000</td>
			<td align="LEFT"><br></td>
			<td align="RIGHT" sdval="12441" sdnum="1033;">12441</td>
			<td align="LEFT"><br></td>
			<td align="RIGHT" sdval="13151" sdnum="1033;">13151</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>This continues until 2018.  Notice how there is a date with an &apos;F&apos; at the end of it?  This allows us to know how the data was collected.  This data can be found in the &apos;Flags.csv&apos; file included in the dataset.  I include it for those are curious, but it was not used in this data exploration.  I&apos;ll use an ellipse to denote where I snipped the dataset.</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="2" rules="NONE" border="0">
	<colgroup><col width="38"><col width="865"></colgroup>
	<tbody>
		<tr>
			<td width="38" height="17" align="LEFT">Flag</td>
			<td width="865" align="LEFT">Flags</td>
		</tr>
		<tr>
			<td height="17" align="LEFT"><br></td>
			<td align="LEFT">Official data</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">*</td>
			<td align="LEFT">Unofficial figure</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">A</td>
			<td align="LEFT">Aggregate, may include official, semi-official, estimated or calculated data</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">&#x2026;</td>
			<td align="LEFT">&#x2026; </td>
		</tr>
		<tr>
			<td height="17" align="LEFT">F</td>
			<td align="LEFT">FAO estimate</td>
		</tr>
		<tr>
			<td height="18" align="LEFT">Fb</td>
			<td align="LEFT">Data obtained as a balance</td>
		</tr>
		<tr>
			<td height="18" align="LEFT">Fc</td>
			<td align="LEFT">Calculated data</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">&#x2026;</td>
			<td align="LEFT">&#x2026;</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<hr><p>TO THE ANALYSIS</p><p>We will be answering 5 questions on this data.</p>
<!--kg-card-begin: html-->
<ol>
&#xA0; <li>Which country in Africa produced the most coffee in 2018?</li>
&#xA0; <li>Which country in Africa produced the most coffee per acre in 2018?</li>
&#xA0; <li>Which country in Africa has the largest increase in the amount of coffee produced since 1961?</li>
&#xA0; <li>Which country (in the world, not specifically Africa) saw the highest increase in the amount of coffee grown per acre since 1961?*</li>
&#xA0; <li>Which continent (or rather, major grouping as presented in the 5 data sets) has the largest increase in the amount of coffee grown per acre since 1961?*</li>
</ol>
<!--kg-card-end: html-->
<p>*This post contains solutions for 1-3, the next post details 4 and 5</p><hr><h3 id="1-which-country-in-africa-produced-the-most-coffee-in-2018">1 Which country in Africa produced the most coffee in 2018?</h3><p>We are going to start by importing the data.  Then we distill the columns that we to display by subsetting the dataset and including only &apos;Area&apos;, &apos;Item&apos;, &apos;Element&apos;, &apos;Unit&apos;, and the year in question &apos;Y2018&apos;.</p><p>The idea here is simple, we are going to filter where the &apos;Item Code&apos; is equal to &apos;656&apos;, and filter on the &apos;Element&apos; column where it is equal to &apos;Production&apos;.  After this it is a matter of sorting data on the &apos;Y2018&apos; column, as this column contains the numerical &apos;Production&apos; values.</p><pre><code class="language-R">#Import dplyr
library(dplyr)

#Import data
africa_table = read.csv(&quot;C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv&quot;)


q1 &lt;- africa_table %&gt;% 
  select(Area, Item.Code, Item, Element, Unit, Y2018) %&gt;%
  filter(Element == &quot;Production&quot;) %&gt;%
  filter(Item.Code == 656) %&gt;%
  arrange(desc(Y2018))

#Output as a csv vile
write.csv(q1africa_table, &quot;C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\q1.csv&quot;)
</code></pre><p>OUTPUT:</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="6" rules="NONE" border="0">
	<colgroup><col width="218"><col width="74"><col width="91"><col width="77"><col width="53"><col width="56"></colgroup>
	<tbody>
		<tr>
			<td width="218" height="17" align="LEFT">Area</td>
			<td width="74" align="LEFT">Item.Code</td>
			<td width="91" align="LEFT">Item</td>
			<td width="77" align="LEFT">Element</td>
			<td width="53" align="LEFT">Unit</td>
			<td width="56" align="LEFT">Y2018</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Ethiopia</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="494574" sdnum="1033;">494574</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Uganda</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="284225" sdnum="1033;">284225</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Madagascar</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="60114" sdnum="1033;">60114</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">United Republic of Tanzania</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="43193" sdnum="1033;">43193</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Guinea</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="42900" sdnum="1033;">42900</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Kenya</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="41375" sdnum="1033;">41375</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">C&#xFFFD;te d&apos;Ivoire</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="39092" sdnum="1033;">39092</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Rwanda</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="38643" sdnum="1033;">38643</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Democratic Republic of the Congo</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="31145" sdnum="1033;">31145</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Cameroon</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="30590" sdnum="1033;">30590</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Sierra Leone</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="20480" sdnum="1033;">20480</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Togo</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="19076" sdnum="1033;">19076</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Angola</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="16308" sdnum="1033;">16308</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Burundi</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="14216" sdnum="1033;">14216</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Malawi</td>
			<td align="RIGHT" sdval="656" sdnum="1033;">656</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">Production</td>
			<td align="LEFT">tonnes</td>
			<td align="RIGHT" sdval="11082" sdnum="1033;">11082</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<h3 id="2-which-country-in-africa-produced-the-most-coffee-per-acre-in-2018">2 Which country in Africa produced the most coffee per acre in 2018?</h3><p>This is a similar problem, there is no need to aggregate the data because the data is already aggregated, it is a matter of not selecting &quot;Production&quot; as the &quot;Element&quot;, but filtering by &quot;Yield&quot;.</p><pre><code class="language-R">library(dplyr)
africa_table = read.csv(&quot;C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv&quot;)


#Only keep Element values equal to &apos;Yield&apos;
q2 &lt;- filter(africa_table, Element == &apos;Yield&apos;) %&gt;%
  #Our focus is on coffee
  filter(Item ==&apos;Coffee, green&apos;) %&gt;%
  #Select output columns
  select(&apos;Area&apos;, &apos;Item&apos;, &apos;Unit&apos;, &apos;Y2018&apos;) %&gt;%
  #Sort highest yield to lowest
  arrange(desc(Y2018))</code></pre><p>OUTPUT:</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="5" rules="NONE" border="0">
	<colgroup><col width="26"><col width="218"><col width="91"><col width="45"><col width="51"></colgroup>
	<tbody>
		<tr>
			<td width="26" height="17" align="LEFT"><br></td>
			<td width="218" align="LEFT">Area</td>
			<td width="91" align="LEFT">Item</td>
			<td width="45" align="LEFT">Unit</td>
			<td width="51" align="LEFT">Y2018</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="1" sdnum="1033;">1</td>
			<td align="LEFT">Malawi</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="RIGHT" sdval="23345" sdnum="1033;">23345</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="2" sdnum="1033;">2</td>
			<td align="LEFT">Sierra Leone</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="RIGHT" sdval="20706" sdnum="1033;">20706</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="3" sdnum="1033;">3</td>
			<td align="LEFT">Ghana</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="RIGHT" sdval="16959" sdnum="1033;">16959</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="4" sdnum="1033;">4</td>
			<td align="LEFT">Rwanda</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="RIGHT" sdval="16506" sdnum="1033;">16506</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="5" sdnum="1033;">5</td>
			<td align="LEFT">Nigeria</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="RIGHT" sdval="12886" sdnum="1033;">12886</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="6" sdnum="1033;">6</td>
			<td align="LEFT">Burundi</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">hg/ha</td>
			<td align="RIGHT" sdval="9658" sdnum="1033;">9658</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p></p><h3 id="3-which-country-in-africa-has-the-largest-increase-in-the-amount-of-coffee-produced-since-1961">3 Which country in Africa has the largest increase in the amount of coffee produced since 1961?</h3><p>A different type of question.  I need to compare values, and use the percent change formula.</p><pre><code class="language-R">#consider dplyr and africa_table already imported

#Select required columns
q3 &lt;- select(africa_table, &apos;Area&apos;, &apos;Item&apos;, &apos;Unit&apos;, &apos;Element&apos;,&apos;Y1961&apos;,&apos;Y2018&apos;) %&gt;%
  #filter by coffee and production
  filter(Item == &apos;Coffee, green&apos;, Element == &apos;Production&apos;) %&gt;%
  #create a new column for percent change rounded to 2 decimal places
  mutate(PercentChange = round(((Y2018 - Y1961) / Y1961)*100,2)) %&gt;%
  #arrange in descending order using PercentChange
  arrange(desc(PercentChange))  </code></pre><p>OUTPUT:</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="8" rules="NONE" border="0">
	<colgroup><col width="26"><col width="218"><col width="91"><col width="53"><col width="77"><col width="56"><col width="56"><col width="105"></colgroup>
	<tbody>
		<tr>
			<td width="26" height="17" align="LEFT"><br></td>
			<td width="218" align="LEFT">Area</td>
			<td width="91" align="LEFT">Item</td>
			<td width="53" align="LEFT">Unit</td>
			<td width="77" align="LEFT">Element</td>
			<td width="56" align="LEFT">Y1961</td>
			<td width="56" align="LEFT">Y2018</td>
			<td width="105" align="LEFT">PercentChange</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="1" sdnum="1033;">1</td>
			<td align="LEFT">Malawi</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="167" sdnum="1033;">167</td>
			<td align="RIGHT" sdval="11082" sdnum="1033;">11082</td>
			<td align="RIGHT" sdval="6535.93" sdnum="1033;">6535.93</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="2" sdnum="1033;">2</td>
			<td align="LEFT">Sierra Leone</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="5103" sdnum="1033;">5103</td>
			<td align="RIGHT" sdval="20480" sdnum="1033;">20480</td>
			<td align="RIGHT" sdval="301.33" sdnum="1033;">301.33</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="3" sdnum="1033;">3</td>
			<td align="LEFT">Rwanda</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="10500" sdnum="1033;">10500</td>
			<td align="RIGHT" sdval="38643" sdnum="1033;">38643</td>
			<td align="RIGHT" sdval="268.03" sdnum="1033;">268.03</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="4" sdnum="1033;">4</td>
			<td align="LEFT">Congo</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="900" sdnum="1033;">900</td>
			<td align="RIGHT" sdval="3049" sdnum="1033;">3049</td>
			<td align="RIGHT" sdval="238.78" sdnum="1033;">238.78</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="5" sdnum="1033;">5</td>
			<td align="LEFT">Uganda</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="94100" sdnum="1033;">94100</td>
			<td align="RIGHT" sdval="284225" sdnum="1033;">284225</td>
			<td align="RIGHT" sdval="202.05" sdnum="1033;">202.05</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="6" sdnum="1033;">6</td>
			<td align="LEFT">Guinea</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="15000" sdnum="1033;">15000</td>
			<td align="RIGHT" sdval="42900" sdnum="1033;">42900</td>
			<td align="RIGHT" sdval="186" sdnum="1033;">186</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="7" sdnum="1033;">7</td>
			<td align="LEFT">Togo</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="10300" sdnum="1033;">10300</td>
			<td align="RIGHT" sdval="19076" sdnum="1033;">19076</td>
			<td align="RIGHT" sdval="85.2" sdnum="1033;">85.2</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="8" sdnum="1033;">8</td>
			<td align="LEFT">Kenya</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="28100" sdnum="1033;">28100</td>
			<td align="RIGHT" sdval="41375" sdnum="1033;">41375</td>
			<td align="RIGHT" sdval="47.24" sdnum="1033;">47.24</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="9" sdnum="1033;">9</td>
			<td align="LEFT">Comoros</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="100" sdnum="1033;">100</td>
			<td align="RIGHT" sdval="140" sdnum="1033;">140</td>
			<td align="RIGHT" sdval="40" sdnum="1033;">40</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="10" sdnum="1033;">10</td>
			<td align="LEFT">United Republic of Tanzania</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="33000" sdnum="1033;">33000</td>
			<td align="RIGHT" sdval="43193" sdnum="1033;">43193</td>
			<td align="RIGHT" sdval="30.89" sdnum="1033;">30.89</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="11" sdnum="1033;">11</td>
			<td align="LEFT">Central African Republic</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="8200" sdnum="1033;">8200</td>
			<td align="RIGHT" sdval="9391" sdnum="1033;">9391</td>
			<td align="RIGHT" sdval="14.52" sdnum="1033;">14.52</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="12" sdnum="1033;">12</td>
			<td align="LEFT">Madagascar</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="54000" sdnum="1033;">54000</td>
			<td align="RIGHT" sdval="60114" sdnum="1033;">60114</td>
			<td align="RIGHT" sdval="11.32" sdnum="1033;">11.32</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="13" sdnum="1033;">13</td>
			<td align="LEFT">Nigeria</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="1100" sdnum="1033;">1100</td>
			<td align="RIGHT" sdval="1161" sdnum="1033;">1161</td>
			<td align="RIGHT" sdval="5.55" sdnum="1033;">5.55</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="14" sdnum="1033;">14</td>
			<td align="LEFT">Burundi</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="14000" sdnum="1033;">14000</td>
			<td align="RIGHT" sdval="14216" sdnum="1033;">14216</td>
			<td align="RIGHT" sdval="1.54" sdnum="1033;">1.54</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="15" sdnum="1033;">15</td>
			<td align="LEFT">Mozambique</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="1000" sdnum="1033;">1000</td>
			<td align="RIGHT" sdval="825" sdnum="1033;">825</td>
			<td align="RIGHT" sdval="-17.5" sdnum="1033;">-17.5</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="16" sdnum="1033;">16</td>
			<td align="LEFT">Cameroon</td>
			<td align="LEFT">Coffee, green</td>
			<td align="LEFT">tonnes</td>
			<td align="LEFT">Production</td>
			<td align="RIGHT" sdval="44700" sdnum="1033;">44700</td>
			<td align="RIGHT" sdval="30590" sdnum="1033;">30590</td>
			<td align="RIGHT" sdval="-31.57" sdnum="1033;">-31.57</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>We will continue with questions 4 and 5 in the next post.</p>]]></content:encoded></item><item><title><![CDATA[Power Bi - TFL Bus Safety Viz]]></title><description><![CDATA[<!--kg-card-begin: html-->
<div style="padding:56.25% 0 0 0;position:relative;"><iframe src="https://player.vimeo.com/video/1004270373?title=0&amp;byline=0&amp;portrait=0&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write" style="position:absolute;top:0;left:0;width:100%;height:100%;" title="Power Bi Visualization on The TFL Bus Incident Dataset"></iframe></div><script src="https://player.vimeo.com/api/player.js"></script>
<!--kg-card-end: html-->
<p>View the interactive visualization <a href="https://app.powerbi.com/groups/me/reports/ef920fa2-72d9-4e3f-8206-52896b2a841e/8feb9e9d9298587bcd8d?ctid=cfa792cf-7768-4341-8857-81754c2afa1f&amp;pbi_source=shareVisual&amp;visual=28c9f2449c4205ea20c3&amp;height=110.00&amp;width=272.50&amp;bookmarkGuid=dd5fcbb0-4878-4bb2-87d2-65d1400da97e" rel="noreferrer">here</a> (I think you may need a Power Bi account to view)</p><p></p>]]></description><link>https://www.williamameyer.com/power-bi-tfl-bus-safety-viz/</link><guid isPermaLink="false">66d0bf2f44502a60aa65ae04</guid><dc:creator><![CDATA[William M]]></dc:creator><pubDate>Thu, 29 Aug 2024 22:31:34 GMT</pubDate><media:content url="https://www.williamameyer.com/content/images/2024/08/TFLZbusSafety-Viz-Screenshot-min.png" medium="image"/><content:encoded><![CDATA[
<!--kg-card-begin: html-->
<div style="padding:56.25% 0 0 0;position:relative;"><iframe src="https://player.vimeo.com/video/1004270373?title=0&amp;byline=0&amp;portrait=0&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write" style="position:absolute;top:0;left:0;width:100%;height:100%;" title="Power Bi Visualization on The TFL Bus Incident Dataset"></iframe></div><script src="https://player.vimeo.com/api/player.js"></script>
<!--kg-card-end: html-->
<img src="https://www.williamameyer.com/content/images/2024/08/TFLZbusSafety-Viz-Screenshot-min.png" alt="Power Bi - TFL Bus Safety Viz"><p>View the interactive visualization <a href="https://app.powerbi.com/groups/me/reports/ef920fa2-72d9-4e3f-8206-52896b2a841e/8feb9e9d9298587bcd8d?ctid=cfa792cf-7768-4341-8857-81754c2afa1f&amp;pbi_source=shareVisual&amp;visual=28c9f2449c4205ea20c3&amp;height=110.00&amp;width=272.50&amp;bookmarkGuid=dd5fcbb0-4878-4bb2-87d2-65d1400da97e" rel="noreferrer">here</a> (I think you may need a Power Bi account to view)</p><p></p>]]></content:encoded></item><item><title><![CDATA[Python EDA - London Bus Safety Part 2/2]]></title><description><![CDATA[<!--kg-card-begin: html-->
 <ol>
  <li>How many unique fields are there for each variable?</li>
  <li>How much nullity is there for each variable?</li>
  <li>Create a new variable to describe if the person was taken to hospital.</li>
  <li>Do any operators have a higher incidence of overall incidents?</li>
  <li>Do any operators have a higher incident of hospitalizations?*</li>
  <li>Compare</li></ol>]]></description><link>https://www.williamameyer.com/python-eda-london-bus-safety-part-1-2/</link><guid isPermaLink="false">66cd093b44502a60aa65add6</guid><dc:creator><![CDATA[William M]]></dc:creator><pubDate>Mon, 26 Aug 2024 23:12:05 GMT</pubDate><media:content url="https://www.williamameyer.com/content/images/2024/08/OptimizedJpgBridge-1.jpg" medium="image"/><content:encoded><![CDATA[
<!--kg-card-begin: html-->
 <ol>
  <li>How many unique fields are there for each variable?</li>
  <li>How much nullity is there for each variable?</li>
  <li>Create a new variable to describe if the person was taken to hospital.</li>
  <li>Do any operators have a higher incidence of overall incidents?</li>
  <li>Do any operators have a higher incident of hospitalizations?*</li>
  <li>Compare the &apos;Slip Trip Fall&apos; count with different operators.  
    Order them by operators with most &apos;Slip Trip Fall&apos;s to least.  
    Which operator has the most of these incidents?*</li>
</ol> 
<!--kg-card-end: html-->
<img src="https://www.williamameyer.com/content/images/2024/08/OptimizedJpgBridge-1.jpg" alt="Python EDA - London Bus Safety Part 2/2"><p>*We have already explored questions 1 through 4 in the previous post, we will continue with the final 2 tasks of this analysis.</p><p>Data preview:</p>
<!--kg-card-begin: html-->
<table border="1" class="dataframe"><thead><tr style="text-align: right;"><th>Year</th>
      <th>Date Of Incident</th>
      <th>Route</th>
      <th>Operator</th>
      <th>Group Name</th>
      <th>Bus Garage</th>
      <th>Borough</th>
      <th>Injury Result Description</th>
      <th>Incident Event Type</th>
      <th>Victim Category</th>
      <th>Victims Sex</th>
      <th>Victims Age</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>2015</td>
      <td>1/1/2015</td>
      <td>1</td>
      <td>London General</td>
      <td>Go-Ahead</td>
      <td>Garage Not Available</td>
      <td>Southwark</td>
      <td>Injuries treated on scene</td>
      <td>Onboard Injuries</td>
      <td>Passenger</td>
      <td>Male</td>
      <td>Child</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2015</td>
      <td>1/1/2015</td>
      <td>4</td>
      <td>Metroline</td>
      <td>Metroline</td>
      <td>Garage Not Available</td>
      <td>Islington</td>
      <td>Injuries treated on scene</td>
      <td>Onboard Injuries</td>
      <td>Passenger</td>
      <td>Male</td>
      <td>Unknown</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2015</td>
      <td>1/1/2015</td>
      <td>5</td>
      <td>East London</td>
      <td>Stagecoach</td>
      <td>Garage Not Available</td>
      <td>Havering</td>
      <td>Taken to Hospital &#x2013; Reported Serious Injury or...</td>
      <td>Onboard Injuries</td>
      <td>Passenger</td>
      <td>Male</td>
      <td>Elderly</td>
    </tr>
    <tr>
      <th>3</th>
      <td>2015</td>
      <td>1/1/2015</td>
      <td>5</td>
      <td>East London</td>
      <td>Stagecoach</td>
      <td>Garage Not Available</td>
      <td>None London Borough</td>
      <td>Taken to Hospital &#x2013; Reported Serious Injury or...</td>
      <td>Onboard Injuries</td>
      <td>Passenger</td>
      <td>Male</td>
      <td>Elderly</td></tr></tbody></table>
<!--kg-card-end: html-->
<h2 id="5-do-any-operators-have-a-higher-incident-of-hospitalizations">5. Do any operators have a higher incident of hospitalizations?</h2><p>Fairly straightforward because we have already created a field for hospitalized, let&apos;s aggregate that, and use Seaborn to plot it.</p><pre><code class="language-Python">#Aggregate the count of hospitilizations 
operator_hospitalizations = df.groupby([&apos;Operator&apos;], as_index=False).agg(hospitalized_count = (&apos;Hospitalized&apos;, &apos;sum&apos;))

#Sort values by hospitalizations descending
operator_hospitalizations = operator_hospitalizations.sort_values(by=[&apos;hospitalized_count&apos;], ascending=False)

#Display
operator_hospitalizations</code></pre><p>OUTPUT:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.williamameyer.com/content/images/2024/08/HOSPITALIZEDcOUNT-2.png" class="kg-image" alt="Python EDA - London Bus Safety Part 2/2" loading="lazy" width="357" height="188"><figcaption><span style="white-space: pre-wrap;">This continues for all operators, only the top of the output is viewed above.</span></figcaption></figure><p>Then we will use matplotlib and seaborn to visualize the data.</p><pre><code class="language-Python">#Add a visualization

import seaborn as sns
import matplotlib.pyplot as plt

operator_hospitalizations_chart = sns.barplot(
    x= operator_hospitalizations[&apos;hospitalized_count&apos;], 
    y = operator_hospitalizations[&apos;Operator&apos;]
    )

operator_hospitalizations_chart.set_title(&apos;Hospitalizations Per Operator&apos;)
operator_hospitalizations_chart.set_xlabel(&apos;People Hospitalized&apos;)

plt.show()</code></pre><p>OUTPUT:</p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2024/08/HOSPITALIZEDViz-1.png" class="kg-image" alt="Python EDA - London Bus Safety Part 2/2" loading="lazy" width="815" height="461" srcset="https://www.williamameyer.com/content/images/size/w600/2024/08/HOSPITALIZEDViz-1.png 600w, https://www.williamameyer.com/content/images/2024/08/HOSPITALIZEDViz-1.png 815w" sizes="(min-width: 720px) 720px"></figure><p>And now for the final task</p><h2 id="6-compare-the-slip-trip-fall-count-in-incident-event-type-with-different-operators-order-them-by-operators-with-most-slip-trip-falls-to-least">6. Compare the slip trip fall count (in incident event type) with different operators. Order them by operators with most slip trip falls to least.</h2><pre><code class="language-Python">#search for Slip Trip Fall in &apos;Incident Event Type&apos;

df[&apos;SlipTripFall&apos;] = df[&apos;Incident Event Type&apos;].str.contains(&apos;Slip Trip Fall&apos;)
#Confirm that the string is found (if so there will be two unique values,

operator_tripslipfall = df.groupby([&apos;Operator&apos;], as_index=False).agg(SlipTripFallCount = (&apos;SlipTripFall&apos;, &apos;sum&apos;))

operator_tripslipfall = operator_tripslipfall.sort_values(by=[&apos;SlipTripFallCount&apos;], ascending=False)

operator_tripslipfall</code></pre><p>OUTPUT:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.williamameyer.com/content/images/2024/08/slipfallcounts.png" class="kg-image" alt="Python EDA - London Bus Safety Part 2/2" loading="lazy" width="448" height="268"><figcaption><span style="white-space: pre-wrap;">This continues for all operators, only the top of the output is viewed above.</span></figcaption></figure><p>Then we will make a visualization of the data using matplotlib and seaborn</p><pre><code class="language-Python">import seaborn as sns

sns.barplot(
    x= operator_tripslipfall[&apos;SlipTripFallCount&apos;], 
    y = operator_tripslipfall[&apos;Operator&apos;]
    )


#Add a visualization

import seaborn as sns
import matplotlib.pyplot as plt

operator_tripslipfall_chart = sns.barplot(
    x= operator_tripslipfall[&apos;SlipTripFallCount&apos;], 
    y = operator_tripslipfall[&apos;Operator&apos;]
    )

operator_tripslipfall_chart.set_title(&apos;Slip/Trip/Fall Count Per Operator&apos;)
operator_tripslipfall_chart.set_xlabel(&apos;People who Slipped/Tripped/Fell&apos;)

plt.show()



###ANSWER THE QUESTION

print(&apos;The Operator with the most Slip/Trip/Fall incidents is:\n\n&apos; 
      + BOLD + operator_tripslipfall[&apos;Operator&apos;].iloc[0] + END)</code></pre><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2024/08/SlipfallViz.png" class="kg-image" alt="Python EDA - London Bus Safety Part 2/2" loading="lazy" width="831" height="576" srcset="https://www.williamameyer.com/content/images/size/w600/2024/08/SlipfallViz.png 600w, https://www.williamameyer.com/content/images/2024/08/SlipfallViz.png 831w" sizes="(min-width: 720px) 720px"></figure><p>And that is the end of our Python Exploratory Data Analysis.</p>]]></content:encoded></item><item><title><![CDATA[Python EDA - London Bus Safety Part 1/2]]></title><description><![CDATA[<p><em>Transport for London</em> provided their data on bus safety.  This data will be explored using Python.</p><p>We will use a Jupyter notebook using python 3.  We do our library imports, import the data, and preview:</p><pre><code class="language-Python">import pandas as pd
import numpy as np

#For formatting output
BOLD = &apos;\033[1m&</code></pre>]]></description><link>https://www.williamameyer.com/python-eda-london-bus-safety/</link><guid isPermaLink="false">66ccddd344502a60aa65ad24</guid><dc:creator><![CDATA[William M]]></dc:creator><pubDate>Mon, 26 Aug 2024 21:27:34 GMT</pubDate><media:content url="https://www.williamameyer.com/content/images/2024/08/OptimizedJpgBridge.jpg" medium="image"/><content:encoded><![CDATA[<img src="https://www.williamameyer.com/content/images/2024/08/OptimizedJpgBridge.jpg" alt="Python EDA - London Bus Safety Part 1/2"><p><em>Transport for London</em> provided their data on bus safety.  This data will be explored using Python.</p><p>We will use a Jupyter notebook using python 3.  We do our library imports, import the data, and preview:</p><pre><code class="language-Python">import pandas as pd
import numpy as np

#For formatting output
BOLD = &apos;\033[1m&apos;
END = &apos;\033[0m&apos;

df = pd.read_csv(&quot;TFL Bus Safety.csv&quot;,encoding=&apos;cp1252&apos;)
df.head()</code></pre><p>OUTPUT:</p>
<!--kg-card-begin: html-->
<table border="1" class="dataframe"><thead><tr style="text-align: right;"><th>Year</th>
      <th>Date Of Incident</th>
      <th>Route</th>
      <th>Operator</th>
      <th>Group Name</th>
      <th>Bus Garage</th>
      <th>Borough</th>
      <th>Injury Result Description</th>
      <th>Incident Event Type</th>
      <th>Victim Category</th>
      <th>Victims Sex</th>
      <th>Victims Age</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>2015</td>
      <td>1/1/2015</td>
      <td>1</td>
      <td>London General</td>
      <td>Go-Ahead</td>
      <td>Garage Not Available</td>
      <td>Southwark</td>
      <td>Injuries treated on scene</td>
      <td>Onboard Injuries</td>
      <td>Passenger</td>
      <td>Male</td>
      <td>Child</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2015</td>
      <td>1/1/2015</td>
      <td>4</td>
      <td>Metroline</td>
      <td>Metroline</td>
      <td>Garage Not Available</td>
      <td>Islington</td>
      <td>Injuries treated on scene</td>
      <td>Onboard Injuries</td>
      <td>Passenger</td>
      <td>Male</td>
      <td>Unknown</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2015</td>
      <td>1/1/2015</td>
      <td>5</td>
      <td>East London</td>
      <td>Stagecoach</td>
      <td>Garage Not Available</td>
      <td>Havering</td>
      <td>Taken to Hospital &#x2013; Reported Serious Injury or...</td>
      <td>Onboard Injuries</td>
      <td>Passenger</td>
      <td>Male</td>
      <td>Elderly</td>
    </tr>
    <tr>
      <th>3</th>
      <td>2015</td>
      <td>1/1/2015</td>
      <td>5</td>
      <td>East London</td>
      <td>Stagecoach</td>
      <td>Garage Not Available</td>
      <td>None London Borough</td>
      <td>Taken to Hospital &#x2013; Reported Serious Injury or...</td>
      <td>Onboard Injuries</td>
      <td>Passenger</td>
      <td>Male</td>
      <td>Elderly</td></tr></tbody></table>
<!--kg-card-end: html-->
<p>We will be doing 6 tasks to explore this data:</p>
<!--kg-card-begin: html-->
 <ol>
  <li>How many unique fields are there for each variable?</li>
  <li>How much nullity is there for each variable?</li>
  <li>Create a new variable to describe if the person was taken to hospital.</li>
  <li>Do any operators have a higher incidence of overall incidents?</li>
  <li>Do any operators have a higher incident of hospitalizations?*</li>
  <li>Compare the &apos;Slip Trip Fall&apos; count with different operators.  
    Order them by operators with most &apos;Slip Trip Fall&apos;s to least.  
    Which operator has the most of these incidents?*</li>
</ol> 
<!--kg-card-end: html-->
<p><em>*Items 5 and 6 will be finished in the next post.</em></p><h2 id="1a-how-many-unique-fields-are-there-for-each-variable">1a. How many unique fields are there for each variable?</h2><p>This is very straightforward as pandas has the nunique() method.  </p><pre><code class="language-Python">df.nunique()</code></pre><p>OUTPUT:</p><pre><code class="language-PythonOutput">Year                           4
Date Of Incident              45
Route                        612
Operator                      25
Group Name                    14
Bus Garage                    84
Borough                       35
Injury Result Description      4
Incident Event Type           10
Victim Category               17
Victims Sex                    3
Victims Age                    5
dtype: int64</code></pre><h2 id="1b-london-does-not-have-35-boroughs-i-know-this-because-i-googled-it">1b. London does not have 35 Boroughs, I know this because I googled it...</h2><p>so I put a list of London Boroughs in a dataframe and compared this to the &apos;list&apos; of Boroughs in this dataset.</p><pre><code class="language-Python">#copy list of boroughs
dfboroughs = pd.read_csv(&quot;londonboroughlist.csv&quot;)

#remove whitespace (the copying of boroughs caused a trailing whitespace)
dfboroughs[&apos;borough&apos;] = dfboroughs[&apos;borough&apos;].str.strip()

#convert the dataframe to a list
dfboroughs = dfboroughs[&apos;borough&apos;].to_list()

#List of valuses are in the dataframe[&apos;Borough&apos;] but are not in the borough list.
nonboroughs = df[&apos;Borough&apos;][~df[&apos;Borough&apos;].isin(dfboroughs)].unique()


#Show what boroughs are in the dataframe that are not in the borough list.
print(&apos;Dataframe contains these non-borough fields as boroughs:\n&apos;)

for i in nonboroughs:
    print(&quot;&apos;&quot; + i + &quot;&apos;&apos;&quot;)
</code></pre><p>OUTPUT:</p><pre><code class="language-PythonOutput">Dataframe contains these non-borough fields as boroughs:

&apos;None London Borough&apos;&apos;
&apos;City of London&apos;&apos;
&apos;Not specified&apos;&apos;</code></pre><h2 id="1c-examine-unique-values-on-a-subset-of-values-then-on-all-values">1c. Examine unique values on a subset of values, then on all values</h2><p>With this we can understand what may be categorical.  We will also need to examine how this dataset reports nulls, so the exhaustive list of values will be helpful.</p><pre><code class="language-Python">examine_unique_list = [&apos;Injury Result Description&apos;, &apos;Incident Event Type&apos;, &apos;Victim Category&apos;,&apos;Victims Sex&apos;, &apos;Victims Age&apos;]

print(&apos;Examining a few variables unique values:\n\n&apos;)

for i in examine_unique_list:
    print(BOLD + i.upper() + &apos;:&apos; + END)
    print(*df[i].unique(), sep = &apos;\n&apos;)
    print(&apos;\n&apos;)</code></pre><p>OUTPUT:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.williamameyer.com/content/images/2024/08/Category-Output.png" class="kg-image" alt="Python EDA - London Bus Safety Part 1/2" loading="lazy" width="631" height="460" srcset="https://www.williamameyer.com/content/images/size/w600/2024/08/Category-Output.png 600w, https://www.williamameyer.com/content/images/2024/08/Category-Output.png 631w"><figcaption><span style="white-space: pre-wrap;">... continues, but omitted for the blog</span></figcaption></figure><h2 id="2a-how-much-null-is-there-in-each-variable">2a. How much null is there in each variable?</h2><pre><code class="language-Python">&apos;&apos;&apos;Using the output above, find out how null is reported on each column, we will
use a dictionary to map the column with how null is reported.&apos;&apos;&apos;

column_null_dictionary = {
    &apos;Route&apos;: &apos;(blank)&apos;, 
    &apos;Bus Garage&apos;: &apos;Garage Not Available&apos;,
    &apos;Borough&apos;: &apos;Not specified&apos;, 
    &apos;Victim Category&apos;: &apos;Insufficient Data&apos;,
    &apos;Victims Sex&apos;: &apos;Unknown&apos;,
    &apos;Victims Age&apos;: &apos;Unknown&apos;
}


print(&quot;NULL COUNTS\n&quot;)

#Calculate row count of dataframe this is used for percent calculations
row_count = df.shape[0]


&apos;&apos;&apos;Iterate through the dictonary and replace values with NaN
and report nulls as we go&apos;&apos;&apos;

for x, y in column_null_dictionary.items():
    #Replace the column values with null where appropriate
    df[x].replace(y, np.nan, inplace=True)
    
    #Set null_count
    null_count = df[x].isna().sum()
    null_count_percent = round(((null_count / row_count)*100),2)
    
    #Count and report the nulls
    print(BOLD + x + &apos;:&apos; + END + &apos;\n&apos; + str(null_count) + &apos; / &apos; + str(row_count))
    
    #Report percent
    print(str(null_count_percent)  + &apos;%&apos; + &apos;\n&apos;)</code></pre><p>OUTPUT:</p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2024/08/Null-count-output-1.png" class="kg-image" alt="Python EDA - London Bus Safety Part 1/2" loading="lazy" width="178" height="542"></figure><p>Continuing this we will just run a report on nulls:</p><pre><code class="language-Python">#Report on all variables as a &apos;sanity check&apos;
df.isna().sum()</code></pre><p>OUTPUT:</p><pre><code class="language-PythonOutput">Year                            0
Date Of Incident                0
Route                          14
Operator                        0
Group Name                      0
Bus Garage                   8572
Borough                       553
Injury Result Description       0
Incident Event Type             0
Victim Category                 2
Victims Sex                  3602
Victims Age                  7135
dtype: int64</code></pre><h2 id="2b-visualize-nullity">2b Visualize nullity</h2><p>I will use the missingno library to output a nullity matrix.</p><pre><code class="language-Python">import missingno as msno 

#only report missing numbers on data with missing values
msno.matrix(df[list(column_null_dictionary.keys())]) </code></pre><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2024/08/nullitymatrix.png" class="kg-image" alt="Python EDA - London Bus Safety Part 1/2" loading="lazy" width="742" height="332" srcset="https://www.williamameyer.com/content/images/size/w600/2024/08/nullitymatrix.png 600w, https://www.williamameyer.com/content/images/2024/08/nullitymatrix.png 742w" sizes="(min-width: 720px) 720px"></figure><h1 id="3-create-a-new-variable-to-describe-if-the-person-was-taken-to-hospital">3 Create a new variable to describe if the person was taken to hospital.</h1><pre><code class="language-Python">#If the Injury Result Description has the string &apos;Hospital&apos; create bool T/F
df[&apos;Hospitalized&apos;] = df[&apos;Injury Result Description&apos;].str.contains(&apos;Hospital&apos;)

#Convert bool T/F into yes/no responses
df[&apos;HospitalizedYN&apos;] = df[&apos;Hospitalized&apos;].replace({
    True: &quot;Yes&quot;,
    False: &quot;No&quot;
})

&apos;&apos;&apos;A column with true and false values is kept for future aggregations,
this is with the hindsight of proofreading, originally I converted it back 
to boolean to do the aggregations&apos;&apos;&apos;

print(&apos;View of the new variables\n&apos;)
print(df[&apos;HospitalizedYN&apos;].head())
print(&apos;\n&apos;)
print(df[&apos;Hospitalized&apos;].head())</code></pre><p>In the example above I created two new variables, one human-readable (&apos;Yes&apos;/&apos;No&apos; values) and one boolean.  We can use the boolean for aggregate functions.</p><p>OUTPUT:</p><pre><code class="language-PythonOutput">View of the new variables

0     No
1     No
2    Yes
3    Yes
4    Yes
Name: HospitalizedYN, dtype: object


0    False
1    False
2     True
3     True
4     True
Name: Hospitalized, dtype: bool
</code></pre><h1 id="4-do-any-operators-have-a-higher-incidence-of-overall-incidents">4. Do any operators have a higher incidence of overall incidents?</h1><pre><code class="language-Python">&apos;&apos;&apos;Create new Dataframe with all the operators.  Operator is needed, 
the other column is just to tranform into count column&apos;&apos;&apos;

#New DF
operator_incidents = df[[&apos;Operator&apos;]].copy()

#Create a new column in new dataframe
operator_incidents[&apos;Count&apos;] = 1


#Value check for above code (maybe thorough, maybe noob stuff IDK)
print(operator_incidents)
print(&apos;\nCheck to make sure sum of count = the # of rows above&apos;)
print(operator_incidents[&apos;Count&apos;].sum())</code></pre><p>Outputs:</p><pre><code class="language-PythonOutput">             Operator  Count
0      London General      1
1           Metroline      1
2         East London      1
3         East London      1
4           Metroline      1
...               ...    ...
23153     East London      1
23154   London United      1
23155   London United      1
23156   London United      1
23157       Metroline      1

[23158 rows x 2 columns]

Check to make sure sum of count = the # of rows above
23158
</code></pre><p>Then we use this for the aggregate function:</p><pre><code class="language-Python">#Use an aggregate function and the sum of the new &apos;Count&apos; variable

operator_incidents = operator_incidents.groupby([&apos;Operator&apos;], as_index=False).agg(incident_count = (&apos;Count&apos;, &apos;sum&apos;))

operator_incidents.sort_values(by=[&apos;incident_count&apos;], ascending=False)</code></pre><p>OUTPUT:</p><figure class="kg-card kg-image-card"><img src="https://www.williamameyer.com/content/images/2024/08/HOSPITALIZEDcOUNT-1.png" class="kg-image" alt="Python EDA - London Bus Safety Part 1/2" loading="lazy" width="357" height="188"></figure><p></p><p>We will finish with tasks 5 and 6 in the next post.</p><p>The code for this can be found <a href="https://github.com/williamAM1/Data-Analytics-Portfolio/blob/53dbe5f752674f42378c34e9511b8fbbb2365775/03PythonEDA/03%20Python%20Project%20London%20Bus%20Safety.ipynb" rel="noreferrer">here</a> on github.</p><p>Data Citation:</p><p>data.world [@vizwiz]. (2018). <em>2018/W51: London Bus Safety Performance</em> [Dataset]. https://data.world/makeovermonday/2018w51</p>]]></content:encoded></item><item><title><![CDATA[SQL/Tableau: AFDB Market Trends with Viz.  Part 2]]></title><description><![CDATA[<p>The next visualization request is to be able to view the differences in the initial value of each <em>indicator</em>, and the final value of each <em>indicator</em>.  This will be completed in SQL.</p><p>Requirements:</p>
<!--kg-card-begin: html-->
<ul>
  <li>Extract <em>month</em> and <em>year</em> from <em>date</em></li>
  <li>Average values based on <em>indicator</em>, <em>year</em>, and <em>month</em></li>
  <li>Filter by the</li></ul>]]></description><link>https://www.williamameyer.com/tableau-afdb-market-trends-with-visualization-part-2/</link><guid isPermaLink="false">66c7b89d44502a60aa65ace3</guid><dc:creator><![CDATA[William M]]></dc:creator><pubDate>Thu, 22 Aug 2024 22:26:40 GMT</pubDate><media:content url="https://www.williamameyer.com/content/images/2024/08/durban-3840075_640-1-.jpg" medium="image"/><content:encoded><![CDATA[<img src="https://www.williamameyer.com/content/images/2024/08/durban-3840075_640-1-.jpg" alt="SQL/Tableau: AFDB Market Trends with Viz.  Part 2"><p>The next visualization request is to be able to view the differences in the initial value of each <em>indicator</em>, and the final value of each <em>indicator</em>.  This will be completed in SQL.</p><p>Requirements:</p>
<!--kg-card-begin: html-->
<ul>
  <li>Extract <em>month</em> and <em>year</em> from <em>date</em></li>
  <li>Average values based on <em>indicator</em>, <em>year</em>, and <em>month</em></li>
  <li>Filter by the first <em>month</em> in the data provided</li>
  <li>Follow the above steps, except filter by last month</li>
  <li>Join those tables together</li>
</ul>
<!--kg-card-end: html-->
<pre><code class="language-SQL">SELECT 
	--3 final output
	-- indicator | initial date | initial average | ending date | ending average
	initial_values.indicatorname,
	make_date(CAST(initial_year AS INT), CAST(initial_month AS INT), 1) AS initital_date,
	initial_average,
	make_date(CAST(ending_year AS INT), CAST(ending_month AS INT), 1) AS ending_date,
	ending_average
FROM (
	/*1  Create a month/year format and aggregate
	Averagre value per indicator per month/year
	Only include first month using HAVING clause*/
	SELECT
		indicatorname,
		EXTRACT(&apos;Year&apos; FROM date) AS initial_year,
		EXTRACT(&apos;Month&apos; FROM date) AS initial_month,
		AVG(value) AS initial_average
	FROM
		afdbmarkettrends2015
	GROUP BY
		indicatorname, initial_year, initial_month
	HAVING
		EXTRACT(&apos;Year&apos; FROM date) = 2011
		AND
		EXTRACT(&apos;Month&apos; FROM date) = 1
	ORDER BY
		indicatorname, initial_year, initial_month)
	AS initial_values
JOIN(
	/*2  Same as first except 
	Only include last month using HAVING clause*/
	SELECT
		indicatorname,
		EXTRACT(&apos;Year&apos; FROM date) AS ending_year,
		EXTRACT(&apos;Month&apos; FROM date) AS ending_month,
		AVG(value) AS ending_average
	FROM
		afdbmarkettrends2015
	GROUP BY
		indicatorname, ending_year, ending_month
	HAVING
		EXTRACT(&apos;Year&apos; FROM date) = 2015
		AND
		EXTRACT(&apos;Month&apos; FROM date) = 7
	ORDER BY
		indicatorname, ending_year, ending_month)
	AS ending_values
ON
	initial_values.indicatorname = ending_values.indicatorname</code></pre><p>OUTPUT:</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="5" rules="NONE" border="0">
	<colgroup><col width="268"><col width="81"><col width="126"><col width="85"><col width="126"></colgroup>
	<tbody>
		<tr>
			<td width="268" height="17" align="LEFT">indicatorname</td>
			<td width="81" align="LEFT">initital_date</td>
			<td width="126" align="LEFT">initial_average</td>
			<td width="85" align="LEFT">ending_date</td>
			<td width="126" align="LEFT">ending_average</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="LEFT">2011-01-01</td>
			<td align="RIGHT" sdval="1401.4" sdnum="1033;">1401.4</td>
			<td align="LEFT">2011-01-01</td>
			<td align="RIGHT" sdval="954.9" sdnum="1033;">954.9</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">CFA zone Countries CFA Franc</td>
			<td align="LEFT">2011-01-01</td>
			<td align="RIGHT" sdval="492.8661" sdnum="1033;">492.8661</td>
			<td align="LEFT">2011-01-01</td>
			<td align="RIGHT" sdval="596.559305" sdnum="1033;">596.559305</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Cocoa (USD/tonne)</td>
			<td align="LEFT">2011-01-01</td>
			<td align="RIGHT" sdval="3164.863" sdnum="1033;">3164.863</td>
			<td align="LEFT">2011-01-01</td>
			<td align="RIGHT" sdval="3332.6425" sdnum="1033;">3332.6425</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Coffee Brazilian Naturals (US cents/tonne)</td>
			<td align="LEFT">2011-01-01</td>
			<td align="RIGHT" sdval="209.263157894737" sdnum="1033;">209.2631578947</td>
			<td align="LEFT">2011-01-01</td>
			<td align="RIGHT" sdval="111.934210526316" sdnum="1033;">111.9342105263</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">&#x2026;</td>
			<td align="LEFT">&#x2026;</td>
			<td align="LEFT">&#x2026;</td>
			<td align="LEFT">&#x2026;</td>
			<td align="LEFT">&#x2026;</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>Then we will use this data in Tableau for the visualization.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.williamameyer.com/content/images/2024/08/VizScreenshotCompressed-1.png" class="kg-image" alt="SQL/Tableau: AFDB Market Trends with Viz.  Part 2" loading="lazy" width="781" height="753" srcset="https://www.williamameyer.com/content/images/size/w600/2024/08/VizScreenshotCompressed-1.png 600w, https://www.williamameyer.com/content/images/2024/08/VizScreenshotCompressed-1.png 781w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">This image shows the desktop view of the dashboard. Below is the embedded dashboard. Depending on your browser the visualization may change.</span></figcaption></figure><p>The following dashboard allows the user to select which indicator(s) they would like to view, and shows the initial and final averages in an easy to understand chart.</p>
<!--kg-card-begin: html-->
<div class="tableauPlaceholder" id="viz1724365547195" style="position: relative"><noscript><a href="#"><img alt="SQL/Tableau: AFDB Market Trends with Viz.  Part 2" src="https://public.tableau.com/static/images/Ma/MarketTrendsDashboard/Dashboard1/1_rss.png" style="border: none"></a></noscript><object class="tableauViz" style="display:none;"><param name="host_url" value="https%3A%2F%2Fpublic.tableau.com%2F"> <param name="embed_code_version" value="3"> <param name="site_root" value><param name="name" value="MarketTrendsDashboard/Dashboard1"><param name="tabs" value="no"><param name="toolbar" value="yes"><param name="static_image" value="https://public.tableau.com/static/images/Ma/MarketTrendsDashboard/Dashboard1/1.png"> <param name="animate_transition" value="yes"><param name="display_static_image" value="yes"><param name="display_spinner" value="yes"><param name="display_overlay" value="yes"><param name="display_count" value="yes"><param name="language" value="en-US"><param name="filter" value="publish=yes"></object></div>                <script type="text/javascript">                    var divElement = document.getElementById('viz1724365547195');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='650px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='887px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='650px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='887px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else { vizElement.style.width='100%';vizElement.style.height='727px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
<!--kg-card-end: html-->
<p></p>]]></content:encoded></item><item><title><![CDATA[Tableau: AFDB Market Trends with Viz.  Part 1]]></title><description><![CDATA[<p>I have embedded the visualization of the AFDB Market Trend dataset below.  Tableau dashboards seem to work flawlessly on desktop, but often have trouble on mobile displays.  I will include screenshots, and the direct link to the dashboard on Tableau Public can be found <a href="https://public.tableau.com/views/TableauProblem1AFDB/Dashboard1?:language=en-US&amp;:sid=&amp;:redirect=auth&amp;:display_count=n&amp;:origin=viz_share_link" rel="noreferrer">here</a>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.williamameyer.com/content/images/2024/08/dashboard-picture.jpg" class="kg-image" alt loading="lazy" width="480" height="513"><figcaption><span style="white-space: pre-wrap;">This is an example of</span></figcaption></figure>]]></description><link>https://www.williamameyer.com/tableau-understanding-the-afdb-market-trends-2015-dataset-with-visualization/</link><guid isPermaLink="false">66c236c644502a60aa65acc6</guid><dc:creator><![CDATA[William M]]></dc:creator><pubDate>Sun, 18 Aug 2024 18:01:46 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1692248071692-9a75dabf6690?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDY1fHxsaXRoaXVtJTIwbWluaW5nfGVufDB8fHx8MTcyNDM2Njc2NXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1692248071692-9a75dabf6690?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDY1fHxsaXRoaXVtJTIwbWluaW5nfGVufDB8fHx8MTcyNDM2Njc2NXww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="Tableau: AFDB Market Trends with Viz.  Part 1"><p>I have embedded the visualization of the AFDB Market Trend dataset below.  Tableau dashboards seem to work flawlessly on desktop, but often have trouble on mobile displays.  I will include screenshots, and the direct link to the dashboard on Tableau Public can be found <a href="https://public.tableau.com/views/TableauProblem1AFDB/Dashboard1?:language=en-US&amp;:sid=&amp;:redirect=auth&amp;:display_count=n&amp;:origin=viz_share_link" rel="noreferrer">here</a>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.williamameyer.com/content/images/2024/08/dashboard-picture.jpg" class="kg-image" alt="Tableau: AFDB Market Trends with Viz.  Part 1" loading="lazy" width="480" height="513"><figcaption><span style="white-space: pre-wrap;">This is an example of the dashboard, if it looks very different (or doesn&apos;t load below this image at all) it may be better to try another browser.</span></figcaption></figure>
<!--kg-card-begin: html-->
<div class="tableauPlaceholder" id="viz1724200509330" style="position: relative"><noscript><a href="#"><img alt="Tableau: AFDB Market Trends with Viz.  Part 1" src="https://public.tableau.com/static/images/Ta/TableauProblem1AFDB/Dashboard1/1_rss.png" style="border: none"></a></noscript><object class="tableauViz" style="display:none;"><param name="host_url" value="https%3A%2F%2Fpublic.tableau.com%2F"> <param name="embed_code_version" value="3"> <param name="site_root" value><param name="name" value="TableauProblem1AFDB/Dashboard1"><param name="tabs" value="no"><param name="toolbar" value="yes"><param name="static_image" value="https://public.tableau.com/static/images/Ta/TableauProblem1AFDB/Dashboard1/1.png"> <param name="animate_transition" value="yes"><param name="display_static_image" value="yes"><param name="display_spinner" value="yes"><param name="display_overlay" value="yes"><param name="display_count" value="yes"><param name="language" value="en-US"><param name="filter" value="publish=yes"></object></div>                <script type="text/javascript">                    var divElement = document.getElementById('viz1724200509330');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='650px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='887px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='650px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='887px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else { vizElement.style.width='100%';vizElement.style.height='727px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

<!--kg-card-end: html-->
<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.williamameyer.com/content/images/2024/08/Draft-Web-optimized.jpg" class="kg-image" alt="Tableau: AFDB Market Trends with Viz.  Part 1" loading="lazy" width="640" height="480" srcset="https://www.williamameyer.com/content/images/size/w600/2024/08/Draft-Web-optimized.jpg 600w, https://www.williamameyer.com/content/images/2024/08/Draft-Web-optimized.jpg 640w"><figcaption><span style="white-space: pre-wrap;">I started my dashboard with a draft. It is a single line graph that allows the end user to choose which data and time-frame they want to examine... it was a draft that was never meant to share on the internet... so don&apos;t judge.</span></figcaption></figure><p>In order to turn this draft into a working dashboard I considered a few steps then began to create the table.</p><ol><li>I imported data and made sure the data types were correct.</li><li>I placed the &apos;date&apos; variable in the columns, and the &apos;value&apos; as the rows.</li><li>On the Graph itself I placed the &apos;Indicator Name&apos; as I wanted to be able to display each indicator&apos;s value&apos;s separately as opposed to aggregating the value for all indicators.</li><li>I created a filter for indicator name, so the user can choose which indicator(s) they would like to see.</li><li>The SUM(value) was changed to average (for future aggregations, by month or quarter, etc)</li><li>A calculated field was created to filter out commodities (</li></ol><pre><code class="language-VizQL">IF ISNULL([Unit]) THEN &apos;Main Indicators&apos;
ELSE &apos;Commodity Indicators&apos;
END
</code></pre><ol start="7"><li>Make sure the indictor filter is affected by the &apos;commody indicator&apos;, else it will show all the indicators even when just &apos;commodies&apos; or just &apos;Main indicators&apos; is selected.</li></ol><p></p><p>Before:</p>]]></content:encoded></item><item><title><![CDATA[SQL EDA - AFDB Market Trends - Part 2]]></title><description><![CDATA[<p>I will continue exploring a dataset found on <em>data.world</em> that was originally found on <a href="https://data.humdata.org/dataset/afdb-market-trends-2015" rel="noreferrer">The Humanitarian Data Exchange</a>.  This dataset contains numerous market indicators of importance to the African Development Bank, along with dates and values for different indicators.  Below is a snippet of the first few rows of</p>]]></description><link>https://www.williamameyer.com/sql-exploratory-analysis-project-afdb-market-trends-2015-part-2/</link><guid isPermaLink="false">66c0e87544502a60aa65ac4a</guid><dc:creator><![CDATA[William M]]></dc:creator><pubDate>Sat, 17 Aug 2024 19:14:43 GMT</pubDate><media:content url="https://www.williamameyer.com/content/images/2024/08/1000012841-1.jpg" medium="image"/><content:encoded><![CDATA[<img src="https://www.williamameyer.com/content/images/2024/08/1000012841-1.jpg" alt="SQL EDA - AFDB Market Trends - Part 2"><p>I will continue exploring a dataset found on <em>data.world</em> that was originally found on <a href="https://data.humdata.org/dataset/afdb-market-trends-2015" rel="noreferrer">The Humanitarian Data Exchange</a>.  This dataset contains numerous market indicators of importance to the African Development Bank, along with dates and values for different indicators.  Below is a snippet of the first few rows of data.</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="6" rules="NONE" border="0">
	<colgroup><col width="70"><col width="268"><col width="114"><col width="76"><col width="79"><col width="67"></colgroup>
	<tbody>
		<tr>
			<td width="70" height="17" align="LEFT">Indicator</td>
			<td width="268" align="LEFT">IndicatorName</td>
			<td width="114" align="LEFT">Unit</td>
			<td width="76" align="LEFT">Frequency</td>
			<td width="79" align="LEFT">Date</td>
			<td width="67" align="LEFT">Value</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242809" sdnum="1033;">91242809</td>
			<td align="LEFT">Egypt CASE 30 Index</td>
			<td align="LEFT"><br></td>
			<td align="LEFT">D</td>
			<td align="LEFT">2011-01-02</td>
			<td align="RIGHT" sdval="7082.4" sdnum="1033;">7082.4</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242109" sdnum="1033;">91242109</td>
			<td align="LEFT">Tunisia Dinar</td>
			<td align="LEFT"><br></td>
			<td align="LEFT">D</td>
			<td align="LEFT">2011-01-03</td>
			<td align="RIGHT" sdval="1.4416" sdnum="1033;">1.4416</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242209" sdnum="1033;">91242209</td>
			<td align="LEFT">Platinum (USD/Troy Ounce)</td>
			<td align="LEFT">USD/Troy Ounce</td>
			<td align="LEFT">D</td>
			<td align="LEFT">2011-01-03</td>
			<td align="RIGHT" sdval="1781.5" sdnum="1033;">1781.5</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242309" sdnum="1033;">91242309</td>
			<td align="LEFT">Kenya Nairobi SE Index- NSE 20</td>
			<td align="LEFT"><br></td>
			<td align="LEFT">D</td>
			<td align="LEFT">2011-01-03</td>
			<td align="RIGHT" sdval="4495.41" sdnum="1033;">4495.41</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>From this data I became curious to explore 7 different things:</p><ol><li><em>How many unique indicators are there?</em></li><li><em>Is the number of unique indicators equal to the unique indicatorname field? If not explain the discrepancy.</em></li><li><em>Remove all commodities from this table and put them in a separate table.</em></li><li><em>From the remaining indicators, are there any gaps in reporting?</em></li><li><strong>Calculate the  &apos;PercentChangeDaily&apos; of indicators.</strong></li><li><strong>Calculate the  &apos;AverageMonthly&apos; value of different indicators.</strong></li><li><strong>What were the indicators that increased the most percentage OVERALL from the inception of this data.</strong></li></ol><p>I have explored the first four tasks in <a href="https://www.williamameyer.com/project-1a-afdb-market-trends-2015/" rel="noreferrer">another post</a>.  I will finish the final three tasks of this project now.</p><p>Our fist task will be to create a new field for our table.  This will be a comparison to previous dates, and will be on a per <em>indicator </em>basis, meaning the data must be partitioned, sorted, and lagged.</p><ol start="5"><li> Create a new field for &quot;PercentChangeDaily&quot;</li></ol><pre><code class="language-SQL">SELECT 
	indicatorname, 
    date, 
    value, 
    yesterday,
	ROUND(((yesterday - value) / value * 100), 5) AS percentchange
FROM(
	SELECT 
		*,
		LAG(value) OVER (PARTITION BY 
						 indicatorname 
    						 ORDER BY 
    						 	indicatorname, 
    						 	date) AS yesterday
	FROM 
		afdbmarkettrends2015
ORDER BY 
	indicatorname, 
	date) AS yesterdaystable</code></pre><p>OUTPUT:</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="5" rules="NONE" border="0">
	<colgroup><col width="251"><col width="79"><col width="67"><col width="71"><col width="101"></colgroup>
	<tbody>
		<tr>
			<td width="251" height="17" align="LEFT">indicatorname</td>
			<td width="79" align="LEFT">date</td>
			<td width="67" align="LEFT">value</td>
			<td width="71" align="LEFT">yesterday</td>
			<td width="101" align="LEFT">percentchange</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="LEFT">2011-01-04</td>
			<td align="RIGHT" sdval="1693" sdnum="1033;">1693</td>
			<td align="LEFT">NULL</td>
			<td align="LEFT">NULL</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="LEFT">2011-01-05</td>
			<td align="RIGHT" sdval="1621" sdnum="1033;">1621</td>
			<td align="RIGHT" sdval="1693" sdnum="1033;">1693</td>
			<td align="RIGHT" sdval="4.4417" sdnum="1033;">4.4417</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="LEFT">2011-01-06</td>
			<td align="RIGHT" sdval="1544" sdnum="1033;">1544</td>
			<td align="RIGHT" sdval="1621" sdnum="1033;">1621</td>
			<td align="RIGHT" sdval="4.98705" sdnum="1033;">4.98705</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="LEFT">2011-01-07</td>
			<td align="RIGHT" sdval="1519" sdnum="1033;">1519</td>
			<td align="RIGHT" sdval="1544" sdnum="1033;">1544</td>
			<td align="RIGHT" sdval="1.64582" sdnum="1033;">1.64582</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="LEFT">2011-01-10</td>
			<td align="RIGHT" sdval="1495" sdnum="1033;">1495</td>
			<td align="RIGHT" sdval="1519" sdnum="1033;">1519</td>
			<td align="RIGHT" sdval="1.60535" sdnum="1033;">1.60535</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="LEFT">2011-01-11</td>
			<td align="RIGHT" sdval="1480" sdnum="1033;">1480</td>
			<td align="RIGHT" sdval="1495" sdnum="1033;">1495</td>
			<td align="RIGHT" sdval="1.01351" sdnum="1033;">1.01351</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="LEFT">2011-01-12</td>
			<td align="RIGHT" sdval="1453" sdnum="1033;">1453</td>
			<td align="RIGHT" sdval="1480" sdnum="1033;">1480</td>
			<td align="RIGHT" sdval="1.85822" sdnum="1033;">1.85822</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="LEFT">2011-01-13</td>
			<td align="RIGHT" sdval="1438" sdnum="1033;">1438</td>
			<td align="RIGHT" sdval="1453" sdnum="1033;">1453</td>
			<td align="RIGHT" sdval="1.04312" sdnum="1033;">1.04312</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="LEFT">2011-01-14</td>
			<td align="RIGHT" sdval="1439" sdnum="1033;">1439</td>
			<td align="RIGHT" sdval="1438" sdnum="1033;">1438</td>
			<td align="RIGHT" sdval="-0.06949" sdnum="1033;">-0.06949</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="LEFT">2011-01-17</td>
			<td align="RIGHT" sdval="1439" sdnum="1033;">1439</td>
			<td align="RIGHT" sdval="1439" sdnum="1033;">1439</td>
			<td align="RIGHT" sdval="0" sdnum="1033;">0</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="LEFT">2011-01-18</td>
			<td align="RIGHT" sdval="1432" sdnum="1033;">1432</td>
			<td align="RIGHT" sdval="1439" sdnum="1033;">1439</td>
			<td align="RIGHT" sdval="0.48883" sdnum="1033;">0.48883</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>This output creates a new column with yesterdays values, and uses that new column to calculate the percent change from the previous date on a per <em>indicatorname </em>basis.</p><p>The next task is to create an average monthly value field.  To do this I will extract the month and year from the <em>date </em>column.  And i will then aggregate the values on the basis of the newly created <em>year </em>and <em>month </em>columns.</p><ol start="6"><li> Calculate the  &apos;AverageMonthly&apos; value of different indicators.</li></ol><pre><code class="language-SQL">SELECT 
	indicatorname, 
	year, 
	month, 
	AVG(value)
FROM
	(SELECT 
		*, 
		EXTRACT(month FROM date) AS month, 
		EXTRACT(year FROM date) AS year
	FROM afdbmarkettrends2015) AS trendswithmonthyear
GROUP BY 
	indicatorname, 
	year, 
	month
ORDER BY 
	indicatorname, 
	year, 
	month</code></pre><p>OUTPUT:</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="4" rules="NONE" border="0">
	<colgroup><col width="251"><col width="41"><col width="49"><col width="137"></colgroup>
	<tbody>
		<tr>
			<td width="251" height="17" align="LEFT">indicatorname</td>
			<td width="41" align="LEFT">year</td>
			<td width="49" align="LEFT">month</td>
			<td width="137" align="LEFT">monthlyaveragevalue</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="RIGHT" sdval="2011" sdnum="1033;">2011</td>
			<td align="RIGHT" sdval="1" sdnum="1033;">1</td>
			<td align="RIGHT" sdval="1401.4" sdnum="1033;">1401.4</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="RIGHT" sdval="2011" sdnum="1033;">2011</td>
			<td align="RIGHT" sdval="2" sdnum="1033;">2</td>
			<td align="RIGHT" sdval="1181.1" sdnum="1033;">1181.1</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="RIGHT" sdval="2011" sdnum="1033;">2011</td>
			<td align="RIGHT" sdval="3" sdnum="1033;">3</td>
			<td align="RIGHT" sdval="1492.6957" sdnum="1033;">1492.6957</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="RIGHT" sdval="2011" sdnum="1033;">2011</td>
			<td align="RIGHT" sdval="4" sdnum="1033;">4</td>
			<td align="RIGHT" sdval="1342.5556" sdnum="1033;">1342.5556</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="RIGHT" sdval="2011" sdnum="1033;">2011</td>
			<td align="RIGHT" sdval="5" sdnum="1033;">5</td>
			<td align="RIGHT" sdval="1352.4" sdnum="1033;">1352.4</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="RIGHT" sdval="2011" sdnum="1033;">2011</td>
			<td align="RIGHT" sdval="6" sdnum="1033;">6</td>
			<td align="RIGHT" sdval="1433.2273" sdnum="1033;">1433.2273</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="RIGHT" sdval="2011" sdnum="1033;">2011</td>
			<td align="RIGHT" sdval="7" sdnum="1033;">7</td>
			<td align="RIGHT" sdval="1365.5238" sdnum="1033;">1365.5238</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="RIGHT" sdval="2011" sdnum="1033;">2011</td>
			<td align="RIGHT" sdval="8" sdnum="1033;">8</td>
			<td align="RIGHT" sdval="1386.9545" sdnum="1033;">1386.9545</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="RIGHT" sdval="2011" sdnum="1033;">2011</td>
			<td align="RIGHT" sdval="9" sdnum="1033;">9</td>
			<td align="RIGHT" sdval="1840.4091" sdnum="1033;">1840.4091</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="RIGHT" sdval="2011" sdnum="1033;">2011</td>
			<td align="RIGHT" sdval="10" sdnum="1033;">10</td>
			<td align="RIGHT" sdval="2072.4762" sdnum="1033;">2072.4762</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Baltic Dry Index</td>
			<td align="RIGHT" sdval="2011" sdnum="1033;">2011</td>
			<td align="RIGHT" sdval="11" sdnum="1033;">11</td>
			<td align="RIGHT" sdval="1835.3182" sdnum="1033;">1835.3182</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>The final problem is not as straightforward as the other examples.  We will be finding which indicators increased the most percentage overall since the inception of the data.  It has to be noted that not all indicators have the same start date.  </p><p>On a per <em>indicator </em>basis we must calculate</p>
<!--kg-card-begin: html-->
<ul>
  <li>Earliest date record of <em>indicator</em> as <em>beginningvalue</em></li>
  <li>Latest date record of <em>indicator</em> as <em>endingvalue</em></li>
  <li>Percent change of the new <em>beginningvalue</em> and <em>endingvalue</em> as <em>percentchange</em></li>
</ul>  
<!--kg-card-end: html-->
<p>This will be done with the lag value and filtering where the <em>beginningvalue </em>and <em>endingvalue </em>match the dates of the <em>indicator </em>on a per <em>indicator </em>basis.</p><p>7 .  What were the indicators that increased the most percentage OVERALL from the inception of the data?</p><pre><code class="language-SQL">SELECT 
	indicatorname, 
	date, 
	beginningvalue, 
	endingvalue, 
	ROUND(percentchange, 4) AS percentchange
FROM
	/* 2 At this point we will have two records for each indicator
	one with the beginningvalue matched to the date, and one with
	the ending value matched to the date.  We will use lag to
	create a record for each indicator with the beginningvalue, and 
	the endingvalue, and then we use this to calculate percentage*/
	
	(SELECT 
	 	indicatorname, 
	 	date, 
	 	LAG (value) OVER (PARTITION BY 
						  	(indicatorname) 
						ORDER BY 
						  	date) AS beginningvalue, 
	 	value AS endingvalue,
		(value - LAG (value) OVER (PARTITION BY 
							(indicatorname) 
						ORDER BY 
    						date)) 
	 		/ LAG (value) OVER (PARTITION BY 
							(indicatorname) 
						ORDER BY 
							date) * 100 AS percentchange 
		FROM
			--1  Create a mindate and maxdate for each indicator
	 		(SELECT *, 
			 	MIN(date) OVER (PARTITION BY 
							indicatorname) AS mindate, 
			 	MAX(date) OVER (PARTITION BY 
    						indicatorname) AS maxdate
			FROM 
			 	afdbmarkettrends2015
		ORDER BY 
			 date) 
	 	AS trendswithminandmax
        
		/*3 Filter for each row using the where clause,
	 	keeps only records that are the beginning and 
	 	endind dates for each record*/
	 	WHERE 
	 		date = mindate OR 
	 		date = maxdate
	ORDER BY 
	 	indicatorname, 
	 	date) 
	AS trendswithchangepercent
WHERE
	/*4 The lag creates a null value  that helps
	to remove the unnecessary record, we will filter
	to remove these records*/
	beginningvalue IS NOT NULL
ORDER BY 
	percentchange DESC
LIMIT 
	5
</code></pre><p>OUTPUT:</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="5" rules="NONE" border="0">
	<colgroup><col width="234"><col width="79"><col width="100"><col width="82"><col width="101"></colgroup>
	<tbody>
		<tr>
			<td width="234" height="17" align="LEFT">indicatorname</td>
			<td width="79" align="LEFT">date</td>
			<td width="100" align="LEFT">beginningvalue</td>
			<td width="82" align="LEFT">endingvalue</td>
			<td width="101" align="LEFT">percentchange</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">South Africa Rand</td>
			<td align="LEFT">2015-07-28</td>
			<td align="RIGHT" sdval="6.6275" sdnum="1033;">6.6275</td>
			<td align="RIGHT" sdval="12.5681" sdnum="1033;">12.5681</td>
			<td align="RIGHT" sdval="89.6356" sdnum="1033;">89.6356</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Cote d&apos;Ivoire BRVM Composite Index</td>
			<td align="LEFT">2015-07-28</td>
			<td align="RIGHT" sdval="159.32" sdnum="1033;">159.32</td>
			<td align="RIGHT" sdval="301.22" sdnum="1033;">301.22</td>
			<td align="RIGHT" sdval="89.066" sdnum="1033;">89.066</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Uganda SE All Share index USE</td>
			<td align="LEFT">2015-07-28</td>
			<td align="RIGHT" sdval="1192.57" sdnum="1033;">1192.57</td>
			<td align="RIGHT" sdval="1893.63" sdnum="1033;">1893.63</td>
			<td align="RIGHT" sdval="58.7856" sdnum="1033;">58.7856</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">South Africa JALSH All Share Index</td>
			<td align="LEFT">2015-07-28</td>
			<td align="RIGHT" sdval="32308.11" sdnum="1033;">32308.11</td>
			<td align="RIGHT" sdval="50758.42" sdnum="1033;">50758.42</td>
			<td align="RIGHT" sdval="57.1074" sdnum="1033;">57.1074</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Uganda Shilling</td>
			<td align="LEFT">2015-07-28</td>
			<td align="RIGHT" sdval="2310" sdnum="1033;">2310</td>
			<td align="RIGHT" sdval="3420" sdnum="1033;">3420</td>
			<td align="RIGHT" sdval="48.0519" sdnum="1033;">48.0519</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>There we have our final SQL analysis of this dataset.</p><hr><p>Key takeaways:</p><p>With SQL it is easy to get the data to report what you want it to.  Some of the positives are that it is quick to query, and the language is quite easy to understand for smaller queries.  Conversely, while trying to accomplish more robust aggregations, the language becomes harder to follow.  Also it is not a &apos;one stop shop&apos; for analyzing and reporting visually.  Tools like Python or R would be useful if visualizations and the same aggregations could be done.</p><p>I will be exploring this data visually with Tableau in <a href="https://www.williamameyer.com/tableau-understanding-the-afdb-market-trends-2015-dataset-with-visualization/" rel="noreferrer">this post</a>.</p><p></p>]]></content:encoded></item><item><title><![CDATA[SQL EDA - AFDB Market Trends - Part 1]]></title><description><![CDATA[<p>I will be exploring a dataset found on <em>data.world</em> that was originally found on <a href="https://data.humdata.org/dataset/afdb-market-trends-2015" rel="noreferrer">The Humanitarian Data Exchange</a>.  This dataset contains numerous market indicators of importance to the African Development Bank, along with dates and values for different indicators.  Below is a snippet of the first few rows of</p>]]></description><link>https://www.williamameyer.com/project-1a-afdb-market-trends-2015/</link><guid isPermaLink="false">66ba6f91b4b8009e25b1bc0f</guid><dc:creator><![CDATA[William M]]></dc:creator><pubDate>Mon, 12 Aug 2024 21:15:24 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1655720360377-b97f6715e1ae?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDd8fG1vbmV5JTIwYWZyaWNhfGVufDB8fHx8MTcyMzQ5NDMxMHww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1655720360377-b97f6715e1ae?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wxMTc3M3wwfDF8c2VhcmNofDd8fG1vbmV5JTIwYWZyaWNhfGVufDB8fHx8MTcyMzQ5NDMxMHww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=2000" alt="SQL EDA - AFDB Market Trends - Part 1"><p>I will be exploring a dataset found on <em>data.world</em> that was originally found on <a href="https://data.humdata.org/dataset/afdb-market-trends-2015" rel="noreferrer">The Humanitarian Data Exchange</a>.  This dataset contains numerous market indicators of importance to the African Development Bank, along with dates and values for different indicators.  Below is a snippet of the first few rows of data.</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="6" rules="NONE" border="0">
	<colgroup><col width="70"><col width="268"><col width="114"><col width="76"><col width="79"><col width="67"></colgroup>
	<tbody>
		<tr>
			<td width="70" height="17" align="LEFT">Indicator</td>
			<td width="268" align="LEFT">IndicatorName</td>
			<td width="114" align="LEFT">Unit</td>
			<td width="76" align="LEFT">Frequency</td>
			<td width="79" align="LEFT">Date</td>
			<td width="67" align="LEFT">Value</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242809" sdnum="1033;">91242809</td>
			<td align="LEFT">Egypt CASE 30 Index</td>
			<td align="LEFT"><br></td>
			<td align="LEFT">D</td>
			<td align="LEFT">2011-01-02</td>
			<td align="RIGHT" sdval="7082.4" sdnum="1033;">7082.4</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242109" sdnum="1033;">91242109</td>
			<td align="LEFT">Tunisia Dinar</td>
			<td align="LEFT"><br></td>
			<td align="LEFT">D</td>
			<td align="LEFT">2011-01-03</td>
			<td align="RIGHT" sdval="1.4416" sdnum="1033;">1.4416</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242209" sdnum="1033;">91242209</td>
			<td align="LEFT">Platinum (USD/Troy Ounce)</td>
			<td align="LEFT">USD/Troy Ounce</td>
			<td align="LEFT">D</td>
			<td align="LEFT">2011-01-03</td>
			<td align="RIGHT" sdval="1781.5" sdnum="1033;">1781.5</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242309" sdnum="1033;">91242309</td>
			<td align="LEFT">Kenya Nairobi SE Index- NSE 20</td>
			<td align="LEFT"><br></td>
			<td align="LEFT">D</td>
			<td align="LEFT">2011-01-03</td>
			<td align="RIGHT" sdval="4495.41" sdnum="1033;">4495.41</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>From this data I became curious to explore 7 different things:</p><ol><li>How many unique indicators are there?</li><li>Is the number of unique indicators equal to the unique <em>indicatorname </em>field? If not explain the discrepancy.</li><li>Remove all commodities from this table and put them in a separate table.</li><li>From the remaining indicators, are there any gaps in reporting?</li><li><em>Calculate the  &apos;PercentChangeDaily&apos; of indicators.*</em></li><li><em>Calculate the  &apos;AverageMonthly&apos; value of different indicators.*</em></li><li><em>What were the indicators that increased the most percentage OVERALL from the inception of this data.*</em></li></ol><p><em>*The last 3 items will be covered in the next post.</em></p><p>For a more condensed version, view on <a href="https://github.com/williamAM1/Data-Analytics-Portfolio/tree/main/Project01_SQL_EDA_AFDB" rel="noreferrer">github</a>.</p><hr><p>Let&apos;s start with the first question.</p><ol><li> How many unique indicators are there?</li></ol><pre><code class="language-SQL">SELECT 
	COUNT(DISTINCT(indicator))
FROM 
	afdbmarkettrends2015</code></pre><p>OUTPUT:</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="1" rules="NONE" border="0">
	<colgroup><col width="104"></colgroup>
	<tbody>
		<tr>
			<td width="104" height="17" align="LEFT">indicator_count</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="31" sdnum="1033;">31</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>Problem 1 is complete. I&apos;ve answered the question, but let&apos;s answer some questions that may be implied.  I will output the 31 indicators along with the indicator names.  We will sort the list alphabetically by the <em>indicatorname</em>.</p><pre><code class="language-SQL">SELECT 
	DISTINCT(indicator), 
	indicatorname
FROM 
	afdbmarkettrends2015
ORDER BY 
	indicatorname</code></pre>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="2" rules="NONE" border="0">
	<colgroup><col width="70"><col width="268"></colgroup>
	<tbody>
		<tr>
			<td width="70" height="17" align="LEFT">indicator</td>
			<td width="268" align="LEFT">indicatorname</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91244509" sdnum="1033;">91244509</td>
			<td align="LEFT">Baltic Dry Index</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242409" sdnum="1033;">91242409</td>
			<td align="LEFT">CFA zone Countries CFA Franc</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91241909" sdnum="1033;">91241909</td>
			<td align="LEFT">Cocoa (USD/tonne)</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91244109" sdnum="1033;">91244109</td>
			<td align="LEFT">Coffee Brazilian Naturals (US cents/tonne)</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242709" sdnum="1033;">91242709</td>
			<td align="LEFT">Coffee Robusta (US cents/tonne)</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91243909" sdnum="1033;">91243909</td>
			<td align="LEFT">Copper (USD/lb)</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91244709" sdnum="1033;">91244709</td>
			<td align="LEFT">Copper (USD/MT)</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91243409" sdnum="1033;">91243409</td>
			<td align="LEFT">Cote d&apos;Ivoire BRVM Composite Index</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91243709" sdnum="1033;">91243709</td>
			<td align="LEFT">Cotton (USD/lb)</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91244809" sdnum="1033;">91244809</td>
			<td align="LEFT">Crude Oil, Brent (USD/bbl)</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242809" sdnum="1033;">91242809</td>
			<td align="LEFT">Egypt CASE 30 Index</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91243809" sdnum="1033;">91243809</td>
			<td align="LEFT">Egypt Pound</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91243609" sdnum="1033;">91243609</td>
			<td align="LEFT">Europe EURO</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91244909" sdnum="1033;">91244909</td>
			<td align="LEFT">Gold (USD/Troy Ounce)</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91244609" sdnum="1033;">91244609</td>
			<td align="LEFT">Iron Ore (USD/Dry MT)</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242609" sdnum="1033;">91242609</td>
			<td align="LEFT">Kenya Kenyan Shilling</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242309" sdnum="1033;">91242309</td>
			<td align="LEFT">Kenya Nairobi SE Index- NSE 20</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242909" sdnum="1033;">91242909</td>
			<td align="LEFT">Mauritius Mauritius AllShares SEMDEX</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91244009" sdnum="1033;">91244009</td>
			<td align="LEFT">Mauritius Rupee</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91244309" sdnum="1033;">91244309</td>
			<td align="LEFT">Morocco Casa All Share Index</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91243109" sdnum="1033;">91243109</td>
			<td align="LEFT">Morocco Dirham</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91243009" sdnum="1033;">91243009</td>
			<td align="LEFT">Nigeria Naira</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242009" sdnum="1033;">91242009</td>
			<td align="LEFT">Nigeria NGSE All Share Index</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242209" sdnum="1033;">91242209</td>
			<td align="LEFT">Platinum (USD/Troy Ounce)</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91245009" sdnum="1033;">91245009</td>
			<td align="LEFT">Silver (USD/Troy Ounce)</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242509" sdnum="1033;">91242509</td>
			<td align="LEFT">South Africa JALSH All Share Index</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91243209" sdnum="1033;">91243209</td>
			<td align="LEFT">South Africa Rand</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91242109" sdnum="1033;">91242109</td>
			<td align="LEFT">Tunisia Dinar</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91243309" sdnum="1033;">91243309</td>
			<td align="LEFT">Tunisia Tunis se Tnse Index TUNINDEX</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91243509" sdnum="1033;">91243509</td>
			<td align="LEFT">Uganda SE All Share index USE</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="91244209" sdnum="1033;">91244209</td>
			<td align="LEFT">Uganda Shilling</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>The second question outlines a &apos;sanity check&apos; for the data.  We will be summing the count of unique values in the <em>indicator </em>field, and the count of unique values in the <em>indicatorname </em>field, to make sure they are the same.</p><ol start="2"><li> Is the number of unique indicators equal to the unique <em>inidcatorname </em>field? If not explain the discrepancy.</li></ol><pre><code class="language-SQL">SELECT 
	COUNT(DISTINCT(indicator)) AS indicator_count,
	COUNT(DISTINCT(indicatorname)) AS indicator_name_count
FROM 
	afdbmarkettrends2015</code></pre>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="2" rules="NONE" border="0">
	<colgroup><col width="104"><col width="145"></colgroup>
	<tbody>
		<tr>
			<td width="104" height="17" align="LEFT">indicator_count</td>
			<td width="145" align="LEFT">indicator_name_count</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="31" sdnum="1033;">31</td>
			<td align="RIGHT" sdval="31" sdnum="1033;">31</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>The fields are equal, this aspect of the data is &apos;sane&apos;.</p><p>Then we want to separate the commodities from this data.  The <em>unit </em>column has a currency if and only if the item is a commodity.  We will use this information to extract commodities, then we will use this data to delete the commodities after the data is transported to a new table.</p><ol start="3"><li> Remove all commodities from this table and put them in a separate table.  Then we will count the number of items in the new table.</li></ol><pre><code class="language-SQL">CREATE TABLE afdbmarketcommodities2015 AS
	SELECT 
		* 
	FROM 
		afdbmarkettrends2015
	WHERE 
		unit IS NOT NULL;

SELECT 
	COUNT (*) AS newcommoditiescount 
FROM 
	afdbmarketcommodities2015;</code></pre><p>OUTPUT:</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="1" rules="NONE" border="0">
	<colgroup><col width="147"></colgroup>
	<tbody>
		<tr>
			<td width="147" height="17" align="LEFT">newcommoditiescount</td>
		</tr>
		<tr>
			<td height="17" align="RIGHT" sdval="11473" sdnum="1033;">11473</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>This makes the new table from the filtered data.  The data was checked for accuracy, and we now need to remove the data from the <em>afdbmarkettrends2015 </em>table.</p><pre><code class="language-SQL">DELETE FROM 
	afdbmarkettrends2015 
WHERE 
	unit IS NOT NULL;</code></pre><p>OUTPUT:</p><blockquote><em>DELETE 11473 Query returned successfully in 11 msec.</em></blockquote><p><em>Good.</em></p><p>When I saw the dates listed, I was genuinely curious if the days had any gaps.  This will require some creative sorting by <em>indicator </em>and <em>date</em>, the creation of a new column that has the previous <em>date </em>that was on record.  This sorting should be done separately for each indicator.</p><ol start="4"><li>From the remaining indicators, are there any gaps in reporting?</li></ol><pre><code class="language-SQL">SELECT 
	indicatorname,
	date,
	previousday,
	CAST(date AS timestamp) - CAST(previousday AS timestamp)  AS timegap
FROM (
	SELECT 
		*,
		LAG(date) OVER(PARTITION BY (indicator) ORDER BY indicator, date) AS previousday
	FROM afdbmarkettrends2015
	ORDER BY 
		indicator, 
		date) 
	AS laggedtable</code></pre><p>OUTPUT (small summary):</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="4" rules="NONE" border="0">
	<colgroup><col width="251"><col width="79"><col width="83"><col width="60"></colgroup>
	<tbody>
		<tr>
			<td width="251" height="17" align="LEFT">indicatorname</td>
			<td width="79" align="LEFT">date</td>
			<td width="83" align="LEFT">previousday</td>
			<td width="60" align="LEFT">timegap</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Nigeria NGSE All Share Index</td>
			<td align="LEFT">2011-01-04</td>
			<td align="LEFT">NULL</td>
			<td align="LEFT">NULL</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Nigeria NGSE All Share Index</td>
			<td align="LEFT">2011-01-05</td>
			<td align="LEFT">2011-01-04</td>
			<td align="LEFT">1 day</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Nigeria NGSE All Share Index</td>
			<td align="LEFT">2011-01-06</td>
			<td align="LEFT">2011-01-05</td>
			<td align="LEFT">1 day</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Nigeria NGSE All Share Index</td>
			<td align="LEFT">2011-01-07</td>
			<td align="LEFT">2011-01-06</td>
			<td align="LEFT">1 day</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Nigeria NGSE All Share Index</td>
			<td align="LEFT">2011-01-10</td>
			<td align="LEFT">2011-01-07</td>
			<td align="LEFT">3 days</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p>Of interest in the same output is that the counting starts over when a new indicator is listed.</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="4" rules="NONE" border="0">
	<colgroup><col width="251"><col width="79"><col width="83"><col width="60"></colgroup>
	<tbody>
		<tr>
			<td width="251" height="18" align="LEFT">indicatorname</td>
			<td width="79" align="LEFT">date</td>
			<td width="83" align="LEFT">previousday</td>
			<td width="60" align="LEFT">timegap</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Nigeria NGSE All Share Index</td>
			<td align="LEFT">2015-07-27</td>
			<td align="LEFT">2015-07-24</td>
			<td align="LEFT">3 days</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Nigeria NGSE All Share Index</td>
			<td align="LEFT">2015-07-28</td>
			<td align="LEFT">2015-07-27</td>
			<td align="LEFT">1 day</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Tunisia Dinar</td>
			<td align="LEFT">2011-01-03</td>
			<td align="LEFT">NULL</td>
			<td align="LEFT">NULL</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Tunisia Dinar</td>
			<td align="LEFT">2011-01-04</td>
			<td align="LEFT">2011-01-03</td>
			<td align="LEFT">1 day</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Tunisia Dinar</td>
			<td align="LEFT">2011-01-05</td>
			<td align="LEFT">2011-01-04</td>
			<td align="LEFT">1 day</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Tunisia Dinar</td>
			<td align="LEFT">2011-01-06</td>
			<td align="LEFT">2011-01-05</td>
			<td align="LEFT">1 day</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Tunisia Dinar</td>
			<td align="LEFT">2011-01-07</td>
			<td align="LEFT">2011-01-06</td>
			<td align="LEFT">1 day</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Tunisia Dinar</td>
			<td align="LEFT">2011-01-10</td>
			<td align="LEFT">2011-01-07</td>
			<td align="LEFT">3 days</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p><em>Good</em></p><p>Let&apos;s look at another</p>
<!--kg-card-begin: html-->
<table frame="VOID" cellspacing="0" cols="4" rules="NONE" border="0">
	<colgroup><col width="251"><col width="79"><col width="83"><col width="60"></colgroup>
	<tbody>
		<tr>
			<td width="251" height="18" align="LEFT">indicatorname</td>
			<td width="79" align="LEFT">date</td>
			<td width="83" align="LEFT">previousday</td>
			<td width="60" align="LEFT">timegap</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Tunisia Dinar</td>
			<td align="LEFT">2015-07-27</td>
			<td align="LEFT">2015-07-24</td>
			<td align="LEFT">3 days</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Tunisia Dinar</td>
			<td align="LEFT">2015-07-28</td>
			<td align="LEFT">2015-07-27</td>
			<td align="LEFT">1 day</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Kenya Nairobi SE Index- NSE 20</td>
			<td align="LEFT">2011-01-03</td>
			<td align="LEFT">NULL</td>
			<td align="LEFT">NULL</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Kenya Nairobi SE Index- NSE 20</td>
			<td align="LEFT">2011-01-04</td>
			<td align="LEFT">2011-01-03</td>
			<td align="LEFT">1 day</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Kenya Nairobi SE Index- NSE 20</td>
			<td align="LEFT">2011-01-05</td>
			<td align="LEFT">2011-01-04</td>
			<td align="LEFT">1 day</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Kenya Nairobi SE Index- NSE 20</td>
			<td align="LEFT">2011-01-06</td>
			<td align="LEFT">2011-01-05</td>
			<td align="LEFT">1 day</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Kenya Nairobi SE Index- NSE 20</td>
			<td align="LEFT">2011-01-07</td>
			<td align="LEFT">2011-01-06</td>
			<td align="LEFT">1 day</td>
		</tr>
		<tr>
			<td height="17" align="LEFT">Kenya Nairobi SE Index- NSE 20</td>
			<td align="LEFT">2011-01-10</td>
			<td align="LEFT">2011-01-07</td>
			<td align="LEFT">3 days</td>
		</tr>
	</tbody>
</table>
<!--kg-card-end: html-->
<p><em>Still Good</em></p><p>We will continue with <a href="https://www.williamameyer.com/sql-exploratory-analysis-project-afdb-market-trends-2015-part-2/" rel="noreferrer">tasks 5-7</a> in another post.</p><p>I will also be exploring this data visually with Tableau in <a href="https://www.williamameyer.com/tableau-understanding-the-afdb-market-trends-2015-dataset-with-visualization/" rel="noreferrer">this post</a> if you wanted to skip ahead.</p>]]></content:encoded></item></channel></rss>