By William M — Sep 3, 2024

R - Agriculture Crop Production EDA - Part 2

We will be answering 5 questions on the Agriculture Crop Production data.

Which country in Africa produced the most coffee in 2018?
Which country in Africa produced the most coffee per acre in 2018?
Which country in Africa has the largest increase in the amount of coffee produced since 1961?
Which country (in the world, not specifically Africa) saw the highest increase in the amount of coffee grown per acre since 1961?*
Which continent (or rather, major grouping as presented in the 5 data sets) has the largest increase in the amount of coffee grown per acre since 1961?*

*These final two questions will be answered, for the previous questions, and an explanation of the data, please review the previous post.

4 Which country (in the world, not specifically Africa) saw the highest increase in the amount of coffee grown per acre since 1961?*

Let's jump into the solution

library(dplyr)

#Import all of the datasets
t1 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv")
t2 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Americas.csv")
t3 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Asia.csv")
t4 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Europe.csv")
t5 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Oceania.csv")

#Combine the datasets
q4 <- rbind(t1,t2,t3,t4,t5) %>%
  #Keep necessary variables
  select('Area', 'Item', 'Unit', 'Element','Y1961','Y2018') %>%
  #Filter "Item" and "Element" on appropriate values
  filter( Item == 'Coffee, green', Element == 'Yield') %>%
  #Create a new variable and apply the percent change formula
  mutate(PercentChange = ((Y2018 - Y1961) / Y1961 * 100)) %>%
  #Sort with the highest percent change on top
  arrange(desc(PercentChange))

#round for better formatting
q4$PercentChange = round(q4$'PercentChange', 2)

OUTPUT:

	Area	Item	Unit	Element	Y1961	Y2018	PercentChange
1	Viet Nam	Coffee, green	hg/ha	Yield	1934	26117	1250.41
2	Thailand	Coffee, green	hg/ha	Yield	600	5725	854.17
3	Nigeria	Coffee, green	hg/ha	Yield	1833	12886	603
4	Malaysia	Coffee, green	hg/ha	Yield	3391	22810	572.66
5	China, mainland	Coffee, green	hg/ha	Yield	5000	29405	488.1
6	Malawi	Coffee, green	hg/ha	Yield	4175	23345	459.16
7	Honduras	Coffee, green	hg/ha	Yield	2767	11195	304.59
8	Nicaragua	Coffee, green	hg/ha	Yield	2762	10672	286.39
9	Brazil	Coffee, green	hg/ha	Yield	5084	19060	274.9
10	Lao People's Democratic Republic	Coffee, green	hg/ha	Yield	5000	18611	272.22
11	Sierra Leone	Coffee, green	hg/ha	Yield	6379	20706	224.6
12	Ghana	Coffee, green	hg/ha	Yield	5667	16959	199.26
13	Rwanda	Coffee, green	hg/ha	Yield	5833	16506	182.98
14	French Polynesia	Coffee, green	hg/ha	Yield	869	2268	160.99

5 Which continent (or rather, major grouping as presented in the 5 data sets) has the largest increase in the amount of coffee grown per acre since 1961?*

This required a bit of working to do. The data needed to be lagged, aggregated, and aggregated some more to get the information on the continental level and 'subtractable' and divisible at a row level.

library(dplyr)

#PART 1
#Import all of the datasets
#Before the merge each individual dataset was given a 'Continent' column


t1 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv")
t1$Continent = "Africa"
t2 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Americas.csv")
t2$Continent = "Americas"
t3 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Asia.csv")
t3$Continent = "Asia"
t4 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Europe.csv")
t4$Continent = "Europe"
t5 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Oceania.csv")
t5$Continent = "Oceania"




#PART 2
This part of the code is where I filtered data, and used the lag function to essentially pivot the data.

#Combine the datasets
q5step1 <- rbind(t1,t2,t3,t4,t5) %>%
  #Select just coffee, and Element 'Production' OR 'Area harvested' 
  filter(Item == 'Coffee, green', Element == 'Production' | Element =='Area harvested') %>%
  #Select a few variables for now, more will be removed later
  select('Continent', 'Area', 'Unit', 'Element', 'Y1961', 'Y2018') %>%
  #We will lag the production of both years partitioned by the 'Area'
  group_by(Area) %>%
  mutate('1961AreaHarvested' = lag(Y1961, n=1, order_by=Area)) %>%
  mutate('2018AreaHarvested' = lag(Y2018, n=1, order_by=Area)) %>%
  #Renaming Y1961, and Y2018 Columns to denote their new role as Production values
  rename( "1961Production" = "Y1961") %>%
  rename( "2018Production" = "Y2018") %>%
  #We will get filter by Element = Production... this will help with the data pivot
  #It will remove the 'Area Harvested' elements from the 1961/2018_Production variables
  filter(Element == 'Production') %>%
  #Clean data up a bit and continue in another pipe, because this has become cumbersome
  #We will do maths and such in another pipe
  group_by(Continent) %>%
  select('Continent', '1961Production','1961AreaHarvested','2018Production','2018AreaHarvested')



#PART 3
The data was further aggregated by continent, and yield calculated as production-units/area harvested for each year.  From there the percent change calculation was applied.

q5step2 <- q5step1 %>%
  group_by(Continent) %>%
summarize_at(vars('1961Production','1961AreaHarvested','2018Production','2018AreaHarvested'), list(name = sum), na.rm=T )
 
#Create the yield for each continent
q5step3 <- q5step2 %>%
  #Calculate yield, the original document uses hectares and hectograms... but this will be a percent change
  #the unit is not important so long as its the same in 1961 and 2018
  mutate(Yield2018 = .$'2018Production_name' / .$'2018AreaHarvested_name') %>%
  mutate(Yield1961 = .$'1961Production_name' / .$'1961AreaHarvested_name') %>%
  select(Continent,Yield1961,Yield2018)

#PART 4 FINAL OUTPUT
#Output the final table with the yield-percent-change per continent in descending order

q5 <- q5step3 %>%
  #Apply the percent change calculation
  mutate(Yield_Change_Percent = (Yield2018 - Yield1961)/ Yield1961 * 100) %>%
  #Nice round Numbers
  mutate(Yield_Change_Percent = round(Yield_Change_Percent,2)) %>%
  #sorted by highest yield change to lowest
  arrange(desc(Yield_Change_Percent)) %>%
  #Choose only the variables needed
  select(Continent,Yield_Change_Percent)

OUTPUT:

Continent	Yield_Change_Percent
Oceania	163.29
Americas	137.01
Asia	111.31
Africa	14.63
Europe	NA

So with all that we have our final (very small) output.

This was my first time using RStudio, or the R language in general. So I am early on in the learning curve. I will definitely revisit R at another time.

R - Agriculture Crop Production EDA - Part 2

4 Which country (in the world, not specifically Africa) saw the highest increase in the amount of coffee grown per acre since 1961?*

5 Which continent (or rather, major grouping as presented in the 5 data sets) has the largest increase in the amount of coffee grown per acre since 1961?*

R - Agriculture Crop Production EDA - Part 1

Superset - Agriculture Crop Production Visualization