R - Agriculture Crop Production EDA - Part 2

R - Agriculture Crop Production EDA - Part 2

We will be answering 5 questions on the Agriculture Crop Production data.

  1. Which country in Africa produced the most coffee in 2018?
  2. Which country in Africa produced the most coffee per acre in 2018?
  3. Which country in Africa has the largest increase in the amount of coffee produced since 1961?
  4. Which country (in the world, not specifically Africa) saw the highest increase in the amount of coffee grown per acre since 1961?*
  5. Which continent (or rather, major grouping as presented in the 5 data sets) has the largest increase in the amount of coffee grown per acre since 1961?*

*These final two questions will be answered, for the previous questions, and an explanation of the data, please review the previous post.


4 Which country (in the world, not specifically Africa) saw the highest increase in the amount of coffee grown per acre since 1961?*

Let's jump into the solution

library(dplyr)

#Import all of the datasets
t1 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv")
t2 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Americas.csv")
t3 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Asia.csv")
t4 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Europe.csv")
t5 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Oceania.csv")

#Combine the datasets
q4 <- rbind(t1,t2,t3,t4,t5) %>%
  #Keep necessary variables
  select('Area', 'Item', 'Unit', 'Element','Y1961','Y2018') %>%
  #Filter "Item" and "Element" on appropriate values
  filter( Item == 'Coffee, green', Element == 'Yield') %>%
  #Create a new variable and apply the percent change formula
  mutate(PercentChange = ((Y2018 - Y1961) / Y1961 * 100)) %>%
  #Sort with the highest percent change on top
  arrange(desc(PercentChange))

#round for better formatting
q4$PercentChange = round(q4$'PercentChange', 2)

OUTPUT:


Area Item Unit Element Y1961 Y2018 PercentChange
1 Viet Nam Coffee, green hg/ha Yield 1934 26117 1250.41
2 Thailand Coffee, green hg/ha Yield 600 5725 854.17
3 Nigeria Coffee, green hg/ha Yield 1833 12886 603
4 Malaysia Coffee, green hg/ha Yield 3391 22810 572.66
5 China, mainland Coffee, green hg/ha Yield 5000 29405 488.1
6 Malawi Coffee, green hg/ha Yield 4175 23345 459.16
7 Honduras Coffee, green hg/ha Yield 2767 11195 304.59
8 Nicaragua Coffee, green hg/ha Yield 2762 10672 286.39
9 Brazil Coffee, green hg/ha Yield 5084 19060 274.9
10 Lao People's Democratic Republic Coffee, green hg/ha Yield 5000 18611 272.22
11 Sierra Leone Coffee, green hg/ha Yield 6379 20706 224.6
12 Ghana Coffee, green hg/ha Yield 5667 16959 199.26
13 Rwanda Coffee, green hg/ha Yield 5833 16506 182.98
14 French Polynesia Coffee, green hg/ha Yield 869 2268 160.99

5 Which continent (or rather, major grouping as presented in the 5 data sets) has the largest increase in the amount of coffee grown per acre since 1961?*

This required a bit of working to do. The data needed to be lagged, aggregated, and aggregated some more to get the information on the continental level and 'subtractable' and divisible at a row level.

library(dplyr)

#PART 1
#Import all of the datasets
#Before the merge each individual dataset was given a 'Continent' column


t1 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv")
t1$Continent = "Africa"
t2 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Americas.csv")
t2$Continent = "Americas"
t3 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Asia.csv")
t3$Continent = "Asia"
t4 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Europe.csv")
t4$Continent = "Europe"
t5 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Oceania.csv")
t5$Continent = "Oceania"




#PART 2
This part of the code is where I filtered data, and used the lag function to essentially pivot the data.

#Combine the datasets
q5step1 <- rbind(t1,t2,t3,t4,t5) %>%
  #Select just coffee, and Element 'Production' OR 'Area harvested' 
  filter(Item == 'Coffee, green', Element == 'Production' | Element =='Area harvested') %>%
  #Select a few variables for now, more will be removed later
  select('Continent', 'Area', 'Unit', 'Element', 'Y1961', 'Y2018') %>%
  #We will lag the production of both years partitioned by the 'Area'
  group_by(Area) %>%
  mutate('1961AreaHarvested' = lag(Y1961, n=1, order_by=Area)) %>%
  mutate('2018AreaHarvested' = lag(Y2018, n=1, order_by=Area)) %>%
  #Renaming Y1961, and Y2018 Columns to denote their new role as Production values
  rename( "1961Production" = "Y1961") %>%
  rename( "2018Production" = "Y2018") %>%
  #We will get filter by Element = Production... this will help with the data pivot
  #It will remove the 'Area Harvested' elements from the 1961/2018_Production variables
  filter(Element == 'Production') %>%
  #Clean data up a bit and continue in another pipe, because this has become cumbersome
  #We will do maths and such in another pipe
  group_by(Continent) %>%
  select('Continent', '1961Production','1961AreaHarvested','2018Production','2018AreaHarvested')



#PART 3
The data was further aggregated by continent, and yield calculated as production-units/area harvested for each year.  From there the percent change calculation was applied.

q5step2 <- q5step1 %>%
  group_by(Continent) %>%
summarize_at(vars('1961Production','1961AreaHarvested','2018Production','2018AreaHarvested'), list(name = sum), na.rm=T )
 
#Create the yield for each continent
q5step3 <- q5step2 %>%
  #Calculate yield, the original document uses hectares and hectograms... but this will be a percent change
  #the unit is not important so long as its the same in 1961 and 2018
  mutate(Yield2018 = .$'2018Production_name' / .$'2018AreaHarvested_name') %>%
  mutate(Yield1961 = .$'1961Production_name' / .$'1961AreaHarvested_name') %>%
  select(Continent,Yield1961,Yield2018)

#PART 4 FINAL OUTPUT
#Output the final table with the yield-percent-change per continent in descending order

q5 <- q5step3 %>%
  #Apply the percent change calculation
  mutate(Yield_Change_Percent = (Yield2018 - Yield1961)/ Yield1961 * 100) %>%
  #Nice round Numbers
  mutate(Yield_Change_Percent = round(Yield_Change_Percent,2)) %>%
  #sorted by highest yield change to lowest
  arrange(desc(Yield_Change_Percent)) %>%
  #Choose only the variables needed
  select(Continent,Yield_Change_Percent)

OUTPUT:

Continent Yield_Change_Percent
Oceania 163.29
Americas 137.01
Asia 111.31
Africa 14.63
Europe NA

So with all that we have our final (very small) output.

This was my first time using RStudio, or the R language in general. So I am early on in the learning curve. I will definitely revisit R at another time.