R - Agriculture Crop Production EDA - Part 2

We will be answering 5 questions on the Agriculture Crop Production data.
- Which country in Africa produced the most coffee in 2018?
- Which country in Africa produced the most coffee per acre in 2018?
- Which country in Africa has the largest increase in the amount of coffee produced since 1961?
- Which country (in the world, not specifically Africa) saw the highest increase in the amount of coffee grown per acre since 1961?*
- Which continent (or rather, major grouping as presented in the 5 data sets) has the largest increase in the amount of coffee grown per acre since 1961?*
*These final two questions will be answered, for the previous questions, and an explanation of the data, please review the previous post.
4 Which country (in the world, not specifically Africa) saw the highest increase in the amount of coffee grown per acre since 1961?*
Let's jump into the solution
library(dplyr)
#Import all of the datasets
t1 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv")
t2 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Americas.csv")
t3 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Asia.csv")
t4 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Europe.csv")
t5 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Oceania.csv")
#Combine the datasets
q4 <- rbind(t1,t2,t3,t4,t5) %>%
#Keep necessary variables
select('Area', 'Item', 'Unit', 'Element','Y1961','Y2018') %>%
#Filter "Item" and "Element" on appropriate values
filter( Item == 'Coffee, green', Element == 'Yield') %>%
#Create a new variable and apply the percent change formula
mutate(PercentChange = ((Y2018 - Y1961) / Y1961 * 100)) %>%
#Sort with the highest percent change on top
arrange(desc(PercentChange))
#round for better formatting
q4$PercentChange = round(q4$'PercentChange', 2)
OUTPUT:
Area | Item | Unit | Element | Y1961 | Y2018 | PercentChange | |
1 | Viet Nam | Coffee, green | hg/ha | Yield | 1934 | 26117 | 1250.41 |
2 | Thailand | Coffee, green | hg/ha | Yield | 600 | 5725 | 854.17 |
3 | Nigeria | Coffee, green | hg/ha | Yield | 1833 | 12886 | 603 |
4 | Malaysia | Coffee, green | hg/ha | Yield | 3391 | 22810 | 572.66 |
5 | China, mainland | Coffee, green | hg/ha | Yield | 5000 | 29405 | 488.1 |
6 | Malawi | Coffee, green | hg/ha | Yield | 4175 | 23345 | 459.16 |
7 | Honduras | Coffee, green | hg/ha | Yield | 2767 | 11195 | 304.59 |
8 | Nicaragua | Coffee, green | hg/ha | Yield | 2762 | 10672 | 286.39 |
9 | Brazil | Coffee, green | hg/ha | Yield | 5084 | 19060 | 274.9 |
10 | Lao People's Democratic Republic | Coffee, green | hg/ha | Yield | 5000 | 18611 | 272.22 |
11 | Sierra Leone | Coffee, green | hg/ha | Yield | 6379 | 20706 | 224.6 |
12 | Ghana | Coffee, green | hg/ha | Yield | 5667 | 16959 | 199.26 |
13 | Rwanda | Coffee, green | hg/ha | Yield | 5833 | 16506 | 182.98 |
14 | French Polynesia | Coffee, green | hg/ha | Yield | 869 | 2268 | 160.99 |
5 Which continent (or rather, major grouping as presented in the 5 data sets) has the largest increase in the amount of coffee grown per acre since 1961?*
This required a bit of working to do. The data needed to be lagged, aggregated, and aggregated some more to get the information on the continental level and 'subtractable' and divisible at a row level.
library(dplyr)
#PART 1
#Import all of the datasets
#Before the merge each individual dataset was given a 'Continent' column
t1 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv")
t1$Continent = "Africa"
t2 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Americas.csv")
t2$Continent = "Americas"
t3 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Asia.csv")
t3$Continent = "Asia"
t4 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Europe.csv")
t4$Continent = "Europe"
t5 <- read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Oceania.csv")
t5$Continent = "Oceania"
#PART 2
This part of the code is where I filtered data, and used the lag function to essentially pivot the data.
#Combine the datasets
q5step1 <- rbind(t1,t2,t3,t4,t5) %>%
#Select just coffee, and Element 'Production' OR 'Area harvested'
filter(Item == 'Coffee, green', Element == 'Production' | Element =='Area harvested') %>%
#Select a few variables for now, more will be removed later
select('Continent', 'Area', 'Unit', 'Element', 'Y1961', 'Y2018') %>%
#We will lag the production of both years partitioned by the 'Area'
group_by(Area) %>%
mutate('1961AreaHarvested' = lag(Y1961, n=1, order_by=Area)) %>%
mutate('2018AreaHarvested' = lag(Y2018, n=1, order_by=Area)) %>%
#Renaming Y1961, and Y2018 Columns to denote their new role as Production values
rename( "1961Production" = "Y1961") %>%
rename( "2018Production" = "Y2018") %>%
#We will get filter by Element = Production... this will help with the data pivot
#It will remove the 'Area Harvested' elements from the 1961/2018_Production variables
filter(Element == 'Production') %>%
#Clean data up a bit and continue in another pipe, because this has become cumbersome
#We will do maths and such in another pipe
group_by(Continent) %>%
select('Continent', '1961Production','1961AreaHarvested','2018Production','2018AreaHarvested')
#PART 3
The data was further aggregated by continent, and yield calculated as production-units/area harvested for each year. From there the percent change calculation was applied.
q5step2 <- q5step1 %>%
group_by(Continent) %>%
summarize_at(vars('1961Production','1961AreaHarvested','2018Production','2018AreaHarvested'), list(name = sum), na.rm=T )
#Create the yield for each continent
q5step3 <- q5step2 %>%
#Calculate yield, the original document uses hectares and hectograms... but this will be a percent change
#the unit is not important so long as its the same in 1961 and 2018
mutate(Yield2018 = .$'2018Production_name' / .$'2018AreaHarvested_name') %>%
mutate(Yield1961 = .$'1961Production_name' / .$'1961AreaHarvested_name') %>%
select(Continent,Yield1961,Yield2018)
#PART 4 FINAL OUTPUT
#Output the final table with the yield-percent-change per continent in descending order
q5 <- q5step3 %>%
#Apply the percent change calculation
mutate(Yield_Change_Percent = (Yield2018 - Yield1961)/ Yield1961 * 100) %>%
#Nice round Numbers
mutate(Yield_Change_Percent = round(Yield_Change_Percent,2)) %>%
#sorted by highest yield change to lowest
arrange(desc(Yield_Change_Percent)) %>%
#Choose only the variables needed
select(Continent,Yield_Change_Percent)
OUTPUT:
Continent | Yield_Change_Percent |
Oceania | 163.29 |
Americas | 137.01 |
Asia | 111.31 |
Africa | 14.63 |
Europe | NA |
So with all that we have our final (very small) output.
This was my first time using RStudio, or the R language in general. So I am early on in the learning curve. I will definitely revisit R at another time.