R - Agriculture Crop Production EDA - Part 1

I found a dataset on data.world that had a great deal of information on country crop yields. It contained 5 data tables divided into different groupings including Africa, The Americas, Asia, Europe, and Oceania. It also contained 3 more tables that helped explain the dataset and offer context about the data collection, measurements, etc.
Lets preview the data concerning Africa.
Area Code | Area | Item Code | Item | Element Code | Element | Unit | Y1961 |
4 | Algeria | 221 | Almonds, with shell | 5312 | Area harvested | ha | 13300 |
4 | Algeria | 221 | Almonds, with shell | 5419 | Yield | hg/ha | 4511 |
4 | Algeria | 221 | Almonds, with shell | 5510 | Production | tonnes | 6000 |
4 | Algeria | 515 | Apples | 5312 | Area harvested | ha | 3400 |
4 | Algeria | 515 | Apples | 5419 | Yield | hg/ha | 45294 |
4 | Algeria | 515 | Apples | 5510 | Production | tonnes | 15400 |
Each agricultural product in each country has three entries. The 'Element' column describes the measure. Three elements are included 'Area harvested', 'Yield', and 'Production'. 'Yield' is an aggregate of the 'Area harvested', and 'Production'.
The data for the years is ordered by column as well.
Y1961F | Y1962 | Y1962F | Y1963 | Y1963F | Y1964 | Y1964F | Y1965 |
F | 13300 | F | 13300 | F | 14200 | F | 13800 |
Fc | 4511 | Fc | 4511 | Fc | 4507 | Fc | 4493 |
6000 | 6000 | 6400 | 6200 | ||||
F | 3100 | F | 2800 | F | 2700 | F | 2900 |
Fc | 45161 | Fc | 46429 | Fc | 46078 | Fc | 45348 |
14000 | 13000 | 12441 | 13151 |
This continues until 2018. Notice how there is a date with an 'F' at the end of it? This allows us to know how the data was collected. This data can be found in the 'Flags.csv' file included in the dataset. I include it for those are curious, but it was not used in this data exploration. I'll use an ellipse to denote where I snipped the dataset.
Flag | Flags |
Official data | |
* | Unofficial figure |
A | Aggregate, may include official, semi-official, estimated or calculated data |
… | … |
F | FAO estimate |
Fb | Data obtained as a balance |
Fc | Calculated data |
… | … |
TO THE ANALYSIS
We will be answering 5 questions on this data.
- Which country in Africa produced the most coffee in 2018?
- Which country in Africa produced the most coffee per acre in 2018?
- Which country in Africa has the largest increase in the amount of coffee produced since 1961?
- Which country (in the world, not specifically Africa) saw the highest increase in the amount of coffee grown per acre since 1961?*
- Which continent (or rather, major grouping as presented in the 5 data sets) has the largest increase in the amount of coffee grown per acre since 1961?*
*This post contains solutions for 1-3, the next post details 4 and 5
1 Which country in Africa produced the most coffee in 2018?
We are going to start by importing the data. Then we distill the columns that we to display by subsetting the dataset and including only 'Area', 'Item', 'Element', 'Unit', and the year in question 'Y2018'.
The idea here is simple, we are going to filter where the 'Item Code' is equal to '656', and filter on the 'Element' column where it is equal to 'Production'. After this it is a matter of sorting data on the 'Y2018' column, as this column contains the numerical 'Production' values.
#Import dplyr
library(dplyr)
#Import data
africa_table = read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv")
q1 <- africa_table %>%
select(Area, Item.Code, Item, Element, Unit, Y2018) %>%
filter(Element == "Production") %>%
filter(Item.Code == 656) %>%
arrange(desc(Y2018))
#Output as a csv vile
write.csv(q1africa_table, "C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\q1.csv")
OUTPUT:
Area | Item.Code | Item | Element | Unit | Y2018 |
Ethiopia | 656 | Coffee, green | Production | tonnes | 494574 |
Uganda | 656 | Coffee, green | Production | tonnes | 284225 |
Madagascar | 656 | Coffee, green | Production | tonnes | 60114 |
United Republic of Tanzania | 656 | Coffee, green | Production | tonnes | 43193 |
Guinea | 656 | Coffee, green | Production | tonnes | 42900 |
Kenya | 656 | Coffee, green | Production | tonnes | 41375 |
C�te d'Ivoire | 656 | Coffee, green | Production | tonnes | 39092 |
Rwanda | 656 | Coffee, green | Production | tonnes | 38643 |
Democratic Republic of the Congo | 656 | Coffee, green | Production | tonnes | 31145 |
Cameroon | 656 | Coffee, green | Production | tonnes | 30590 |
Sierra Leone | 656 | Coffee, green | Production | tonnes | 20480 |
Togo | 656 | Coffee, green | Production | tonnes | 19076 |
Angola | 656 | Coffee, green | Production | tonnes | 16308 |
Burundi | 656 | Coffee, green | Production | tonnes | 14216 |
Malawi | 656 | Coffee, green | Production | tonnes | 11082 |
2 Which country in Africa produced the most coffee per acre in 2018?
This is a similar problem, there is no need to aggregate the data because the data is already aggregated, it is a matter of not selecting "Production" as the "Element", but filtering by "Yield".
library(dplyr)
africa_table = read.csv("C:\\Users\\WillPortFolio\\Desktop\\Blog\\Project 05 R and Apache Superset\\Data\\agriculture-crop-production\\Production_Crops_E_Africa.csv")
#Only keep Element values equal to 'Yield'
q2 <- filter(africa_table, Element == 'Yield') %>%
#Our focus is on coffee
filter(Item =='Coffee, green') %>%
#Select output columns
select('Area', 'Item', 'Unit', 'Y2018') %>%
#Sort highest yield to lowest
arrange(desc(Y2018))
OUTPUT:
Area | Item | Unit | Y2018 | |
1 | Malawi | Coffee, green | hg/ha | 23345 |
2 | Sierra Leone | Coffee, green | hg/ha | 20706 |
3 | Ghana | Coffee, green | hg/ha | 16959 |
4 | Rwanda | Coffee, green | hg/ha | 16506 |
5 | Nigeria | Coffee, green | hg/ha | 12886 |
6 | Burundi | Coffee, green | hg/ha | 9658 |
3 Which country in Africa has the largest increase in the amount of coffee produced since 1961?
A different type of question. I need to compare values, and use the percent change formula.
#consider dplyr and africa_table already imported
#Select required columns
q3 <- select(africa_table, 'Area', 'Item', 'Unit', 'Element','Y1961','Y2018') %>%
#filter by coffee and production
filter(Item == 'Coffee, green', Element == 'Production') %>%
#create a new column for percent change rounded to 2 decimal places
mutate(PercentChange = round(((Y2018 - Y1961) / Y1961)*100,2)) %>%
#arrange in descending order using PercentChange
arrange(desc(PercentChange))
OUTPUT:
Area | Item | Unit | Element | Y1961 | Y2018 | PercentChange | |
1 | Malawi | Coffee, green | tonnes | Production | 167 | 11082 | 6535.93 |
2 | Sierra Leone | Coffee, green | tonnes | Production | 5103 | 20480 | 301.33 |
3 | Rwanda | Coffee, green | tonnes | Production | 10500 | 38643 | 268.03 |
4 | Congo | Coffee, green | tonnes | Production | 900 | 3049 | 238.78 |
5 | Uganda | Coffee, green | tonnes | Production | 94100 | 284225 | 202.05 |
6 | Guinea | Coffee, green | tonnes | Production | 15000 | 42900 | 186 |
7 | Togo | Coffee, green | tonnes | Production | 10300 | 19076 | 85.2 |
8 | Kenya | Coffee, green | tonnes | Production | 28100 | 41375 | 47.24 |
9 | Comoros | Coffee, green | tonnes | Production | 100 | 140 | 40 |
10 | United Republic of Tanzania | Coffee, green | tonnes | Production | 33000 | 43193 | 30.89 |
11 | Central African Republic | Coffee, green | tonnes | Production | 8200 | 9391 | 14.52 |
12 | Madagascar | Coffee, green | tonnes | Production | 54000 | 60114 | 11.32 |
13 | Nigeria | Coffee, green | tonnes | Production | 1100 | 1161 | 5.55 |
14 | Burundi | Coffee, green | tonnes | Production | 14000 | 14216 | 1.54 |
15 | Mozambique | Coffee, green | tonnes | Production | 1000 | 825 | -17.5 |
16 | Cameroon | Coffee, green | tonnes | Production | 44700 | 30590 | -31.57 |
We will continue with questions 4 and 5 in the next post.