### Abstract¶

We would like to see if there is any connection between the products (names) and price, as well as existing patterns. This is set a-priori. When we do the exploration further question will arise. Some of the data will be removed as it will not be used. There will be plots, groupings and hypothesis testing via ANOVA to find potential connections. In this notebook we will explore the nature of the products, the last one went from the countries to their creator.

```
%matplotlib inline
```

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import mstats
```

### Introduction¶

For the second dataset we will choose the wfp_market_food_prices from here: https://www.kaggle.com/jboysen/global-food-prices#wfp_market_food_prices.csv. The file is saved in the data folder.

### Reading the Data and Preprocessing¶

```
food_price_data = pd.read_csv('data\wfp_market_food_prices.csv', sep = ',', encoding='iso-8859-1')
```

The data is read and stored as food_price_data, the separator is a ',' , there is an UTF8 encoding that has successfully been fixed with iso-8859-1 encoding. We are having 743914 rows and 18 columns at the beginning.

```
food_price_data.head()
```

```
food_price_data.shape
```

```
food_price_data.columns
```

Most of ids tell us little information so they will be removed. We are left with 11 columns.

```
food_price_data = food_price_data.drop(['adm0_id', 'adm1_id', 'cm_id', 'mkt_id', 'cur_id', 'pt_id', 'um_id'] , axis = 1 )
```

```
food_price_data.shape
```

```
food_price_data.head()
```

### EDA - grouping and plotting¶

```
products = food_price_data.groupby(['cm_name'], axis=0)['mp_price'].mean()
```

```
products
```

Lets start the EDA by grouping the product names by price. We see that there are 321 products, a lot of the product have several types. Sorting the values by price shows that the livestock (sheep, one-year-old alive female) have the highest price tag significantly higher then the rest. Exploring further we see that mostly meat products, rice and fish occupy the top 20 spots. Some studies show that a significant part of the biosphere biomass is made by livestock animals. This could potentially be an evidence for this study, for further reading read the abstract here: https://www.pnas.org/content/early/2018/05/15/1711842115

```
products.sort_values(ascending=False).head(20)
```

```
products.sort_values(ascending=True).head(20)
```

Doing the opposite suggest that mostly plant food have the cheapest price, as well as water and eggs. We will try to further aggregate the data, because the product categories are to many. We removed the many of the similar types and created the new variable product2.

```
product2 = food_price_data.groupby([x.split(' ')[0] for x in food_price_data['cm_name']], axis=0)['mp_price'].mean()
```

```
product2.shape
```

```
product2.sort_values(ascending=False).head(20)
```

```
plt.figure(figsize=(15,10))
plt.title("Top 20 products by price")
plt.bar(product2.sort_values(ascending=False).head(20).index, product2.sort_values(ascending=False).head(20).values, align='center', alpha=0.5)
plt.xticks(product2.sort_values(ascending=False).head(20).index, rotation=90)
plt.ylabel('price')
plt.show()
```

Now we have reduced the number of products to 86. The top 20 confirms our suspicion that the livestock mean price is significantly higher then the rest, after that are various plant species some meat and fish. The visualization is telling us how much higher is the price of the livestock. Interesting the 20th potions is Fuel, it can be suggested that some of the food produced is used as a biofuel, see here https://en.wikipedia.org/wiki/Biofuel. Unfortunately looking at the data there is not something useful for further exploration on the topic. Now lets view the bottom 20.

```
product2.sort_values(ascending=False).tail(20)
```

```
plt.figure(figsize=(15,10))
plt.title("Lowest 20 products by price")
plt.bar(product2.sort_values(ascending=False).tail(20).index, product2.sort_values(ascending=False).tail(20).values, align='center', alpha=0.5)
plt.xticks(product2.sort_values(ascending=False).tail(20).index, rotation=90)
plt.ylabel('price')
plt.show()
```

The lowest 20 are occupied with mostly plant food. Their mean price is significantly lower then the rest. The variance between the price is lower then in the top 20, but the lowest 6 are significantly lower then the rest. The lowest is water. It is hard to interpret why the price of water is the lowest, is it only as a by product for the process of creating other food products. Considering many countries are suffering dwindling water supply, the price of water should be higher. See here more on the topic https://www.theguardian.com/global-development-professionals-network/2017/mar/27/aquifers-worlds-reserve-water-tank-asia

```
plt.boxplot([product2.sort_values(ascending=False).head(20).values, product2.sort_values(ascending=False).tail(20).values], labels = ["top 20 with highest price", "lowest 20"])
plt.ylabel("price")
plt.title('Boxplot')
plt.show()
```

The boxplot proves that the variance is higher for the top 20, the live stock is is clearly an outlier. Lets try to remove the outlier, to see if there is a significant change. There is another lower outlierso it is also going to be removed.

```
plt.boxplot([product2.sort_values(ascending=False).iloc[2:].head(20).values, product2.sort_values(ascending=False).tail(20).values], labels = ["top 20 with highest price", "lowest 20"])
plt.ylabel("price")
plt.title('Boxplot without the two outliers')
plt.show()
```

Even without the outliers there is still a large difference, now lets try to plot the lowest 20 separately because it is not very evident in this boxplot, the scale for the top 20 if far higher.

```
plt.boxplot([product2.sort_values(ascending=False).tail(20).values], labels = ["lowest 20"])
plt.ylabel("price")
plt.title('Boxplot lowest 20')
plt.show()
```

With the latest boxplot we can conclude that the lowest 20 products have a much more symmetric distribution compared with the top 20, who seem to be skewed to the right. Now a histogram will make things more clearly.

```
plt.hist(product2, edgecolor = 'black')
plt.xlabel("products")
plt.ylabel("price")
plt.title('Histogram of the products')
plt.show()
```

```
product2.skew()
```

The distribution is very skewed to the left. We will try to remove the outliers and increase the number of bins for a clear picture.

```
plt.hist(product2.sort_values(ascending=False).iloc[2:], edgecolor = 'black', bins = 20)
plt.xlabel("products")
plt.ylabel("price")
plt.title('Histogram of the products without the 2 biggest outliers')
plt.show()
```

```
product2.sort_values(ascending=False).iloc[2:].skew()
```

Still the majority are cheaper then the rest, the more expensive products don't have a very high number. We will try to remove the top 15 products to see if the distribution will become smoother. Still the majority are concentrated in the lower part.

```
plt.hist(product2.sort_values(ascending=False).iloc[15:], edgecolor = 'black', bins = 20)
plt.xlabel("products")
plt.ylabel("price")
plt.title('Histogram of the products without the 15 biggest outliers')
plt.show()
```

```
product2.sort_values(ascending=False).iloc[15:].skew()
```

```
product2.index
```

Now let us consider that the product are divided into 2 groups, animal-derived products and plant derived products. When we gen the product2 index we see that some of the products are not a part of the 2 categories like water, salt, there is also other costs, that we have not come across earlier like Transport, Wage, Charcoal, Exchange or Fuel (it can be speculated if the fuel is plant derived or traditional fossil fuels, lets assume its plant derived and we will leave it hear), there is also oil (let as assume its plant oil and it will stay). We will omit the 'wage' and the 'transport', 'charcoal', 'exchange' from now own. It would be wise to do the same exploration as the above but without this 'additions'.

```
product2[['Wage', 'Transport','Charcoal','Exchange' ]]
```

If we get the wage and the transport, charcoal, exchange separately we see that they are not outliers, none make it to the top 20.

```
product2.describe()
```

Now we get the descriptive statistics, both wage and transport are below the mean, what if we remove the 15 biggest? The mean drops mean and standard deviation drop significantly. We don't think that the wage and the transport, charcoal, exchange are going to have a very big impact, considering the time, we will leave their removal for the above research for another time. But now they will be removed from now own. Livestock is also removed as being the largest outlier.

```
product2.sort_values(ascending=False).iloc[15:].describe()
```

```
product2.drop(['Wage', 'Transport','Charcoal' ,'Exchange']).sort_values(ascending=False).iloc[15:].describe()
```

```
plant_derived = product2[['Apples', 'Avocados', 'Bananas', 'Beans', 'Beans(mash)', 'Beetroots',
'Blackberry', 'Bread', 'Broccoli', 'Buckwheat','Bulgur','Cabbage', 'Carrots', 'Cashew', 'Cassava', 'Cauliflower',
'Chili', 'Cocoa', 'Coffee', 'Cornstarch','Cotton','Cowpeas', 'Cucumbers','Dates', 'Eggplants','Fonio', 'Gari',
'Garlic','Guava','Lentils','Lettuce','Maize', 'Mangoes','Millet', 'Noodles', 'Oil','Onions', 'Oranges', 'Papaya',
'Parsley', 'Passion', 'Pasta', 'Peanut','Peas', 'Peppers', 'Plantains', 'Potatoes', 'Poultry', 'Pulses',
'Pumpkin', 'Rice','Sesame', 'Sorghum', 'Sour', 'Soybeans',
'Spinach', 'Sugar', 'Sweet', 'Tamarillos/tree', 'Tea', 'Tomatoes','Tortilla', 'Wheat', 'Yam', 'Yogurt',
'Zucchini']]
```

```
animal_derived = product2[['Butter','Cheese','Eggs','Curd','Fat', 'Fish','Labaneh','Meat', 'Milk']]
```

```
plant_derived.count()
```

```
animal_derived.count()
```

### Hypothesis Testing¶

The number of animal derived products is far lower then the rest.

```
plt.boxplot([plant_derived.values, animal_derived.values], labels = ["plant derived", "animal derived"])
plt.ylabel("price")
plt.title('Boxplot for animal and plant derived')
plt.show()
```

```
plant_derived.skew()
```

```
animal_derived.skew()
```

```
plant_derived.idxmax()
```

The Boxplot shows that the plant derived products are more skewed to the right. We will try to remove the biggest outlier in the plants derived in our case 'Sesame'.

```
plt.boxplot([plant_derived.sort_values(ascending=False).iloc[1:].values, animal_derived.values], labels = ["plant derived", "animal derived"])
plt.ylabel("price")
plt.title('Boxplot for animal and plant derived')
plt.show()
```

We will try a one way parametric ANOVA test from the stats package, to see if there is any statistical significant differences between the mean of the 2 groups, for more information of the test see here: https://www.marsja.se/four-ways-to-conduct-one-way-anovas-using-python/. We set the H0 hypothesis that there is no statistically significant difference between the 2 means, and the H1 that there is. We take the default value for the p-value of 0.05, if the it is higher we reject the H1 hypothesis else, we reject the H0 hypothesis.

```
F, p = stats.f_oneway(plant_derived.sort_values(ascending=False).iloc[1:].values, animal_derived.values)
```

```
F, p
```

The p-value is higher the 0.05 so, we will conclude that H0 is the correct hypothesis. This is good, but if we read the criteria for for one way ANOVA we can see that the two populations are not normally distributed and it is possible that they are dependent or and populations are not the same see here: https://en.wikipedia.org/wiki/Homogeneity_and_heterogeneity This leaves us that the results are not to be trusted.

```
plt.hist(plant_derived.sort_values(ascending=False).iloc[1:], edgecolor = 'black', bins = 20)
plt.xlabel("Plant products")
plt.ylabel("price")
plt.title('Histogram of plant based products')
plt.show()
```

```
plt.hist(animal_derived, edgecolor = 'black', bins = 20)
plt.xlabel("Plantproducts")
plt.ylabel("price")
plt.title('Histogram of plant based products')
plt.show()
```

Now we will use another non-parametric test Kruskal-Wallis H-test.For more information for the test check here: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mstats.kruskalwallis.html The hypothesis are the same. Now we don't assume normal distributions and similar sizes to the population.

```
H, p_value = mstats.kruskalwallis(plant_derived.sort_values(ascending=False).iloc[1:].values, animal_derived.values)
```

```
H, p_value
```

The p-value is significantly higher then 0.05 meaning we reject the H1 hypothesis, and conclude that the mean price of the plant and animal derived product are statistically similar.

```
animal_derived.mean()
```

```
plant_derived.mean()
```

Unfortunately this test is also not the best, is assumes that the populations are independent. Some of the plant derived food can be used for food for animals. There could also be competition for land for example so there is likely some form of interdependence between the populations. The issue should be addressed for example with https://en.wikipedia.org/wiki/Multivariate_analysis_of_variance. But for now we conclude the work on this notebook.

### Conclusion¶

The focus of this notebook was on the the products and their prices, several grouping and preprocessing have been used. Finally the products were divided to plant and animal based. Some of the products were excluded, also several links have been added, if the reader is more interested on the subject. Finally an one way ANOVA test and the non-parametric test Kruskal-Wallis H-test were used. The results were inconclusive.

### Further development¶

In the future the time and the countries, sectors could also be checked. Some counties are exporters, others are importers, there could be seasonality based on the time of year.

Also there is a clear connection between the variables, this should also be addressed.