### Abstract¶

Viewing the file we can explore some of the effects on the GPA (grape point average) by the other variables. Considering the short timespan for this project we will not delve deeper into the data. The file will be read, some columns and values are going to be removed, correlations are going to be measured, plots will be present, and at the end we will see some hypothesis testing.

```
%matplotlib inline
```

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.stats import weightstats
from scipy.stats import mstats
import seaborn as sns
```

### Introduction¶

For the third dataset we will choose the food choices from here: https://www.kaggle.com/borapajo/food-choices#food_coded.csv. The file is saved in the data folder.

### Reading the Data and Preprocessing¶

```
food_choices = pd.read_csv('data\\food_coded.csv', sep = ',')
```

```
food_choices.head()
```

```
food_choices.shape
```

This dataset is a lot smaller compared to the previous 2. We have only 125 rows and 61 columns.

```
food_choices.columns
```

Some of the columns contain not numerical information from quizes, such columns will be removed, others are categorical values, they will stay.

```
food_choices = food_choices.drop(['comfort_food', 'comfort_food_reasons', 'diet_current', 'eating_changes', 'father_profession',
'fav_cuisine','food_childhood', 'healthy_meal','ideal_diet','meals_dinner_friend',
'mother_profession', 'type_sports'], axis = 1)
```

```
food_choices.shape
```

Now we have only 49 columns, we will decrease the number of columns further for the exploration part and we can return here if we want to include all the information in a regression model.

```
food_choices.columns
```

```
gpa_data = food_choices.drop(['calories_day', 'comfort_food_reasons_coded', 'cook','comfort_food_reasons_coded.1',
'cuisine', 'diet_current_coded','eating_changes_coded', 'eating_changes_coded1', 'eating_out',
'employment','ethnic_food', 'exercise', 'father_education',
'fav_cuisine_coded', 'fav_food', 'fries', 'fruit_day', 'grade_level',
'greek_food', 'healthy_feeling', 'ideal_diet_coded', 'income',
'indian_food', 'italian_food', 'life_rewarding', 'marital_status',
'mother_education', 'nutritional_check', 'on_off_campus',
'parents_cook', 'pay_meal_out', 'persian_food',
'self_perception_weight', 'soup', 'sports'], axis = 1)
```

```
gpa_data.shape
```

Finally we are left with only 14 columns. This can be seen below.

```
gpa_data.head()
```

### Correlation Analysis¶

```
plt.figure(figsize=(20,10))
sns.heatmap(gpa_data.corr(), annot = True)
```

We will start by creating a correlation matrix from the library seaborn with a heat map inside. The closer to one the higher the correlation, closer to zero mark the lack of correlation, while negative values mark negative correlation. For more information about correlation you can check https://www.kdnuggets.com/2017/02/datascience-introduction-correlation.html. Correlation does not mean a causal relationship between the variables. There is also the false correlation effect meaning we are witnessing just a coincidence. There is also the issue that the Person correlation coefficient is only for a linear relationships. The diagonal values are correlated with themselves so they have a coefficient of 1. Looking at the matrix there does not seem to be a stong correlation. We will plot the highest values.

```
plt.scatter(gpa_data['turkey_calories'], gpa_data['tortilla_calories'])
plt.xlabel('turkey_calories')
plt.ylabel('tortilla_calories')
plt.title('Scater Plot Correlation Coefficient = 0.48')
```

The turkey calories and the tortilla calories have a 0.48 correlation coefficient. It seems that turkey may be a major part of the tortilla, although there are always other ingredients. Other scatterplots will likely be random in nature. The likely reason for the lack of any correlation is likely that this are mostly dummy variables created for specific values of categorical ones. For more information on the subject here: https://www.kdnuggets.com/2015/12/beyond-one-hot-exploration-categorical-variables.html

```
plt.scatter(gpa_data['coffee'], gpa_data['veggies_day'])
plt.xlabel('coffee')
plt.ylabel('veggies_day')
plt.title('Scater Plot Correlation Coefficient = 0.11')
```

The values don't seem to be correlated, this is due to the reason that some of them are categorical values. If we want to more exploration we should consider the chi-square test for more information see here :https://en.wikipedia.org/wiki/Chi-squared_test

### EDA - grouping and plotting¶

Now lets explore the GPA with a histogram. We use the unique function, 3 values are not numeric and will be removed. We now have only 122 rows. The format of the data is string, this can be seen in the describe function.

```
gpa_data['GPA'].unique()
```

```
gpa_data = gpa_data[gpa_data['GPA'] != 'Personal ']
gpa_data = gpa_data[gpa_data['GPA'] != 'Unknown']
gpa_data = gpa_data[gpa_data['GPA'] != '3.79 bitch']
```

```
gpa_data.shape
```

```
gpa_data['GPA'].describe()
```

```
plt.figure(figsize=(15,10))
plt.title("GPA histogram")
plt.hist(gpa_data['GPA'].dropna(), bins = 10)
```

```
plt.figure(figsize=(15,10))
plt.title("GPA histogram")
plt.hist(gpa_data['GPA'].dropna(), bins = 20)
```

2 histograms have been plotted, the first with 10 bins, the second with 20, the second is a better considering the xticks are visible. We can see that data is fairly close to normal distribution a little elongated at the tails, there is also some rounding in the marks. Now we will convert the gpa data to float and will drop the NaN values.

```
gpa = gpa_data['GPA'].astype(float).dropna()
```

```
plt.figure(figsize=(10,8))
plt.title('GPA data violin plot')
sns.set(style="whitegrid")
ax = sns.violinplot(x=gpa)
```

Taking a violin plot of the gpa data we see it has an elongated left tail. Lets explore more. The skew function gives us a value below 0 confirming our previous observations for a left asymmetry. Also the mean is smaller then the median, if the distribution was symmetric they will be the same.

```
gpa.skew()
```

```
gpa.describe()
```

```
gpa.median()
```

```
gpa_data['Gender'].describe()
```

Now we move to the gender variable, if we use the describe function the mean is less then 1.5 meaning there are more 1s, then 2s. 1 stands for females so they are dominating the sample. Lets use a pie chart to visualize their proportions. There are 74 females and 48 males.

```
gpa_data[gpa_data['Gender'] == 1]['Gender'].count()
```

```
gpa_data[gpa_data['Gender'] == 2]['Gender'].count()
```

```
gender = gpa_data.groupby('Gender').count()
```

```
gender.index = ['Female', 'Male']
```

```
ratio = (gpa_data[gpa_data['Gender'] == 1]['Gender'].count())/(gpa_data[gpa_data['Gender'] == 2]['Gender'].count())
```

```
ratio
```

We set the index as male and female. The ration between the two is 1.54.

```
plt.title('Gender pie plot')
gender['GPA'].plot.pie()
```

Now let us explore the weights column, we will store it as the variable weight. Taking the unique function we will see that some of the data is not valid like "I'm not answering this. ", 'Not sure, 240' or '144 lbs'

```
weight = gpa_data['weight']
```

```
weight.unique()
```

```
weight = weight[weight.values != "I'm not answering this. "]
weight = weight[weight.values != "Not sure, 240"]
weight = weight[weight.values != "144 lbs"]
```

```
weight = weight.astype(float).dropna()
```

```
weight.skew()
```

```
weight.describe()
```

The values are again converted to float, they as rightly asymmetric considering the the skewness coefficient is above 0. We can see its descriptive statistics.

```
plt.title('Density Plot of the Weight')
plt.xlabel('weight')
weight.plot.density()
```

### Hypothesis testing via Z test¶

Now lets use a different approach this time we will plot a density plot. Looking at the plot confirms that is is slightly asymmetrical to the right.

```
average = gpa_data['GPA'].astype(float).dropna().mean()
```

```
average
```

```
gpa_data = gpa_data[gpa_data['weight'] != "I'm not answering this. "]
gpa_data = gpa_data[gpa_data['weight'] != "Not sure, 240"]
gpa_data = gpa_data[gpa_data['weight'] != "144 lbs"]
```

```
high_gpa = gpa_data[gpa_data['GPA'].astype(float) > average]
```

```
low_gpa = gpa_data[gpa_data['GPA'].astype(float) <= average]
```

```
high_gpa.head()
```

```
high_gpa.shape
```

```
low_gpa.head()
```

```
low_gpa.shape
```

Finally lets split the table into 2 parts of low and high gpa with the dividing part being the mean or 3.41.We will test if there is difference in the weight of the 2 groups. This will tell us if the weight is having a statistically significant effect. Lets use a boxplot first. We again need to remove the invalid rows.

```
plt.boxplot([low_gpa['weight'].astype(float).dropna(), high_gpa['weight'].astype(float)], labels = ["low gpa weights", "high gpa weights"])
plt.ylabel("Weight")
plt.title('Boxplot of the weights')
plt.show()
```

At first glance it seems that the low gpa have an average higher weight and a more right asymmetrical. Lets use the skew function to see more.

```
low_gpa['weight'].astype(float).dropna().skew()
```

```
high_gpa['weight'].astype(float).dropna().skew()
```

Both have a right asymmetricity more evident in the low_gpa weights. According this study https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5222954/ there is a negative link between the weigh of the student and their academic performance. Let see if our results will agree with the study.

This time we will use a z test, considering we have more then 30 observations in both population (according to other sources its 50) otherwise we will do Sudent's t test.
The null hypothesis H0 is that we don't have a difference, the alternative H1 is that we have. We assume a normal distribution of the variables.

```
plt.hist(low_gpa['weight'].astype(float).dropna(), bins = 20)
plt.title('low gpa weights histogram')
```

```
plt.hist(high_gpa['weight'].astype(float).dropna(), bins = 20)
plt.title('high gpa weights histogram')
```

The higher one is closer to the normal, although both have 3 total outliers. We have enough population so that the outliers don't have a major effect so we decide to leave them here this time. If should be tried to remove them to see if there is a major difference in the results.

Now lets start the z test you can read more here about it https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.ztest.html. The samples are assumed to be independent, i don't see how the weight of someone will affect the weight of other people, so lets keep this assumption present.

```
z, p = weightstats.ztest(low_gpa['weight'].astype(float).dropna(),high_gpa['weight'].astype(float))
```

```
z, p
```

The p-value is grater then our default value of 0.05 so we conclude that the H0 hypothesis is the correct one. Our study contradicts the previous study. The likely have a larger sample size, our samples the distributions are not entirely normal but lets assume they are. More evaluation is needed in the future.

### Conclusion¶

We had some short introduction to this file, we have gained some understanding of the makeups of some columns and performed a hypothesis testing to if the gpa has an effect on the weight. This turned out negative contradicting the mentioned study above.

### Further Development¶

Provided we have more time, more of the variables need to be explored, some regression model can be tried out, more in depth exploration of the categorical variables is needed.