Popular articles by alex-efremov
Popular comments by alex-efremov
Hi, taha-junaid3000 🙂
tomislavk is right… Splitting the work with someone would help you to achive better results and to learn much more while collaborating with others 🙂
Keeping in mind your work I would focus more on the analysis and conclusions regarding the data quality, variables for modelling, etc. This would be helpful to make next steps.
Hi team :), Good work!!!
You are right that the issue with missings should be solved in a better way (but not replacing with last known value). If there is one or a few neighbour missings, we may replace them without distorting the data, but in the case of long missing interval, Instead of replacement, we may use the data sets separately… There are ways to concatenate data from different data sets even when build dynamic models…
Not more to add after Agamemnon 🙂
I also like your validation approach, keeping in mind the small number of data, also introducing different scenarios related to the forecasts.
Hi, again 🙂
I like very much your work: the considerations related to the data, the interpretation of the outliers, the conclusions and also the good business understanding. 🙂
I have some comments & questions about the final model: looking at the p-values you put many not significant factors in the model or I misunderstood something. In order to reduce the possibility of overfitting, I would remove some of them in order only significant factors finally to stay. And to check the model for overfitting, we should compare R2, adj.R2, RMSE, etc., both for the train and test samples. In the case of the cross validation you did we also should do this for the average measures of the model quality. Also using linear regression, we impose particular hypothesis about the type of relation between the factors and the dependent. So, it would be good to check other models as well, especially non-parametric ones. Nevertheless, I really like what you have done.
Hi, Svilen, 🙂
Once again – you did very good job. The attempt to predict the outliers is not easy. Usually the appropriate data analysis, data prep and also data enrichment is critical for the final solution of cases like this one. You did a lot here, and at the same time there is more to do (e.g. to improve the balance in the data w.r.t. dependent, to reformulate the dependent variable, as you mentioned…). By the way, about the modelling – I noticed that the mean absolute error is much higher for the test sample compared with the train data. This is indicator of overfitting (if I understand correctly what you presented). So, I recommend you in future to take care about this when build models. In this case adding more factors and optimizing the model on the training sample usually reduces the predictive power when play with new data.