|For how many years have you been experimenting with data?||
Popular articles by zenpanik
Popular comments by zenpanik
Great – lots of content and interesting charts and numbers.
Nice idea to use BERT – I would be happy to see what are results from Russian model on Bulgarian text
Please share results from RNN model
1. Data Prep
1.1. What is the result from describe method in Pandas?
1.2. Since Pandas is based on dictonaries and numpy arrays (really fast and efficient). Not sure why the team does not use Pandas industry’s standard methods
and spent time on writing code to work with dictonaries.
Would you mind to elaborate more on this please?
a = train[‘pagePath’]
a = a.to_list()
a = set(a)
a = list(a)“`
This is just list(train[‘pagePath’].unique())
1.2. Formation of dictonary could be done with applying a function to the URL – You used column “PagePath” but said that the columns are:
[“User ID”, “Time Stamp”, “URL”, “Page Title” and “Page Views”]
I do not see how you derived this column list
1.3. I do not see sorting by timestamp in your code.
1.4. I do not see soring by visitor in the code – so all loops you are doing in step MODEL are not correct since:
while(visitor[j] == visitor[j+1])
depends on visitor (probably User ID)
1.5. KNN usually stands for K-Nearest Neighbours … What is it in your case?
1.6. Please show a snippet of your data … this code is puzzeling:
What are knn_data and knn_data?
What are you predicting exacltly? Why do you use regression model?
3. Please include charts and samples from your data.
6. Would you bet your own money on your predictions? If so how much?
1. You may want to include some evaluation metrics for your models both on train & test sets.
2. On the data prep part – it is not the best solution to just remove rows where you see missing values because it is time-series data and could seriously bias your next steps.
3. Assumption you have made about the “large number of missing values” is probably poor. Do you have any data/metric you used to prove it?
4. You may want to include more detailed explanation why the data is not continuous (here is a link on discrete and continuous data https://www.mathsisfun.com/data/data-discrete-continuous.html)
5. How you would rank your model? What are the metrics you used?