zenpanik

Popular articles by zenpanik

Popular comments by zenpanik

Datathon 2020 – NetInfo Article Recommender – Newbies

1. Data Prep
1.1. What is the result from describe method in Pandas?

1.2. Since Pandas is based on dictonaries and numpy arrays (really fast and efficient). Not sure why the team does not use Pandas industry’s standard methods
and spent time on writing code to work with dictonaries.
Would you mind to elaborate more on this please?

For example:“`
a = train[‘pagePath’]
a = a.to_list()
a = set(a)
a = list(a)“`

This is just list(train[‘pagePath’].unique())

1.2. Formation of dictonary could be done with applying a function to the URL – You used column “PagePath” but said that the columns are:
[“User ID”, “Time Stamp”, “URL”, “Page Title” and “Page Views”]

I do not see how you derived this column list

1.3. I do not see sorting by timestamp in your code.

1.4. I do not see soring by visitor in the code – so all loops you are doing in step MODEL are not correct since:

while(visitor[j] == visitor[j+1])

depends on visitor (probably User ID)

1.5. KNN usually stands for K-Nearest Neighbours … What is it in your case?

1.6. Please show a snippet of your data … this code is puzzeling:

What are knn_data[1] and knn_data[0]?

2. MODEL

What are you predicting exacltly? Why do you use regression model?

3. Please include charts and samples from your data.

Cryptocurrency Prediction by Kautilya

1. You may want to include some evaluation metrics for your models both on train & test sets.
2. On the data prep part – it is not the best solution to just remove rows where you see missing values because it is time-series data and could seriously bias your next steps.
3. Assumption you have made about the “large number of missing values” is probably poor. Do you have any data/metric you used to prove it?
4. You may want to include more detailed explanation why the data is not continuous (here is a link on discrete and continuous data https://www.mathsisfun.com/data/data-discrete-continuous.html)
5. How you would rank your model? What are the metrics you used?