Case NetInfo/Vesti.bg, Team Army of Ones and Zeroes, Datathon 2020
*Please, reach out anytime at +35987877797 or [email protected] or caseyp on DSS with any questions and we will answer ASAP and/or add to the article – we tried to keep the article concise instead of making it overwhelming.
Final prediction/submission located in this google drive folder:
https://drive.google.com/open?id=1UjUC2eyb3SMIjfP2jwlHyJ0qzw5bxBts
File name: permissions.zip (The zipped version is 9x smaller and faster to download)
*Note that the above submission is our best model/prediction for each user for both the next 24 and 48 hours.
To view this article with pictures, navigate to the google drive link above and open the Google Docs file.
- Business Understanding
Vesti.bg needs to predict what articles users will like (based on past analytics data) in order to maximize engagement and revenue.
- Data Understanding
23 504 707 Google Analytics pageviews of 2 361 105 users interacting with 58 654 web articles on vesti.bg
All interactions are in the time interval of 2020/04/12 21:06 – 2020/05/12 20:58
- Data Preparation
We used perl scripts to compress and optimize the data for fast processing and analysis. (Appendix C1 – parse.pl)
- Modeling
[skipping many hours of iteration, analysis, and thinking]
The first big insight we discovered was that MOST articles get page views dropping VERY quickly – i.e. over 90% of page views occur in the first 24 hours. For example, look at this this article:
www.vesti.bg/bulgaria/volen-siderov-e-obiaven-za-obshtodyrzhavno-izdirvane-6108313
This article got 35352 pageviews (which is relatively high), all of which happened within 2.5 hours of it being published on April 15. The reason why this happened is that the article was replaced by an updated article and the traffic went to the updated one:
www.vesti.bg/bulgaria/volen-siderov-e-obiaven-za-obshtodyrzhavno-izdirvane-6108313
Still, this pattern of traffic doping by orders of magnitude as seen on the chart above is valid for almost all articles.
Except, there IS a small subset of articles, which don’t lose pageviews as quickly. These articles are more educational than newsworthy.
For example this article was published back in July 2016, but is still consistently averaging about 150 views per day.
www.vesti.bg/lyubopitno/lyubopitno/klasicheska-recepta-za-gris-halva-6056434
The article is about making a traditional Bulgarian homemade dessert 🙂
However, these “educational” articles all have relatively few total pageviews per day, meaning that the probability of a user returning to them is very low.
Therefore, our first big insight led us to focus on predicting which article would drop its pageview traffic the slowest for the next 24-48 hours.
Thus, we focused on articles that had the highest pageviews in the last 21 hours (Dataset ends at 21:00 on May 12th).
This is a good time to note that, as expected, there is almost no traffic during the night:
So, we focused on articles that had the highest pageviews in the last 21 hours.
On the chart below, we only plot the top-candidate articles (we looked at the next 20 articles that go down to 7k views, but the trends there were way worse)
From this single chart, it is clear that article_id 198 is holding its drop of pageviews best.
In the end, we are betting that for every single user the most likely article for them to visit is this one (there will be newly-published articles, but we don’t have access to them):
We saw a lot of repeat user views of articles, so one other question that we also explored was whether a user would return to the same article after they have read it. It does happen a lot, even after 24/48 hours. It doesn’t happen as often for the high-traffic articles, so we would recommend another article if a user has already seen the top article.
- Evaluation
The best way we found to evaluate the models is to:
- drop the data for the last day (5/12), and try to predict/score the most user-article hits for that day. Repeat for previous days 5/11, 5/10, etc.
- we didn’t have time to automate the implementation, so we will leave that for later work
*Deployment – optional
We don’t see how to deploy this except by A/B testing it on vesti.bg . For example, by diverting 1% of the traffic to this model and seeing how well it performs against the existing/current one.
Future work
We have a ton of ideas for future improvements.
The biggest open question that we have is where/how the inbound traffic comes from. That will likely make a huge difference in modeling as some of it is organic (via visitors to the home page or app) vs traffic from other websites/search engines/Facebook. Modeling these two separately will likely reveal very different engagement patterns.
We couldn’t find a productive way to use TensorFlow for now. Our best idea is to see if there is a way to see if some users strongly prefer some types of articles and maybe the preference will be so strong that they would be way more likely to read the 2nd and 3rd best articles, if their affinity for the 1st best article is low. We doubt that will make a significant improvement, but data beats intuition, so we want to test it.
Case NetInfo/Vesti.bg, Team Army of Ones and Zeroes, Datathon 2020
*Please, reach out anytime at +35987877797 or [email protected] or caseyp on DSS with any questions and we will answer ASAP and/or add to the article – we tried to keep the article concise instead of making it overwhelming.
Final prediction/submission located in this google drive folder:
https://drive.google.com/open?id=1UjUC2eyb3SMIjfP2jwlHyJ0qzw5bxBts
File name: permissions.zip (The zipped version is 9x smaller and faster to download)
*Note that the above submission is our best model/prediction for each user for both the next 24 and 48 hours.
To view this article with pictures, navigate to the google drive link above and open the Google Docs file.
- Business Understanding
Vesti.bg needs to predict what articles users will like (based on past analytics data) in order to maximize engagement and revenue.
- Data Understanding
23 504 707 Google Analytics pageviews of 2 361 105 users interacting with 58 654 web articles on vesti.bg
All interactions are in the time interval of 2020/04/12 21:06 – 2020/05/12 20:58
- Data Preparation
We used perl scripts to compress and optimize the data for fast processing and analysis. (Appendix C1 – parse.pl)
- Modeling
[skipping many hours of iteration, analysis, and thinking]
The first big insight we discovered was that MOST articles get page views dropping VERY quickly – i.e. over 90% of page views occur in the first 24 hours. For example, look at this this article:
www.vesti.bg/bulgaria/volen-siderov-e-obiaven-za-obshtodyrzhavno-izdirvane-6108313
This article got 35352 pageviews (which is relatively high), all of which happened within 2.5 hours of it being published on April 15. The reason why this happened is that the article was replaced by an updated article and the traffic went to the updated one:
www.vesti.bg/bulgaria/volen-siderov-e-obiaven-za-obshtodyrzhavno-izdirvane-6108313
Still, this pattern of traffic doping by orders of magnitude as seen on the chart above is valid for almost all articles.
Except, there IS a small subset of articles, which don’t lose pageviews as quickly. These articles are more educational than newsworthy.
For example this article was published back in July 2016, but is still consistently averaging about 150 views per day.
www.vesti.bg/lyubopitno/lyubopitno/klasicheska-recepta-za-gris-halva-6056434
The article is about making a traditional Bulgarian homemade dessert 🙂
However, these “educational” articles all have relatively few total pageviews per day, meaning that the probability of a user returning to them is very low.
Therefore, our first big insight led us to focus on predicting which article would drop its pageview traffic the slowest for the next 24-48 hours.
Thus, we focused on articles that had the highest pageviews in the last 21 hours (Dataset ends at 21:00 on May 12th).
This is a good time to note that, as expected, there is almost no traffic during the night:
So, we focused on articles that had the highest pageviews in the last 21 hours.
On the chart below, we only plot the top-candidate articles (we looked at the next 20 articles that go down to 7k views, but the trends there were way worse)
From this single chart, it is clear that article_id 198 is holding its drop of pageviews best.
In the end, we are betting that for every single user the most likely article for them to visit is this one (there will be newly-published articles, but we don’t have access to them):
We saw a lot of repeat user views of articles, so one other question that we also explored was whether a user would return to the same article after they have read it. It does happen a lot, even after 24/48 hours. It doesn’t happen as often for the high-traffic articles, so we would recommend another article if a user has already seen the top article.
- Evaluation
The best way we found to evaluate the models is to:
- drop the data for the last day (5/12), and try to predict/score the most user-article hits for that day. Repeat for previous days 5/11, 5/10, etc.
- we didn’t have time to automate the implementation, so we will leave that for later work
*Deployment – optional
We don’t see how to deploy this except by A/B testing it on vesti.bg . For example, by diverting 1% of the traffic to this model and seeing how well it performs against the existing/current one.
Future work
We have a ton of ideas for future improvements.
The biggest open question that we have is where/how the inbound traffic comes from. That will likely make a huge difference in modeling as some of it is organic (via visitors to the home page or app) vs traffic from other websites/search engines/Facebook. Modeling these two separately will likely reveal very different engagement patterns.
We couldn’t find a productive way to use TensorFlow for now. Our best idea is to see if there is a way to see if some users strongly prefer some types of articles and maybe the preference will be so strong that they would be way more likely to read the 2nd and 3rd best articles, if their affinity for the 1st best article is low. We doubt that will make a significant improvement, but data beats intuition, so we want to test it.
Appendix C1 – parse.pl
use Text::CSV_XS;
use B;
use Time::Piece;
if( 0 ) {
$lastfile = 0;
} else {
$lastfile = 26;
}
my $top_users=2000;
my $top_urls=1000;
my $day_count=32;
my $start_time = Time::Piece->strptime(“2020-04-12 00:00:00”,’%Y-%m-%d %H:%M:%S’);
$lc=0; $errc=0;
# histograms/buckets
my @hist_url_day_of_month;
my @hist_url_time_of_day;
my @hist_url_time_of_week;
my $buckets_in_day=24;
my $buckets_in_week=7*12;
for(my $i=0; $i<$buckets_in_day; $i++) { $hist_url_time_of_day[$i]=0; }
for(my $i=0; $i<$buckets_in_week; $i++) { $hist_url_time_of_week[$i]=0; }
my $user_count=0;
my @user_id2name;
my %user_name2id;
my @user_id2viewcount;
my $url_count=0;
my @url_id2name;
my @url_id2articlestring;
my @url_id2viewcount;
my %url_name2id;
my $mindate=222200000000000;
my $maxdate=0;
my @url_mintime;
my @url_minepoch;
my $hist_user_size=2000;
my @hist_user;
for(my $i=0; $i<$hist_user_size; $i++) { $hist_user[$i]=0; }
$delimiter = “,”;
open OUTF,”>01.csv” or die “Cannot write to compact.csv\n”;
print OUTF “url_id${delimiter}time${delimiter}user_id\n”;
open OUTE,”>errors.txt” or die $!;
my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
for($fid=0; $fid<=$lastfile; $fid++){
$fname = sprintf(“z_orig/vest%012d”,$fid);
open my $fh, “<:encoding(utf8)”, “$fname” or die “cannot read $fname: $!”;
my $row = $csv->getline ($fh);
while (my $row = $csv->getline ($fh)) {
if($lc%100000==0){ print “$fid.$lc $mindate-$maxdate $url_count $user_count\n”;}
# [0]url4article [1]headline [2]time [3]user
# fix the time
my $time_obj = Time::Piece->strptime(substr($row->[2],0,19),’%Y-%m-%d %H:%M:%S’);
my $prettydate = $time_obj->yy*100000000 + $time_obj->mon *1000000 + $time_obj->mday*10000 + $time_obj->hour * 100 + $time_obj->min;
if($prettydate>$maxdate) { $maxdate = $prettydate; }
if($prettydate<$mindate) { $mindate = $prettydate; }
my $seconds_from_start_time=$time_obj->epoch-$start_time->epoch;
# process the URL
my $url_name = $row->[0];
if( !defined($url_name2id{$url_name} )){
$url_name2id{ $url_name } = $url_count;
$url_id2name[ $url_count] = $url_name;
$url_id2articlestring[ $url_count ] = $row->[1];
$url_id2viewcount[ $url_count ] = 0;
$url_mintime[ $url_count ] = $prettydate;
$url_minepoch[ $url_count ] = $seconds_from_start_time;
for(my $aday=0; $aday<$day_count; $aday++){
$hist_url_day_of_month[$url_count][$aday]=0;
}
$url_count++;
}
my $url_id = $url_name2id{$url_name};
$url_id2viewcount[ $url_id ]++;
if($url_mintime[$url_id] > $prettydate) {
$url_mintime[$url_id] = $prettydate;
$url_minepoch[$url_id] = $seconds_from_start_time;
}
my $user_name = $row->[3];
if( !defined($user_name2id{$user_name} )){
$user_name2id{ $user_name } = $user_count;
$user_id2name[ $user_count] = $user_name;
$user_id2viewcount[$user_count] = 0;
$user_count++;
}
my $user_id = $user_name2id{$user_name};
$user_id2viewcount[$user_id]++;
#update buckets $buckets_in_day
my $dtmp=int(($time_obj->hour+$time_obj->min/60)* $buckets_in_day/24);
$hist_url_time_of_day[$dtmp]++;
$dtmp=int(($time_obj->day_of_week+$time_obj->hour/24+$time_obj->min/60/24)* $buckets_in_week/7);
$hist_url_time_of_week[$dtmp ] ++;
my $daytime_bucket = int($seconds_from_start_time/ (60*60*24)); # days bucket for histogram
if($daytime_bucket<0 or $daytime_bucket>=$day_count){
$errc++;
print OUTE “Bucket Error: $daytime_bucket $seconds_from_start_time $url_id $user_id\n”;
}
$hist_url_day_of_month[$url_id][$daytime_bucket]++;
my $short_print_time = int($seconds_from_start_time/60); # just make it in minutes
print OUTF “$url_id${delimiter}$short_print_time${delimiter}$user_id\n”;
$lc++;
}
close $fh;
} # foreach file 0..25
close OUTF;
#print “MINDATE: $mindate\n MAXDATE: $maxdate\n\n”;
# — PRIINT ARTICLES, SORTED —-
#init hash
my %hash_for_sorting;
for(my $i=0; $i<$url_count; $i++){
$hash_for_sorting{$i}=$url_id2viewcount[$i];
}
#sort and make sorted array
my $idcount=0;
my @viewsorted_url_ids;
foreach my $id (sort { $hash_for_sorting{$b} <=> $hash_for_sorting{$a} or $a cmp $b } keys %hash_for_sorting) {
$viewsorted_url_ids[$idcount] = $id;
$idcount++;
}
print “Assert sorted articles ok $idcount == $url_count \n”;
open TOPA,”>top_articles.csv” or die $!;
open OUTA,”>01articles.csv” or die $!;
print OUTA “id${delimiter}viewcount${delimiter}time_of_click${delimiter}url${delimiter}title\n”;
for(my $i=0; $i<$url_count; $i++){
my $id=$viewsorted_url_ids[$i];
if($i<$top_urls){
print TOPA “$url_id2name[$id]\n”;
}
if(1){
print OUTA “$id${delimiter}$url_id2viewcount[$id]${delimiter}$url_minepoch[$id]${delimiter}$url_mintime[$id]${delimiter}$url_id2name[$id]”;
} else {
print OUTA “$id${delimiter}$url_id2viewcount[$id]${delimiter}$url_minepoch[$id]${delimiter}$url_mintime[$id]${delimiter}$url_id2name[$id]${delimiter}$url_id2articlestring[$id]”;
}
for(my $bid=0; $bid<$day_count; $bid++){
print OUTA “,$hist_url_day_of_month[$id][$bid]”;
}
print OUTA “\n”;
}
close OUTA;
close TOPA;
# — PRIINT USERS, SORTED —-
# this just fixes the hist_user histogram
for(my $i=0; $i<$user_count; $i++){
if($user_id2viewcount[$i]<$hist_user_size){
$hist_user[$user_id2viewcount[$i]]++;
}
if( 0 ){ # OLD UNSORTED USER CODE
print OUTU “$i,$user_id2viewcount[$i],$user_id2name[$i]\n”;
}
}
my %hash_for_sorting_users;
for(my $i=0; $i<$user_count; $i++){
$hash_for_sorting_users{$i}=$user_id2viewcount[$i];
}
#sort and make sorted array
my $useridcount=0;
my @viewsorted_user_ids;
foreach my $id (sort { $hash_for_sorting_users{$b} <=> $hash_for_sorting_users{$a} or $a cmp $b } keys %hash_for_sorting_users) {
$viewsorted_user_ids[$useridcount] = $id;
$useridcount++;
}
print “Assert sorted articles ok $useridcount == $user_count \n”;
open TOPU,”>top_users.csv” or die $!;
open OUTU,”>01users.csv” or die “Cannot write to users.csv\n”;
print OUTU “user_id${delimiter}view_count${delimiter}user_name\n”;
for(my $i=0; $i<$user_count; $i++){
my $id=$viewsorted_user_ids[$i];
if($i<$top_users){
print TOPU “$user_id2name[$id]\n”;
}
if(1){
print OUTU “$id${delimiter}$user_id2viewcount[$id]${delimiter}$user_id2name[$id]”;
}
print OUTU “\n”;
}
close OUTU;
close TOPU;
close OUTE;
# print histogram buckets
my $histmp;
open OUTH,”>histograms.csv” or die $!;
for(my $i=0; $i<$buckets_in_day; $i++) {
$histmp = $i*$buckets_in_day/24;
print OUTH “$histmp,”;
}
print OUTH “\n”;
for(my $i=0; $i<$buckets_in_day; $i++) { print OUTH “$hist_url_time_of_day[$i],”; }
print OUTH “\n\n\n”;
for(my $i=0; $i<$buckets_in_week; $i++) {
$histmp = int(10*$i*$buckets_in_week/7)/10;
print OUTH “$histmp,”;
}
print OUTH “\n”;
for(my $i=0; $i<$buckets_in_week; $i++) { print OUTH “$hist_url_time_of_week[$i],”; }
print OUTH “\n\n\n”;
for(my $i=0; $i<$hist_user_size; $i++){ print OUTH “$i,”; }
print OUTH “\n”;
for(my $i=0; $i<$hist_user_size; $i++){ print OUTH “$hist_user[$i],”; }
print OUTH “\n\n\n”;
close OUTH;
print “MINDATE: $mindate\n MAXDATE: $maxdate\n\n”;
print “LINE1: $lc ERRORS: $errc\n”;
printf “ARTICLES: %d\n”,$url_count;
printf “USERS: %d”,$user_count;
Appendix C2 – step01.pl
my $outdir=”out/”;
my $delimiter=”,”;
my @user_lviews;
my @art_lviews;
my @user_views;
my @art_views;
my $user_cnt;
my $art_cnt;
my @user_top_ids;
my @art_top_ids; # the top ids for
my $top_user_count=100;
my $top_art_count=10;
my %is_top_user;
my %is_top_art;
my %is_special_art;
$is_special_art{906}=1;
$is_special_art{320}=1;
$is_special_art{204}=1;
$is_special_art{14}=1;
$is_special_art{47}=1;
$is_special_art{106}=1;
$is_special_art{2}=1;
$is_special_art{233}=1;
$is_special_art{110}=1;
$is_special_art{687}=1;
$is_special_art{28}=1;
$is_special_art{1010}=1;
$is_special_art{198}=1;
$is_special_art{80}=1;
$is_special_art{789}=1;
$is_special_art{91}=1;
$is_special_art{847}=1;
$is_special_art{854}=1;
$is_special_art{906}=1;
$is_special_art{1042}=1;
$is_special_art{729}=1;
$is_special_art{731}=1;
# — load users and articles —
open IU,”<01users.csv” or die $!;
open IA,”<01articles.csv” or die $!;
<IU>; <IA>; $user_cnt=0; $art_cnt=0;
while(<IU>){ chomp; my @fields = split(‘,’,$_);
if($user_cnt<$top_user_count) {
$user_top_ids[$user_cnt]=$fields[0];
$is_top_user{$fields[0]}=$user_cnt;
$user_cnt++;
}
$user_views[$fields[0]]=$fields[1];
}
while(<IA>){ chomp; my @fields = split(‘,’,$_);
if($art_cnt<$top_art_count) {
$art_top_ids[$art_cnt]=$fields[0];
$is_top_art{$fields[0]} =$art_cnt;
$art_cnt++;
}
$art_views[$fields[0]]=$fields[1];
if(!defined($fields[35])){ die “bad article $art_cnt $_\n”;}
$art_lviews[$fields[0]]=$fields[35];
}
close IU; close IA;
print “— Starting the big one\n”;
$lc=0;
open IN,”<01.csv” or die $!;
while(<IN>){ chomp; my @fields = split(‘,’,$_); $lc++; if($lc%100000==0){ print $lc/1000000; print “\n”; }
my $art_id=$fields[0];
my $time=$fields[1];
my $user_id=$fields[2];
# dump users
if(defined($is_top_user{$user_id})){
open OU,”>>${outdir}u${user_id}.csv” or die $!;
print OU “$art_id${delimiter}$time${delimiter}$art_views[$art_id]${delimiter}$art_lviews[$art_id]\n”;
close OU;
}
# dump articles
if(defined($is_top_art{$art_id}) || defined($is_special_art{$art_id})){
open OA,”>>${outdir}a${art_id}.csv” or die $!;
print OA “$time${delimiter}$user_id${delimiter}$user_views[$user_id]\n”;
close OA;
}
}
close IN;
Appendix C1 – parse.pl
use Text::CSV_XS;
use B;
use Time::Piece;
if( 0 ) {
$lastfile = 0;
} else {
$lastfile = 26;
}
my $top_users=2000;
my $top_urls=1000;
my $day_count=32;
my $start_time = Time::Piece->strptime(“2020-04-12 00:00:00”,’%Y-%m-%d %H:%M:%S’);
$lc=0; $errc=0;
# histograms/buckets
my @hist_url_day_of_month;
my @hist_url_time_of_day;
my @hist_url_time_of_week;
my $buckets_in_day=24;
my $buckets_in_week=7*12;
for(my $i=0; $i<$buckets_in_day; $i++) { $hist_url_time_of_day[$i]=0; }
for(my $i=0; $i<$buckets_in_week; $i++) { $hist_url_time_of_week[$i]=0; }
my $user_count=0;
my @user_id2name;
my %user_name2id;
my @user_id2viewcount;
my $url_count=0;
my @url_id2name;
my @url_id2articlestring;
my @url_id2viewcount;
my %url_name2id;
my $mindate=222200000000000;
my $maxdate=0;
my @url_mintime;
my @url_minepoch;
my $hist_user_size=2000;
my @hist_user;
for(my $i=0; $i<$hist_user_size; $i++) { $hist_user[$i]=0; }
$delimiter = “,”;
open OUTF,”>01.csv” or die “Cannot write to compact.csvn”;
print OUTF “url_id${delimiter}time${delimiter}user_idn”;
open OUTE,”>errors.txt” or die $!;
my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
for($fid=0; $fid<=$lastfile; $fid++){
$fname = sprintf(“z_orig/vest%012d”,$fid);
open my $fh, “<:encoding(utf8)”, “$fname” or die “cannot read $fname: $!”;
my $row = $csv->getline ($fh);
while (my $row = $csv->getline ($fh)) {
if($lc%100000==0){ print “$fid.$lc $mindate-$maxdate $url_count $user_countn”;}
# [0]url4article [1]headline [2]time [3]user
# fix the time
my $time_obj = Time::Piece->strptime(substr($row->[2],0,19),’%Y-%m-%d %H:%M:%S’);
my $prettydate = $time_obj->yy*100000000 + $time_obj->mon *1000000 + $time_obj->mday*10000 + $time_obj->hour * 100 + $time_obj->min;
if($prettydate>$maxdate) { $maxdate = $prettydate; }
if($prettydate<$mindate) { $mindate = $prettydate; }
my $seconds_from_start_time=$time_obj->epoch-$start_time->epoch;
# process the URL
my $url_name = $row->[0];
if( !defined($url_name2id{$url_name} )){
$url_name2id{ $url_name } = $url_count;
$url_id2name[ $url_count] = $url_name;
$url_id2articlestring[ $url_count ] = $row->[1];
$url_id2viewcount[ $url_count ] = 0;
$url_mintime[ $url_count ] = $prettydate;
$url_minepoch[ $url_count ] = $seconds_from_start_time;
for(my $aday=0; $aday<$day_count; $aday++){
$hist_url_day_of_month[$url_count][$aday]=0;
}
$url_count++;
}
my $url_id = $url_name2id{$url_name};
$url_id2viewcount[ $url_id ]++;
if($url_mintime[$url_id] > $prettydate) {
$url_mintime[$url_id] = $prettydate;
$url_minepoch[$url_id] = $seconds_from_start_time;
}
my $user_name = $row->[3];
if( !defined($user_name2id{$user_name} )){
$user_name2id{ $user_name } = $user_count;
$user_id2name[ $user_count] = $user_name;
$user_id2viewcount[$user_count] = 0;
$user_count++;
}
my $user_id = $user_name2id{$user_name};
$user_id2viewcount[$user_id]++;
#update buckets $buckets_in_day
my $dtmp=int(($time_obj->hour+$time_obj->min/60)* $buckets_in_day/24);
$hist_url_time_of_day[$dtmp]++;
$dtmp=int(($time_obj->day_of_week+$time_obj->hour/24+$time_obj->min/60/24)* $buckets_in_week/7);
$hist_url_time_of_week[$dtmp ] ++;
my $daytime_bucket = int($seconds_from_start_time/ (60*60*24)); # days bucket for histogram
if($daytime_bucket<0 or $daytime_bucket>=$day_count){
$errc++;
print OUTE “Bucket Error: $daytime_bucket $seconds_from_start_time $url_id $user_idn”;
}
$hist_url_day_of_month[$url_id][$daytime_bucket]++;
my $short_print_time = int($seconds_from_start_time/60); # just make it in minutes
print OUTF “$url_id${delimiter}$short_print_time${delimiter}$user_idn”;
$lc++;
}
close $fh;
} # foreach file 0..25
close OUTF;
#print “MINDATE: $mindaten MAXDATE: $maxdatenn”;
# — PRIINT ARTICLES, SORTED —-
#init hash
my %hash_for_sorting;
for(my $i=0; $i<$url_count; $i++){
$hash_for_sorting{$i}=$url_id2viewcount[$i];
}
#sort and make sorted array
my $idcount=0;
my @viewsorted_url_ids;
foreach my $id (sort { $hash_for_sorting{$b} <=> $hash_for_sorting{$a} or $a cmp $b } keys %hash_for_sorting) {
$viewsorted_url_ids[$idcount] = $id;
$idcount++;
}
print “Assert sorted articles ok $idcount == $url_count n”;
open TOPA,”>top_articles.csv” or die $!;
open OUTA,”>01articles.csv” or die $!;
print OUTA “id${delimiter}viewcount${delimiter}time_of_click${delimiter}url${delimiter}titlen”;
for(my $i=0; $i<$url_count; $i++){
my $id=$viewsorted_url_ids[$i];
if($i<$top_urls){
print TOPA “$url_id2name[$id]n”;
}
if(1){
print OUTA “$id${delimiter}$url_id2viewcount[$id]${delimiter}$url_minepoch[$id]${delimiter}$url_mintime[$id]${delimiter}$url_id2name[$id]”;
} else {
print OUTA “$id${delimiter}$url_id2viewcount[$id]${delimiter}$url_minepoch[$id]${delimiter}$url_mintime[$id]${delimiter}$url_id2name[$id]${delimiter}$url_id2articlestring[$id]”;
}
for(my $bid=0; $bid<$day_count; $bid++){
print OUTA “,$hist_url_day_of_month[$id][$bid]”;
}
print OUTA “n”;
}
close OUTA;
close TOPA;
# — PRIINT USERS, SORTED —-
# this just fixes the hist_user histogram
for(my $i=0; $i<$user_count; $i++){
if($user_id2viewcount[$i]<$hist_user_size){
$hist_user[$user_id2viewcount[$i]]++;
}
if( 0 ){ # OLD UNSORTED USER CODE
print OUTU “$i,$user_id2viewcount[$i],$user_id2name[$i]n”;
}
}
my %hash_for_sorting_users;
for(my $i=0; $i<$user_count; $i++){
$hash_for_sorting_users{$i}=$user_id2viewcount[$i];
}
#sort and make sorted array
my $useridcount=0;
my @viewsorted_user_ids;
foreach my $id (sort { $hash_for_sorting_users{$b} <=> $hash_for_sorting_users{$a} or $a cmp $b } keys %hash_for_sorting_users) {
$viewsorted_user_ids[$useridcount] = $id;
$useridcount++;
}
print “Assert sorted articles ok $useridcount == $user_count n”;
open TOPU,”>top_users.csv” or die $!;
open OUTU,”>01users.csv” or die “Cannot write to users.csvn”;
print OUTU “user_id${delimiter}view_count${delimiter}user_namen”;
for(my $i=0; $i<$user_count; $i++){
my $id=$viewsorted_user_ids[$i];
if($i<$top_users){
print TOPU “$user_id2name[$id]n”;
}
if(1){
print OUTU “$id${delimiter}$user_id2viewcount[$id]${delimiter}$user_id2name[$id]”;
}
print OUTU “n”;
}
close OUTU;
close TOPU;
close OUTE;
# print histogram buckets
my $histmp;
open OUTH,”>histograms.csv” or die $!;
for(my $i=0; $i<$buckets_in_day; $i++) {
$histmp = $i*$buckets_in_day/24;
print OUTH “$histmp,”;
}
print OUTH “n”;
for(my $i=0; $i<$buckets_in_day; $i++) { print OUTH “$hist_url_time_of_day[$i],”; }
print OUTH “nnn”;
for(my $i=0; $i<$buckets_in_week; $i++) {
$histmp = int(10*$i*$buckets_in_week/7)/10;
print OUTH “$histmp,”;
}
print OUTH “n”;
for(my $i=0; $i<$buckets_in_week; $i++) { print OUTH “$hist_url_time_of_week[$i],”; }
print OUTH “nnn”;
for(my $i=0; $i<$hist_user_size; $i++){ print OUTH “$i,”; }
print OUTH “n”;
for(my $i=0; $i<$hist_user_size; $i++){ print OUTH “$hist_user[$i],”; }
print OUTH “nnn”;
close OUTH;
print “MINDATE: $mindaten MAXDATE: $maxdatenn”;
print “LINE1: $lc ERRORS: $errcn”;
printf “ARTICLES: %dn”,$url_count;
printf “USERS: %d”,$user_count;
Appendix C2 – step01.pl
my $outdir=”out/”;
my $delimiter=”,”;
my @user_lviews;
my @art_lviews;
my @user_views;
my @art_views;
my $user_cnt;
my $art_cnt;
my @user_top_ids;
my @art_top_ids; # the top ids for
my $top_user_count=100;
my $top_art_count=10;
my %is_top_user;
my %is_top_art;
my %is_special_art;
$is_special_art{906}=1;
$is_special_art{320}=1;
$is_special_art{204}=1;
$is_special_art{14}=1;
$is_special_art{47}=1;
$is_special_art{106}=1;
$is_special_art{2}=1;
$is_special_art{233}=1;
$is_special_art{110}=1;
$is_special_art{687}=1;
$is_special_art{28}=1;
$is_special_art{1010}=1;
$is_special_art{198}=1;
$is_special_art{80}=1;
$is_special_art{789}=1;
$is_special_art{91}=1;
$is_special_art{847}=1;
$is_special_art{854}=1;
$is_special_art{906}=1;
$is_special_art{1042}=1;
$is_special_art{729}=1;
$is_special_art{731}=1;
# — load users and articles —
open IU,”<01users.csv” or die $!;
open IA,”<01articles.csv” or die $!;
<IU>; <IA>; $user_cnt=0; $art_cnt=0;
while(<IU>){ chomp; my @fields = split(‘,’,$_);
if($user_cnt<$top_user_count) {
$user_top_ids[$user_cnt]=$fields[0];
$is_top_user{$fields[0]}=$user_cnt;
$user_cnt++;
}
$user_views[$fields[0]]=$fields[1];
}
while(<IA>){ chomp; my @fields = split(‘,’,$_);
if($art_cnt<$top_art_count) {
$art_top_ids[$art_cnt]=$fields[0];
$is_top_art{$fields[0]} =$art_cnt;
$art_cnt++;
}
$art_views[$fields[0]]=$fields[1];
if(!defined($fields[35])){ die “bad article $art_cnt $_n”;}
$art_lviews[$fields[0]]=$fields[35];
}
close IU; close IA;
print “— Starting the big onen”;
$lc=0;
open IN,”<01.csv” or die $!;
while(<IN>){ chomp; my @fields = split(‘,’,$_); $lc++; if($lc%100000==0){ print $lc/1000000; print “n”; }
my $art_id=$fields[0];
my $time=$fields[1];
my $user_id=$fields[2];
# dump users
if(defined($is_top_user{$user_id})){
open OU,”>>${outdir}u${user_id}.csv” or die $!;
print OU “$art_id${delimiter}$time${delimiter}$art_views[$art_id]${delimiter}$art_lviews[$art_id]n”;
close OU;
}
# dump articles
if(defined($is_top_art{$art_id}) || defined($is_special_art{$art_id})){
open OA,”>>${outdir}a${art_id}.csv” or die $!;
print OA “$time${delimiter}$user_id${delimiter}$user_views[$user_id]n”;
close OA;
}
}
close IN;
12 thoughts on “Case NetInfo/Vesti.bg article recommendation — Team Army of Ones and Zeroes — Datathon 2020”
I do not see the graphs. Is this something about my browser or are they missing? I see also to links to graphs.
I couldn’t figure out how to include the graphs in this document. Here is a link to the Google Driver version that has the graphs: https://docs.google.com/document/d/186Bcv4DbrYLY7m3ZCeGji9QM7DAjm0arY2doV5TTNAs/edit?usp=sharing
Also in PDF format:
https://drive.google.com/file/d/1eFiw4lqNtMXAbQZLwhLPBCOyODPQY4oU/view?usp=sharing
Now it shows; I also looked at the GDoc. No worries.
SQL code in Jupyter – just not readable!
Perl code …
I like the fact that I can not understand what you have done here – you just pasted some code with very ugly formatting 😀
Hi zenpanik,
The copy/paste lost the formatting and pictures. We wrote everything on a Google Doc (see below) which has pictures and fixed-font formatting. But even with the formatting you are right – Perl is hard to read. We tried to add comments, but the main goal is to make it work, and Perl is very fast and easy for prototyping (at least for old-timers like me :-).
Here are the links to the pretty google docs and PDF:
https://docs.google.com/document/d/186Bcv4DbrYLY7m3ZCeGji9QM7DAjm0arY2doV5TTNAs/edit?usp=sharing
Also in PDF format:
https://drive.google.com/file/d/1eFiw4lqNtMXAbQZLwhLPBCOyODPQY4oU/view?usp=sharing
“Our best idea is to see if there is a way to see if some users strongly prefer some types of articles and maybe the preference will be so strong that they would be way more likely to read the 2nd and 3rd best articles, if their affinity for the 1st best article is low.”
That’s a nice idea – but how would it be implemented?
We didn’t have time to test that. So at this point it is just a hypothesis (guess) that needs to be tested. To be clear, we are somewhat confident that it will NOT help FOR THIS PARTICULAR DATASET. (There are clearly many scenarios where it will)
The way to test it would be erase the last day of data (or last 2 days, or 3, etc) and use the erased data for evaluation.
The only other thing needed is some model for what “types of articles users like”. For that we can use words from the title of the article, or the natural categories that vesti.bg has (like coronavirus, bulgaria, sviat, etc), or some other way to model user preference.
With the above two, once can measure the difference in the objective (accurately predicted user article visits for the next 24 hours) and see which one is better and by how much (and whether it is statistically significant).
We didn’t have time to do all of that 🙂
But we did estimate/guess/wager that it won’t improve significantly our prediction score, so we prioritized it lower and ultimately didn’t do it.
The idea to test on the most recent day(s) is the way to go. This is done right.
I do like your data exploration, based on the time of the day, and the recurrent visits.
Yet, I am missing an exploration by the user, or similar articles by similar users.
Overall, I like tis article for the analysis that was done on the data and for pointing to interesting ideas based on the observations. This is how data science works: we have to look at the data! This team is well ahead of any other team on that, and I congratulate them for this! There are tins of nice things like decay over time, observation when this happens, how long it lasts, etc.
They have not advanced as much with formulating how this would be operationalized, e.g., what model to use exactly and what does it do — it is collaborative or content-based filtering? And their ideas for testing are too high level (A/B testing, ok but how about some evaluation measure too)?
Yet, they do find some very nice insights, which can help a lot future work on this dataset.