Datathon casesDatathons Solutions





Approach to the task. Steps

1. Tweet Data

Reviewing the sales data, we saw outliers in some of the collections in count of transactions. The problem was solved after the deletion of zero value transactions.

Checking the correlation between the sum of the retweets (daily sum, cumulative sum, cumulative rolling sum) and the sum of the USD amount of the transaction per day per collection. The reason we concentrate our effort on the retweets is because reposting shows a greater engagement and has a bigger influence on spreading a word about nfts.

There are 8 different users(studios) in the Tweet data, which we link to the 8 collections of NFT. MAYC (Mutant Ape Yacht Club) has no tweets or is mentioned together with Bored Ape Yacht Club.

The most tweets are for clone_x collection (from RTFKT Studios).

import numpy as np
import pandas as pd

tweet = pd.read_csv(‘hackathon_tweets.csv’)
# print(tweet)
userscreenname = tweet[‘userscreenname’].value_counts()

RTFKT Studios                1393
Cool Cats                     649
MeebitsDAO (minting now!)     497
Bored Ape Yacht Club          270
doodles                       246
Pudgy Penguins                 13
Azuki                          12
Penguins                        1
Name: userscreenname, dtype: int64

Due to the fact we base our tweeter analysis on summed values,
we choose to work with the collection 'clone_x' (1393 observations). Then we obtain
the sum of tweets per day, cumulative sum of tweets, and cumulative rolling sum for a
10 days horizon (R code):

clone_x_tweet_agg_sum <- aggregate(retweets ~ timestamp, data = clone_x_tweet, FUN = sum)
clone_x_tweet_agg_cum <- clone_x_tweet_agg %>%
mutate(cumsum = cumsum(retweets)) %>%
mutate(cum_rolling_10 = rollapplyr(retweets, width = 10, FUN = sum, partial = TRUE))
The result is an object with 411 rows and 4 columns ('timestamp' and 'sum of daily retweets',
'cumulative sum of retweets', and 'cumulative rolling sum for a
10 days horizon').

We calculate than the so called 'daily sum of transactions' (for the collection clone_x)
which is the sum of all transactions of nfts for each particular day (from the sales file).
We get an object with 107 rows and 2 columns ('timestamp' and 'daily sum of transactions' in USD).

Then we merge the 2 data sets by the column 'timestamp' and get a new data set that contains
data only on the common days. This data set is of the size 92x5 (the columns are 'timestamp',
'sum of daily retweets', 'cumulative sum of retweets', and 'cumulative rolling sum for a
10 days horizon', and 'daily sum of transactions').

Then we calculate the correlation of the values of the tweeter related columns with the 
'daily sum of transactions for the collection':

#Corr. sum of daily retweets and daily sum of USD amount
cor(clon_x_mrd_sum$retweets, clon_x_mrd_sum$amountusd) # 0.6384357
cor(log(clon_x_mrd_sum$retweets), log(clon_x_mrd_sum$amountusd)) # 0.3564773

#Corr. cumsum retweets and daily sum of USD amount
cor(clon_x_mrd_sum$cumsum, clon_x_mrd_sum$amountusd) # -0.362008
cor(log(clon_x_mrd_sum$cumsum), log(clon_x_mrd_sum$amountusd)) # -0.5066663

#Corr. cum_rolling_10 retweets and daily sum of USD amount
cor(clon_x_mrd_sum$cum_rolling_10, clon_x_mrd_sum$amountusd) # 0.2243003
cor(log(clon_x_mrd_sum$cum_rolling_10), log(clon_x_mrd_sum$amountusd)) # 0.3191343

High correlation between the daily sum of retweets and the daily sum of the transactions
is detected: 0.6384357. We can try now to build a timeseries model with an exogenous (daily sum
of retweets) that will explain the sum of the daily transactions for a collection (clone_x).

To this end we first check the type of process that is thought to generate the daily sum
of transactions. We use the R function auto.arima(). It suggests an ARIMA(0, 1, 2) process.

The output of the ARIMA(0, 1, 2) model with an exogenous variable is:

       ma1         ma2            retweets
     -0.3704     -0.2593          4972.2122
s.e.  0.1107      0.1097          789.5406

The process is statistically significant.

However, we try an alternative model: ARIMA(1, 1, 1) with an exogenous variable

        ar1        ma1            retweets
       0.5698     -0.9277         5158.4020
s.e.   0.1361      0.0763         684.7399

The process is statistically significant.
Based on the AIC and the RMSE we conclude the alternative model is better (nft_tweets_R).

2. Calculating rarity Score:


To calculate the NFT Rarity, we are using the following formula:

Rarity Score for a Trait = 1 / (Number of NFTs with that trait/ Total number of NFTs in collection)

To calculate the overall rarity scope of an NFT, we are using the following formula:

Overall Rarity Score =  Σ rarity score for trait i

To calculate these rates, we had to go through a couple of calculations.


  1. Calculate the sum of NFTs grouped by: collection, trait_type,trait_value:

,count(*) over (partition by [collection],[trait_type],[value]) as [TotalTraitValue]

,count(*) over (partition by [collection],[trait_type]) as [TotalTraitType]

,count(*) over (partition by [collection]) as [TotalCollection]

2. Calculate the ratio between sum of NFTs with a specific trait and sum of all NTFs in the collection:

,(cast(count(*) over (partition by [collection],[trait_type],[value]) as float)/count(*) over (partition by [collection])) as [RarityByCollection]

3. Calculate Rarity Score for a trait = 1/ [RarityByCollection]

4. Overall Rarity Score = Sum(Rarity Score for a trait) for all traits of an NFT

!!!See attached file

3. Creating image mapping algorithm:

We have extracted image data base from the given files data (Trait data). Each image is given a name, which is combination from the Token ID and Collection name.

  1. When there is a specific image to be recognized, we search it in the database and print the features: all transactions with all the available information for them, including the last one, so that we know who is the last owner.
  2. Another test is performed to check the similarities of this image with the others in the Collection (whole database). This can be used for checking the rarity of the image in this collection (database).



Share this

5 thoughts on “JustForFun

  1. 0

    Overall: you had some nice ideas but probably ran out of time.

    2. It would be nice to make some data analysis and show examples of the rarity score. It would really help us understand if it provides any value for further use.

    3. Nice idea to use compare_ssim, but again, it is not clear what does this comparison lead to. The test in the article shows only one image, if there were more, that would validate the idea.

  2. 0

    It is great that you tried to take into account the tweet data and transactional data and I believe that on a collection level this general trend can be used to support the forecast on single nft level.

Leave a Reply